Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Deciphering the variability of goal-directed and habitual decision-making
(USC Thesis Other)
Deciphering the variability of goal-directed and habitual decision-making
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Deciphering the variability of goal-directed and habitual decision-making
Brenton Keller
Faculty of the Graduate School
University of Southern California
Doctor of Philosophy
Neuroscience
December 2018
1
Contents
1 Introduction 8
1.1 An extended example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 Relationship to behavioral theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3 Computational modeling of reinforcement learning . . . . . . . . . . . . . . . . 11
1.4 Parameter estimation and hierarchical Bayesian statistics . . . . . . . . . . . . . 12
1.5 Individual differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.6 Intelligence, transfer learning, awareness, and mindfulness . . . . . . . . . . . . 14
1.7 My goals and the bigger picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2 Computational task history 17
2.1 A task disassociating model-free and model-based learning . . . . . . . . . . . . 18
2.2 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Stress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Psychological populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Neuroscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6 Critiques: Task complexity, reward rate, and cognitive effort . . . . . . . . . . . . 27
2.7 Reducing the complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Reinforcement learning model 33
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 A sequential two-stage task . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3 Reinforcement learning model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.1 Basic Theory: Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2
3.3.2 Values versus choice preference . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3.3 Prediction error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.4 Model-free reinforcement learning . . . . . . . . . . . . . . . . . . . . . . 41
3.3.5 Model-based reinforcement learning . . . . . . . . . . . . . . . . . . . . . 42
3.3.6 Hybrid reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.7 Choice selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4 Other parameters of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.1 Forgetting to default . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.2 Perseveration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 Useful Parameterizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 The categorical distribution and reinforcement strategy processes . . . . 51
3.5.2 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4 Parameter estimation 54
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Point-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3 Hierarchical pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4 Maximum a posterior estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Expectation-maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.6 Hierarchical Bayesian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6.1 Posterior created via sampling . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.6.2 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Estimating the parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7.1 Precondition-postcondition fitting . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 Bayesian hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5 Counterfactual 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Experimental design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3
5.2.3 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.4 Reinforcement learning model . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.5 Model comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.6 Linear modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.1 General findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.2 Punishment versus reward . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.3 Counterfactual task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3.4 Comparative processes and model-free valuation . . . . . . . . . . . . . . 77
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
6 Observational learning 82
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.2 Two-stage valuation task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.3 Observational prediction task . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2.4 Questionnaires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.1 General findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.3.2 Understanding the differences following experience-based learning and
observational prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3.3 Effects of observational prediction on model-free and model-based re-
sponding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7 Mindfulness 108
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.2 Sequential learning task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4
7.2.3 Toronto Mindfulness Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.4 Raven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.5 Self-control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.6 Trait mindfulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.7 Mindfulness experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.1 Manipulation check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.2 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.3.3 Mindfulness increases model-based choice consistency . . . . . . . . . . 117
7.3.4 IQ, mindfulness, and choice consistency . . . . . . . . . . . . . . . . . . . 125
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Bibliography 130
5
Preface
Neuroeconomics. Value-based reinforcement learning. Artificial intelligence. Machine learn-
ing. Hierarchical Bayesian modeling. All buzzwords I can use to describe the various re-
search areas that needed to come together to produce this dissertation. Honestly, I had no
idea what I was getting myself into when I started the Doctoral Program. I had a Bachelor’s
in Neuroscience and Philosophy, but I really knew nothing.
I had taken statistics classes, but I did not fully understand just what was being done
to the numbers to lead to a conclusion of statistically different. I had taken neuroscience
classes, but I was far away from having a fundamental grasp as to how just important each
link of a complicated biological pathway is. I had learned many topics in my psychology
classes, but in a way I differentiated between them and did not recognize at how some topics
were referring to the same thing in the external world. This dissertation involves learning,
and I have been learning my whole life. But just because I had been doing that task for so
long did not mean that I understood what was actually going on.
And that is the idea that I want to first enter in your head. Typically, our beliefs are
stronger than the evidence supporting them. Sometimes we think we know something de-
spite little evidence. Whereas other times we act like we know something, but never question
if in fact we do know it to be true – we just act as if it is without a second thought.
One way in which we learn is by comparing what happened in the external world against
a prediction of what we thought was to happen, and then we adjust our beliefs accordingly.
But if we have too strong of beliefs, we might not budge from our stance despite evidence
against it. Worse, we might not even make the comparison, we just think that that is the way
it has to be because that is the way it has been in the past.
As you read this, I aim keep your mind open not with a mountain of references and sci-
6
entific jargon, but examples. Ways in which I can relate some of the more complicated al-
gorithms and intricate discoveries to something familiar with everyone. From there, I hope
that you can abstract from these examples and see just how universal the topics, concepts,
methods, philosophy are.
7
Chapter 1
Introduction
1.1 An extended example
Life involves decisions, yet in some situations we use different mental processes when se-
lecting one action over another. For example, consider driving down a two-lane freeway.
Without much thought, you find yourself in the right-hand lane. You do not remember con-
sciously thinking that the right-hand lane was better than the left-hand lane, or even com-
paring between the two choices. Rather, you have learned from past experience that if you
are not passing another car, it is best to stay in the right-hand lane. This heuristic was learned
from past experiences of driving on similar roads, and is now embedded into an automatic
context-dependent behavior.
While on the same two-lane freeway, you notice that you are approaching another vehi-
cle. You consider whether or not to pass the car. After realizing that you are not in a rush
and that you are already above the speed limit, you decide to slow down and stay behind the
vehicle. Here, the action of slowing down was effortful; you were cognizant of the internal
decision process. You considered your main goal – i.e., driving safely to your destination –
along with other factors – e.g., not being in a rush – and selected an action that best aligned
with those goals – i.e., slowing down and following the car from a safe distance.
Obviously, if the situation was different your resulting action may also change. And in
response, these two ways of deciding upon an action may produce different responses. For
example, in regards to the first situation, it is unlikely that you have the same behavioral
8
policy – i.e., remain in the right-hand lane – when in a city. If you do, then you are going to
be waiting quite often as the cars turning right must yield for crossing pedestrians. Again,
you are not actively thinking, ’on highway stay right, in city avoid right’ . In fact, if you are
like my fellow Angelinos, you probably are in the correct lane but your attention is directed
towards the conversation you’re having on your cellphone and the dog in your passenger
seat.
Whereas in the second situation, say the approaching car was driving erratically. You still
realize that you are speeding and that you are not in a rush, but you also see the potential
danger of remaining near the seemingly reckless driver. You have not witnessed many acci-
dents first-hand, but the behavior of the approaching vehicle worries you and you decide to
pass when given an opportune moment. Again, in this example, this type of process remains
effortful. You considered many different factors and weighed each piece of information in-
teractively with your goals. Thereby bringing you to preform the action which you believe is
most optimal considering all the various circumstances.
Perhaps unintuitively, there are times when these two ways of processing information
can be at odds. For example, the garage at my apartment complex requires me to turn left
when parking my fiancee’s car and right when I am parking my car. When parking my car, I
am almost always by myself. Whereas the times I drive my fiancee’s car, she is often in the
car with me.
In this one situation, I internalized many different pieces of information. For example,
her car should be parked by turning left, thus if I find myself in her car, I should turn left.
Similarly, if I find myself in my car, then I should turn right. This requires state-information
( ’where am I currently’ ) to be translated into action preferences ( ’given where I am, what
should I do’ ). This goal-oriented process requires consideration of multiple associative con-
nections driven via forward thinking – i.e., action-outcome understanding.
Whereas, another way I internalized my action preferences was via context-based auto-
maticity – i.e., the actions were habitualized. Here, after performing an action many times
in a specific context, I have developed a seemingly automatic action preference. For exam-
ple, the action of turning left was habitualized after many repetitions of turning left when in
her car, which was often paired with her being in the car. And rather than having to think
to myself, ’Where am I?’ to determine the best action, somewhat automatically I find myself
9
instinctively turning left in this context. Furthermore, the action of turning right was habit-
ualized after many repetitions of turning right in my car, and this automaticity was learned
with it’s own unique sensory contexts – i.e., a lower profile to the ground and often blasting
trance music.
On this one rare occasion that she was with me in my car while parking, I mistakenly
turned left. In one sense, I knew my goals – park the car in the correct spot. But life can
be distracting. I am not continuously asking myself which car am I in. However, if I was
asked which car I was in, I would be able to respond correctly without hesitation. But even
then, that doesn’t necessarily mean that I would have preformed the correct action (i.e., turn
right). Being able to mentally consider the various actions and their immediate and delayed
interactions with the world is difficult. Just because I realize that I am in my car, the fact that
when my fiancÃl’e is in the car with me I almost always turn left bares some "non-conscious"
influence on my action selection, and affects my ability to select the correct response.
And there are reason for this automaticity. To begin, it requires less effort and often pro-
duces a favorable action. This habitual process in a sense is saying, "in this situation, I’m
going to do this because it has worked before." And because it has been shown to work quite
well previously, we continue to respond in that type of way rather than reevaluate what is
a better option. Eventually, this gets a point where we forget are performing the action all
together, and it is just ingrained in almost a context dependent stimulus-response behavior,
allowing us to focus on other tasks while doing actions that were once effortful.
There are also obvious advantages to the goal-directed way of thinking. It allows flexibil-
ity in our behavioral policy. Unlike habitualized actions, we do not have to rely only on what
has worked before. With goal-directed processing, we consider not which actions have been
previously been shown to be good, but in which situations did they work and what potential
factors allowed for that action to be successful in that specific situation. As we consider more
and more information, we are better at accurately selecting the optimal action given the cur-
rent state of the world because our understanding of the world (i.e., the model) is more richly
detailed.
At the heart of this dissertation, we examine these two types of on-going learning pro-
cesses. Ultimately, information processing is on a continuum. There are times when we are
goal-oriented, times when we are habitual, and times when we show a consideration of both
10
factors. And this dissertation examines the variability relating to these processes, what types
of manipulations affect goal-directed and habitual behavior, and in which types of people.
1.2 Relationship to behavioral theory
This dissertation begin with an overview of computationally measuring goal-directed versus
habitual action selection through value-based learning tasks. In these types of decision tasks,
these two learning mechanisms have been modeled through model-based and model-free
reinforcement learning algorithms. Specifically, these sequential tasks contain a number of
states and a reward scheme that allows for us to differentiate between subjects who may
be responding using supposed goal-directed and/or habitual valuation. Unlike previous ex-
periments, this has allowed us to disassociate and quantify goal-directed against habitual
behavior.
In order to construct computational models of decision-making, it is essential to have a
strong foundation as to exactly what previous experiments in behavioral psychology, neu-
roscience, and economics have been able to show us. Recognizing both the findings and
limitations of their studies.
1.3 Computational modeling of reinforcement learning
After setting a context for my research, the dissertation focuses on the field of reinforcement
learning from computer science as the primary model for human on-going valuation learn-
ing. During these chapters, I examine how reinforcement learning models are designed to
model the specific mental processes which the tasks were designed to capture.
Specifically, the history of a sequential two-step task developed by Daw and colleagues
[26]. Beginning with the task’s structure, I outline how each element of the task relates to
concepts of goal-directed and habitual decision-making as described in reference to behav-
ioral psychology. From there, we will discuss the limitations and extensions of the task.
Ultimately, the sequential two-step task has allowed us to computationally describe sub-
ject learning behavior. Yet there are questions as to whether model-free responding is syn-
onymous with habitual responding.
11
1.4 Parameter estimation and hierarchical Bayesian statis-
tics
Afterwards, the dissertation focuses on computational methods for estimating subject-/group-
level parameters of the reinforcement learning model. During this section, I outline the com-
putational basis for the workings of the reinforcement-learning model used to quantify be-
havior, and the various decisions made during the free parameter estimation process.
This is a chapter very important to me. At the beginning of my dissertation, I had no
knowledge of model-fitting. My parameter fits were noisy and conclusions inconclusive. My
advisers were supportive and allowed me to grow with my data. Not forcing me to publish
following my initial analysis. They allowed me to have additional time to fully comprehend
what I was doing along each step of the way.
I found that there is an art to statistics, just as there is an art to photography or music.
There are many ways to an end product – whether it be a completed statistical model, pho-
tograph, or song. There may be no universal absolute, i.e., we cannot say "this photograph
is better than this photograph". Because what is better? Similarly, I can construct multiple
models with very small differences, and this leads to slightly different conclusions. Can I re-
ally say "this model is better that model"? In one sense, I can say that after examining the
data, this model was better able to explain the variation in the acquired data than this other
model. Similarly, for the photograph example, I might be able to say, "after acquiring choice
preference data, I can confidently say that the general population prefers this image to that."
So, while many different modalites have "artistic features" – i.e., the end product is the
result of many, many small decisions that form an overall end product – there is some truth
to be extracted from these end products. It just requires a context along with proper forms of
analysis. There is some meaning behind the fact that more people prefer this image to that
image. Just as there must be some meaning that the data was better explained via this model
compared to that model.
12
1.5 Individual differences
In three separate experiments, I examine how a number of experimental conditions affect
performance in this task used for disassociating between goal-directed and habitual pro-
cessing.
In each of my experiments, I begin with a precondition task preceding and a postcon-
dition task following some experimental manipulation – e.g., counterfactual learning, ob-
servational learning, mindfulness. Although each experiment has its own unique research
question there were consistent truths behind each dataset. To begin, subjects performance
in the task can typically be classified in one of three ways: goal-directed, model-free, and
poor performers. This subject clustering was reliability identified by using the fitted subject-
level parameter-estimates in cluster analysis consisting of principle component analysis and
hierarchical clustering.
Given this finding, it was important to not only examine what a specific manipulation
does to behavioral policy, but also in which types of subjects. For example, if some sub-
jects do not show goal-directed processing in the precondition task, it would be concerning
if the model did not examine if the experimental manipulation affected these types of sub-
jects differently – i.e., the model should test for block£goal-directedness and block£goal-
directedness:condition interactions. Puzzling in my field, very few studies involve within-
subject test-retest comparisons, and furthermore, even less focus on if there were these
subject-type:task interactions.
Aligning with my previous beliefs, as I modeled this possibility, I was able to chip away
at the ambiguity of my previous findings. Despite adding complexity to the model, these
new explanatory terms allowed a closer examination to the data’s latent truths. Throughout
this dissertation, each model presented has gone through countless revisions. Some better
models better, some worse, but the data the models were tested against remained the same,
and so too the goal of wanting to best understand what is contained within the obtained
data.
13
1.6 Intelligence, transfer learning, awareness, and mindful-
ness
With those topics and chapters outlined, I want to bring up something that I have observed
in my life as a result of my research.
When I discuss topics of intelligence, or more specifically fluid intelligence, people of-
ten misunderstand exactly what is discussed and often respond with some level of emotion
clouding over rationality. In my studies, fluid intelligence was examined via a pattern recog-
nition task (Raven [95]).
Personally, I have learned that one critical aspect of intelligence is association learning.
Being able to take what one has learned in one context, and infer how what has been learned
can be applied to other seemingly unrelated contexts. And this is something that is shared
between my studies.
With that in mind, I think there is something to learn from just about everything. As
previously mentioned, this dissertation will also cover the concepts of cognitive effort. Nat-
urally, we want to reduce cognitive effort. And in today’s day and age, we are only being to
see the effects of this. Yet practices such as mindfulness aim at restoring our cognition of the
events around us.
1.7 My goals and the bigger picture
It would be wrong to not state my goals in a dissertation concerning goal-directedness. Of
course, there are many goals. To begin, I need to graduate by a certain date so that I can start
my next position - and this shapes what I can and cannot accomplish.
Continuing forward, I believe that science should be repeatable and easily understand-
able. Obviously, this cannot be done with every topic, some things are just extremely abstract
and require a strong foundation to grasp the full meaning behind a study. But something
such as decision-making? It falls on my shoulders to make sure that no matter how abstract I
make my tasks, I am able to easily relate the tested concepts to a general audience. If I cannot
explain these topics, then how could I design a task that truly responded to some variation in
human behavior? And in this dissertation, I hope to respond to the abstractedness of these
14
types of tasks, provide insight how it attaches to some general aspects of human behavioral
responding.
Similarly, science should easily accessible. I must be confident in my analyses, open-
minded and open to criticism, and make the science accessible (easily understandable, ex-
act statistical methodology, willingness to express limitations). I proudly present how I con-
ducted each statistical analysis my research because that is what is important.
Researchers may claim that this significant difference represents that this group showed
a greater increase of this process than a control group. But unless I see every step of the way,
I can only take their word that their model was constructed appropriately. Seemingly small
statistical choices can have rather big implications with even simple statistical models.
Furthermore, by making my data open source, and representing every transformation in
my data analyses, I allow others to take what I have accomplished and further expand on
it. This is how we progress forward. By taking the best of the previous and expanding on
those methodologies. I am not saying that my methods are best. Rather, in order for me to
take my current analyses to where they are now, I went through many, many papers using
similar analyses. I tested each model and learned the limitations of each type of model, just
how they differ, and why one may chose to use one or the other. Ultimately, the process I
settled on was the one which best aligned with my goal of understanding human behavior-
the process that provided the most accurate representation of what is truly represented in
the data given the chosen model.
Although, this may be difficult with some more abstract topics, it is obvious how decision-
making is entwined into our everyday lives. However, finding a balance of science, philos-
ophy, and statistics is extremely challenging. Ultimately, my goal is to present what I have
learned in an easy to relate manner. There will be some subjects irrelevant to some audi-
ences, but I hope that even in those irrelevant sections, I explain my methodology in such a
way that it showcases the complexity behind each decision. And furthermore that decisions
can be broken down into choices, and choices can be represented by some value thereby
influencing choice behavior–using this framework we can reformulate almost any process
affecting behavior.
What I learned the most during this Ph.D. involving reinforcement learning was how to
learn. Regardless of whether we are talking about reinforcement learning or statistics, we
15
have beliefs, e.g., a belief of an action’s current value, a belief whether a value is credible
given some data. These beliefs are often completely unknown at first, and through evidence
and our own natural tendencies we act in specific ways.
With that being said, I believe that this can be further extended into the importance of
two concepts: awareness and relatedness. Goal-directed action cannot occur without aware-
ness of a goal. Without a goal in mind, the subject is acting instinctively. On the concept of
relatedness, one thing I discuss is how goal-directed behavior involves simulation, which re-
quires intelligence (non-directly experienced learned information). Overall our actions are
the result of a multiple consideration probabilistic weighed algorithm (we consider many
different things in unequal ways). Although we cannot predict actions all the time, there is
some average behavior to which we can identify for the different groups of individuals.
16
Chapter 2
Computational task history
With the advancement of statistical and computational methods, behavioral research has
transitioned from simple summary statistics, such as probability to select from an advanta-
geous deck [11], to reinforcement learning models with a number of free parameters quan-
tifying supposed mental processes in controlled task environments [26][90][109][29][133].
However, it is important to understand the way in which these computational tasks work,
and what exactly has been shown through the studies that have utilized them. Although the
field is relatively young, with key studies first appearing in 2011, new statistical techniques
have developed since then (e.g., expectation maximization [58] and rStan [120]).
One reason new methods emerge is to respond to previous limitations, such as limita-
tions in the ability to find a maximum to an equation and limitations in our current under-
standing of related topics. These advances allow construction of new experiments to further
push our understanding. But we must also look back at what was done before. Not only
to build a strong foundation but to evaluate what has stood the test of time. Introspection
about what were the potential flaws of the past and how did we respond them. Furthermore,
what still remains ambiguous.
17
2.1 A task disassociating model-free and model-based learn-
ing
In two pivotal papers, Daw, Glascher, and colleagues [26][51] translated the concepts of goal-
directed and habitual behavior into a sequential learning task, thereby allowing quantifica-
tion of each behavioral process.
Goal-directed and habitual behavior were measured concurrently in both tasks. That is,
the authors quantified two separate processes per subject during the task. Habitual process-
ing was equated to model-free reinforcement learning, whereas goal-directed computation
was to model-based reinforcement learning. These two ways of responding differ in how trial
outcome is interpreted. With different interpretations of reward, beliefs concerning each ac-
tion’s value and subsequent choice selection differ.
In this task, each trial began with a selection between two stimuli – originally represented
by Tibetan symbols. Each of the two first-stage symbols probabilistically transitioned to one
of two second-stage states. After transitioning to a second-stage state, the subject selects
from another set of two symbols. Following this second selection, the participant was either
rewarded or received no reward (Figure 2.1).
Here, reward was directly affected by the second decision, but the acquired second-stage
state (which contained the two potential second-stage choices) was probabilistically depen-
dent on the participant’s choice in the first-stage. That is, one first-stage choice was more
likely to transition to one second-stage state which itself contained another binary deci-
sion, with each second-stage choice having their own latent value (i.e., reward probability).
Whereas, the other first-stage action led to the other second-stage state more commonly.
Thus, each of the two first-stage choices infrequently transitioned to the second-stage state
that the other option more commonly transitioned to. It is following these rare transition
trials, where model-free and model-based valuation most strongly differ.
As previously mentioned, habitual action is repeating an action because it has been pre-
viously shown to be rewarding, whereas goal-directedness involves using a model to calcu-
late what is the most advantageous option given the current state and understanding (e.g.,
action-outcome contingencies) of the world. With this in mind, model-free and model-
based reinforcement learning algorithms differ in the way first-stage action values are cal-
18
Figure 2.1: Two-stage task. In this task, subjects made two choices. In the first-stage they decided be-
tween two symbols which probabilistically transitioned to one of two second-stage states. Participants
then made yet another decision, which directly affected the probability of reward.
culated.
In model-free reinforcement learning, the selected first-stage choice is reinforced by ei-
ther the acquired second-stage choice (model-free SARSA(0)) or by the trial’s outcome (model-
free SARSA(1)). In a way, model-free learning is reactionary [130]. An action is performed, an
outcome is eventually realized, and the action’s value is updated based on the newly sampled
information.
Whereas, model-based reinforcement learning involves forward computation. The opti-
mal first-stage action is determined by each choice’s probability to transition to each second-
stage state and the value of the optimal choice in each of those states. The model-based
learner is commonly modeled as performing a tree search (i.e., mental simulation, planning,
etc.) over all possible scenarios [24][61]. This type of learner considers how each action
19
transitions to a second-stage state, and which of the second-stage actions is most favorable.
Ultimately, selecting the first-stage action more likely to transition to the second-stage state
containing the currently believed best second-stage choice.
Daw, Glascher, and colleagues tested numerous models to quantify the participants’ choices,
including pure model-free or model-based valuation learning. However, the best fitting
model was one that included consideration from both valuation processes. The model con-
sidered both model-free and model-based processing (model-based weight, !Æ
¯
mb
¯
m f
ů
mb
)
to best fit the subjects’ choices in the tasks, along with function MRI data acquired concur-
rently.
2.2 Validation
The two-stage task has been replicated in a number of other studies. Often coming to the
conclusion that the population is best represented with a valuation learning model that con-
siders a weighing of both model-free and model-based valuation when making choices.
Gillan and colleagues [50] provide a connection between model-based responding and
devaluation sensitivity. Devaluation sensitivity relates to withholding a previously rewarding
action, given newly acquired information learned in a separate circumstance. Most com-
monly, a learned rewarded action (e.g., pressing a lever for candy treat) has the outcome
devalued in a separate circumstance (e.g., freely given candy that is poisoned in a different
room). Then the animal is tested in the original environment to see if the action is still pre-
formed (i.e., does the animal still press the lever).
In their experiment, an individual utilizing goal-directed processing will be able to de-
value the action because they understand that the action is a means to a specific desired
outcome. That is, they are able to realize that the action no longer will result in the reward-
ing experience that was previously learned through direct experience. Whereas, an indi-
vidual responding habitually will continue to select the devalued option because it has not
yet experienced the devaluation in the context where the action was preformed. Although
they learned in a separate context that the outcome was devalued, their action value repre-
sentation does not involve the outcome (candy); rather, just that the action was previously
rewarding (pressing the lever lead to satisfaction/reward).
20
Figure 2.2: To implement a devaluation paradigm, the to-be-devalued option was main-
tained at a 90 % chance of reward, whereas the option that would remain valued was
maintained at a 10 % chance of reward. Following the devaluation, those responding
with goal-directedness would be more likely to select from the option that had a lower
chance of reward because that outcome still had value.
Devaluation sensitivity was tested at the end of their reinforcement-learning task. One
choice option was stabilized at a 90 % chance of receiving a silver coin, whereas the other was
stabilized at 10 % chance of receiving a gold coin. During the task, gold and silver coins were
equally valued; therefore, at the end of the experiment, subjects developed a preference for
the symbol that most commonly lead to reward, i.e., selecting the action that 90 % of the time
resulted in a silver coin, rather than a 10 % chance of resulting in a gold coin. After 50 trials
with these contingencies, the subject was then informed that the silver coin was devalued,
and that they would no longer receive feedback concerning the trial outcome (Figure 2.2).
Participants behaving in a goal-directed manner should be able to update their action-
values through an action-outcome understanding. Since the silver coin has been devalued,
the first-stage action that was previously most rewarding is no longer optimal. This real-
ization is made via simulation, i.e., the subject has yet to encounter a situation where they
selected the previously advantageous choice and the silver coin was devalued; rather, they
updated the value of selecting that symbol based upon task instructions.
The researchers found that the subjects who continued to select the action that pro-
duced a silver coin 90 % of the time (i.e., did not show reward devaluation) had significantly
reduced model-based responding during the reinforcement learning task compared to the
subjects that showed reward devaluation. Furthermore, no model-free effect on reward de-
valuation was found in either experiments, suggesting that model-free responding does not
21
Figure 2.3: Model-based, but not model-free, processing was associated with devalua-
tion sensitivity. Suggesting that model-based processing is protective of habit formation,
as defined by continued selection of a previously learned but now devalued outcome.
influence reward devaluation (i.e., high model-free processing does not necessarily lead to
worse devaluation abilities). Instead, model-based processing was found to be associated
with reward devaluation, i.e., goal-directness allows for increased flexibility in modifying a
previously learned action preference.
2.3 Stress
Stress has been previously shown to be detrimental to prefrontal cortex function [53]. Since
the development of this task a number of experiments have examined the relationship be-
tween stress and goal-directed behavior.
To begin, Otto and colleagues examined stress on goal-directed and habitual process-
ing by administration of the cold-pressor test before the two-stage RL task. Ultimately, they
found that stress, represented via¢-cortisol, caused a reduction in model-based processing,
but not model-free processing. Furthermore, the decrease in model-based processing was
reduced as working memory (measured via operational span test) increased, suggesting that
working-memory capacity can reduce the effects of stress on goal-directed processing [85].
Radenbach and colleagues found that using a within-subjects design to examine stress
(Trier Social Stress Test) acute stress lowered choice consistency, but did not alter model-
based weighting for all subjects. Rather, there was an interaction between chronic and acute
stress, such that subjects with high chronic stress levels exhibited reduced model-based con-
trol in response to acute psychosocial stress [93].
Yet this differed for Park and colleagues, who compared stress via control using the so-
22
Figure 2.4: Stress (induced by the Cold Presser Task) was found to interact with reward
£ transition, overall suggesting a reduction of model-based processing following acute
stress. However, working memory (as measured by OSPAN) was found to be protective of
this reduction.
Figure 2.5: In a test-retest design, increased cortisol (a measure of acute stress) was found
to be negatively associated with model-based weighting, whereas arousal was found to
be associated with increased model-based weighting.
Furthermore, chronic stress (as measured by the Life Stress Scale Score) was found to
further influence the effect of cortisol. Specifically, as chronic stress increased, the effect
of cortisol on model-based weighting was further negatively associated.
cially evaluated cold pressor test. They found that stress decreased the contributions of
negative model-free reinforcement learning, i.e., model-free learning following no reward.
Furthermore, stress resulted in less overall choice consistency with the valuation process,
suggesting that the stress condition affected ability to utilize task information to select ad-
vantageous choices. However, in their design model-based processing was not found to be
significant in either control or stress condition, bring about methodological questions, i.e.,
why did the population not show a significant effect for model-based processing, given the
previous studies showing that choice behavior utilizes a combination of both valuation pro-
cesses.
However stressing cognitive resources [84] has differential effects on model-based deci-
sion making. According to the RL model, the model-based learner at the beginning of the
23
Figure 2.6: Binge-drinking was found to be associated with decreased model-based
weighting and learning rate and increase perseveration.
Figure 2.7: Days since last binge drinking episode was found to be positively associated
with both model-based weighting and processing and negatively associated with model-
free processing in a binge-drinking population.
trial computes which option is the most advantageous, a strenous task, and acute stress af-
fects goal-directed but not necessarily habitual valuation consideration. Yet individuals with
both increased processing speed and above-average working memory show a change from
model-free to model-based choice [99].
2.4 Psychological populations
The task has been shown to be consistent with many previous studies that have found a link
between habitual and goal-directed processing with disorders of compulsivity and dopamine.
For example, Robbins and colleagues focus on impulsivity and compulsivity as key examples
of cognitive neural system deficits [96].
Alcohol consumption has been shown to have mixed effects on goal-directed behavior.
Binge drinking may have differential effects depending on the age of the drinker, their typical
24
Figure 2.8: Alcohol expectancies, as measured by the Alcohol Expectancy Questionaire,
was found to be positively associated with model-based weighting in healthy controls,
but negatively associated in a group of alcohol-dependent patients that relapsed. Inter-
estingly, patients that did not relapse (abstainers) did not show an association between
alcohol expectancies and model-based weighting.
Figure 2.9: Following alcohol infusion, those measured as low-risk drinkers (AUDIT >=8,
e) showed a decrease of model-based weighting, whereas those measured as high-risk
drinkers (f) showed an increase of model-based weighting.
drinking behavior, and their intentions [35]. Whereas other studies seem to suggest that
binge drinking is associated with an overall decrease of goal-directed valuation [103] and that
those more likely to relapse following drinking abstinence have lower goal-directed control
than those less likely to relapse [102].
To add more complexity to the situation, other studies have found that alcohol admin-
istration reduced habitual processing. But goal-directed decision making was increased in
high-risk drinkers (AUDIT>=8), but decreased in low-risk drinkers following alcohol admin-
istration [80]. This overall leads us to question how these systems interact to influence action
preference.
25
Figure 2.10: Model-based weighting was found to be significantly decreased in obese in-
dividuals with binge eating disorder, obsessive compulsive disorder, and amphetamine
addiction.
Using voxel-based morphometry to contrast obesity by the presence of binge-eating dis-
order highlighted volumetric differences in the caudate and medial orbitofrontal cortex.
Figure 2.11: Levodopa was found to increase model-based weighting in a within-subjects
test-retest design.
Other impulsive populations with reduced goal-directed control include binge eaters.
To begin, high BMI in men is associated with lower behavioural adaptation with respect to
changes in motivational value of food – leading to automatic overeating patterns [57]. And
this was corroborated by studies showing that obesity with binge eating (but not those that
obese) show reduced goal-directed control [130].
2.5 Neuroscience
Previously dopamine has been found to be used in the (model-free) reward system [101] ,
but more recent investigations have shown how dopamine not only plays a role in action
value, but in more complex understandings, such as goal-directed consideration [138] and
value-neutral sensory features [117].
This was further investigated by Doll and colleagues. In their study, prefrontal function
26
Figure 2.12: Genes relating to frontal (COMT) and striatal (DARPP-32) dopaminergic
functioning were found to affect model-based weighting. More specifically, DARPP-32
was found to affect both model-free and model-based processing, whereas COMT was
found to only affect model-based processing.
was indexed by a polymorphism in the COMT gene. Whereas striatal function was indexed
by a gene coding for DARPP-32. The study found that prefrontal dopamine relates to model-
based learning, whereas striatal dopamine relates to model-free dopamine [34].
Worbe and colleagues investigated the contribution of serotonin to model-based choice
in both rewarding and punishment contexts. Specifically in two separate sequential tasks
(one solely with either reward or no reward, the other punishment or no punishment), they
demonstrated that reduced serotonin neurotransmission (via tryptophan depletion) resulted
in impaired goal-directed behavior in the reward condition, but promoted goal-directed be-
havior in the punishment condition [137].
2.6 Critiques: Task complexity, reward rate, and cognitive ef-
fort
Akam, Costa, and Dayan further investigated the two-step task as originally formulated, and
concluded that increased model-based usage does not significantly increase [2]. Although
the task is designed in a way that should lead to increased reward for model-based com-
27
Figure 2.13: Following tryptophan depletion, we see a decrease of model-based weight
in a reward-only version of the two-stage task. Model-based weighting did not differ in
the punishment-only version of the task.
Figure 2.14: In the reduced task, the second-stage state decision is replaced with just
a state-value that directly influenced trial outcome. This does not affect differentiation
of model-based and model-free decision-making, but does increase the association be-
tween model-based weight and reward.
28
Figure 2.15: In the original task, increased model-based weighting was not associated
with an increase of reward due to the task’s stochastic nature. Through various modifi-
cations (change of reward payoffs and no second choice), we see a strong change in the
association of increased reward with increased model-based weighting.
pared to model-free processing, ultimately due to the stochastic nature of the random walks
and binary sampling of reward, given typical subject responding behaviors we do not see
increased reward with increased model-based weighting.
The group redesigned the two-step task so that increased model-based weight was as-
sociated with increased reward. In this task, the second-stage choice is removed, therefore,
each first-stage choice is likely to transition to one of two second-stage states, from which
reward <0, 1> is generated given that second stage state’s current reward probability. In the
paper, Akam describes shifting between reward rates of 20% and 80% for the two second-
stage states, while maintaining a consistent 80% - 20% transition probability for the first-
stage choices.
This is an critique responds to many difficulties with the task. For example, some papers
fail to show model-based processing [89], and personally in one task my own subjects found
that the rewards did not seem to be affected in any way by the first-stage choices. This diffi-
culty in some way may stem how separate the first-stage choices are from the latent reward
value of the second-stage choices when using only reward/no-reward sampling.
Kool, Cushman, and Gershman came to similar conclusions as Akam and colleagues,
specifically that, increased model-based weighting does not lead to increased reward in the
original task and redesigned the two step task relating model-based usage with the concepts
of cognitive control and mental effort [64]. In response, Kool and colleagues compared a
novel task that also disassociates model-based and model-free learning against the original
29
formulation.
In their new task, there are two potential first-stage states (sets of spaceships), s
1,a
and
s
1,b
, from which the trial can begin. On each first-stage state there are two stimuli (space-
ships), which deterministically transition to one of two second-stage states, s
2,a
or s
2,b
. From
there reward is generated by the acquired second-stage state’s current value, producing a re-
ward between +1 to +5.
However, this version of goal-directed behavior has theoretical differences from the pre-
vious tasks. In the previous tasks, goal-directed learners update value considers both the
previous trialâ
˘
A
´
Zs transitions and current second-stage value estimates. If a rare transition
were to occur, the goal-directed learner utilizes an internal map of the task to perform the
action that is optimal according to this calculation.
In comparison, in the Kool two-step task, the goal-directed learner must only remember
that choices c
1,1
and c2,1 both deterministically transition to s
2a
. Therefore, if s
2a
is more
greatly valued than s
2b
then the subject should choose c
x,1
depending upon which first-stage
state they find themselves in. Whereas, model-free behavior updates the value of the choice
only by previous selections of that choice and the resulting reward.
Since this task is deterministic, this means that the model-free learner does not update
of the equivalent first-stage action, e.g., the value of c
2,1
is not updated if the model-free
subject selects c
1,1
and transitions to s
2a
and receives reward. Therefore, in the previous
task designs, model-based learners had to realize that on some trials the trial’s outcome
should more greatly affect the non-selected action. Whereas, in Kool’s design the model-
based learner had to only remember that for each first-stage state, one action in both states
will lead to one second-stage state, whereas the other action will lead to the other second-
stage state. Thus, if one action is rewarding, the adjacent action in the other first-stage state
is too.
2.7 Reducing the complexity
Keramati and colleagues increased the tasks complexity by having multiple levels of tran-
sitive action-outcome choices, thereby allowing them to examine the extent to which goal-
directed tree search is preformed. They found that participants showed a mixture of pure
30
Figure 2.16: One issue with pure planning is the sheer number of potential transitions
between actions and outcomes. Here, the plan-until-habit strategy aims to model how
we may simplify this process, by allowing for goal-directed consideration of habitualized
action sequences.
planning against a strategy in which they planning occurred using cached second-stage state
representations, or as they described it plan-until-habit. Furthermore, decreased resources
(modeled as reduced choice decision time) decreased the usage of pure planning, causing
an increase in plan-until-habit [61].
Cushman and Morris utilized a variation of the deterministic two-stage task[64] to in-
vestigate the way in which we prioritize certain goals. In this task, the subject learns that
two options are associated with one state, e.g., choice 1 and 3 transitions to the blue state,
whereas the two other options are associated with another state, i.e., 2 and 4 transition to
the red state [24]. The value of these end states change throughout the experiment and the
subject only ever selects between 1 and 2 or 3 and 4.
Their experiment was most concerned with how we construct the value of one goal state
versus another. To examine this, on certain setup trials, a selection had a rare transition to
a third green state, and this met with high reward. The critical test was whether the subject
would be more likely to select the equivalent action in the adjacent state, i.e., the action that
al ???
31
Figure 2.17: The group tested goal-selection by having a critical trial preceding a setup
trial. On the setup trial, a selection between two actions which commonly lead to one
goal (blue or red goal) was followed by a rare transition (green state) and high reward.
Afterwards, the critical trial involved choices which lead to similar goals (blue or red) but
utilized different symbols from the setup trial.
32
Chapter 3
Reinforcement learning model
3.1 Introduction
In the previous chapter, we examined computational tasks targeting goal-directed and ha-
bitual decision-making. As shown, there are a myriad of ways to analyze these types of tasks.
For example, goal-directedness can be evaluated through overall reward rate [40], stay prob-
ability on a given trial type [34][106][83], or model-free versus model-based choice consis-
tency [29].
However, no matter how simple or complex the analysis, the model used to investigate
hypotheses says something about our beliefs. In all of these analyses, we are using some
metric (e.g., reward rate / stay probability) to describe some latent belief, i.e., our statistical
measurement has some valid representation for one aspect of that subject’s abilities. Obvi-
ously then, some metrics of analysis will be a better representation for the latent abilities we
wish to investigate. For example, comparing subjects on reward performance may be less
robust than trial type stay probability because of the randomness associated with reward in
the task. Then comes the issue of determining which analyses are a better representation
of the data, and furthermore, how can we discover just what one metric is telling us in the
concept of the model and/or a population we are examining – e.g., does reward rate really
even tell us something about subject behavior, if so what?
My work has focused on how to mathematically formulate the processes relating to habit-
ual versus goal-directed behavior through a dynamic valuation task. As previously discussed,
33
this type of task models habitual behavior as an action policy with choices being directly re-
inforced by trial outcome (model-free RL). In comparison, goal-directed behavior utilizes a
model of the task environment to determine how previous information (previously made
choices along with the resulting trial transition and outcome) should guide future decisions
(model-based RL).
3.2 Methodology
In the next two chapters, we will begin by discussing the elements comprising the hybrid
reinforcement learning model. We will then move on with how to estimate the parameters
of the learning model given a behavioral choice dataset.
3.2.1 A sequential two-stage task
To explain the reinforcement learning model, first we describe our task in detail. As with
previous designs [27][37][99][103][34], we used a sequential Markov decision task to exam-
ine goal-directed and habitual processing. This sequential decision task consisted of 175
trials, and utilized one choice per trial to disassociate goal-directed and habitual processing
modeled via model-based and model-free reinforcement learning.
In this task, each trial began with a selection between the two first-stage symbols – repre-
sented as two locations. This first-stage choice was always between the same two locations.
Participants were given 2 seconds to make a response using the v and n key to select the
symbol on the left or right of the screen, respectively.
Following selection of a first-stage symbol, participants transitioned into a second-stage
state denoted by an alien. In our experiment, there are two aliens representing the two po-
tential second-stage states, s
2a
and s
2b
.
Afterwards, the outcome of the trial was displayed, which was either reward (+1 to +3
points), punishment (-1 to -3 points), or no change (0 points). To allow for a disassociation
of model-free and model-based learning, trial outcome was directly affected by the current
value of the acquired second-stage state, and each action preferentially (75 %) transitioned
to one second-stage state, and less commonly (25 %) to the second-stage state more often
transitioned to by the other location.
34
Figure 3.1: Progression of a trial in the normal task.
In our design, each second-stage state had a latent value from 1 to 7 that probabilistically
influenced trial outcome (Figure 4.1). To encourage on-going learning, these values pseudo-
randomly and independently changed every 10-20 trials.
Thus, the selected action did not have a direct effect on reward probability. Rather, the
selected action probabilistically influenced the trial’s acquired second-stage state which di-
rectly affected the trial’s reward probability.
Given this design, model-free and model-based reinforcement learning action policies
differ. Model-free reinforcement learning is modeled as holding a cached representation for
each action’s value. This value is updated by that trial’s outcome, without consideration of
the second-stage state. In comparison, a model-based reinforcement learner maintains a
model of the task environment and a value representation for the second-stage states. Thus,
the model-based learner is thought to use previous trial outcomes to plan which action is
optimal given the task’s action-outcome contingencies.
As previously described, a model-based strategy is advantageous to a model-free strategy
35
2
4
6
0 50 100 150
Trial
Second−stage alien value
1 2 3 4 5 6 7
−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
0.0
0.1
0.2
0.3
0.4
0.5
Trial outcome
Reward probability
Figure 3.2: The latent value of the acquired second-stage state directly af-
fected trial outcome. Second-stage states latent value varied between 1 to
7 and the two values were carefully pseudo-randomized to encourage on-
going learning. We balanced the average value, value variance, and other
factors such that a random policy throughout the whole block will result in
an average of 0 outcome.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.25
0.00
0.25
0.50
0.25 0.50 0.75
Model−based weight
Average block reward
(−3 to +3 points)
Figure 3.3: The task was designed in such a way that model-based weight
should increase the acquired reward. We find that reward is positively corre-
lated with model-based weight.
(Figure 3.3). Yet, the population is characterized by having variation as to which strategy
(among other factors) is utilized.
Participants are given a basic competency quiz about the elements of the task – e.g., alien
versus location meaning, factors influencing reward, etc. Participants did not move onto
testing, until they correctly answered 8 of the 11 questions (on average 1 attempt necessary,
max = 3 attempts).
36
3.3 Reinforcement learning model
The reinforcement learning model is a flexible, valuation policy that mathematically de-
scribes how new information is integrated to affect future choice behavior.
In each decision, choice options are represented by some latent value that probabilis-
tically influences choice. These action values are modulated by specific processes – e.g.,
goal-directed computation, habitual action selection, consideration of many versus only the
most recent past trials, stochastic behavior, etc. These processes are typically the variables
of interest, represented as free parameters in the model, and usually vary between subjects
or groups.
In this section, we will describe the basic way in which experience can modulate value
using a reinforcement learning valuation model that allows for the consideration of multiple
valuation processes.
3.3.1 Basic Theory: Initialization
Before we begin, since this work deals with goal-directed versus habitual behavior, it is im-
portant to consider our goal with regard to the hybrid RL model. In one sense, we are using
mathematics and probability theory to translate choice preferences and experience in the
task into values representing quantifications of mental processes. We take the 200 trial task
experience (composed of choice, second-stage state, and reward) and compress this into 4-
6 parameters, which are represented by 1 (or many) numerical values. Ultimately, credible
values for these estimated parameters are the ones that best align with the actual observed
choice on trial tÅ 1 when inputted into the hybrid RL model.
However, our goal as data scientists is to parameterize the mental processes using nu-
merical values that can be used for hypothesis testing. There can be many ways to mathe-
matically describe choice behavior, but because we wish to compare between subjects and
groups to assert claims, it is important to not only best fit the choice data, but to do so in a
way that permits and facilitates such comparisons.
And this brings us to yet another intersection, one between probability theory, neuroe-
conomics, behavioral psychology, and the art of research. Assuming data was properly col-
lected, then whatever lie in the data contains some truth. Models are created to explain
37
variations in behavior to decipher some latent truth within the data. We make choices as
to which analyses we run, and whether we are aware of it or not, we accept the limitations
of those analyses. As we use become better at modeling our beliefs and as new technology
is developed allowing quick estimation of these processes, then we can paint a clearer pic-
ture as to how, e.g., our experimental manipulation actually caused a difference between two
groups.
Sometimes, a difference in parameterization will not fit a model better one way com-
pared to another. And yet, you must select one model to use for further analyses. In such a
case, the selected model should align with the scientific theory at the foundation of what is
trying to be explored. In our case, estimates relating to latent processes of goal-directedness
[9][21][134], as modeled by a reinforcement learning framework [115][26].
In terms of where to begin, it is best to start simple and work the model up from there.
I have coded the choice model in various coding languages and estimation methods, with
each being slightly different from one another. Even though I would develop a very technical
model using a more basic estimation method, when testing this model using a more ad-
vanced method, I found myself wanting to understand what each coded decision imparted
on estimation. To do this, a building-up process proved to be vital.
Furthermore, various types of mistakes occurred when I used a more complex version
of the model without testing one that was more basic. Sometimes they were actual mis-
takes, such as, inputting an incorrect parameter somewhere in the model. At other times, it
was a misunderstanding as to how my model would interact with the task. For example, if
two parameters have extremely similar effects and produce similar resulting behavior, then
estimation is slow and potentially inaccurate. The estimation algorithm is unable to deter-
mine whether to assign the variance to the first or second parameter because their role in
the model is such that they both produce behavior that is similar most of the time.
Lastly, it is important to realize that you can only squeeze out so many details about
the subject (i.e., free parameters) from their experience in the task. We are using a limited
number of trials to estimate latent truths about their abilities. As more and more parameters
are added to the model, the ability to estimate previously well estimated variates is reduced.
Some of the previously assigned variance is reassigned to other newly estimated parameters,
because they, e.g., share slight behavioral correlations between one another. Given that it
38
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.5 1.0 1.5 2.0
Absolute difference
between action values
Exploration percentage
Figure 3.4: As the absolute difference (normalized) between Q
av
(1,c
1,t
) and
Q
av
(1,c
2,t
) increases, then the subject is more likely to select the action
which has greater value, and therefore, less likely to commit an "exploratory"
action. In this case, exploratory action refers to selecting the action not cur-
rently estimated to be of the greatest value.
takes time to estimate the model, care needs to be taken in terms of which parameters to
include and how to improve the model based on previous model-fitting analyses.
3.3.2 Values versus choice preference
On each trial, a decision between two choices is made. Specifically, the participant made a
decision in state 1, between choice 1 and 2. This process is modeled by each potential action
having some latent value Q
av
(st ate,c
i ,t
). Thus, choiceP (c
tÅ1
) on trial t was modeled via a
comparison of two values: Q
av
(1,c
1,t
) and Q
av
(1,c
2,t
), with av representing action value. A
softmax function (Equation 3.1) transforms the linear difference between these two values
into a probabilistic value2 (0,1), representing the choice probability on trial tÅ1. As the dif-
ference between the two choice values increases, so does the probability that the participant
will select the choice that is currently estimated by the model to be of greater value (Figure
3.4).
P
i ,tÅ1
Æ
exp
Q
av
(1,c
i ,t
)
exp
Q
av
(1,c
1,t
)
Åexp
Q
av
(1,c
2,t
)
(3.1)
This is the foundation of our choice model. We fit the model, based on the observed
choice data (choice on tÅ1) and their past experiences, which together are encoded into the
39
two continuously updated action values. The parameter estimation process tries to align the
difference between Q
av
(1,1
t
) and Q
av
(1,2
t
) with choice on tÅ 1. Ultimately, the model is
tested on the predictive performance of the parameters given the data[128].
Assuming trial outcome is standardized and center transformed (e.g., transformed to2 {-
1,Å1}), it is reasonable to assume that on trial 0 (in which trial 1 is predicted), the partici-
pant begins with a value of 0 for both choices. The resulting decision is a 50 % - 50 % ran-
dom choice between the two potential actions. But from then on, experience within the task
causes the subject to preferentially select one option over the other. This experience-driven
choice preference is modeled via changes in Q
av
(1,1) and Q
av
(1,2), with these updates being
refined by free parameters representing subject-/group-level variations as to how experience
influences choice preference.
We now have a probabilistic model for estimating a model’s predictive ability. On any
decision, there are two potential actions. We assume that if the participant believes one
option to be of greater value than the other, then they are more likely to select that option
(i.e., they can act upon their beliefs). Through variations in responding, we can disentangle
learning and strategy differences between subjects.
3.3.3 Prediction error
Following experience, the initial values of 0 for Q
av
(1,1) and Q
av
(1,2) change according to
the reinforcement learning model. RL describes the learning process via prediction error
theory [52][97]. In this sense, presumed value is compared to what actually happened – in
this case, trial outcome (Formula 3.2).
±Æ r¡Q (3.2)
According to this theory, learning happens when our expectations are not met. We have
some internal belief about the external world and via interaction (sampling) we modulate
this value representation to account for new, unexpected outcomes.
However, this value comparison differs between model-free and model-based valuation.
In model-free RL, first-stage action value is directly influenced by trial outcome. Whereas,
in model-based RL, trial outcome influences the value representation of the trial’s acquired
40
0.00
0.25
0.50
0.75
1.00
Punishment Reward
Stay percentage
Transition Common Rare
Figure 3.5: When we examine the subjects fitted as using only model-free
processing, we see that the probability of staying with the same choice is
clearly affected by whether the trial was rewarded or punished. There seems
to be no obvious differences in stay probability following common or rare
transition trials.
second-stage state – i.e., trial outcome directly affects the state value of the alien that was
seen, rather than the action selected.
3.3.4 Model-free reinforcement learning
Model-free learning uses a cached-learning process that updates first-stage action value via
a direct comparison with trial outcome (Equation 3.4).
±
m f
(1, t)Æ r
t
¡Q
m f
(1,c
t
) (3.3)
Q
m f
(1,c
tÅ1
)Æ Q
m f
(1,c
t
)Å®£±
m f
(1, t) (3.4)
Using this form of learning, positive prediction error will cause an increase in the prob-
ability of staying with the same choice, whereas negative prediction error will result in de-
creased stay probability. An easy way to visualize this process is by examining the stay-switch
proportions of our model-free subjects (Figure 3.5).
As the actual participant stay-switch behavior reveals, those identified as using exclu-
sively model-free valuation (here with model-based weight,!,Ç 0.25) show a significant ef-
fect for reward on stay probability – i.e., an increased probability to stay with the same choice
41
following reward compared to punishment, regardless of transition.
Yet, the first-stage choice does not directly affect the trial’s reward, rather the selected
action affects the probability as to which second-stage state will appear, and this acquired
second-stage state probabilistically influences trial outcome. Despite this task contingency,
model-free learning is modeled without reference to the second-stage state, and the first-
stage decision only requires consideration of previous actions and their resulting conse-
quences.
3.3.5 Model-based reinforcement learning
In comparison, model-based learning tracks the value of the two second-stage states and
relays this information to first-stage action values by also tracking each choice’s transitional
probabilities – i.e., the probability of each location to transition to each alien (Equation 3.5).
Q
mb
(1,i )ÆP (s
2a
ji )£Q(s
2a
)ÅP (s
2b
ji )£Q(s
2b
) (3.5)
There are various ways to encode the learning of the transitional probability matrix (i.e.,
P (s
2
ji )). In the original formulation, model-based participants were assumed to believe
that there was a 70 % / 30 % common/rare probability split for the two first-stage choices.
However, which choice led to which first-stage choice was learned via experience by tracking
the number of transitions from each choice to second-stage state.
P (s
2a
ji )
tÅ1
Æ (1¡®
tr ans
)£ P(s
2a
ji )
t
Å®
tr ans
(3.6)
Another way we can model this is via a transitional probability learning algorithm (Equa-
tion 3.6). This has been further examined by Akam and colleagues [3]. Ultimately, a more
complex model that contains the learning of the transitional probability matrix is viable if
the task was constructed in such a way to bring out differential behavior. An example of
which would be to have the transitional properties change throughout the testing block[70].
Given that the transitional learning rate is usually quite low (both are task dependent, but
®
tr ans
is typically 0.05 to 0.15, whereas® can be found between 0.40 to 0.90) and only appli-
42
0.00
0.25
0.50
0.75
1.00
Punishment Reward
Stay percentage
Transition Common Rare
Figure 3.6: Examining the stay probability of our purely model-based
subjects, we see a clear outcome:transition interaction. Stay probability
is increased either when there is a common-transitioned reward or rare-
transitioned punishment.
cable if the participant is utilizing some model-based processing, it is understandable how
this parameter is poorly estimated – with an exception being if the task was specifically de-
signed to bring out this type of differential behavior.
In our models, the learning of the transitional probability matrix was a Bayesian updating
process that determined for each first-stage action the associated probabilities of transition-
ing to either of the second-stage states following that action’s selection. Thus, one symbol
could have be represented, e.g., by a 70%-30% split, and the other by a 25%-75% split.
Model-based valuation utilizes the transitional matrix and learned second-stage state
values to determine which action is better. Here, better is the action that is more likely to
transition to the alien which the participant believes is currently more rewarding (Formula
3.7). Ultimately, this produces a stay-switch pattern that is guided by the interaction between
previous trial transition and previous trial outcome (Figure 3.6).
Q
mb
(1, a)Æ P(s
2a
ja)£Q(2a,1)Å P(s
2b
ja)£Q(2b,1) (3.7)
On a fundamental level this suggests that model-free and model-based valuation are do-
ing two very different processes. It is true that model-free and model-based valuation is
usually quite similar to one another – i.e., the valuation policies do not strongly differ fol-
lowing common transitions, which account for approximately 75 % of choices. Yet, despite
43
frequently selecting the same action, they do so in different ways. Model-free valuation
does not require tracking of second-stage state values, only a remembrance of if an action
was previously rewarding or punishing. Whereas, model-based valuation considers the con-
struction of the task environment (a model of the task’s action-outcome contingencies) and
tracks second-stage state values in a behavioral policy that calculates the best action given
their current understanding of the task, rather than only what was directly experienced in
the past.
3.3.6 Hybrid reinforcement learning
The hybrid reinforcement learning model allows for mixed valuation schemes by using a
weighing term (model-based weight, !2 {0,1}) that determines the balance of model-free
and model-based valuation learning.
Q
l v
Æ (1¡!)£Q
m f
Å!£Q
mb
(3.8)
With this formulation, values nearing 0 force the value representation to be dominated
by Q
m f
thereby representing complete model-free valuation. Whereas, values nearing 1 rep-
resent an action policy that only considers model-based valuation learning. Interestingly,
values near 0.5 represent a consideration of both model-free and model-based processes
(Formula 3.8).
Mixed (hybrid) valuation is marked with both a significant effect of reward and reward
£transition on stay probability. The hybrid learner shows model-based tendencies – i.e.,
increased stay probability following rare punishment trials in comparison to common pun-
ishment trials, and decreased stay following rare rewarded trials in comparison to common
rewarded trials. However, they also show model-free tendencies. In this case, stay probabil-
ity following rare rewards is often equivalent to stay probability following rare punishment
trials, whereas in complete model-based dominance, rare rewarded trials have a stay proba-
bility nearly equivalent to that of common punishment trials (Figure 3.7).
44
All participants Model−free Mixed Model−based
Punishment Reward Punishment Reward Punishment Reward Punishment Reward
0.00
0.25
0.50
0.75
1.00
Stay percentage
Transition Common Rare
Figure 3.7: Mixed valuation is marked by a combination of model-free and
model-based valuation. This action policy is characterized by the presence
of a outcome:transition effect, along with an effect for outcome. Specifi-
cally, we can see how rare transition stay probability shifts between valuation
schemes.
Furthermore, we find a difference between hybrid learner stay probability
compared to the full experimental sample. When examining the full cohort,
we see a stay probability that best matches mixed valuation; however, there
is a difference in how rare rewarded compared to rare punishment trials af-
fected stay behavior.
3.3.7 Choice selection
So far, we have considered how different learning strategies encode trial outcome differen-
tially, thereby influencing first-stage action values. To encode choice consistency (or behav-
ior consistent with reinforcement learning) we can multiply this learned value difference by
a reinforcement learning inverse temperature parameter,¯
r l
.
As mentioned, the softmax equation transforms a linear distance between values into
a probabilistic one. Therefore, increases to the inverse temperature parameter cause the
learned difference to be magnified, whereas decreases cause the difference between action
values to shrink in magnitude (Equation 3.10).
Q
av
Ư
r l
£Q
l v
(3.9)
¼
1,a
Æ
exp
Q
av
(1,a)
exp
Q
av
(1,1)
Åexp
Q
av
(1,2)
(3.10)
45
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.5 1.0 1.5 2.0
Absolute difference
Exploration percentage
Inverse temperature High Average Low
Figure 3.8: The inverse temperature parameter can be thought of as a mea-
sure of choice consistency. The inverse temperature parameter scales the
distance between Q
l v
(1,c
1,t
) and Q
l v
(1,c
2,t
). Therefore, increases in inverse
temperature represents that the participant’s choices were more consistent
with their estimated learning strategy. Participants with lower inverse tem-
perature show greater stochasticity when making choices, denoted here as
increased exploration percentage.
However, the inverse temperature parameter does not change which action is currently
estimated to be of greater value. Parameters such as learning rate and model-based weight
change the type of strategy used by the participant. In comparison, the RL inverse tempera-
ture parameter magnifies the probability that they will make the choice consistent with the
valuation strategy they are utilizing (Figure 3.10).
Increases to this parameter lead to more deterministic decision-making. Whereas, de-
creases represent increased choice stochasticity, potentially representing an inability to align
behavior with how that participant is believed to have interpreted the task (i.e., despite being
fit with a specific strategy of responding, their choices were still quite random).
3.4 Other parameters of interest
3.4.1 Forgetting to default
Typically, reinforcement learning consists of experience-driven updating – i.e., changes in
the selected choice or acquired second-stage state via some actualized result (e.g., trial out-
come). However, forgetting has been found to be an important element in this type of task
46
[84] [122].
Forgetting (i.e., resetting of learned values) can be modeled by having the unselected
choice (!c
1,t
) and unsampled second-stage state (! s
2,t
) default back to some reference point
(mu) with some forgetting rate,®
F
.
Q
m f
(1,!c
1,t
)Æ (1¡®
F
)£Q
m f
(1,!c
1,t
) (3.11)
Q
(
! s
2,t
,1)Æ (1¡®
F
)£Q
(
! s
2,t
,1) (3.12)
Toyama, Katahira, and Ohira [122] have examined the effects of having a forgetting rate
equaling the learning rate, whereas, Otto and colleagues [84] have implemented a forgetting
rate as equaling 1-learning rate. We have found the forgetting rate to be significantly reduced
compared to the remembrance rate. This signifies that the participant more readily updates
their selected choice based on the trial outcome, compared to how the unselected choice is
updated via the default value mu.
More accurately, by examining the stay-switch behavior we see that forgetting increases
the prevalence of the last trial especially concerning the contrast between stay and switch-
ing decisions. That is, for the model-free subjects, with increased®
F
we see an increase in
the difference between stay probability following rewarded (stay) compared to punishment
(switch) trials.
On a related note, as®
F
increases, the difference between 1-back model-based stay and
switch probabilities also increases. More specifically, as forgetting increases the model-
based learner more strongly considers the events of the previous trial, causing a stronger
1-back outcome:transition effect.
Finally hybrid valuation shows a mixture of both effects. For these participants, increased
forgetting led to decreased staying, except following commonly rewarding trials. Perhaps
relating to previous accounts on the innate difference of encoding positive versus negative
prediction error or punishment versus reward learning.
47
Low forgetting High forgetting
Model−free Hybrid Model−based
Punishment No change Reward Punishment No change Reward
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Stay percentage
Transition Common Rare
Figure 3.9: High forgetting increases the effect of the previous trial on stay probability given the indi-
vidual’s valuation policy.
3.4.2 Perseveration
Additional tendencies or strategies can be added to this learned value. For example, human
choices are usually perseverative [98][126] – i.e., they show continued selection of the same
choice, regardless of predictions made through value-learning.
Q
t v
(1,c
t
)Æ Q
av
(1,c
t
)Å ps (3.13)
The perseveration parameter varies freely, with positive values representing increased
staying with the previously made choice, whereas negative values represent a tendency to
switch choices. Furthermore, this value, ps, is not a learned value per-se. I.e., after each trial
the additional value given to the previously selected choice is reset, and the perseveration
parameter applies to whichever choice they selected on the most recent trial.
Similar to the forgetting parameter, we see that perseveration modulates stay probabil-
ity, while still maintaining characteristics of model-free and model-based valuation (Figure
48
Low perseveration Norm perseveration High perseveration
All Participants Model−free Hybrid Model−based
Punishment No change Reward Punishment No change Reward Punishment No change Reward
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Stay percentage
Transition Common Rare
Figure 3.10: Perservation was coded as an increased probability of staying
with the previously selected choice. Yet because this parameter combined
with reinforcement learning, we see that perseveration had the greatest ef-
fect on trials that did not have a common transition reward. For example, in
model-free subjects, we see that increased perseveration caused the largest
increase in stay probability following no change and punishment trials. We
see similar trends in both hybrid and model-based participants.
3.10).
Perseveration has a prominent effect on trials in which the subject received neither re-
ward nor punishment. As the parameter increases, so does the probability that the partici-
pant will stay following a trial with no change.
3.5 Useful Parameterizations
The model as presented above allows us to model subjects’ decision-making tendencies. Yet,
we wish to do more than only model subjects’ choices; we wish to do so with parameter
values that are comparable between subjects and facilitate further hypothesis testing.
This may sound puzzling at first, why would parameters not be comparable between
subjects if the same model was used? And what types of modifications must be made so that
parameters are comparable.
49
In one sense, if parameters are intrinsically correlated, then we will be unable to reli-
ably trust our parameter estimates. For example, imagine that as! (model-based weight)
increased from 0, so too did the difference between Q
av
(1,1) and Q
av
(1,2), regardless of
previous trial transition history. Keep in mind, that the purpose of the! parameter is to pro-
vide an estimate of the ratio of model-based to model-free valuation, whereas, a parameter
affecting the scaling between Q
av
(1,1) and Q
av
(1,2) would affect choice consistency (i.e.,
stochasticity of decisions).
This correlation will not cause a problem in estimating any single subject’s data. I.e.,
the model-fitting procedure will find a value of ! and ¯
r l
such that it maximizes the log-
likelihood of the choice data given the (incorrectly) designed model. Given that in this ex-
ample, model-based weight is positively correlated with inverse temperature, those with in-
creased model-based weight would be estimated with reduced¯
r l
than if they responded
with the same amount of choice consistency but with less model-based consideration. As!
increases, it functionally accounts for some the purpose that is meant to be captured by the
¯
r l
parameter, thus for those subjects¯
r l
is reduced compared to what we would expect for
an estimate representing the process of "goal-directed decision-making."
As we will later discuss, this problem as formulated will not affect the ability to produce
free parameter estimates for any one subject. Rather, it would cause a difference in how dif-
ferent parameter values compare between subjects. For example, we may find those with
increased model-based decision-making have reduced reinforcement learning inverse tem-
perature compared to those that utilized primarily model-free decision-making. Yet that was
an artifact as to how! interacted with¯
r l
, in this case stemming for a miscoded inherent
correlation between parameters.
Given this information, there are a number of parameterizations to consider when mod-
eling this task. Some of these modifications may seem small or inconsequential, but just as
there is some truth within the acquired data, there is some truth to properly representing
the model that quantifies our beliefs concerning value-based learning. As we improve the
model, by investigating previous ambiguities from past analyses, so too is our closeness to
this truth hiding within the data. Even if the new model is worse than the previous model,
that too says something about the direction we are heading in our analysis.
50
3.5.1 The categorical distribution and reinforcement strategy processes
Previously, we examined how model-based weight modulated the consideration of model-
free versus model-based valuation, specifically by weighing the value estimates of both pro-
cesses (Q
m f
and Q
mb
) by (1¡!) and !, respectively, to produce a weighted value repre-
sentation Q
l v
for both choices. And the¯ scaled difference between the two action values
Q
av
(1,1) and Q
av
(1,2), reflects how strongly the choices align with the fitted RL model.
Another way we can think of the decision process is not as a comparison between two
learned action values, but rather from multiple sources affecting an overall value represen-
tation of each choice. Here we instead consider choice consistency relating to the differ-
ent valuation strategies, rather than choice consistency relating to the overall reinforcement
learning valuation strategy.
!Æ
¯
mb
¯
r l
(3.14)
¯Æ¯
m f
ů
mb
(3.15)
Here, reinforcement learning choice consistency is instead broken up into its model-free
and model-based inputs (Equation 3.15). We refer to these terms as model-free (¯
m f
) and
model-based (¯
mb
) processing, whereas model-based weighting (!) is the ratio of model-
based processing to reinforcement learning choice consistency (Equation 3.14).
P
1,a
Æ
exp
¯
m f
£Q
m f
(1,a)ů
mb
£Q
mb
(1,a)
P
2
iÆ1
exp
¯
m f
£Q
m f
(1,i )ů
mb
£Q
mb
(1,i )
(3.16)
In relation to the softmax equation, we see that it is now expanded allowing for easy vi-
sualization of the multiple sources of value that go into each action value representation
(Equation 3.16). While this does not affect value-estimation per se, we will see that later
when linking subject-level and group-level hierarchical parameters, it is easier to estimate
the model via model-free and model-based processing compared to a model utilizing rein-
forcement learning choice consistency and model-based weighting. One reason for this is
that model-based weighting is a beta distributed parameter2 {0,1} thereby complicating the
estimation process, especially in repeated-measures experimental designs[81].
51
3.5.2 Learning rate
Previously, value updating was modeled in accordance with prediction error theory as:
±Æ r¡Q (3.17)
QÆ QÅ®£± (3.18)
Expanded out producing:
Q
1,c
t
Æ (1¡®)£Q
1,c
t
Å®£ r
t
(3.19)
However, with this formulation, trial outcome (r ) is weighed by® and the previous value
estimate (Q) by 1¡®. This causes learning rate to be intrinsically linked with choice consis-
tency, i.e., for a given inverse temperature, as learning rate decreases, so too does the differ-
ence between, e.g., Q
m f
(1,1) and Q
m f
(1,2) by a factor of
1
®
.
Ultimately, we found that this caused participants with extremely low choice consistency
(i.e., the participants who were most likely random guessing) not to be fitted with a low value
of¯. Rather, due to this intrinsic correlation, those participants were fit with an extremely
low learning rate, and high inverse temperature (e.g.,® = 0.05,¯ = 10).
We can see this effect quite simply. Imagine the value estimate at the start of the ex-
periment for all subjects, i.e., Q(1,1)Æ Q(1,2)Æ 0. Now imagine that the participant was
rewarded +1.0 (i.e., +3 points out of a maximum of 3 points). Using the previous model, a
subject with an ® = 0.05, would have a resulting Q(1,1) - Q(1,2) difference equaling 0.05.
Whereas, a subject with a® = 1.00 would have a value difference equaling 1.00.
This causes an extreme correlation between learning rate and Q
d
difference. Essentially,
as learning rate decreases, the participant is given more flexibility in their choice consistency,
given the same value of¯
r l
.
We can eliminate the effect learning rate has on choice consistency by reformulating[84][48]
the prediction update as:
Q
1,c
t
Æ (1¡®)£Q
1,c
t
Å r
t
(3.20)
The principal change is that trial outcome, r
t
, is no longer weighed by learning rate. Now
52
regardless of learning rate, value is increased by r
t
. Whereas, the amount remembered is
weighed by 1¡®. These two values are summed together to produce the updated value
representation Q
tÅ1
.
53
Chapter 4
Parameter estimation
4.1 Introduction
The previous chapter examined how the hybrid reinforcement learning model flexibly en-
codes previous task experience thereby allowing variation between participants. In this sec-
tion, we continue forward with the process of estimating these latent processes given some
set of behavioral data – i.e., subject, trial, choice, second-stage state, outcome, and next
choice.
We begin with the principles found in maximum likelihood estimation and move forward
to hierarchical models containing group-level (mixed) effects. Ultimately progressing to a
completely hierarchical Bayesian approach that fully approximates both subject-level and
group-level parameters of interest.
In accord with the previous section, the parameter estimation process has a goal: to esti-
mate parameters relating to the subjects- and groups- latent behavioral characteristics in a
way that facilitates further comparisons. In a way we start from ambiguity, for example, in a
completely uninformed state, subject 1’s model-based weight could be anywhere between 0
to 1. The acquired data chips away at this ambiguity according to how our model is formu-
lated [114]. With more acquired data, we become more sure of that subject’s model-based
weight
1
. However, different estimation methodologies can result in very different outputs.
And these types of analyses (i.e., where a fitted model estimates subject-level parameters that
are then used for group comparisons) are extremely susceptible to artifacts that arise during
54
the model-fitting process. Thus caution needs to taken when analyzing truly the types of
inferences that can be made from each type of parameter estimation method.
4.2 Point-estimation
Point-estimation is a type of parameter estimation in which the free parameters are esti-
mated and represented by one numerical value. This value is often the one that maximizes
the given mathematical model. In our case, point-estimation deals with estimating the as-
sumed mental processes previously described via production of one numerical represen-
tation for each parameter. For example, subject 1 could be estimated to have a¯
1
m f
= 1.2
and¯
1
mb
= 2.4 because those values produce Q
av
(1,1) and Q
av
(1,2) values that (typically)
best align with that subject’s actual choice behavior on trial tÅ 1 during the first block when
inputted into the reinforcement learning model, given their task experience (trial, choice,
transition, outcome).
There are limitations when using only one value to represent each free parameter. For
example, we used the 200 trials of choice behavior to estimate¯
1
m f
as being 1.2, but does
that truly represent our beliefs? How sure are we that¯
1
m f
=/= 1.19 or¯
1
m f
=/= 1.21? If we
were to continue with further analysis utilizing these point-estimates (e.g., linear regression
on¯
1
m f
), we are stating that there is 0% probability that¯
1
m f
= 1.19 or 1.21, and that there is
100 % probability that the value is 1.2. Which affects the ability for the linear regression to
find associations hidden in the choice data.
Despite these limitations, basic point-estimation provides us with the tools and back-
bone for more complex designs. Statistical theory is not rewritten because we select a Bayesian
method versus a more simplistic frequentist method of parameter estimation. Rather, dif-
ferent methods respond to the problem of estimating latent parameters uniquely. But these
methodological differences are lessened given that there is some universal truth to the data,
and that these methods are suppose to allow us to peer into these hidden associations ly-
ing within the raw choice data. While a Bayesian analysis may lead to a richer understanding
than one that utilizes only point-estimates, there should exist some consistency between the
various approaches.
1
Assuming that the subject stays relatively consistent in the way they are making choices. Which is generally
true, however, we do find some slight improvement in choice consistency following the first 20 trials.
55
The parameter estimation techniques described all follow a process, minimize/maximize
the acquired data (choice on trial tÅ1) via some mathematical equation that allows for vari-
ability between subjects (the hybrid learning model with its numerous free parameters). The
hybrid RL model does not differ between model-fitting approaches; however, the relation-
ships of parameter estimates between subjects- and groups- can be refined via hierarchical
(mixed) modeling approaches.
4.2.1 Maximum likelihood estimation
When fitting a reinforcement learning model to choice data, we are searching for the set of
free parametersµ
MLE
that maximizes the likelihood of the observed choices – i.e., the free
parameters that best align Q
t v
(1,1) and Q
t v
(1,2) with observed choice on trial tÅ 1.
µ
MLE
Æ argmax
µ
P(c
tÅ1
jµ) (4.1)
µ
MLE
Æ argmax
µ
log (P(c
tÅ1
jµ)) (4.2)
Remember that for each decision there are two values Q
t v
(1,1) and Q
t v
(1,2) that shift
according to the model, the current free parameters, and the actual experienced outcomes.
In maximum likelihood estimation, we are searching the parameter space for the set of pa-
rameters that maximize choice likelihood (i.e., the set of reinforcement learning parameters
that align the softmax decision rule best with how the participant actually behaved).
Here, we often estimate the natural logarithm of the data due to low probabilities, which
risk numerical underflow (Equation 4.2). However, this does not affect the parameter esti-
mation process and is a common principal shared between all model-fitting techniques.
During this process a number of other parameters can also be estimated. For exam-
ple, after finding the set of parameters that maximize the log-likelihood of the data, the log-
likelihood itself is an important metric. The log-likelihood is used for model comparison to
determine if a newly fitted model actually preforms better than some previously fit model.
Although a parameter may be identified by the model as being significant, we must fur-
ther test its necessity for explaining the observed outcomes. For example, a more complex
model may have an additional term, which is estimated to be significantly different from 0,
56
whereas in the simpler model this parameter is equal to zero. Just because the model was
fit with this additional term, and it was found to be significantly different from 0, does not
mean that the parameter itself is required for explaining the data. To do so, we must actu-
ally compare the posterior predictiveness of both models [129]. If in fact the more complex
model explains more of the acquired data (despite controlled reductions for the additional
free parameter [128]), then there is increased evidence to prefer that model.
Furthermore, we can also estimate a Hessian matrix for the free parameters to provide
the standard error of the estimated maximized parameters. In a way, this displays our con-
fidence for the parameters of interest. In accordance to what was previously discussed with
a participant estimated to have a¯
1
m f
= 1.2, we now have a way to express our confidence of
that estimate. But incorporating the standard error estimate for further analyses (e.g., post-
hoc regression) is not well addressed, nor is it commonly done.
However, researchers often cite difficulties with MLE parameter estimation – estimates
are often described as noisy [123], there are issues with modeling non-normality [39], and
estimation is strongly affected by outliers [63]. On the one hand, MLE and model-fitting
itself is a lossy process: a dataset (consisting of many trials with choice, second-stage state,
outcome, and next choice) is translated to 4-6 free parameters. More specifically in the case
of frequentist approaches, 4-6 values representing 4-6 different aspects of behavior.
By reducing the quantity of information, we reduce some of the informativeness pertain-
ing to our current understanding of the data. For example, let’s say we perform MLE on all 52
subjects, thereby producing 1 estimate per mental process per subject, and this 1 value is the
value that maximizes that subject’s log-likelihood. Now let’s say we have the question of if
subject 2 has significantly increased model-based processing compared to subject 1. Using
point-estimation, we would have two values subject 1¯
m f
and subject 2¯
m f
, but what then
defines a significant difference in model-based processing? A value greater than 1.0? 2.0?
Some analysis that also includes the standard error estimates?
In a way, despite using the same model to fit subject 1 and subject 2’s task experience, we
are leaving something out of the model, i.e., the relationship between subject 1 and subject
2. In this case, both are individuals from the same population, and this population may
have some latent characteristics pertaining to the fitted choice model. Furthermore, the
hypotheses we wish to answer usually involves the characteristics of the latent population,
57
and not of the sampled participants.
This also targets another issue, the effects of outlier fits. As we have discussed, we are
using a limited number of trials to estimate latent behavioral processes. Let’s say a subject
was fit with a rather large estimate for¯
1
m f
, e.g.,¯
1
m f
= 6.0. This large estimate is the value
that maximizes the log-likelihood for their 175 choice task experience, but it is vastly greater
than the rest of the subjects. In fact, we would argue that this high estimate is an artifact of
potentially not having enough trials. And idealistically, if we were to include beliefs that we
obtained by looking at other subjects (the rest of the population having¯
1
m f
ranging from
0.5 to 4.0), we might want to change the estimated value to be slightly lower (e.g.,¯
1
m f
= 4.5).
These topics are examined by further expanding the maximized parameter estimation
process to include relationships pertaining to the experimental design and tested group
comparisons.
4.3 Hierarchical pooling
There is a relationship between subjects that is not modeled with MLE. Here, we describe
MLE as an estimation process that searches for the set of free parameters that maximizes the
likelihood of a subject’s acquired data given only the RL valuation model.
As previously described, when model-fitting we are using a limited number of trials to
produce numerical estimates representing the usage of some mental process. Maximum
likelihood only considers the subject’s next choice in estimating the best fitting parameters:
µ
MLE
Æ argmax
µ
(4.3)
But as mentioned, the subjects represent samples of a population, e.g., they are all college-
aged and participants in the study. And we wish for our hypotheses to typically measure
truths concerning the population and not just something relating to that specific sample.
In hierarchical pooling or mixed modeling, the equation that we try to maximize not
only considers choice on tÅ 1, but also the relationship that the free parameters have with
one another. For example, each subject could have their own estimate of ¯
1
m f
(as in the
MLE model), but there could be an additional parameter relating the subject-level estimates
to, e.g., a normal distribution that is representative of beliefs concerning the population’s
58
model-free processing.
In doing this, we estimate the parameters from two sources: the choice likelihood and
the hierarchical (prior) likelihood. Thus, the outputted maximized values between subjects
all share something alongside the fact that they use the same flexible reinforcement learn-
ing model. Specifically, they share a relationship with the defined hierarchical distribution,
relating subject-level estimates to beliefs held concerning our experimental design (e.g., the
latent population or a between-group hierarchical comparison).
4.4 Maximum a posterior estimation
Maximum a posterior estimation (MAP) can viewed as an extension of MLE. Bayes’ rule
mathematically describes how prior beliefs and the log-likelihood are weighed:
P(µjD)Æ
P(Djµ)£ P(µ)
P(D)
(4.4)
P(µjD)/ P(Djµ)£ P(µ) (4.5)
The posterior is the product of the choice likelihood and prior beliefs. In MAP estimation,
we replace the likelihood with the posterior estimate:
µ
M AP
Æ argmaxP(Xjµ)P(µ) (4.6)
µ
M AP
Æ argmaxl og P(Xjµ)£ P(µ) (4.7)
Therefore, MAP differs from MLE by the inclusion of a prior, causing the parameter es-
timates to be the maximization of the choice likelihood and whatever defined priors. Given
that the outcome of MLE is a point-estimate, MAP tries to refine this parameterized com-
pression through mixed modeling, causing the point-estimate to consider both the subject’s
choices according to the reinforcement learning model and beliefs concerning what we think
the estimate should be.
Ultimately, the priors assert less of an effect on the estimation of parameters when the
parameters are well identified by the likelihood. Whereas, if the estimates are not resolved
through their choice data, then the parameter estimates are more greatly affected by the
59
priors and collapse upon characteristics of the prior distribution (e.g., the mean of the dis-
tribution).
One common criticism with Bayesian methods is the selection of appropriate priors. But
maximum likelihood estimation can be thought of as a form of MAP estimation (i.e., one with
uniform priors on the parameters). And even using uniform priors is a decision itself [1], and
often a uniform distribution does not resemble our beliefs, therein producing parameters
estimates not in line with our beliefs as an analyst.
Now the question becomes what should the priors be set to? Previous studies have used
lightly defined priors to shape the maximized estimates [46][64]. For example, Gershman
examines using maximum likelihood on a separate dataset, D2, to estimate parameter pop-
ulation means and standard deviations, which are then used as the empirical priors for MAP
estimation on D1.
In empirical Bayesian methods, these hierarchical priors are estimated via the fitted subject-
level estimates (i.e., the choice likelihood). At the same time, the hierarchical terms exert an
influence on the subject-level parameter fits by being the prior on which the subject-level
fits are weighed with.
4.5 Expectation-maximization
The expectation-maximization procedure [58], is an empirical Bayes method in which hier-
archical parameters are estimated alongside subject-level parameters and used to influence
subject’s free parameter point-estimates. The process contains two-steps in an iterative fash-
ion and quickly converges upon a local maximum.
This estimation process specifies a vector of parameters h, and begins an iterative pro-
cess. This process begins by selecting initial values for the Gaussian prior distributions,µ,
such that P(hjµ). This is used to find the maximum a posterior estimate m
i
for each subject
i .
From there, the maximization step of the EM procedure sets the parameters of the prior
distributionµ to the maximum likelihood given the current set of subject-level parameter
estimates.
The hyperparametersµ
h
are estimated by setting the mean¹
h
and the variance v
2
h
of the
60
prior distribution to:
¹
k
Æ
P
m
k
i
N
(4.8)
(v
k
)
2
Æ 1/N¤
X
(m
k
i
)
2
Å
k
X
i
]¡ (¹
k
)
2
(4.9)
This procedure repeats itself until some convergence criteria is met, e.g., 10
¡4
Ƶ
i ter
¡
µ
i ter¡1
. I.e., the total difference of estimated value for all the parameters stabilize.
Expectation-maximization can be thought of as a form of MAP estimation. However, the
distributions that describe the group-level hierarchical priors are not predefined, but contin-
uously estimated from the dataset (in a rather unique way). Ultimately allowing estimation
of the mean and standard deviations for the normal distributions that are used as priors
against the choice likelihood.
4.6 Hierarchical Bayesian
Bayesian analysis produces a posterior distribution rather than a single maximized value per
estimated parameter [30]. This distribution describes the likelihood of a value being the true
parameter estimate given the subjectâ
˘
A
´
Zs data, the RL choice model, and the constructed
hierarchical model. In this case, the output being, e.g., 4000x values per estimated parame-
ter, rather than the 1x value that maximizes the model.
Again, the choice model (choice likelihood) does not change from the frequentist design
– in both cases, the same hybrid RL model is used. Rather, we can design more accurate
and detailed experimental models. This is because of the sampling method used for many
Bayesian analysis tools [120], rather than an analytic gradient function that finds a local max-
imum.
4.6.1 Posterior created via sampling
Bayesian analysis implements full statistical inference via Markov Chain Monte Carlo[110][54].
Here, the estimatedµ, is the result of random, iterative process. Each iteration produces a
single set of values, specifically one value per estimated parameter. Here, an iteration can be
61
bMB1_mu bMB1_sd
0 250 500 750 1000 0 250 500 750 1000
1
2
3
1
2
3
4
Iteration
chain
1
2
3
4
5
6
7
Figure 4.1: Bayesian estimation results in a posterior distribution for each
parameter, rather than a single maximized value. Typically, multiple chains
are ran that randomly walk through the parameter space according to the log
density.
The shaded region represents a warm-up phase. During this phase, the
mcmc process is allowing itself to find the high density region of the posterior
distribution following the process’ initiation. Afterwards, the remaining iter-
ations are utilized for further analyses, assuming the model is identified and
the different chains are exploring a similar region of the parameter space.
thought of as a potential valid estimate given the model and acquired data. At the end of the
parameter estimation process, these many sets of parameter values (i.e., from each iteration)
are combined into a posterior distribution representing credible values for that parameter,
thereby preserving our uncertainty ofµ.
MCMC Bayesian statistical techniques utilize a number of chains randomly initialized.
Similar to how it is recommended to run a point-estimate analyses numerous times. How-
ever, with methods that use point-estimates, typically we keep the estimated values of the
analysis that results in the maximum likelihood, and discard estimates with reduced log-
liklihood. Whereas with mcmc, multiple chains are ran, and the analyst checks that they
exploring a similar region of the parameter space. We know that within each chain a number
of values will be created for each estimated parameter, and by running multiple chains, we
check that our model is correctly identifying the high density region of the posterior from a
variety of initialization points.
Running multiple chains makes the process of identifying misidentified models easier
than with frequentist approaches. For example, parameters that cause a model to misbe-
have, will often produce posterior distributions with little variation, or unmistakably high
Rhat values.
62
4.6.2 Model comparison
Models often differ by the inclusion of one or more flexible parameters, that are encoded as
a static baseline value in more basic models. When these flexible parameters are found to
be significantly different from an assumed baseline, some researchers take this as a sign that
the newly created free parameter is essential in explaining the variance found in the choice
data, and thus, the more complex model is the more accurate representation.
However, to determine if a parameter is significant we must first determine that the more
complex model is a better representation of the data. To do this we test the predictive perfor-
mance of each model by estimating the pointwise out-of-sample prediction accuracy. Tech-
niques such as leave-one-out cross-validation (LOO-CV) [128] and the widely applicable in-
formation criterion (WAIC) [129] measure the posterior predictive performance of the fitted
models [91] with consideration of the number of parameters.
4.7 Estimating the parameters
The focus of many published Bayesian works involve limited data entries, and in such cases,
parameter estimation can require prior information for the iterative MCMC process to prop-
erly lock onto a single posterior-distribution for each of model’s free parameters. However,
as the number of samples increase typically so too does the identifiability of the model’s pa-
rameters.
Specifically, our experiments have focused on 30-50 subjects per group, repeated mea-
sures (precondition-postcondition), with 175+ trials per block. We have found both subject-
and group- level estimates to be extremely well-identified with this construction. However,
we find that a far lesser number can of both subjects and trials will still allow for proper model
identification. Rather, we stress the importance of model parameterization in the estimation
of these parameters.
Because of this well-identifiability we use uniform prior distributions on all model pa-
rameters. Despite having no prior beliefs on subject- and group-level parameters, we do
use a hierarchical construction during the estimation process. Specifically, our hierarchical
(mixed) model focuses on using the parameters of the reinforcement learning model in the
estimation of a repeated measures experimental design.
63
4.7.1 Precondition-postcondition fitting
For our experiments, we were mostly concerned with how some experimental manipulation
changed behavior, and this was examined by running the same participants in the RL task
both before and after condition.
There are a number of ways to design this change. For example, subjects could begin with
some free parameter that changes with block, e.g., ps2
sub
Æ ps1
sub
Å ps2
chang e
, and thus
there is some¹
ps1
and¹
ps2
chang e
. However, with this construction there is now a correlation
between parameters and their effect on the estimation of the choice data. That is, an increase
in the tendency repeat selections in the postcondition can be caused by either an increase of
ps1 or ps2
chang e
.
N (yj¹,§)Æ
1
p
2¼
k
1
p
j
P
j
exp(¡
1
2
(y¡¹)
T
§
¡
1
(y¡¹)) (4.10)
¹2R
k
(4.11)
§2R
k£k
(4.12)
Rather than using change parameters, we model the hierarchical distribution such that
each unique process has it’s own free parameter, i.e., ps1
sub
uniquely affects choice perse-
veration in the precondition and ps2
sub
does so in the postcondition. However, we relate
the parameters via a multivariate normal distribution. The multivariate normal distribution
is the extension of the normal distribution, in which the correlations between the normally
distributed variates are also estimated.
N (yj¹,LL
top
)ÆN (yj¹,§) (4.13)
Similar to the normal distribution, the estimation process of the multivariate normal dis-
tribution includes parameters relating to the distribution’s mean¹ and standard deviation¾.
However, the multivariate normal distribution also considers a correlation matrix describing
the relationship the different parameters have with one another.
To do this, we use the Cholesky parameterization of the multivariate normal distribution.
64
bMB1_mu
bMB1_sd
bMB1[31]
bMB1[22]
bMB1[50]
bMB1[27]
bMB1[24]
bMB1[16]
bMB1[8]
bMB1[14]
bMB1[36]
bMB1[41]
bMB1[13]
bMB1[11]
bMB1[52]
bMB1[21]
bMB1[45]
bMB1[29]
bMB1[43]
bMB1[35]
bMB1[7]
bMB1[44]
bMB1[38]
bMB1[46]
bMB1[37]
bMB1[23]
bMB1[18]
bMB1[17]
bMB1[34]
bMB1[2]
bMB1[20]
bMB1[10]
bMB1[39]
bMB1[12]
bMB1[48]
bMB1[19]
bMB1[5]
bMB1[9]
bMB1[49]
bMB1[4]
bMB1[51]
bMB1[26]
bMB1[30]
bMB1[28]
bMB1[32]
bMB1[25]
bMB1[1]
bMB1[33]
bMB1[40]
bMB1[6]
bMB1[47]
bMB1[15]
bMB1[42]
bMB1[3]
0.0 2.5 5.0 7.5
bMB1,bMF1
ps1,bMF1
ps1,bMB1
pun1,bMF1
pun1,bMB1
pun1,ps1
bMF2,bMF1
bMF2,bMB1
bMF2,ps1
bMF2,pun1
bMB2,bMF1
bMB2,bMB1
bMB2,ps1
bMB2,pun1
bMB2,bMF2
ps2,bMF1
ps2,bMB1
ps2,ps1
ps2,pun1
ps2,bMF2
ps2,bMB2
pun2,bMF1
pun2,bMB1
pun2,ps1
pun2,pun1
pun2,bMF2
pun2,bMB2
pun2,ps2
−0.4 0.0 0.4
Figure 4.2: Not only is the mean and standard deviations for each parameter
considered when utilizing a multivariate normal distribution, but also the
relationship between parameters.
Here, we use the Cholesky factor L of the k£ k correlation matrix§. With this parameteriza-
tion the diagonal of LL
top
is the unit k-vector.
One benefit of this parameterization is how simple it is to program between group com-
parisons. For example, let’s assume that two groups have the same precondition task and
postcondition task, but differ in some experimental condition between the two tasks. Here
we would assume that both groups consist of the same population for the precondition
task, thus the hierarchical terms (e.g.,¹
¯
1
m f
and¾
¯
1
m f
) would be equivalent between groups.
Furthermore, correlations of precondition parameters would be expected to be the same
between groups (e.g., ½
¯
1
m f
,¯
1
mb
). Yet the hierarchical postcondition parameters and cor-
relations (e.g., ¹
¯
2
m f
, ¾
¯
2
m f
, and ½
¯
1
m f
,¯
2
m f
) would differ between groups. Construction of
the model is made simple by having the two groups covariance matrix as being equiva-
65
lent for the precondition-precondition parameter correlation estimates (and other precon-
dition hierarchical terms) but differ for the postcondition-postcondition and postcondition-
precondition correlation terms.
When comparing the posterior predictiveness of the models that differ only in their hi-
erarchical terms, we see a slight increase of leave-one-out cross validation out-of-sample
prediction accuracy with the more intricately designed hierarchical models. Specifically, us-
ing a normal distribution results in a slightly better fit than a model lacking any hierarchical
terms (elpd = 16.6 se = 7.0) and using one hierarchical normal distribution fits slightly better
than multiple normal distributions (elpd = 26.5 se = 14.3). These small increases in explain-
ing participant choice data showcase how subject-level fits were mostly due to their actual
choice likelihood, and that the hierarchical terms only exerted a small influence after being
weighted by the characteristics of the group’s subject-level estimates. Furthermore, this sug-
gests that estimated parameter distributions from the multivariate normal distribution did
not latch onto something wrong during the estimation process (which would be the case if
the elpd was, e.g., negative and five factors greater than the standard error).
4.8 Bayesian hypothesis testing
We have now outlined a detailed methodology to acquire both subject-level and group-level
parameter estimates in a way that preserves confidence. During this process, each itera-
tion produces a credible estimate for each modeled parameter (i.e., a single value for each
subject-level and group-level parameter). Ultimately, the posterior distribution is the col-
lection (i.e., the outputs from all the iterations) of these credible parameter sets. Given this
information, we utilize post-hoc linear regression testing to test for main effects by analyzing
the outputted posterior distributions.
N (yj¹,§)Æ
1
p
2¼
k
1
p
j
P
j
exp(¡
1
2
(y¡¹)
T
§
¡
1
(y¡¹)) (4.14)
¹2R
k
(4.15)
§2R
k£k
(4.16)
66
Previously Kruschke has discussed a Bayesian version of the t-test [66][67]. Here, we ex-
tend on the methodology to allow for linear regressions with an understanding that the in-
putted dataset is a posterior distribution consisting of many iterations.
For Kruschke’s analysis, the Bayesian t-test was essentially produced by running a t-test
on each single iteration thereby creating a posterior distribution for the outputted statistics.
We use a similar process, but in a linear regression model that utilizes the full posterior dis-
tribution.
We acknowledge other papers that forewarn a two-step analysis [13]. However, in their
analyses, the focus is on using the posterior means as a representation of the subject-level
parameters. Rather than use the means or medians of the posterior distribution, here we
preform linear regression using the posterior distribution itself.
Since we are using the posterior distribution, each subject will be represented multiple
times representative of the numerous iterations of the mcmc process. As the number of
iterations included in the mixed-effects linear regression model increases, so too does the
accuracy of the estimates (i.e., a reduction in the standard error of the mean). We code the
multiple representations of each subject as mixed effects, with not only a varying intercept,
but also the other parameters of interest.
We begin with a simple example, do subjects show a change in model-free processing?
value = {bMF1, bMF2}
value ~
block
+ (block|sub)
+ (block|iteration)
Using a repeated measures approach, we can say the dependent variables are¯
1
m f
and
¯
2
m f
. Here,¯
1
m f
is the variance associated with the (Intercept) since we are concerned in the
change of the parameter. In this model, we attempt to explain the change between¯
1
m f
and
¯
2
m f
, where¯
1
m f
and¯
2
m f
consist of many values (i.e., a posterior distribution rather than a
single value). To address this, we allow the main effect of block to vary between individuals,
along with the intercept. Thus, not only is each explanatory variable’s mean (and associated
standard error) estimated, but the variance within the population.
67
We can extend the equation to include standardized performance measures from the
precondition. First we transform the estimated free parameters into iteration-specific z-
scores.
z
i
Æ
x
i
¡¹
i
¾
i
(4.17)
With this construction, a participant’s z-scored free parameter estimate will vary between
iterations due to the nature of the mcmc process. That is, both the subject-level and group-
level estimates change each iteration, thereby representing a slight decorrelation between x
i
and z
i
. Moving forward, we are concerned with questions such as, do those who show in-
creased model-free and/or model-based processing show more or less of a change in model-
free processing. The previous linear model can be extended to respond to this question.
value = {bMF1, bMF2}
value ~
block + block:bMF1.z + block:bMB1.z
+ (block + block:bMF1.z + block:bMB1.z|sub)
+ (block + block:bMF1.z + block:bMB1.z|iteration)
With this parameterization, the z-scored precondition parameters only affect the change
in the estimated dependent variable, i.e., the change in model-free processing from precon-
dition to postcondition condition.
A concern might be the correlation between, e.g., bMF1.z and bMB1.z. Yet by using a
mixed-effects model and the posterior distribution rather than the means or medians of
those distributions, we not only estimate those terms as fixed effects, but also as subject-
and iteration- mixed effects. This produces a correlation matrix, and allows for the model to
correctly decorrelate such terms.
68
Chapter 5
Counterfactual
5.1 Introduction
Direct experience is but one way that we learn about the external world. According to pre-
diction error theory, we have some expectation as to what will happen, and learning occurs
when what happens deviates from those learned expectations [79]. Yet, direct experience is
limited [132]. Not only do we actually have to preform the action and endure whatever the
outcome might be, but there is little reference as to how that action compared to the other
potential actions.
Take, for example, driving in a busy city before the development of real-time traffic up-
dates. One might take a route to work, consider events relevant during that drive (e.g., traffic
due to a random accident), and evaluate whether it was faster or slower than expected. Ul-
timately, leading one to determine if in the future they should take that same route again.
However, we also learn from actions not taken, i.e., learning from counterfactuals [14]. In
this case, one might check Google maps afterwards to see if an alternative route would have
been faster.
Unique to situations with counterfactual learning is the comparison between the ac-
tual outcome and what would have happened if acting otherwise [56]. With this formu-
lation, there is a difference between value-tracking and factual-versus-counterfactual out-
come comparative process. In value-tracking, each option has some value, and sampled
information modulates this stored value representation. Whereas, in a counterfactual com-
69
parative process, there is less of a focus on the latent value of each option. Rather, the indi-
vidual is behaving on the basis of what produced a better outcome. In the case of rejoice, a
behavioral tendency is initiated in favorable situations, whereas regret follows nonfavorable
situations (i.e., selecting another option would have resulted in increased reward).
In this experiment, we focus on counterfactual learning utilizing the two-step task, which
has been shown to disambiguate habitual and goal-directed processes through a hybrid
(model-free/model-based) reinforcement learning model. Specifically, we test participants
over two days via a repeated measures task design to gauge how decision-making changes
with consideration as to how the participant behaved in the original precondition task.
5.2 Methods
5.2.1 Participants
Fifty-two healthy participants (32 women, age - mean = 20.2, sd = 2.4) from the University of
Southern California and surrounding area participated in this experiment. All participants
provided written informed consent in accordance with procedures approved by the Institu-
tional Review Board at the University of Southern California.
Data from all subjects were included in the final analyses.
5.2.2 Experimental design
Participants were tested in two sessions, separated by two to three days.
During the first session, participants began with basic instructions outlining the two-
stage task and a short 30 trial example task using symbols that were utilized during the in-
structions. After the practice session, participants performed 175 trials in the two-stage task,
which lasted for 24 minutes. Afterwards participants were given a five minute break, and
ended the session with the Raven’s progressive matrices.
The second session began with instructions outlining the counterfactual task, and a 30
trial practice task. Afterwards, participants completed 250 trials in the counterfactual two-
stage task lasting 30 minutes. This was followed by a short ten minute break and then the
final 24 minute, 175 trial two-stage task. Each new block of the two-stage task utilized a
unique set of symbols representing choices and acquired second-stage states.
70
Figure 5.1: When we examine the subjects fitted as using only model-free
processing, we see that the probability of staying with the same choice is
clearly affected by whether the trial was rewarded or punished. There seems
to be no obvious differences in stay probability following common or rare
transition trials.
Two-stage
Counterfactual two-stage
The counterfactual two-stage task was identical to the two-stage task, except that the coun-
terfactual outcome was also displayed on all trials – i.e., what would have happened if the
participant would have selected the other location.
In our design, the unselected first-stage location maintained identical rules as if it were
the option selected. Therefore, it was possible for both first-stage options to transition to
the same second-stage state, i.e., one location has a common transition and the other a rare
transition. In such a case, the acquired second-stage alien was sampled twice. Therefore,
it was possible for both the acquired and counterfactual second-stage states to be the same
alien, yet the resulting outcomes differ.
Raven progressive matrices
Intelligence Quotient (IQ) was approximated via Raven’s progressive matrices in a 20-minute
(maximum) session. This measure was used for explaining the variance in some of the fitted
71
Figure 5.2: When we examine the subjects fitted as using only model-free
processing, we see that the probability of staying with the same choice is
clearly affected by whether the trial was rewarded or punished. There seems
to be no obvious differences in stay probability following common or rare
transition trials.
parameters of interested, e.g., precondition model-based processing.
5.2.3 Data analysis
5.2.4 Reinforcement learning model
In modeling this task, we use the same learning model as discussed in the counterfactual
chapter.
5.2.5 Model comparison
To compare models, we used leave-one-out cross validation to measure the posterior pre-
dictiveness of each model [129] [128].
5.2.6 Linear modeling
To examine the effects of intelligence on model-free and model-based learning, we used
mixed effects linear modeling (lmer, package: lme4 [10]) on the iterations from the Bayesian
MCMC process. Unlike typical linear regression which focuses on significance via a p-value,
we focused on the posterior distribution produced after running the linear model on each
72
iteration. If an estimates high posterior density (0.95 confidence interval) did not contain 0,
then the covariate was found to be significant.
5.3 Results
5.3.1 General findings
In our experiment, we ran subjects in three separate reinforcement learning blocks. In the
first and final block, subjects participated in a two-stage task that has been shown to dis-
ambiguate goal-directed and habitual action selection through a hybrid reinforcement val-
uation learning model. During the second block, subjects participated in the same rein-
forcement learning task, however, the counterfactual outcome was also displayed – i.e., the
second-stage state and outcome that would have resulted if the participant had selected the
other location.
Overall, the hybrid reinforcement learning model was well estimated via a hierarchical
Bayesian parameter estimation process [120]. Specifically, all beta temperature weights –
e.g.,¯
1
m f
,¯
1
mb
, ps
1
,¯
2
m f
,¯
2
mb
, ps
2
,¯
c
m f
f ,¯
c
mb
f , ps
c
f , etc. – were estimated with a mul-
tivariate normal distribution. With each parameter’s mean, standard deviation, and corre-
lation coefficient matrix being estimated completely from the acquired data – i.e., the hi-
erarchical terms were set to be uniform and the posterior density represents the Bayesian
log-likelihood.
In the first block, 17.9 % of actions were deemed exploratory - i.e., trials in which the
participant selected the action of lesser value according to their fitted valuation model. We
examined reaction time and found that reaction time was significantly increased on trials in
which subjects were modeled to have explored (Table 5.1).
5.3.2 Punishment versus reward
Central to our task design was the element of both reward and punishment. Previous stud-
ies have found differences in positive and negative prediction error learning [88][87]. In re-
sponse, we designed the task so that trial outcome could be either reward (trial outcome
= +1, +2, +3), punishment (trial outcome = -1, -2, -3), or no change. Furthermore, the la-
tent second-stage state values pseudo-randomly changed throughout the block so that there
73
Regression predicting reaction time
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.5976 0.0153 39.0424 0.0000
explore 0.0559 0.0099 5.6346 0.0000
r ¡0.0032 0.0107 ¡0.2992 0.9997
trans:r ¡0.0198 0.0117 ¡1.6870 0.4065
explore:r ¡0.0626 0.0260 ¡2.4065 0.0872
explore:trans:r 0.0629 0.0277 2.2712 0.1226
Table 5.1: Exploration was defined as selecting the action that was estimated to be of lesser value ac-
cording to the fitted choice model. Reaction time was significantly increased on trials in which the
participant explored.
were periods where both states resulted in reward or punishment to disambiguate pure val-
uation learning from tendencies relating specifically to either reward or punishment.
To examine differences in punishment and reward learning, we constructed three vari-
ations of the reinforcement learning model. In the most basic model (model.basic), the re-
inforcement learning model was constructed without a difference in behavior following re-
ward and punishment. In another model (model.lr-valence), we had separate value learning
following reward and punishment.
QÆ (1¡®)£QÅ r
t
£ 2£ (r ewPunDi f f ) (5.1)
QÆ (1¡®)£QÅ r
t
£ 2£ (1¡ r ewPunDi f f ) (5.2)
r ewPunDi f f 2 [0,1] (5.3)
Here, the rewPunDiff parameter modulates how much information was learned when
comparing reward learning and punishment learning. As the parameter surpasses 0.50,
more value-based information was learned following rewards compared to punishment, whereas
the values less than 0.50 denote increased learning following punishment. Values near 0.50
represent neutrality in valence-dependent value-based learning. With this parameteriza-
tion, rewPunDiff does not influence reinforcement choice consistency – for example, two
subjects differing on rewPunDiff would have a difference in learning following reward and
punishment, but overall choice consistency (i.e.,¯
r l
¤Q) is not directly affected.
In the final model (model.punMF), we started with the model.basic and included an ad-
74
bMF1_mu
bMB1_mu
ps1_mu
punMF1_mu
bMF1_sd
bMB1_sd
ps1_sd
punMF1_sd
0 1 2 3
Figure 5.3: Alongside reinforcement learning and perseveration, a model-
free punishment switching tendency was found. This behavioral tendency
was found influence choice behavior to a similar degree as perseveration dur-
ing the precondition task.
ditional punishment switching parameter,¯
pun
. Functionally this parameter modeled a ten-
dency to switch choices following punishment, on top of what has been learned through
reinforcement learning.
Q
av
(1,c)Æ Q
av
(1,c)ů
pun
(5.4)
With this parameterization, punishment switching is not affected by magnitude (e.g., fol-
lowing trial.outcome = -1, -2, or -3, a static value of¯
pun
was added to that choice’s Q
av
on
the next trial) and, similar to perseveration, is reset following each trial. Therefore, although
this parameter deals with punishment, it is not a learning tendency per se, but a behavioral
tendency associated with punishing circumstances.
Using leave-one-out cross validation the model that best explained the acquired data
was the model.punMF . This model’s out-of-sample prediction accuracy was significantly in-
creased compared to the model.basic (elpd = 253.2, se = 35.3) and slightly better than the
model.lr-valence (elpd = 45.8, se = 14.4).
Examining the parameter estimates relating to punishment switching revealed that this
behavioral strategy affected behavior to a similar degree as perseveration, and was signifi-
cantly reduced compared to model-based valuation (Figure 5.3). We found both model-free
75
and model-based processing (est = -0.0390, se = 0.0078, t = -4.9726, p = 0.000) to be signifi-
cantly negatively associated with punishment switching, whereas perseveration had a strong
positive association (est = 0.2505, se = 0.0116, t = 21.6199, p = 0.000) (Table 5.2).
Regression predicting¯
1
pun
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.6350 0.0460 13.8170 0.000
bMF1.z ¡0.0179 0.0063 ¡2.8265 0.022
bMB1.z ¡0.0390 0.0078 ¡4.9726 0.000
ps1.z 0.2505 0.0116 21.6199 0.000
bMF1.z:bMB1.z 0.0212 0.0040 5.2767 0.000
Table 5.2: Regressing¯
1
pun
revealed significant associations with a number of the precondition param-
eters. Specifically, increased punishment-switching was strongly associated with increased persevera-
tion, along with weaker negative associations with both model-free and model-based RL. Interestingly
the interaction of model-free and model-based processing was positively associated with punishment
switching. In this case, those with low levels of both model-free and model-based processing demon-
strated a further increase of punishment switching.
5.3.3 Counterfactual task
During the second day of testing, subjects were tasked with 200 trials in a counterfactual
modification of the previous reinforcement learning task. We found that participants re-
ceived significantly increased reward in the counterfactual task (0.0202, 0.0043, 4.628 2.08e-
05) compared to the standard two-stage task. Furthermore, participant’s choices were less
stochastic, as determined by a decrease in the exploration rate (est = -0.3423, se = 0.0758, t
= -4.5129, p = 0.0000) , but also more perseverative (est = 0.4399, se = 0.0990, t = 4.4425, p =
0.0000) .
To simplify this, we constructed a mixed effects logistic regression on probability to ex-
plore. We found that as the difference between the Q-values increased in magnitude, explo-
ration decreased. Furthermore, exploration was extremely decreased in the counterfactual
task, and also slightly reduced in the postcondition task compared to the precondition. Sur-
prisingly, although exploration percentage decreased, the effect of the absolute difference in
Q-values decreased in magnitude in the counterfactual and postcondition tasks, suggesting
that exploration percentage decreased and this decrease was met with a decreased consid-
eration of reinforcement valuation tracking. In this case, it could represent, e.g., increased
usage of perseveration or other strategies.
76
In terms of valuation learning, participants demonstrated counterfactual learning, i.e.,
learning from the unselected location’s acquired second-stage state and resulting counter-
factual trial reward. We tested multiple models, including models without counterfactual
learning, along with models that had different learning rates for factual and counterfactual
learning. However, the counterfactual learning rate was not significantly different from fac-
tual learning, and the best-fitting model was one containing only one learning rate.
5.3.4 Comparative processes and model-free valuation
Given previous accounts of regret and rejoice consideration [22], we tested if participants’
behavior was affected by these comparative processes. We defined regret as trials in which
the participant’s acquired reward was less than the outcome of the unselected choice, and
rejoice as instances in which the participant’s acquired reward was greater than that of the
unselected choice.
We tested multiple models that coded combinations of model-free and model-based re-
gret and rejoice consideration. Here, model-free regret/rejoice affected the selected action,
with model-free rejoice leading to an increased probability of selecting the same action,
whereas model-free regret caused a decrease of selecting the same action. In comparison,
model-based regret/rejoice only occurred on trials in which both actions resulted in a com-
mon or rare transition. In such instances, model-based regret/rejoice consideration affected
the value of the action that commonly led into the acquired second-stage state. With this
construction, model-based regret/rejoice did not occur, e.g., if the selected choice had a
common transition and unselected choice had an uncommon transition. In such a case,
both actions transitioned to the same second-stage state. However, given that reward was
probabilistically affected by the second-stage state, it was possible that the two outcomes
differed. On such trials, there was model-free regret/rejoice consideration, but not model-
based regret/rejoice.
Ultimately, the best fitting model was one that included model-free regret, along with
model-based regret and rejoice consideration (i.e., significance was not found for a model-
free rejoice effect).
Examining the population means for the parameters of the counterfactual task, revealed
that choice responding was significantly altered compared to the normal task. Specifically,
77
bMFC_mu
bMBC_mu
psC_mu
regretMFC_mu
regretMBC_mu
rejoiceMBC_mu
bMFC_sd
bMBC_sd
psC_sd
regretMFC_sd
regretMBC_sd
rejoiceMBC_sd
0.0 0.5 1.0 1.5
Figure 5.4: During the counterfactual task, participants’ choice behavior
demonstrated a lesser consideration of model-free and model-based pro-
cessing than in the normal two-stage task, but an increase of perseveration.
Furthermore, there was increased behavioral consideration for model-free
regret compared to model-based regret and rejoice.
we see a reduction of model-based processing (est = -1.5132, se = 0.1779, t = -8.5067, p =
0.0000) , and almost a complete lack of model-free processing (est = -0.9815, se = 0.0919,
t = -10.6739, p = 0.0000) . Yet, despite the reduction of valuation tracking, we do see, e.g.,
increased precondition model-free processing to be associated with increased model-free
processing during the counterfactual task (est = 0.0074, se = 0.0008, t = 9.4530, p = 0.0000)
and similarly for precondition model-based processing on counterfactual model-based pro-
cessing (est = 0.0921, se = 0.0119, t = 7.7101, p = 0.0000) .
We find the reduction of model-free and model-based reinforcement valuation to be re-
placed by regret and rejoice comparative considerations. Specifically, we see similar levels of
behavioral consideration for model-free regret behavior as we do model-based processing
(Figure 5.4) – which was the primary driver of behavior in the precondition.
When examining the relationships between the parameters of the counterfactual task,
we find that increased regret processing (both model-free and model-based) to be associ-
ated with increased rejoice processing (regretMFC.z - (est = 0.0723, se = 0.0055, t = 13.2183,
p = 0.0000) ) (regretMBC.z - (est = 0.0793, se = 0.0041, t = 19.2116, p = 0.0000) ). Furthermore,
78
model-free regret processing was significantly negatively associated with model-based re-
gret processing (est = -0.0871, se = 0.0054, t = -16.2468, p = 0.0000) .
Next, we examined relationships between value-learning and regret/rejoice considera-
tions. Despite the extremely low estimates of model-free processing (est = 0.1183, se =
0.0024, t = 48.8772, p = 0.0000) , we found it to be significantly positively associated with
both model-free regret consideration (est = 0.0053, se = 0.0011, t = 4.9175, p = 0.0000) and
model-based regret consideration (est = 0.0025, se = 0.0007, t = 3.7542, p = 0.0009) . No sig-
nificant relationship was found between model-free processing in the counterfactual task
and model-based rejoice processing (est = 0.0006, se = 0.0006, t = 0.9714, p = 0.8482) .
In comparison, counterfactual model-based processing was found to have negative rela-
tionships with most of the comparative parameters. Specifically, we find strongly significant
negative associations between counterfactual model-based processing and model-free re-
gret consideration (est = -0.0951, se = 0.0096, t = -9.8920, p = 0.0000) and model-based
rejoice consideration (est = -0.0156, se = 0.0037, t = -4.1739, p = 0.0001) , along with a slight
negative association between model-based regret consideration (est = -0.0150, se = 0.0055, t
= -2.7127, p = 0.0311) .
Last, we relate precondition parameters with behavioral tendencies in the counterfac-
tual task. We first begin by examining the relationship between value-tracking parameters.
Precondition model-free processing was found to be associated with increased model-free
processing (est = 0.0279, se = 0.0038, t = 7.2840, p = 0.0000) and decreased model-based
processing (est = -0.0476, se = 0.0088, t = -5.4430, p = 0.0000) in the counterfactual task.
Punishment switching had similar relationships, having a positive association with model-
free processing (est = 0.0128, se = 0.0018, t = 7.2394, p = 0e+00) and a negative association
with model-based processing (est = -0.0232, se = 0.0040, t = -5.8361, p = 0e+00) in the coun-
terfactual task. In comparison, model-based processing showed the opposite relationship,
having a negative association with model-free processing (est = -0.0124, se = 0.0034, t = -
3.6334, p = 0.0018) and a positive association with model-based (est = 0.1440, se = 0.0180, t
= 7.9918, p = 0.0000) .
79
5.4 Discussion
In the current study, we examined additional influences affecting choice behavior in a task
that was originally derived to disambiguate model-free and model-based processing. By uti-
lizing versions of the task with reward/punishment and counterfactual outcomes, we quan-
tified a number of additional considerations that influence future behavior.
In the first task, we identified a model-free punishment switching consideration. Specif-
ically, we found that participants were more likely to switch choices following punishment,
regardless of the trial’s transition. This tendency was found to be decreased in individuals
utilizing either model-free or model-based valuation, but increased in those with selecting
perseveratively. Interestingly, this differs from some other papers that have found a decrease
of learning following negative prediction error [72]. We note a slight difference in our task’s
formulation. Specifically, we focus on a distinction between reward and punishment with
there being a tendency to switch choices following punishment. Whereas works by other
groups have focused on positive and negative prediction error on valuation learning. Thus
in one case a simple behavioral tendency is examined, and in others a question concerning
how prediction error modulates valuation-learning.
In the counterfactual task, we found significance for both model-free and model-based
regret processing along with model-based rejoice processing. It was relatively unsurprising
not to find significance for the model-free rejoice processing because the population showed
far less model-free processing in the precondition task compared to model-based.
Unexpectedly, we found decreased model-based and model-free processing in the coun-
terfactual task, and the coefficients for the This suggests that participant’s choices were less
based off of a value learning process, but rather focused on more on the comparison between
the outcome of the selected choice and unselected choice. Critically, with our model, these
comparative processes only considered the previous trial, and were reset on each new trial,
whereas value learning (through the RL portion of the model) is a continuous value tracking.
Although this parameterization is slightly different from other previous works [86][88],
it captures the population’s decision tendency and is still quite consistent with previous re-
search.
Previous works have found that the orbitofrontal cortex is utilized in a number of coun-
terfactual processes including regret [17] [112] [112]. Further investigations should examine
80
how the variability in orbitofrontal usage in a similar goal-directed task with counterfactuals.
81
Chapter 6
Observational learning
6.1 Introduction
How we select actions lies on a spectrum. On one end, there is goal-directed decision mak-
ing. At the core of this form of behavioral control is the goal itself. Once a goal is real-
ized, many different potential actions are considered, along with factors uniquely relevant
to the current situation [31][24][61]. Rarely is one action always best, and when acting goal-
directed, we consider which action is most likely to produce the outcome that is presently
desired.
On another end of this behavioral spectrum is habitual action selection [32]. This action
preference, unlike goal-directed processing, is much less about thinking, and more about
efficiency [60] and reducing cognitive load [64]. Being goal-directed is effortful and we can
only divide our attention in so many ways [68][12]. Habitual action selection aims at easing
this cognitive load by allowing us to prefer certain actions in a given type of situation [135].
We reduce a complex situation and preform the action which we have learned over time to
be best, most of the time. In doing so, once what was a difficult, complex task is transformed
into a seemingly automatic stimuli-response behavior [28].
Computational methods allow us to examine the variability of both habitual and goal-
directed processing concurrently through a task that models these processes using model-
free and model-based reinforcement learning algorithms [116][51]. Here habitual respond-
ing is associated with an increased probability of selecting a choice if it has previously shown
82
to lead to reward (learning occurs without reference to a model of the environment, i.e.,
model-free). Whereas, goal-directed responding is modeled as a way of selecting actions
which considers the way in which the task is constructed (e.g., unique response-outcome
associations that cause participants to interpret reward differently than model-free consid-
eration). It is believed this type of behavioral selection utilizes a sort of cognitive map [121]
that relates actions with outcomes [8][139][140] along with previously learned information
information (e.g., trial outcome) to determine which action is truly best.
Yet, the dichotomy of goal-directed versus habitual behavior has most often been studied
under a lens of direct experience [40][106][29]. It has been previously found that model-
based processing seems to improve slightly with increased task experience [38]. But often we
learn not only from our continued efforts in performing some task, but also from how others
do it too. And there are many questions concerning the various ways in which observational
learning differs from direct experience in future behavior.
Previous studies have shown that the two types of learning utilize different neural regions
[16]. Observation learning has some additional unique interactions with the mirror neuron
system [127][23], and beliefs expressed following observational learning can largely depend
upon beliefs the observer has involving the observed individual [75][105].
On a theoretical level, there are many reasons as to why there are these differences. In
experience-based learning, currently held beliefs are transferred to the environment (i.e.,
your beliefs cause you to prefer one action over another, which you then preform). And
hence, the outcome has a rather meaningful connection with the beliefs held when making
a decision to select that action. For example, imagine being honked at for running a red light
because you were in a rush to get to work. Being honked at was very much connected to
your action of selecting to run a red light. In a way, your action (running a red light) brought
about a response (being honked at) that had some valence (negative due to being embar-
rassed) thereby creating a learning experience (do not be late, keep better track of time).
Whereas in observational learning, there is more of a disconnect between yourself and the
action which brings about a response that initiates learning. In this sense, learning can oc-
cur in situations that you yourself would not involve yourself in. In reference to the previous
example, imagine being another car at the intersection and witnessing a seemingly erratic
driver run the red light. From this perspective, there are many potential ways to interpret the
83
situation. One could reflect in a similar way as the driver, i.e., ’I should be careful to plan my
time, else wise I can make my life a dangerous rush’ . Or you may think something differently
entirely, ’I have no idea what would cause a person to run a red light. They must be com-
pletely unaware of what they are doing. I should check my surroundings of careless drivers
more often.’ Which relates to phenomena, such as confounded learning, in which our beliefs
shape how outcomes are interpreted [15].
In the current study, we examined how observational prediction differentially affects fu-
ture goal-directed and habitual action selection compared to direct experience-based learn-
ing. Specifically, we utilized a variation of the previously described task to measure habit-
ual and goal-directed action selection via model-free/model-based (hybrid) reinforcement
learning. This task was applied in a precondition and postcondition to examine how the
following experimental task conditions affected valuation processing.
Our experiment consisted of two core groupings (experience-based learning and obser-
vational learning), and three subgroups within the observational learning group. During the
experimental task, the experience-based group participated in yet another block of the main
reinforcement learning task. Whereas, the observational prediction groups observed a pre-
vious subject in the task, with the goal of correctly predicting that previous subject’s next
response. Critically, the three subgroups differed in the type of previous subject they pre-
dicted: one group predicted a model-free learner, another model-based, and the final group
predicted a previous learner that was estimated to be using a mixed valuation model (con-
tained elements of both model-free and model-based processing).
Given the current understanding of goal-directed versus habitual processing, we expected
good consistency between blocks within-subject, but hypothesized that predicting a model-
free learner may increase model-free processing in participants, in comparison to experience-
based learning, especially with participants that had lower choice consistency in the precon-
dition. This is because we expect that those types of subjects have the ability to use model-
free processing, but they might not have been able to demonstrate this ability without first
observing and predicting a previous learner. In the current experiment, although the previ-
ously observed learners differed in their valuation strategy (e.g., model-free versus model-
based learner prediction), they all were quite consistent in their choices – for example, we
did not use previous participants who showed only model-free processing, but this estimate
84
was still low, which would suggest highly random behavior. So if our population contains
subjects with low estimates of both model-free and model-based processing, we expect that
model-free observation will increase their model-free processing, more so than similar types
of subjects following experience-based learning.
We further hypothesize that predicting a model-based learner will increase model-based
processing in comparison to experience-based learning, especially in subjects with above
average model-based abilities. We believed this because we give ample instructions to par-
ticipants about habitual versus goal-directed strategies, and they are specifically informed
that a goal-directed strategy will result in more reward. Thus, if the participants were unable
to show model-based processing in the precondition, then they might be completely unable
to demonstrate that form of behavior responding. Whereas, those that demonstrated model-
based processing during the precondition may be better able to consider model-based in-
formation (an increase of model-based processing) following the observation of another
model-based learner than if they just had continued experience with the main task. Here
it is important to note that while two participants using model-based processing will gen-
erally show a similar response (since they both are considering model-based information),
they will still differ in their choices due to differences in, e.g., how many previous trials they
consider (learning rate) or fluctuations of random behavior (i.e., when they choose to make
an exploratory decision or a when they, for whatever reason, preform an action they did not
mean to). So despite both using a goal-directed strategy, it might be that those subjects will
be better able to be goal-directed following observation of another goal-directed individual
because they will be able to contrast their form of model-based processing versus another
individual using a slightly different model-based variation. This is coupled with difference of
goal between the two main experimental task conditions (i.e., predict the observed learner
who is trying to maximize reward, compared to being the one who is attempting to maximize
reward).
Lastly, we cite recent literature that has focused on a revised computational definition of
habits [76]. In this sense, we also focus strongly on perseveration, i.e., the tendency to select
an action solely because it had been previously selected (without reference to if that action
had been shown to be good or bad in the past).
We believe that participants with high perseverance will show less perseveration follow-
85
ing observational prediction than experience-based learning. We expect this because per-
severation, like model-free and model-based processing, is a heuristic we use to give pref-
erence to some potential action. Specifically, we believe this tendency is proportionally in-
creased in subjects with decreased model-free and model-based valuation processing. Such
subjects are not able to use reward in a way to guide future decisions (i.e., low model-free
and model-based value learning), so instead we believe that they are mainly repeating their
previous choices. This is in comparison to, e.g., a positive correlation, which might suggest
that subjects with a decreased ability to use previous reward information show reduced per-
severation, and in the case of our model, it would mean that their choices are just highly
random. Given that the goal will be different in the observational prediction task, we pre-
dict that those types of subjects, i.e., those with increased perseveration, will show reduced
perseveration and potentially increased model-free or model-based responding, depending
on which type of learner they predicted. If there is in fact a negative correlation between
perseveration and reinforcement learning choice consistency, then those types of subjects
may develop specifically increased model-free processing because of their low ability to use
either model-free or model-based processing in the precondition, and previous reports on
the association of intelligence and goal-directed valuation [48].
6.2 Methods
6.2.1 Participants
One hundred and eighty-three healthy subjects (95 males, age mean = 20.4, sd = 2.4) par-
ticipated in a three-part experiment at the University of Southern California. Subjects were
all current residents of the Southern California area and/or current psychology students at
the University of Southern California. Participants received either a psychology credits and
monetary reward or increased monetary reward for the non-psychology students. Overall,
we found no significant difference in behavior between USC and non-USC students, nor be-
tween subjects of different monetary payouts.
All subjects participated in the three blocks of the experiment, and no subject data was
excluded.
86
Figure 6.1: Participants of each group were tested in three separate blocks. While the
first and third block were identical between all the groups, the experimental learning
task (block 2) differed. Specifically, those in the experience-based learning group par-
ticipated in the same task as first and third block. Whereas, those in the observational
prediction group, observed and predicted a previous participant who was estimated to
be using either model-free, model-based, or a mixture (hybrid) of both processes.
6.2.2 Two-stage valuation task
In two separate blocks (precondition and postcondition, and the experimental manipulation
for the experience-based learning group), participants performed 175 trials in a two-stage
valuation task that has been shown to disambiguate model-free and model-based respond-
ing. Specifically, model-free and model-based responding are ways in which reinforcement
learning compares a cached value representation (habitual action preference), against an
action preference that calculates the best action via an understanding of the task’s contin-
gencies (in this case, action-outcome associative learning, an element of goal-directed pro-
cessing [49]).
Each trial began with a selection between the two first-stage symbols. Subjects had 2
seconds to select the symbol on the left or the right of the screen using the v or n key, respec-
tively. The position of the symbols varied between trials, but participants were informed that
that was done to keep them attentive.
Following selection of one symbol, one of two potential symbols appears on the screen,
denoting the acquired second-stage state. In our design, each of the two first-stage sym-
bols probabilistically transitioned to one of the two second-stage states with 75% probabil-
ity (common transition) and 25% to the second-stage state that the other first-stage symbol
87
Figure 6.2: Normal two-stage task used to disassociate habitual and goal-directed pro-
cesses affecting action selection. Critically, first-stage actions probabilistically transi-
tioned to second-stage states, and the resulting second-stage state affects reward proba-
bility. Therefore, ideal action selection involves consideration of not only the previously
selected actions and their resulting reward, but the transition type (second-stage state).
more commonly transitions to (rare transition).
Participants are informed that the acquired second-stage state (and not their actual se-
lected action) directly affects the trial’s reward probability. However, unknown to them, each
second-stage state was independently either in a state of low (20%), moderate (50%), or high
(80%) reward probability. To encourage on-going learning, second-stage state value pseudo-
randomly changed approximately every seventeen trials.
With this construction, a model-free valuation scheme uses only the acquired reward to
influence the believed value of the selected first-stage symbol. Ultimately, selecting an action
based on how often it usually produces reward. Whereas, a model-based valuation scheme
not only considers acquired reward, but the second-stage state which affected the reward’s
probability. By considering the second-stage state, a goal-directed preference for an action
is created by considering which action is most likely to transition to the second-stage state
they believe to be of greater value.
For example, consider receiving reward following a rare transition. A model-free learner
would be more likely to select the same first-stage symbol because preforming that action
was just rewarded. In comparison, a model-based learner would be more likely to select
the unselected first-stage symbol. The model-based learner considers the model of the task
and produces a preference for the unselected action because that action is more likely to
transition to the recently rewarded second-stage state.
Ultimately, given that subjects are well informed of the task’s contingencies, model-based
88
Figure 6.3: In the observational prediction task, participants observed the first five tri-
als, and for the remainder of the trials were rewarded for correctly predicting observed
learner’s next choice. Critically, following prediction the participant saw the actual choice
and the learner’s acquired second-stage state. I.e., the participant’s actions in no way af-
fected the observed learner’s actions or environment.
processing should be preferred to model-free processing. Yet, given that model-based pro-
cessing has a hidden intrinsic cost (cognitive effort [64]), it is important that increased model-
based valuation actually causes an increase to reward in the task. For example, certain ran-
dom walks may produce environments in which goal-directed behavior does not actually
improve reward – the two second-stage states have the same reward probability or the walk
is too stochastic. This was a key factor in why we used the task’s current design, i.e., to ensure
that increases to model-based weighting actually produces increased reward, when control-
ling for overall reinforcement learning choice consistency.
6.2.3 Observational prediction task
During the experimental learning task block, subjects in the observational prediction group
(n = 123, whereas experience-based learning n = 59) participated in an observational pre-
diction modified version of the two-stage valuation task. These subjects were tasked and re-
warded for correctly predicting the choices made by a previous participant (observed learner)
from a previous control experiment that utilized the two-stage valuation task given in a sim-
ilar way as this experiment’s precondition task.
During this 200-trial task block, subjects began by observing the first five trials of the ob-
served learner’s task experience. Following the fifth trial and from then on, each trial began
89
with a two second period for the participant to predict the observed learner’s next choice.
Following a correct prediction, a green checkmark was displayed (+$0.05 per correct predic-
tion). Whereas both incorrect prediction or failure to make a prediction resulted in a red X
over the symbol unselected by the learner (+$0 no reward). Following both outcomes, the
learner’s actual choice moved to the center of the screen.
Therefore regardless of the prediction results (i.e., correct or incorrect prediction), the
observed learner’s actual choice, resulting second-stage state, and trial outcome were then
displayed. With this design, the participant experiences the previous reinforcement learning
task in a defined way – the way in which the observed learner experienced it. The partici-
pant’s actions do not affect the observed learner’s choices, thus the participant is in a way
disconnected from the environment’s response to the learner. The only difference between
participants viewing the same learner was if their prediction was rewarded or unrewarded.
As mentioned, participants were informed that the observed learner was a previous sub-
ject that had participated in the same task as was done in the precondition, and that this
observed learner was given identical background information concerning the tasks’ contin-
gencies. In order to examine the potential interactive effects of observational prediction,
unknowingly subjects were tasked with predicting the actions a previous participant who
used either purely model-free learning (n = 42), model-based learning (n = 39), or a mixture
of both learning schemes (hybrid, n = 42).
6.2.4 Questionnaires
After each block, subjects answered a few short questions concerning the task. Following the
precondition and postcondition tasks, subjects were asked how they processed previous task
information, specifically, they selected one of three responses which best describe the three
different valuation schemes (i.e., model-free, model-based, and a mixture of the two). They
were also asked how many trials back they considered when making choices (5-factored re-
sponse i.e., only the previous trial, the past couple trials, etc.). These questions were carefully
worded such that it reiterated aspects that were previously present in the instructions.
90
6.2.5 Data Analysis
Trial-by-trial learning and overall task performance was primarily analyzed through a rein-
forcement learning model[26] that considered model-free and model-based influences on
action selection. However, we also relate these parameters back to stay-switch probability
analysis to both: 1) ensure that our fitted model parameter estimates were consistent be-
tween modalities and 2) decomplexify the conclusions from the computational analysis.
Reinforcement learning model
We represented goal-directed and habitual behavior through a hybrid valuation reinforce-
ment learning model [26][2]. In this model, habitual and goal-directed processes both in-
fluence action selection. A softmax decision rule does this by ultimately summing valuation
(which influences action selection) from three sources: model-free RL (habitual), model-
based RL (goal-directed), and value-independent previous choice history (perseveration).
The full model is outlined in the earlier sections; however, since this task only contained
reward or no reward, task outcome was recentered to {-1, 1} and we did not include the pun-
ishment switching parameter.
Bayesian parameter estimation
To estimate the parameters in the reinforcement learning tasks, we used rStan [120] to pre-
form a hierarchical empirical Bayesian analysis. Specifically, we estimated model-free in-
verse temperature, model-based inverse temperature, and perseveration between blocks all
together in one multivariate normal distribution that utilized a Cholesky correlation ma-
trix between the parameters within-subjects. Furthermore, we separately estimated selected
and unselected action learning rates as separate parameters2 (0,1) via separate beta distri-
butions per block.
Since our groups consisted of individuals from the same population, for the precondition
all groups were governed by the same hierarchical terms. For example,¹
bMF
1,¾
bMF
1, and
the associated first block correlations r were identical between groups, but¹
bMF
2,¾
bMF
2,
and the remaining correlation coefficients r differed between groups.
In short, there are many advantages to this methodology, in comparison to an analysis
that estimates the reinforcement learning free parameters in separate instances per subject.
91
Perhaps the largest advantage is that each free parameter is represented by a distribution
(in this case 4000 credible values) rather than a single maximized value (i.e., what would be
produced in most frequentist methods – MLE, MAP , and EM).
However, this design also permits us to construct the model so that the estimation pro-
cess is completely empirical Bayesian. By this we mean that the population terms (e.g.,
¹
bMF
1) are estimated during the model-fitting process and these hierarchical factors both:
1) impact the values of subject-level parameter estimates (i.e., they become a prior on the
subject-level fits) and 2) are impacted by the subject-level parameter estimates (i.e., the like-
lihood that arises from the subjects’ choices shape the population-level terms).
Generalized mixed linear model
The benefits of Bayesian parameter estimation, compared to point-estimates (MLE, MAP ,
EM), is that the resulting output is a posterior distribution rather than a single value, al-
lowing us to have a measure of confidence for where the true parameter value lies given
the data and model. However, our analyses are most concerned with hypothesis that fol-
low from the parameter estimation. For example, while the hierarchical parameter fits are
important (e.g.,¹
¯
2
m f
between the model-free and model-based learner prediction groups),
we are also interested in questions such as, which types of subjects showed change. For ex-
ample, do participants measured as being strongly model-based in the precondition show
more model-based processing following experience-based or observational-prediction? Is
this change similar for participants that showed dominantly model-free processing during
the precondition?
To conduct such analyses we use the outputted subject-level posterior distributions in
a mixed effects repeated measures generalized linear model. The benefit of this procedure
is that we are able to first estimate the parameters in a completely Bayesian fashion (which
takes a long period of time), but are then able to use the outputted distribution in a model
that correctly programs the relationship between iterations. Furthermore, compared to a full
frequentist approach (e.g., using maximized values rather than posterior distributions), we
are able to obtain variance measurements on the covariates of interest, along with standard
effect size.
In this basic construction, we use mixed regression on our parameter of interest (e.g.,
92
¯
1
m f
and¯
2
m f
), such that, each subject is included 8000x (4000 iterations x 2 parameters) in
the model. To relate precondition with postcondition values, the block term2 (0,1) made
it such that each subject’s baseline was the estimate for precondition parameter (¯
1
m f
), and
the postcondition parameter was the precondition parameter along with the effect of block
(¯
2
m f
=¯
1
m f
+ block).
To account for group differences, the terms block£task and block£task£ag ent allowed
for examinations as to how the different experimental task manipulations affected future
task behavior. With a block£task effect denoting a difference between experience-based
learning (task = 0) and observation prediction (task = 1), which was further broken down via
the observed agent subgroups block£task£ag ent.
To investigate which types of subjects were affected by each type of experimental ma-
nipulation, we constructed linear regressions that included block£z-scored precondition
interaction terms. Here, significance for a block£z-scored precondition interaction term
denotes that as the z-scored precondition process linearly changed in value, the change in
the parameter of interest (the effect of block) did too.
As an obvious extension, the block£z-scored precondition interaction with task and task
£agent represents a difference between groups on the effect of the precondition task esti-
mate. For example, a positive value for block£bet a
1
m f
.z£task would signify that as z-scored
¯
1
m f
increased, so too did the parameter of interest in participants of the observational learn-
ing task, compared to those of the experience-based learning task manipulation.
Ultimately, to relate the mixed effects, each subject was estimated to have their own
unique contribution to the different population estimated parameters, whereas each iter-
ation contributed to each of the group interaction terms.
6.3 Results
6.3.1 General findings
To analyze task behavior, we fit subjects’ choices to a hybrid valuation reinforcement learn-
ing model [26][27]. More specifically, this model estimated the separate contributions of
both model-free and model-based learning on action selection.
We found high validity for our computational model given our task’s formulation. When
93
All participants Model−free Mixed Model−based
No reward Reward No reward Reward No reward Reward No reward Reward
0.00
0.25
0.50
0.75
1.00
Stay percentage
Transition Common Rare
Figure 6.4: Stratifying subjects based on their model-based weight as estimated by the
reinforcement learning model, we see the typical stay-switch patterns for each method
of responding.
stratifying subjects by their model-based weight parameter estimates, the resulting stay-
switch behavior plots (Figure 6.4) are consistent with what has been commonly reported
in previous literature [138][78].
Subjects identified as using primarily model-free valuation show an effect for only re-
ward on stay probability, whereas those with model-based valuation show only a reward
£transition effect. Finally, those found with mixed valuation show a positive main effect
for both reward and reward£transition.
Consistent with our task design, we found a significant positive effect for inverse tem-
perature and model-based weight on reward. Previous studies have found that model-based
valuation requires more effort than model-free valuation [2][64][61], thus our finding that in-
creased model-based weighting (i.e., the proportion of model-based processing against the
sum of the model-free and model-based terms) resulted in increased reward, suggests that
there is reason to prefer model-based compared to model-free valuation in our task.
Further in support of our parameter estimates, we found high measures of correlation
94
between associated inverse temperature measurements within-individuals between blocks.
Specifically, the correlation matrix was estimated during the Bayesian model-fitting proce-
dure by using a Cholesky covariance matrix. We found significant correlations between, e.g.,
¯
1
mb
and¯
2
mb
which positively correlated (mean = 0.3549, ci min = 0.1756, ci max = 0.5117)
in the experience-based learning group and so did¯
1
m f
and¯
2
m f
(mean = 0.4343, ci min =
0.1841, ci max = 0.6528). Similar values were also found for each of the observational groups.
Suggesting some trait-like, rigid qualities to model-based weighting.
Yet contrary to one of our original beliefs, we see that perseveration positively correlates
with both model-free and, even more so, model-based processing during the precondition.
Ultimately, this causes us to question just how increased perseveration will affect future anal-
yses concerning the change of these parameters.
Following the first reinforcement learning task, we asked the subjects a few questions
about their experience. One question asked which valuation scheme did they believe that
they had used. For this question, we had previously described the different types of valua-
tion and allowed for a three-choice response. We found that model-based processing,¯
1
mb
,
explained a significant amount of the variance in the subjects’ beliefs as to which strategy
they used but not¯
1
m f
.
Regression predicting subjects self-believed strategy usage (i.e.,!
1
)
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.5975 0.0019 306.9547 0.0000
¯
1
m f
.z ¡0.0219 0.0265 ¡0.8266 0.9142
¯
1
mb
.z 0.1393 0.0297 4.6946 0.0000
¯
1
ps
.z 0.0054 0.0094 0.5788 0.9800
¯
1
m f
.z:¯
1
mb
.z 0.0372 0.0309 1.2022 0.7019
Table 6.1: Estimated model-based, but not model-free, processing was associated with the valuation
strategy the participant believed they used during the precondition.
Participants were also asked how many trials back did they consider when updating their
current beliefs. Responses included "only the previous trial", "the previous 2 trials", and so
on (5-choice response). We found that learning rate in the precondition significantly corre-
lated with this response, but not inverse temperature, model-based weight, nor their inter-
actions.
95
Observational prediction
In our experiment, groups differed in an experimental task block that proceeded the first
two-stage task. Specifically, experience-based learning was contrasted with observational
prediction in the ability to induce a change of behavior from what was measured in the pre-
condition. With observational prediction being further subdivided by which type of learner
was observed and predicted.
We found that of those that participated in the observational prediction task, there was
an increased ability to correctly predict an observed model-free learner compared to model-
based prediction. However, we found significance in a factor(agent)1 £¯
1
mb
interaction,
signifying as precondition model-based processing increased, so too did the ability to cor-
rectly predict the model-based learner. We found a trending positive significance in an fac-
tor(agent)0.5£¯
1
mb
interaction. Overall the increased difficulty for those with low model-
based processing to predict a learner using goal-directed choice selection.
Using a computational theory-based approach, we fit a modified hybrid reinforcement
learning model to explain prediction behavior. Our model was designed such that the par-
ticipant utilized some combination of model-free and model-based valuation processing
when interpreting the observed learner’s cumulative experience and predicting their next re-
sponse. Thus, the observed learner was utilizing some unknown weighing of model-free and
model-based processing, and the participant in response utilized some weighing of model-
free and model-based processing in comprehending (i.e., simulating) as to how the observed
learner would act. As the participant’s utilized model-based weighting neared that of the ob-
served learner’s model-based weighting, prediction performance increased.
Despite the task experience differing between groups – for example, one group experi-
enced the task from a model-free choice perspective, whereas another group from a model-
based perspective – we did find extremely different levels of model-free and model-based
processing between groups as evident by the hierarchical mean terms of the associated model-
free and model-based estimates. Suggesting that the participants in each observational pre-
diction group were doing something slightly different from one another.
We defined an alignment parameter as the difference between the ideal model-based
weight given the participant’s condition and the model-based weight that the participant
was estimated to utilize. For example, during the model-free observational prediction task,
96
we considered the ideal model-based weight as 0, and participant model-based weight de-
viations away from 0 would reduce alignment. Whereas, for experience-based learning and
model-based prediction the ideal model-based weight was 1. For the hybrid learner, we set
the ideal model-based weight as 0.5 and deviations toward 0 or 1 reduced alignment to 0.
Consistent with the previous finding, we found that participants were better able to align
their model-based weight to the model-free learner. Surprisingly, we found that model-
free processing in the precondition to negatively impact alignment for all groups, suggest-
ing that increased model-free processing did not allow for increased prediction of a model-
free learner. In terms of precondition model-based processing, increases to this parameter
caused a decrease of alignment in the experience-based learning group and an even further
decrease when predicting a model-free learner, but an increase of alignment during model-
based and hybrid prediction. In a way, these complex relationships suggest that during
model-free prediction those with decreased model-free and model-based processing (i.e.,
participants who made very stochastic decisions) were better able to align their choices with
the model-free valuation strategy, whereas model-based processing allowed for increasing
ability to align valuation with model-based and hybrid observed learners.
Regression predicting al i g nment
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.4912 0.0331 14.8519 0.0000
factor(agent)0 0.2534 0.0565 4.4859 0.0002
¯
1
m f
.z ¡0.0161 0.0041 ¡3.9143 0.0018
¯
1
mb
.z ¡0.0132 0.0042 ¡3.1016 0.0345
¯
1
mb
.z£task 0.0231 0.0067 3.4385 0.0111
¯
1
mb
.z£task£factor(agent)0 ¡0.0356 0.0071 ¡5.0155 0.0000
¯
1
m f
.z:¯
1
mb
.z 0.0115 0.0038 3.0457 0.0414
¯
1
ps
.z£task£factor(agent)0.5 ¡0.0302 0.0093 ¡3.2456 0.0213
Table 6.2: We defined alignment as the ability to align prediction model-based weight with that of the
observed participant. Here we found that increased model-free processing led to a decreased ability
to align valuation strategy. Whereas, precondition model-based processing was associated with an
increased ability to align behavior with a model-based and hybrid learner, but a decreased ability to
align with the model-free learner.
97
sd_bMF2_a1
sd_bMF2_a5
sd_bMF2_a0
sd_bMF2_eb
mu_bMF2_a1
mu_bMF2_a5
mu_bMF2_a0
mu_bMF2_eb
0.2 0.4 0.6 0.8
sd_ps2_a1
sd_ps2_a5
sd_ps2_a0
sd_ps2_eb
mean_ps2_a1
mean_ps2_a5
mean_ps2_a0
mean_ps2_eb
0.6 0.9 1.2
Figure 6.5: The hierarchical terms representing the mean and standard deviation of
model-free processing in the postcondition (¹
¯
2
m f
and ¾
¯
2
m f
) between the different
groups. Participants that predicted the model-free and hybrid learners showed in-
creased model-free inverse temperature.
We also see slight differences in perseveration between the same groups. But despite
having significantly reduced perseveration, the experience-based group is character-
ized as also having increased variance.
6.3.2 Understanding the differences following experience-based learning
and observational prediction
To first compare experience-based learning and observational prediction, we examine the
hierarchical population terms for the parameters of interest. As mentioned, during our Bayesian
analysis we calculated the latent means, standard deviations, and correlations for the group-
level estimates using a multivariate normal distribution between the various subject-level
parameters that affect the softmax choice rule (i.e., the free parameters¯
m f
,¯
mb
,¯
ps
).
We found that the groups that predicted either a model-free or mixed valuation learner
had increased model-free responding in the postcondition compared to both the experience-
based learning group and model-based learner prediction. Furthermore, the experience-
based and model-based prediction groups were estimated to have significantly greater post-
condition model-free processing variance than the model-free and hybrid prediction groups.
Suggesting that there was more to the story than just an overall difference in the postcondi-
tion model-free processing population means between groups.
Increased perseveration was found for both the model-free and hybrid prediction groups
compared to the experience-based learning group, but, again, the experience-based learning
group had significantly increased variance. We also found that the model-free prediction
group had significantly increased perseveration compared to the model-based prediction
98
group, but the groups did not differ in variance (Table 6.5).
In terms of model-based valuation, we found no significant differences when examin-
ing only the fitted hierarchical means governing the postcondition parameters of interest.
However, we did find increased variance in postcondition model-based processing for the
experience-based group compared to the model-free prediction group, despite the similar
central tendencies.
Given these results, we wished to further understand how observational prediction af-
fected value-based decision making, with reference to which types of subjects were affected
in the different manipulations and in which ways. For this, we used mixed effects linear
regression on the fitted parameter posterior distributions.
6.3.3 Effects of observational prediction on model-free and model-based
responding
We first examine model-free and model-based processing with covariates: group, block, and
iteration-dependent z-scored precondition covariate (¯
1
m f
.z, ¯
1
mb
.z, ¯
1
ps
.z, ¯
1
m f
.z £¯
1
mb
.z)
£block interactions.
We begin our analysis with the change of model-free processing. A significant nega-
tive estimate for block is found, along with a significant positive estimate for the interac-
tion between block£task£factor(agent)0 and block£task£factor(agent)5. These effects
relate back to our previous findings – i.e., those who predicted model-free or hybrid learners
showed increased model-free processing in the post-condition compared to both model-
based prediction and experience-based learning. Furthermore, the effect for block and the
lack of significance for block£task, suggests that there was an overall decrease in model-
free processing following both experience-based learning and model-based observational
prediction, but a slight increase following model-free and hybrid prediction.
Significance was found for block£task£¯
1
mb
.z, suggesting that model-free processing
in the precondition task affected the change in model-free processing for those in the ob-
servational prediction task, compared to experience-based learning. However, it was further
complicated by the presence of both block£task£factor(agent)0£¯
1
m f
.z and block£task
£factor(agent)0£¯
1
mb
.z terms. In this case, increased precondition model-free processing
led to an increase in model-free processing for the observational group. Furthermore, those
99
who predicted a model-based agent have this effect intensified (i.e., precondition model-
free processing more heavily affected the change of model-free processing for those in the
model-based prediction group than in all other groups).
We also see a protective effect for model-based processing on the change of model-
free processing in the experience-based group (a negative block£¯
1
mb
.z interaction). How-
ever, these effects are reversed for the model-based prediction group, but intensified for the
model-free prediction group and reduced in magnitude for the hybrid prediction. What this
suggests is that for experience-based learning, model-free prediction, and hybrid predic-
tion, increased precondition model-based processing led to a decrease in model-free pro-
cessing. But following model-based observational prediction, this effect is either not present
or slightly positive, i.e., those with increased model-based processing had a slightly positive
increase of model-free processing following model-based observational prediction.
Lastly, we see that perseveration had a protective effect on model-free processing in all
groups except model-free prediction. We find that this effect is further magnified following
model-based prediction.
Regression predicting¯
1
m f
¯
2
m f
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.6083 0.0502 12.1114 0.0000
block ¡0.2206 0.0446 ¡4.9502 0.0000
block£task£factor(agent)0 0.3671 0.0761 4.8225 0.0000
block£task£factor(agent)0.5 0.3880 0.0761 5.0972 0.0000
block£¯
1
m f
.z£task 0.1149 0.0090 12.7160 0.0000
block£¯
1
m f
.z£task£factor(agent)0 ¡0.0943 0.0098 ¡9.5753 0.0000
block£¯
1
m f
.z£task£factor(agent)0.5 ¡0.0398 0.0099 ¡4.0097 0.0015
block£¯
1
mb
.z ¡0.0304 0.0061 ¡4.9917 0.0000
block£¯
1
mb
.z£task 0.0411 0.0095 4.3371 0.0004
block£¯
1
mb
.z£task£factor(agent)0 ¡0.0562 0.0101 ¡5.5908 0.0000
block£¯
1
mb
.z£task£factor(agent)0.5 ¡0.0338 0.0103 ¡3.2729 0.0231
block£¯
1
ps
.z ¡0.0210 0.0054 ¡3.8868 0.0023
block£¯
1
ps
.z£task ¡0.0483 0.0087 ¡5.5737 0.0000
block£¯
1
ps
.z£task£factor(agent)0 0.0580 0.0093 6.2524 0.0000
block£¯
1
ps
.z£task£factor(agent)0.5 0.0308 0.0095 3.2280 0.0268
100
Repeating the regression analysis on model-based processing, revealed to us changes
that were not as readily obvious from the previous comparisons that utilized only each group’s
estimated hierarchical postcondition terms.
Overall, participants showed increased model-based processing. Furthermore, follow-
ing experience-based learning, both increased model-free processing and increased model-
based processing were associated with an increase of model-based processing in the post-
condition. Whereas, following observational prediction, we see a negative effect of model-
free processing on model-based processing, with the magnitude of this effect being greatest
following model-based observational prediction. This suggests that following experience-
based learning, model-free participants are able to show an even greater increase of model-
based processing, whereas the previously identified increase of model-based processing (ef-
fect of block) is reduced with increasing model-free processing following observational pre-
diction. Furthermore, we see a negative block£¯
1
mb
.z £task interaction. Given the previ-
ously identified significant terms, increased precondition model-based processing is associ-
ated with a further increase of model-based processing, and the magnitude of this effect is
increased following experience-based learning in comparison to observational prediction.
Again, we find that perseveration has an effect on reinforcement learning, specifically,
increased perseveration was associated with an increase of model-based processing for all
groups except model-free prediction. We see this effect is further increased following model-
based prediction, suggesting that those types of subjects were better able to increase model-
based responding after observing a model-based learner than following direct experience.
We also find a slight effect for block£¯
1
m f
.z:¯
1
mb
.z£task following both hybrid and model-
based prediction. Together these terms symbolize how subjects with decreased model-free
and model-based processing end up having increased model-based processing change fol-
lowing model-based oriented observational prediction. Suggesting that observational pre-
diction has some effect on subjects who preformed poorly (high choice stochasticity) in the
precondition task.
In terms of model-based weight (i.e., the ratio of model-based processing to the sum of
model-free and model-based processing scores), we find results consistent with the previous
analyses. Specifically, there was an overall increase of model-based weighting following both
experience-based learning and model-based prediction, whereas there was a slight decrease
101
Regression predicting¯
1
mb
¯
2
mb
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.5898 0.0743 7.9396 0.0000
block 0.1583 0.0488 3.2405 0.0256
block£¯
1
m f
.z 0.0344 0.0047 7.2751 0.0000
block£¯
1
m f
.z£task ¡0.0984 0.0079 ¡12.5226 0.0000
block£¯
1
m f
.z£task£factor(agent)0 0.0433 0.0086 5.0505 0.0000
block£¯
1
m f
.z£task£factor(agent)0.5 0.0310 0.0087 3.5573 0.0086
block£¯
1
mb
.z 0.1268 0.0089 14.2926 0.0000
block£¯
1
mb
.z£task ¡0.0839 0.0139 ¡6.0320 0.0000
block£¯
1
ps
.z 0.0212 0.0061 3.4819 0.0111
block£¯
1
ps
.z£task 0.0584 0.0110 5.3109 0.0000
block£¯
1
ps
.z£task£factor(agent)0 ¡0.0799 0.0118 ¡6.7796 0.0000
block£¯
1
ps
.z£task£factor(agent)0.5 ¡0.0511 0.0120 ¡4.2595 0.0004
block£¯
1
m f
.z:¯
1
mb
.z£task ¡0.0254 0.0074 ¡3.4126 0.0143
block£¯
1
m f
.z:¯
1
mb
.z£task£factor(agent)0 0.0254 0.0079 3.2238 0.0274
of model-based weighting following model-free and hybrid prediction.
We see a positive block£¯
1
mb
.z term suggesting that increased model-based processing
was associated with an increase of model-based weighting, however this effect is reduced fol-
lowing observational prediction. Uniquely, we see a negative block£¯
1
m f
.z£task interaction
following observational prediction, with this effect being greatest in magnitude following
model-based prediction. Suggesting that precondition model-free processing was strongly
associated with a decrease of model-based weighting following observational prediction, but
had no effect on the change of model-based weighting following experience-based learning.
Lastly, we find that increased precondition perseveration is associated with an increase of
model-based processing following experience-based learning, hybrid prediction, and even
more so, following model-based prediction. We do not find an effect of perseveration on
change of model-based weighting following model-free prediction.
Examining reinforcement learning choice consistency (¯
1
m f
+¯
1
mb
), we see that for the
experience-based and model-based prediction groups, increased model-free and increased
model-based processing were both associated with a further increase of reinforcement learn-
102
Regression predicting!
1
!
2
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.4323 0.0352 12.2791 0.0000
block 0.1133 0.0251 4.5082 0.0002
block£task£factor(agent)0 ¡0.1400 0.0429 ¡3.2624 0.0240
block£task£factor(agent)0.5 ¡0.1390 0.0429 ¡3.2370 0.0261
block£¯
1
m f
.z£task ¡0.0776 0.0054 ¡14.2960 0.0000
block£¯
1
m f
.z£task£factor(agent)0 0.0612 0.0059 10.3565 0.0000
block£¯
1
m f
.z£task£factor(agent)0.5 0.0446 0.0060 7.4307 0.0000
block£¯
1
mb
.z 0.0592 0.0052 11.4907 0.0000
block£¯
1
mb
.z£task ¡0.0324 0.0080 ¡4.0218 0.0014
block£¯
1
ps
.z 0.0137 0.0036 3.8023 0.0031
block£¯
1
ps
.z£task 0.0399 0.0058 6.8276 0.0000
block£¯
1
ps
.z£task£factor(agent)0 ¡0.0528 0.0063 ¡8.4325 0.0000
block£¯
1
ps
.z£task£factor(agent)0.5 ¡0.0388 0.0064 ¡6.0328 0.0000
block£¯
1
m f
.z:¯
1
mb
.z ¡0.0149 0.0028 ¡5.2233 0.0000
ing choice consistency. However, we see the opposite effect of model-free processing on RL
choice consistency following model-free prediction, i.e., those with decreased model-free re-
sponding in the precondition showed an increase of choice consistency following model-free
prediction. Furthermore, the block£¯
1
mb
.z£task£factor(agent)0 interaction signified that
the effect of increased model-based processing on the change of RL choice consistency was
significantly reduced, or almost completely gone, in the model-free prediction group, and
the block£¯
1
mb
.z£task£factor(agent)0.5 suggests this effect was reduced following hybrid
prediction compared to experience-based learning and model-based prediction.
The change in perseveration is marked with a between block increase for participants
with increased model-free or model-based processing following experience-based learning.
However, we see the opposite effect following model-free and hybrid prediction in terms
of model-free processing, i.e., increased precondition model-free processing was associated
with a decrease of perseveration following model-free and hybrid prediction, but an increase
of perseveration following experience-based and model-based prediction. Furthermore, we
see a strongly negative block£¯
1
mb
.z£task term, along with further negative block£¯
1
mb
.z
103
£task£factor(agent)0 and block£¯
1
mb
.z£task£factor(agent)0.5 terms. Ultimately, this sug-
gests that for the experience-based learning group, precondition model-based processing
was associated with an increase of perseveration. Whereas following observational predic-
tion (and especially model-free and hybrid prediction), precondition model-based process-
ing was associated with a decrease of perseveration in the postcondition RL task.
Regression predicting ps
1
ps
2
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.8583 0.0896 9.5823 0.0000
block£¯
1
m f
.z 0.0190 0.0056 3.4104 0.0142
block£¯
1
m f
.z£task£factor(agent)0 ¡0.0645 0.0103 ¡6.2873 0.0000
block£¯
1
m f
.z£task£factor(agent)0.5 ¡0.0952 0.0105 ¡9.0337 0.0000
block£¯
1
mb
.z 0.0619 0.0095 6.5351 0.0000
block£¯
1
mb
.z£task ¡0.0897 0.0148 ¡6.0671 0.0000
block£¯
1
mb
.z£task£factor(agent)0 ¡0.0759 0.0156 ¡4.8518 0.0000
block£¯
1
mb
.z£task£factor(agent)0.5 ¡0.0751 0.0161 ¡4.6639 0.0001
block£¯
1
ps
.z 0.0800 0.0106 7.5399 0.0000
block£¯
1
ps
.z£task£factor(agent)0 0.0987 0.0182 5.4328 0.0000
block£¯
1
ps
.z£task£factor(agent)0.5 0.1220 0.0184 6.6148 0.0000
6.4 Discussion
In this study, we examined how observational prediction in comparison to increased task
experience affected future habitual and goal-directed tendencies. Subjects choices were
analyzed using a reinforcement learning model [116] that allows for consideration of both
model-free and model-based processes [27]. With a precondition-postcondition repeated-
measures regressional analysis on the parameter posterior densities we found numerous dif-
ferential effects concerning the two forms of learning on future behavior.
Overall, we found that observational prediction of a model-free and hybrid learners elicited
an increase of model-free processing and a decrease in model-based weighting, compared to
the model-based observational prediction and experience-based learning. In those groups,
we see a decrease of model-free processing following the experimental learning task and an
increase of model-based weighting.
104
Alongside the group effects, we see various interactions on the change of parameters de-
pending upon behavioral tendencies displayed in the precondition task. For the experience-
based learning group, we see a best-get-better effect for standardized precondition model-
based processing on the change of model-based processing and on the change of model-
based weighting. Whereas, this effect is decreased in the observational prediction groups.
Surprisingly, we see a positive effect of precondition model-free processing on the change
of model-based processing for the experience-based learning group, but this relationship
is reversed (negative) in the observational prediction groups, especially following model-
based prediction. When also including the results from the model-based weighting regres-
sion, we find that for experience-based learning increased model-free responding was asso-
ciated with increased model-based processing, but not necessarily model-based weighting.
Whereas following observational prediction, those with increased model-free responding in
the precondition show increased model-free responding, decreased model-based respond-
ing, and decreased model-based weighting in the postcondition. Suggesting that it might be
the case that those who display model-free tendencies in the precondition have the ability
to show increased model-based processing, but for whatever reason, they capture model-
free tendencies (rather than model-based tendencies) following observational prediction,
regardless of the observed learner’s valuation strategy.
When we factor in perseveration, we see that following experience-based learning, pre-
condition model-free and model-based processing was associated with an increase of per-
severation . Whereas, model-based processing was associated with a decrease of persever-
ation following observational prediction. We also see that increased perseveration led to a
further increase of perseveration in the postcondition. Given the interactions of increased
perseveration on model-based weighting, we see a differing profile of behavior following
experience-based learning and model-based prediction. Experience-based learning seemed
to affect those with increased reinforcement learning choice consistency show increasing
model-based processing. Whereas model-based prediction seemed to increase reinforce-
ment learning choice consistency (i.e., increased precondition model-free processing led
to a decrease of model-based processing and an increase of model-free processing follow-
ing model-based prediction). Yet following model-based prediction those who showed high
perseveration in the precondition were better able to show increased model-based weight-
105
ing and model-based processing. Suggesting that observational prediction of a similar well-
trained other (in this case high model-based weighting) is able to aid those who showed
increased perseverative tendencies to show increased goal-directed consideration. Previ-
ous reports have found that increases to perseveration are associated with increased fatigue
[126], thus the model-based prediction interaction on precondition model-based processing
change of perseveration may suggest that having a different task goal allowed for them not
to show a fatigue-related increase of perseverative behavior.
Previous studies have examined the many modulations to model-free and model-based
weight [83][37][61][24]. However many of those studies used manipulations inaccessible
to the general public–e.g., L-dopamine administration [138], transcranial direct stimulation
[111]. Here we demonstrate how different types of learning can affect future model-free and
model-based processing considerations. While there were many interesting interactions be-
tween the different observational groups, one common finding was that the worst perform-
ing subjects (i.e., those with both low model-free and model-based precondition estimates),
showed some form of improvement following observational prediction (either model-free or
model-based) compared to experience-based learning.
We also stress the importance of perseveration on the various parameters of interest,
especially with how the parameter positively affected model-based processing and model-
based weighting for experience-based learning and model-based observational learning groups.
On one hand, selecting a choice only on the basis of it being previously selected is a prim-
itive, yet important heuristic as to how humans generally select actions [98]. In one way
it lessens our cognitive load, we don’t have to consider the uniqueness of the situation, in-
stead our action is what we have done before. Yet as this parameter was increased in the
precondition, it seemed to positively effect the within-subjects measurement of the change
in the model-based parameters. We reason this as, some subjects showed reduced model-
based processing due to inexperience with the task in the precondition and this was coupled
with increased perseveration, as we did see a positive correlation between perseveration
and model-based processing in the precondtion. Then following experience-based learn-
ing and model-based observational learning these subjects were able to have within-subject
increased model-based processing and weighting following those conditions, yet following
model-free observational prediction these subjects showed decreased model-based weight-
106
ing, suggesting that perseveration plays a large role as to how future learning will change
valuation strategy.
107
Chapter 7
Mindfulness
7.1 Introduction
Goal-directed consideration is only one factor affecting the way in which we decide what
action to preform. This form of action selection requires consideration of our current goals
along with aspects presently true that may affect each potential action’s successfulness. How-
ever, there are intrinsic costs associated with this form of control [64][71], and more of-
ten is it studied in the context of reduction (e.g., in adolescents [29], clinical populations
[130][49][131], or when other stressors are present [83][93]) than enhancement.
Standing in opposition is habitual action selection. In this form of control, action selec-
tion is not primarily guided by current beliefs or goals, nor is it preforming the best action
given of all the relevant environmental information. Instead, it is a learned preference for
preforming an action in a specific context [135][77]. Often these habitualized actions were
something that was once goal-directed [94]. However, following many repetitions, the action
is no longer performed because of some goal-related desire, but rather a learned automatic-
ity [5][104][73], permitting attention elsewhere [136] through a reduction of cognitive load
[64].
In terms of factors that may affect behavioral action selection, mindfulness-based inter-
ventions have demonstrated rather unique shifts in reward processing. Previous papers have
shown that this practice has led to a reduction of drug-induced cue saliency [45][44][118],
improved emotional well-being [47], along with various neural modulations in the default
108
mode network and medial prefrontal regions [33][113][42]. These changes seem to be in-
duced through an enhancement of self-regulation [43][82][107] and self-awareness [125][55].
In terms of reward learning, mindfulness potentially plays an important role in shaping how
attention is directed [4] and emotion is regulated [65][20].
However many questions still remain concerning the effects of mindfulness on reward
learning. Specific to our interests is if mindfulness promotes goal-directed versus habit-
ual responding. Probabilistic reward learning tasks [51][26] have been used to differentiate
goal-directed and habitual action selection through a model-free/model-based reinforce-
ment learning model [116]. These studies have found that the general population uses a
combination of both model-free (habitual) and model-based (goal-directed) processes when
deciding between potential actions [108]. Here, we ask if a short mindfulness intervention
shifts the weighing of these processes towards goal-directed rather than habitual control in
a test-retest experimental design.
In this task, goal-directed control is advantageous to habitual processing, yet the gen-
eral population selects choices in a way that considers both ways of responding. Previous
studies have found that goal-direct behavior is associated with intelligence [48][36][92], and
that both processes are associated with a number of genetic factors, especially those relating
to striatal (model-free) and prefrontal (model-based) dopaminergic networks [34]. Further-
more, there have been a number of studies on the interaction between model-based pro-
cessing and stress [93][84] for this type of probabilistic task. Given previous studies on the
effects of mindfulness and stress reduction [18][59][62], we hypothesize that mindfulness
will increase model-based processing (goal-directed control).
We are unsure of how mindfulness will affect model-free (habitual) processing. Since
both processes are measured concurrently, it may be the case that both systems are in-
creased/decreased. Model-based weighting is the ratio of model-based to model-free pro-
cessing, and similarly, we are unsure as to how mindfulness will affect the consideration
of model-based versus model-free valuation. A finding that mindfulness increases model-
based weighting would support that mindfulness shifts action selection from being habitual
to goal-directed, whereas, e.g., only a mindfulness-induced increase of model-based pro-
cessing, may not necessarily support a shift of habitual to goal-directed processing.
Unique to our task, participants were either rewarded, punished, or received no change
109
following each decision. Ultimately, this experimental design choice played a role as to how
the task was interpreted by participants. The best fitting model was one that included a
model-free punishment switching parameter (i.e., a parameter that coded a tendency to
switch choices following punishment, in addition to what is calculated by reinforcement
learning). Although we had no strict hypotheses as to how mindfulness would affect this
parameter, we find a web of interactions between model-free and model-based reinforce-
ment learning and punishment switching on mindfulness induced changes of behavior.
Along with examining how a short mindfulness intervention affected changes of behav-
ior compared to control, we also examine how traits affected performance measures dur-
ing the precondition task. Given previous studies, we expected intelligence (as measured
by Raven’s progressive matrices) to be associated with increased model-based weighting
[48], but wanted to further examine how it related to model-free processing. Mindfulness
was measured as both a state-like quality (to see if the intervention induced a change be-
tween groups) and trait-like quality. We used the five facet mindfulness scale [6][7] to ex-
amine mindfulness traits (acting with awareness, non-reactive, non-judgmental, observing,
describing) for associations with precondition behavioral characteristics.
7.2 Methods
In this experiment, Inna Arnaudova aided with the experimental design and data collection.
7.2.1 Participants
The study consisted of eighty-one participants (45 female,¹
ag e
= 23.16, SD = 3.97) divided
into two groups: control (n = 40) and mindfulness intervention (n = 41). All subjects com-
pleted the full experimental design and no subject data was omitted from the final analysis.
The study was approved by the University of Southern California Institutional Review
Board. All participants provided written informed consent prior to the start of the experi-
ment.
110
Figure 7.1: Habitual versus goal-directed processing was tested in a task which utilized a decision in
the first-stage of the trial, which directly affected transitional properties to the second-stage state. Ul-
timately, trial reward was dependent upon which second-stage state was present in the trial.
Here, model-free learners are more likely to repeat a choice whether it was rewarded and less likely if
the choice led to a punishment. Whereas, a model-based learner considers the trial’s transition, which
could either be a common or rare transition. Critically on rare transitions, model-based learners are
able to correctly reevaluate both first-stage actions given their transitional properties to the second-
stage states.
2
4
6
0 50 100 150 200
Trial
Second−stage alien value
1 2 3 4 5 6 7
−2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2 −2 0 2
0.0
0.2
0.4
0.6
Trial outcome
Reward probability
Figure 7.2: Throughout the block (200 trials), the acquired second-stage states had a latent value which
probablistically affected trial reward. Whereas, the first-stage choice influenced the transition proba-
bility to second-stage states.
7.2.2 Sequential learning task
The two-step task [26][2] was programmed in MATLAB 2017a (MathWorks) with the Psy-
chophysics Toolbox 3beta extension.
In total, participants completed 2x a 25 minute blocks consisting of 200 trials. Despite
utilizing the same task construction, the two blocks utilized different first-stage and second-
stage symbols, along with unique second-stage reward probability random walks.
Each trial began with presentation of the same two locations, representing the two first-
stage choices. Participants had 2 seconds to respond with the left or right buttons - v or n -
to select a location. Following selection of a location, an alien appeared on screen denoting
111
the acquired second-stage state. Following this transition to the acquired second-stage state,
the participant was either rewarded (+1 to +3 points represented by gold coins), punished (-
1 to -3 points represented by gold coins superimposed with a red X), or received no points
(empty circle).
The trial’s outcome (number of points received) was directly related to the acquired alien’s
current value. Each alien had a value that varied from +1 to +7 and pseudo-randomly changed
every 10-20 trials, with values closer to 7 relating to increased reward probability, and values
closer to 1 with increased punishment probability. These changing state values were care-
fully programmed such that locations had approximately the same mean and standard de-
viation throughout the whole block, but changed in a way such that increased model-based
processing would result in increased reward (thereby incentivizing model-based valuation).
In reference to the acquired second-stage state, each first-stage choice location predomi-
nately (0.75%) transitioned to the alien that the other location less commonly transitioned to
(0.25%). With this construction, model-free and model-based learning differ in their behav-
ioral policy. Specifically, a model-free learner will reinforce actions directly by the trial’s out-
come, whereas model-based valuation not only considers trial outcome, but which second-
stage state that outcome was obtained in and how the second-stage states relate to first-stage
actions. For example, following a trial with an uncommon transition but highly rewarding
outcome, a model-free learner would be more likely to select the same choice because it had
just resulted in high reward. A model-based learner would be more likely to select the un-
selected first-stage choice because that location is more likely to transition to the alien that
gave the participant the high reward.
7.2.3 Toronto Mindfulness Scale
Mindfulness contains both state and trait-like qualities. The Toronto Mindfulness Scale
(TMS) has been shown as a reliable and valid measure to assess mindfulness as a state dur-
ing a short period of time [69][25]. This 13-item scale asks participants to indicate the degree
to which each question describes what they just experienced during the preceding audio
exercise. Specifically, these questions focuses on whether the participant related to the ex-
perience without judgment, but with acceptance, openness, and curiosity.
The scale contains two indexes, measuring curiosity and decentering. Together these
112
records were used to confirm the effect of the brief mindfulness intervention.
7.2.4 Raven
Raven’s advanced progressive matrices [95] was administered in a 20-minute session. This
test is considered a reliable test of nonverbal, analytical intelligence [19]. Specifically, partic-
ipants selected 1 of 8 potential figures to fill in a missing section from the reference matrices.
7.2.5 Self-control
Self-control was evaluated with the 13-item Brief Self-Control Measure [74][119]. Partici-
pants rated their ability to inhibit actions that might not be beneficial in the long term on a
5-point scale. Higher sum scores represent higher perceived inhibitory control.
7.2.6 Trait mindfulness
The Five Facet Mindfulness Questionnaire [7] consists of 39-items assessing five factors of
mindfulness: observe, describe, act with awareness, non-judging, and non-reactivity. In-
creases to each facet relate to an increase of mindfulness. The participants rated each ques-
tion on a 5-point Likert scale from 1 (never/very rarely true) to 5 (very often/always true).
7.2.7 Mindfulness experience
Prior mindfulness experience was assessed with a single question regarding the participant’s
history with various mindfulness-based practices [124]. Participants responded on a 6-point
scale, rating the number of times they have engaged in these practices from zero, never, to
five, once or more per week in the last six weeks or more.
7.3 Results
7.3.1 Manipulation check
The Toronto Mindfulness Scale was used to examine if the brief mindfulness intervention
affected state mindfulness. This scale measures aspects of curiosity and decentering. Both
measures relate to awareness of the present, but curiosity relates to being curious about the
113
present, while decentering is doing so with distance or disidentification [69]. We found in-
creases to both curiosity (ctrl = 9.825§ 0.939, mind = 14.146§ 1.321, t value = 3.271, p =
0.00159) and decentering (ctrl = 13.20§ 0.773, mind = 15.634§ 1.087, t value = 2.238, p =
0.028) in the mindfulness group.
7.3.2 Model fitting
We examined the effects of mindfulness on model-free and model-based valuation using a
two-stage task applied precondition and postcondition in a test-retest design. To analyze the
trial-by-trial behavioral data, we first used hierarchical Bayesian mcmc [120] to fit a hybrid
reinforcement learning model. This process creates a posterior distribution for the subject-
level reinforcement learning model parameters and group-level hierarchical parameters.
At first we were primarily interested in the parameters relating to goal-directed versus ha-
bitual decision-making (model-free and model-based choice consistency and model-based
weighting). However, after fitting numerous models, the model that fit best included not only
model-free and model-based valuation learning, but also perseveration (tendency to stay
with the previous choice regardless of trial outcome) and punishment switching. While we
expected significance for perseveration, we did not for the punishment switching parame-
ter. In our formulation, punishment switching is modeled as an increased tendency to switch
choices following punishment, on top of what is calculated by reinforcement learning. Fur-
thermore, this tendency does not respect the environment’s contingencies (i.e., the relation-
ship between transition and outcome), and thus, this switching tendency can be thought of
as model-free (punishment leads to an increased probability of switching).
We demonstrate the role of the softmax coefficients (¯
1
m f
,¯
1
mb
,¯
1
ps
,¯
1
pun
) by plotting the
absolute distance between a trial’s estimated value difference (the value for choice 1 vs choice
2) and the probability that the participant will explore (i.e., the model gives more value to the
unselected choice given that subject’s specific RL model fit). On a given trial, as the abso-
lute value between the estimated value difference increases, the chance of exploring (mak-
ing a choice inconsistent with their fitted valuation learning model) decreases. However,
this decrease is modulated by the softmax coefficients. As the sum of this value increases
(¯
1
m f
,¯
1
mb
,¯
1
ps
,¯
1
pun
), then the subject’s choices are consistent to the fitted model than a par-
ticipant with lesser value. In another sense, those with decreased values are more stochastic
114
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.5 1.0 1.5 2.0
Absolute difference
Exploration percentage
Inverse temperature High Average Low
Figure 7.3: As the value difference between the two choices increased, the chance of
exploring decreased. As inverse temperature increased choices became more determin-
istic.
All Participants Model−free Hybrid Model−based
Punishment No change Reward Punishment No change Reward Punishment No change Reward Punishment No change Reward
0.00
0.25
0.50
0.75
1.00
Stay percentage
Transition Common Rare
Figure 7.4: Precondition stay probabilities after grouping subjects by their associated
precondition model-based weight.
in their decisions, and potentially unable to properly utilize task information to impact their
decision-making process.
However, depending on the ratio of the coefficients (e.g., model-based weighting is¯
1
m f
versus¯
1
mb
), will we see a difference in valuation strategy. To visual these differences in strat-
egy, we can use stay probabilities–unlike the reinforcement learning model, stay probability
only considers the previous trial as influencing choice [78].
By stratifying subjects based on their model-based weighting (Figure 7.4), model-free val-
uation is characterized with a stay probability that is directly associated with trial outcome.
Whereas, model-based valuation shows a transition£outcome interaction, representative
of how they consider the relationship between the previous trial’s outcome and the transi-
tioned second-stage state. Lastly, mixed valuation shows elements of both model-free and
model-based responding. For these participants, there is decreased stay probability follow-
115
Low punishment switch Norm punishment switch High punishment switch
All Participants Model−free Hybrid Model−based
Punishment No change Reward Punishment No change Reward Punishment No change Reward
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Stay percentage
Transition Common Rare
Low perseveration Norm perseveration High perseveration
All Participants Model−free Hybrid Model−based
Punishment No change Reward Punishment No change Reward Punishment No change Reward
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
Stay percentage
Transition Common Rare
Figure 7.5: As punishment switching increased, there was decreased probability of stay-
ing with the same choice following punishment. This is often correlated with increased
stay probability on no change trials.
While increased punishment switching was associated with increases to stay probabil-
ity following both rewarded and no change trials, increases to perseveration cause an
increase to stay probability following all trial types.
ing rare rewarded trials compared to common rewards, but this is still increased compared to
common punishment trials. Similarly, rare punishment trials have increased stay probabil-
ity compared to common punishment trials, but decreased compared to common rewarded
trials.
By also stratifying participants on punishment switching , there is an decrease in staying
following punishment as punishment switching increases. More specifically, as punishment
switching increases the contrast of staying following reward, no change, and punishment.
High punishment switching also being characterized by a high stay percentage on no change
trials, which differs from perseveration in which stay percentage is increased for all trial types
(Figure 7.5).
Examining the precondition parameters revealed that our subjects’ choices were more
consistent with model-based valuation (mean 2.2347 {1.8888, 2.5862}) than model-free (mean
0.9421 {0.7852, 1.1012}). Furthermore, we found a significant negative correlation between
precondition model-free and model-based processing (mean -0.3686 {-0.1653, -0.5585}). Pun-
ishment switching (mean 0.8459 {0.6978, 1.0019}, sd 0.5127 {0.3893, 0.6553}) was found to
positively correlate with both precondition model-free processing (mean 0.4753 0.1915, 0.7256)
and perseveration (mean 0.4780 {0.2347, 0.6982}).
116
7.3.3 Mindfulness increases model-based choice consistency
To examine how mindfulness affected the within-subjects change for the parameters of in-
terest, we conducted post-hoc mixed-effects repeated-measures regression [10] on the subject-
level posterior distributions. Here, we use the term post-hoc to symbolize how the subject-
level and group-level hierarchical parameters were first estimated via the Bayesian mcmc
estimation process [120], producing a posterior distribution for each parameter. From there,
the created parameter posterior distributions
1
were inputted as the raw data in a mixed-
effects repeated-measures model that understood that the dataset consisted of posterior
distributions (rather than single values) representing potential our confidence as to what
the potential values are.
We did this because we were interested not only in the differences associated with a brief
mindfulness intervention, but to also examine which types of subjects were affected (e.g., if
mindfulness induced change depending on model-free tendencies displayed in the precon-
dition task). To do this, we included standardized precondition regression terms that interact
with block. For example, block£¯
m f
1.z represents a within-subject change from precondi-
tion to postcondition (block) that is explained by standardized model-free processing esti-
mated from the precondition. Furthermore, with this methodology, we can investigate par-
ticipants that varied from the model’s predictions, e.g., if a specific subject was estimated to
deviate in terms of their block£¯
m f
1.z estimate. In comparison, a repeated measures analy-
sis that utilized point-estimates, as opposed to a posterior distribution inputs, would be un-
able to identify if a specific subject’s block£¯
m f
1.z estimate deviated from the population
estimate; rather, this analysis would only produce an overall residual for each participant.
With each regression, we checked whether the model correctly explained the variation
versus a simpler formulation. Most often, the best fitting linear model was one which in-
cluded the full list of precondition z-scored softmax coefficients and the model-free£model-
based processing interaction, unless otherwise noted.
We begin by examining the change of model-free processing (Table 7.1). Overall, par-
ticipants decreased model-free processing during the postcondition compared to the pre-
condition. However this decrease was reduced as precondition z-scored model-free pro-
1
Typically we utilize 500 iterations. Assuming that the Bayesian mcmc process properly explored the posterior,
then increasing the number of iterations only increases the resolution of the post-hoc estimation process. We find
that 150 iterations is typically sufficient to extract the larger effects (t value > 4).
117
Regression predicting¯
1
m f
¯
2
m f
Significant Predictor coefficients sigma tstat pvalue
(Intercept) 0.9394 0.0918 10.2331 0.0000
block ¡0.2548 0.0649 ¡3.9234 0.0011
block£¯
1
m f
.z 0.1217 0.0081 15.0127 0.0000
block£¯
1
mb
.z ¡0.0457 0.0102 ¡4.4825 0.0001
mind£block£¯
1
mb
.z ¡0.0606 0.0155 ¡3.9133 0.0011
block£¯
1
m f
.z:¯
1
mb
.z 0.0170 0.0053 3.1764 0.0177
block£¯
1
pun
.z ¡0.0601 0.0063 ¡9.5929 0.0000
mind£block£¯
1
pun
.z 0.1162 0.0113 10.2553 0.0000
mind£block£¯
1
ps
.z ¡0.0382 0.0116 ¡3.2877 0.0124
Table 7.1: Model-free processing decreased between blocks for subjects of both groups. However,
we see that following mindfulness, those with increased model-based processing showed further de-
creased model-free responding compared to those in the control condition.
Furthermore, those with increased precondition punishment switching show increased postcondition
model-free processing following mindfulness, but this is decreased following control.
cessing increased (a positive interaction between block£¯
1
m f
.z). Whereas, increased pre-
condition model-based processing was associated with a further reduction of postcondition
model-free processing (negative interaction between block£¯
1
mb
.z). Yet, those in the mind-
fulness group saw an even further decrease of this interaction (mind£block£¯
1
mb
.z). Thus
with increased experience in the task, participants showed decreased model-free consider-
ation, and those with increased precondition model-based tendencies were associated with
a further reduction of model-free processing following control and, even more so, following
mindfulness.
Interestingly, we also see other effects such as a decrease in model-free processing if the
participant had increased model-free punishment switching (block£¯
1
pun
.z). Yet the com-
plete opposite effect in the mindfulness condition, suggesting that punishment switching
differentially affected model-free processing change depending on condition. Furthermore,
we see a reduction in model-free processing in the mindfulness group as perseveration in-
creases. We further examine these findings in the coming analyses.
For the change of model-based processing, we see a strong positive effect for precondi-
118
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1
0.0
0.1
−1 0 1 2
standardized precondition model−based processing, β
mb
1
.z
Δβ
mf
explained by β
mb
1
.z
Condition
● ● Control
Mindfulness
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.10
−0.05
0.00
0.05
−1 0 1 2
standardized precondition punishment switching, β
pun
1
.z
Δβ
mf
explained by β
pun
1
.z
Condition
● ● Control
Mindfulness
Figure 7.6: Mindfulness increased the protective effect of model-based processing on the
change of model-free processing. Whereas, control and mindfulness completely differed
in how precondition punishment switching would influence the change of model-free
processing.
Regression predicting¯
1
mb
¯
2
mb
Significant Predictor coefficients sigma tstat pvalue
(Intercept) 2.0807 0.2866 7.2597 0.0000
mind£block£¯
1
m f
.z 0.1181 0.0136 8.6912 0.0000
block£¯
1
mb
.z 0.2294 0.0440 5.2126 0.0000
mind£block£¯
1
mb
.z 0.2602 0.0621 4.1923 0.0003
block£¯
1
m f
.z:¯
1
mb
.z ¡0.0197 0.0054 ¡3.6273 0.0034
block£¯
1
ps
.z 0.0289 0.0083 3.4707 0.0057
mind£block£¯
1
ps
.z ¡0.0419 0.0120 ¡3.4958 0.0054
block£¯
1
pun
.z ¡0.0870 0.0099 ¡8.8059 0.0000
mind£block£¯
1
pun
.z 0.0569 0.0143 3.9765 0.0007
Table 7.2: Those with increased precondition model-based processing showed a further increase of
model-based processing, and this effect is further increased following mindfulness. Unique to mind-
fulness is that increased precondition model-free processing was associated with an increase to model-
based processing.
tion model-based processing, i.e., as precondition model-based processing was increased, it
was further increased in the postcondition. Yet we also see positive significance for model-
based processing and mindfulness interaction, suggesting these types of participants showed
a further increase following mindfulness than control.
Uniquely, we see a positive interaction between precondition model-free processing and
119
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2
−0.1
0.0
0.1
−1 0 1 2
standardized precondition model−free processing, β
mf
1
.z
Δβ
mb
explained by β
mf
1
.z
Condition
● ● Control
Mindfulness
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1
0.0
0.1
0.2
−1 0 1 2
standardized precondition punishment switching, β
pun
1
.z
Δβ
mb
explained by β
pun
1
.z
Condition
● ● Control
Mindfulness
Figure 7.7: Only in the mindfulness condition do we see a change in model-based pro-
cessing associated with precondition model-free processing. Decreased model-free pro-
cessing is associated with a decrease in the change of model-based processing, which
supports later conclusions that mindfulness affected RL choice consistency.
Precondition punishment switching was associated with a decrease of model-based pro-
cessing. However, the estimate of this effect was significantly reduced for the mindful-
ness condition.
mindfulness. This suggests that mindfulness allowed participants with increased model-
free processing to have an increase of model-based processing during the postcondition. An
effect of precondition model-free processing on the change of model-based processing is
not found in the control condition.
We see two other interactions that differ between groups. Slight positive significance is
found for perseveration on the change of model-based processing in the control group, but
the opposite effect following mindfulness. In terms of the perseveration interactions, in pre-
vious studies we have found that increased perseveration to lead to an increase of model-
based processing, as was found in our control condition. We often think of these subjects
as being slightly unsure of their choices during the precondition, so they show excessive
reliance on what they had previously selected. But with increased task experience, they usu-
ally end up understanding the workings of the task (an increase of model-based processing).
However, the interaction with mindfulness complicates this picture.
Finally, we see a decrease of model-based processing as model-free punishment switch-
ing increased, but this decrease was lessened following mindfulness intervention. Along with
the effect of precondition punishment switching on model-free processing, we see that those
with increased punishment switching show decreased reinforcement learning (both model-
120
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.03
−0.02
−0.01
0.00
0.01
0.02
−1 0 1 2
standardized precondition punishment switching, β
pun
1
.z
Δω explained by β
pun
1
.z
Condition
● ● Control
Mindfulness
Figure 7.8: For the control task, there is a slight effect for punishment switching having
a positive association with the change in model-based weighting. However, there is a
strong interaction with mindfulness. Those who used less punishment switching in the
precondition saw an increase in the change of model-based weighting following mindful-
ness. Whereas, those whose precondition strategy had increased punishment switching
showed a decrease of model-based weight.
free and model-based RL) with increased task experience (control), but following mindful-
ness, these subjects show an increase of model-free processing and a lesser reduction of
model-based processing (compared to control).
Regression predicting!
1
!
2
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.5708 0.0481 11.8722 0.0000
block£¯
1
m f
.z ¡0.0342 0.0027 ¡12.8949 0.0000
block£¯
1
mb
.z 0.0443 0.0054 8.2754 0.0000
block£¯
1
pun
.z 0.0073 0.0020 3.6860 0.0028
mind£block£¯
1
pun
.z ¡0.0212 0.0030 ¡7.0701 0.0000
Table 7.3: Mindfulness did not seem to interact with precondition model-free or model-based ten-
dencies on the change of model-based weight. Instead, we found that as precondition punishment-
switching increased, model-based weight decreased following mindfulness, whereas there was a slight
positive association following control.
In line with the previous analyses, when regressing model-based weight, we see that the
change between blocks decreased as precondition model-free processing increased and in-
creased as precondition model-based processing increased.
We found a small effect for model-free punishment switching, i.e., as punishment switch-
121
ing increased model-based weight increased. But we see a strong negative interaction with
mindfulness, i.e., as precondition model-free punishment switching increased, model-based
weight was found to be reduced following mindfulness. This is consistent with the previous
analyses, given the strongly positive coefficient of mind£block£¯
1
pun
.z on model-free pro-
cessing and in following analyses.
Regression predicting¯
1
¯
2
Significant Predictor coefficients se tstat pvalue
(Intercept) 3.0099 0.2503 12.0262 0.0000
block ¡0.5245 0.1298 ¡4.0399 0.0007
block£¯
1
m f
.z 0.0752 0.0151 4.9649 0.0000
mind£block£¯
1
m f
.z 0.1514 0.0213 7.1018 0.0000
block£¯
1
mb
.z 0.1737 0.0398 4.3647 0.0002
mind£block£¯
1
mb
.z 0.1662 0.0560 2.9672 0.0329
block£¯
1
m f
.z:¯
1
mb
.z ¡0.0578 0.0087 ¡6.6142 0.0000
block£¯
1
pun
.z ¡0.1449 0.0130 ¡11.1836 0.0000
mind£block£¯
1
pun
.z 0.1828 0.0181 10.0950 0.0000
mind£block£¯
1
ps
.z ¡0.0917 0.0225 ¡4.0799 0.0006
Table 7.4: Overall, there was decreased RL choice consistency in the postcondition. Increased model-
free and model-based processing were more resistant to this decrease of choice consistency, and fol-
lowing mindfulness, these participants show an even further reduction in the decrease to choice con-
sistency.
We find that for the control condition, punishment-switching was associated with a further decrease to
RL choice consistency, whereas following mindfulness, this is associated with an increase to RL choice
consistency, but a decrease of model-based weight.
Thus, we did not find any strong effects for mindfulness or interactions between mindful-
ness and model-free and/or model-based processing on the change of model-based weight.
Instead, we see that mindfulness caused the strongest changes in the ability align choice
behavior with previous task information according to a reinforcement learning model (i.e.,
¯
1
m f
Å ¯
1
mb
). Here, we find positive effects for both precondition model-free and model-
based processing on the change of choice consistency, which suggests that participants with
increased model-free or model-based processing, showed a further increase in the ability to
align their choices with RL theory. With both effects being further increased in the mindful-
122
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2
0.0
0.2
−1 0 1 2
standardized precondition model−free processing, β
mf
1
.z
Δβ rl explained by β
mf
1
.z
Condition
● ● Control
Mindfulness
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.2
−0.1
0.0
0.1
−1 0 1 2
standardized precondition punishment switching, β
pun
1
.z
Δβ rl explained by β
pun
1
.z
Condition
● ● Control
Mindfulness
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.1
0.0
0.1
0.2
−2 −1 0 1 2
standardized precondition perseveration, β
ps
1
.z
Δβ rl explained by β
ps
1
.z
Condition
● ● Control
Mindfulness
Figure 7.9: Mindfulness seems to have affected choice consistency with reinforcement
learning by causing increases based upon precondition tendencies (model-free, model-
based, and punishment switching). However, we found a negative interaction for perse-
veration following mindfulness.
ness condition.
Yet overall, we see a decrease in inverse temperature. Thus, precondition model-free and
model-based processing were protective to a reduction of reinforcement learning choice
consistency, and this protective effect was further increased following mindfulness. Rein-
forcing that mindfulness did not increase model-based weighting, but rather allowed partic-
ipants to not show a behavioral decrease of choice consistency (i.e., inability to correctly use
reward information to guide choices).
There was a strong negative effect for standardized precondition model-free punishment
switching on the change of reinforcement learning choice consistency. Coinciding with pre-
vious analyses, this effect is reversed following mindfulness, with a resulting slightly posi-
tive effect of increased precondition model-free punishment switching leading to increased
postcondition RL choice consistency. Specifically, given the previous analyses, we find this
mindfulness-induced punishment switching increase of RL inverse temperature to be pro-
portionately more model-free than model-based.
We find an overall increase of model-free punishment switching between blocks. Fur-
thermore, we see positive terms for both mind£block£¯
1
m f
and mind£block£¯
1
mb
, sug-
gesting that precondition reinforcement learning processing was associated with a modula-
tion in the overall increase of punishment switching. More specifically, we see that subjects
with increased RL processing showed more punishment switching following mindfulness,
while control has a significant negative¯
1
mb
interaction.
Thus, it seems that participants naturally show increased punishment-switching with
increased task experience. However, while precondition model-based processing was found
123
Regression predicting punMF
1
punMF
2
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.9100 0.0693 13.1216 0.000
block 0.1436 0.0432 3.3237 0.01
mind£block£¯
1
m f
.z 0.0622 0.0109 5.6829 0.000
block£¯
1
mb
.z ¡0.0348 0.0119 ¡2.9104 0.038
mind£block£¯
1
mb
.z 0.0993 0.0183 5.4126 0.000
block£¯
1
pun
.z 0.1845 0.0076 24.2542 0.000
Table 7.5: Overall subjects showed increased punishment switching. This may relate to a reduction
of cognitive load, i.e., a behavioral strategy that is easier to preform than model-free or model-based
weighting, but with reduced efficiency (i.e., odds of being rewarded).
In the control condition, increased model-based processing was associated with a reduction of the
increase to punishment switching. However, following mindfulness, we see that both precondition
model-free and model-based processing are associated with a further increase to punishment switch-
ing.
to be protective of this increase following control, increases to either precondition model-
free or model-based processing were met with further increases to punishment switching
following mindfulness. Significance for these terms, along with the effects of mindfulness
and punishment-switching on reinforcement learning choice consistency demonstrate the
unique changes concerning mindfulness and future valuation behavior.
In terms of perseveration, we see that for the control condition, precondition model-
free processing was positively associated with an increased tendency to stay with the same
choice in the postcondition. Whereas, the opposite is found following mindfulness, i.e., as
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● −0.05
0.00
0.05
0.10
0.15
−1 0 1 2 3
standardized precondition model−based processing, β
mb
1
.z
Δβ
pun
explained by β
mb
1
.z
Condition
● ● Control
Mindfulness
Figure 7.10: Mindfulness interacted with precondition model-based processing result-
ing in a significant positive association with the change in punishment switching.
124
Regression predicting ps
1
ps
2
Significant Predictor coefficients se tstat pvalue
(Intercept) 1.1386 0.0942 12.0837 0
block£¯
1
m f
.z 0.0338 0.0042 8.1425 0
mind£block£¯
1
m f
.z ¡0.0615 0.0063 ¡9.7514 0
mind£block£¯
1
mb
.z ¡0.0334 0.0095 ¡3.5237 0.0053
block£¯
1
ps
.z 0.1381 0.0078 17.7697 0
Table 7.6: Previous analyses suggests that mindfulness is associated with increased RL and punishment
switching for participants with increased precondition model-free or model-based tendencies, we also
find that such participants are marked with decreased perseveration following mindfulness.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.05
0.00
0.05
−1 0 1 2
standardized precondition model−free processing, β
mf
1
.z
Δβ
ps
explained by β
mf
1
.z
Condition
● ● Control
Mindfulness
Figure 7.11: Effect of mindfulness interaction with precondition model-free processing
on the change of perseveration.
precondition model-free processing increased within-subject postcondition perseveration
decreased. We also see a slight significant negative effect of mind£block£¯
1
mb
, signifying
that as precondition model-based processing increased, postcondition perseveration de-
creased following mindfulness. These results suggest that the previously identified mind-
fulness£precondition model-free and model-based processing interactions on the increase
of punishment switching may be further characterized with decreases to perseveration com-
pared to control.
7.3.4 IQ, mindfulness, and choice consistency
Along with the effects of the mindfulness intervention on the change of valuation process-
ing, we also explain the variance of precondition parameter fits. To do this, we also obtained
125
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2
−1
0
1
2
−4 −2 0
IQ.z
β
mb
1
explained by IQ.z
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2
−1
0
1
2
−2 −1 0 1 2
FAA.z
β
mb
1
explained by FAA.z
Gender
● ● Female
Male
Figure 7.12: Significant effects on precondition model-based processing estimates.
measures of intelligence (Raven’s Progressive matrices), self-control (13-question bSCM),
age, gender, mindfulness traits (ffmq [7]), and mindfulness experience (5-point likert scale).
Regression predicting¯
1
m f
Significant Predictor coefficients se tstat pvalue
(Intercept) 1.0033 0.0747 13.4246 0.0000
ravenScore
1
¡0.2317 0.0756 ¡3.0635 0.0044
ravenScore
2
¡0.0931 0.0259 ¡3.5980 0.0007
Table 7.7: Intelligence was measured to have both negative first-order and second-order effects on
model-free processing. Suggesting that reductions to intelligence as being positively associated with
increases to precondition model-free processing, but that those with extremely decreased intelligence
showed less model-free consideration.
Investigating the variance in precondition model-free processing, the best-fitting model
was one that included both the first and second-order effects of IQ, both being significantly
negative. Essentially this suggests that as IQ decreased, model-free processing increased, yet
as IQ was further from the mean, model-free processing decreased.
Regression predicting¯
1
mb
Significant Predictor coefficients se tstat pvalue
(Intercept) 2.2841 0.1857 12.3018 0
ravenScore 0.7479 0.1763 4.2421 1£ 10
¡4
FAA:gender 0.6907 0.0982 7.0332 0
Table 7.8: Intelligence was found to positively correlate with increased precondition model-based pro-
cessing. Furthermore, in males, acting with awareness was positively associated with an increase to
model-based processing.
In terms of model-based processing, we find a previously documented[48] positive as-
126
sociation with IQ. We also find a FAA:gender interaction, i.e., for males, standardized acting
with awareness facet of mindfulness was strongly positively associated with model-based
processing.
For model-free punishment switching, we see a negative relationship with the second-
order effect of IQ. Furthermore, we found a negative effect of mindfulness experience, sug-
gesting that as mindfulness experience increased, the participant was better able to withhold
the tendency to switch following punishment.
Regression predicting¯
1
pun
Significant Predictor coefficients se tstat pvalue
(Intercept) 0.8829 0.0514 17.1782 0.0000
ravenScore
2
¡0.0331 0.0075 ¡4.4156 0.0000
mind.exp ¡0.0965 0.0397 ¡2.4290 0.0387
For perseveration, we were unable to find any specific co-measures that explained the
variance in the precondition fit better than an intercept model.
7.4 Discussion
In the current experiment, we examined how a brief mindfulness intervention affected valu-
ation processing in a task that differentiates between habitual and goal-directed action selec-
tion. To estimate these processes we used a hybrid (model-free/model-based) reinforcement
learning model [26][51][2][122] and Bayesian hierarchical parameter estimation [120]. Ulti-
mately, we did not find a specific main effect of mindfulness on the within-subject change
between control and mindfulness groups, rather, the variance in the changes of behavior was
best explained by various interactions with aspects relating to behavior expressed during the
precondition task.
For the most part, we see that as experience in the task increased, subjects became more
adjusted and settled into valuation strategies they displayed in the precondition. For ex-
ample, we see both positive effects of precondition model-free processing on the within-
subjects change of the model-free processing, and a positive effect of precondition model-
based processing on the change of model-based processing. Rather interestingly, these ef-
fects are further increased following mindfulness, suggesting that mindfulness allowed those
127
subjects to further hone in on their natural internal valuation strategy (i.e., an intensification
of choice consistency with the choice strategy utilized during the precondition). Given pre-
vious studies that have examined the variability in performance of reinforcement learning
tasks [100][41], we find that mindfulness specifically affected participants that showed ele-
vated levels of some form of reinforcement learning processing. And those with both low
model-free and model-based processing did not see an increase of RL choice consistency
with mindfulness intervention.
Yet mindfulness was shown to have other interactions on both model-free and model-
based processing. Specifically, a negative relationship with precondition model-based pro-
cessing on model-free processing and a positive relationship between precondition model-
free processing and the change of model-based processing. These changes of valuation pro-
cessing were not seen in the control group, and suggest that mindfulness did more than just
increase behavioral tendencies found in the precondition.
Unique to our task is the element of both reward and punishment. The best fitting model
was one that also included a tendency to switch choices following punishment, regardless of
the transition. Between blocks we found that participants increased in this tendency. Given
previous accounts of cognitive load reduction [64], we believe that one way in which subjects
reduced the complexity of the task was by demonstrating an aversive reaction to punishment
(increased switching). It was a simple heuristic to behavioral control that most of the time is
relevant (e.g., following common trials). Furthermore, given the increased reduction of rein-
forcement learning choice consistency in the control group, it seems that participants gener-
ally showed a reduction of cognitive load via a behavioral strategy that was lesser impacted
by reinforcement learning (tracking of value) and they showed an increase of displaying a
tendency to switch choices following punishment.
Yet, there is more to this increase of punishment switching between blocks. Specifi-
cally, that in the control condition model-based processing was protective to this increase,
whereas, following mindfulness, model-based processing (and model-free processing) was
associated with a further increase of punishment switching. In a rather complex web of inter-
actions, it seems that mindfulness aided participants that acted deterministically in the pre-
condition (high¯
1
m f
and/or high¯
1
mb
) resist a behavioral decrease of choice consistency, but
that this was met with increased irrationality (switching) following punishment. Uniquely
128
mindfulness also aided participants that were measured to have high punishment switching
in the precondition have increased model-free reinforcement learning choice consistency,
whereas following control those participants saw a reduction of reinforcement learning con-
sideration. As a last piece to this puzzle, we also found significance for increased mindful-
ness experience leading to a reduction of model-free punishment switching in the precon-
dition.
We note some limitations to the present study. With the current design, subjects showed
rather high levels of model-based processing (2.08 average, compared to most other papers
of {0.75, 1.20}). Given the current task’s design (i.e., no second-stage choice), the task may
be too simple to stress some participants and they show complete model-based dominance
on choice strategy (model-based weight estimates around 0.80). This could cause problems
with the ability to fully estimate a change in model-based weight because those subjects can-
not improve in terms of the ratio of model-based to model-free processing. Ways to amend
this would be to make the task slightly more complex but allowing for more potential ways
of responding to previous task information.
129
Bibliography
[1] Part iii - introduction. In J. K. Kruschke, editor, Doing Bayesian Data Analysis (Second
Edition), page 417. Academic Press, Boston, second edition edition, 2015.
[2] T. Akam, R. Costa, and P . Dayan. Simple plans or sophisticated habits? state, transition
and learning interactions in the two-step task. PLoS Computational Biology, 11(12):1
– 25, 12 2015.
[3] T. Akam, I. Rodrigues-Vaz, X. Zhang, M. Pereira, R. Oliveira, P . Dayan, and R. M.
Costa. Single-trial inhibition of anterior cingulate disrupts model-based reinforce-
ment learning in a two-step decision task. bioRxiv, 2017.
[4] B. A. Anderson. The attention habit: How reward learning shapes attentional selection.
Annals of the New York Academy of Sciences, 1369(1):24 – 39.
[5] F . G. Ashby, B. O. Turner, and J. C. Horvitz. Cortical and basal ganglia contributions to
habit learning and automaticity. Trends in Cognitive Sciences, 14(5):208 – 215, 2010.
[6] R. A. Baer, G. T. Smith, J. Hopkins, J. Krietemeyer, and L. Toney. Using self-report as-
sessment methods to explore facets of mindfulness. Assessment, 13(1):27 – 45, 2006.
Pmid: 16443717.
[7] R. A. Baer, G. T. Smith, E. Lykins, D. Button, J. Krietemeyer, S. Sauer, E. Walsh, D. Dug-
gan, and J. M. G. Williams. Construct validity of the five facet mindfulness question-
naire in meditating and nonmeditating samples. Assessment, 15(3):329 – 342, 2008.
Pmid: 18310597.
130
[8] B. W. Balleine and A. Dickinson. Goal-directed instrumental action: Contingency and
incentive learning and their cortical substrates. Neuropharmacology, 37(4):407 – 419,
1998.
[9] B. W. Balleine and J. P . O’Doherty. Human and rodent homologies in action control:
Corticostriatal determinants of goal-directed and habitual action. Neuropsychophar-
macology, 35(1):48 – 69, 2010.
[10] D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting linear mixed-effects models
using lme4. Journal of Statistical Software, 67(1):1 – 48, 2015.
[11] A. Bechara, H. Damasio, D. Tranel, and A. R. Damasio. Deciding advantageously before
knowing the advantageous strategy. Science, 275(5304):1293 – 1295, 1997.
[12] G. Ben-Shakhar and L. Sheffer. The relationship between the ability to divide atten-
tion and standard measures of general cognitive abilities. Intelligence, 29(4):293 – 306,
2001.
[13] U. Boehm, M. Marsman, D. Matzke, and E.-J. Wagenmakers. On the importance of
avoiding shortcuts in applying cognitive models to hierarchical data. Behavior Re-
search Methods, 50(4):1614 – 1631, August 2018.
[14] E. D. Boorman, T. E. Behrens, and M. F . Rushworth. Counterfactual choice and learning
in a neural network centered on human lateral frontopolar cortex. PLOS Biology, 9(6):1
– 13, 06 2011.
[15] R. Breen. Beliefs, rational choice and bayesian learning. Rationality and Society,
11(4):463 – 479, 1999.
[16] C. J. Burke, P . N. Tobler, M. Baddeley, and W. Schultz. Neural mechanisms of observa-
tional learning. Proceedings of the National Academy of Sciences of the United States of
America, 107(32):14431 – 14436, 2010.
[17] N. Camille, G. Coricelli, J. Sallet, P . Pradat-Diehl, J.-R. Duhamel, and A. Sirigu.
The involvement of the orbitofrontal cortex in the experience of regret. Science,
304(5674):1167 – 1170, 2004.
131
[18] J. Carmody and R. A. Baer. Relationships between mindfulness practice and levels of
mindfulness, medical and psychological symptoms and well-being in a mindfulness-
based stress reduction program. Journal of Behavioral Medicine, 31(1):23 – 33, Febru-
ary 2008.
[19] P . A. Carpenter, M. A. Just, and P . Shell. What one intelligence test measures: A theo-
retical account of the processing in the raven progressive matrices test. Psychological
review, 97(3):404, 1990.
[20] A. Chiesa, A. Serretti, and J. C. Jakobsen. Mindfulness: Topâ
˘
A¸ Sdown or bottomâ
˘
A¸ Sup
emotion regulation strategy? Clinical Psychology Review, 33(1):82 – 96, 2013.
[21] L. H. Corbit. Understanding the balance between goal-directed and habitual behav-
ioral control. Current Opinion in Behavioral Sciences, 20:161 – 168, 2018. Habits and
Skills.
[22] G. Coricelli, R. J. Dolan, and A. Sirigu. Brain, emotion and decision making: The
paradigmatic example of regret. Trends in Cognitive Sciences, 11(6):258 – 265, 2007.
[23] E. S. Cross, D. J. M. Kraemer, A. F . de C. Hamilton, W. M. Kelley, and S. T. Grafton.
Sensitivity of the action observation network to physical and observational learning.
Cerebral Cortex, 19(2):315 – 326, 2009.
[24] F . Cushman and A. Morris. Habitual control of goal selection in humans. Proceedings
of the National Academy of Sciences, 112(45):13817 – 13822, 2015.
[25] K. M. Davis, M. A. Lau, and D. R. Cairns. Development and preliminary validation of
a trait version of the toronto mindfulness scale. Journal of Cognitive Psychotherapy,
23(3):185 – 197, 2009. Copyright - Copyright Springer Publishing Company 2009; Last
updated - 2012-03-19.
[26] N. D. Daw, S. J. Gershman, B. Seymour, P . Dayan, and R. J. Dolan. Model-based influ-
ences on humans’ choices and striatal prediction errors. Neuron, 69(6):1204 – 1215,
2011.
[27] N. D. Daw and J. P . O’Doherty. Multiple Systems for Value Learning, pages 393 – 410.
Elsevier Inc., "9" 2013.
132
[28] G.-J. de Bruijn, S. P . J. Kremers, A. Singh, B. van den Putte, and W. van Mechelen. Adult
active transportation: Adding habit strength to the theory of planned behavior. Amer-
ican Journal of Preventive Medicine, 36(3):189 – 194, 2009.
[29] J. H. Decker, A. R. Otto, N. D. Daw, and C. A. Hartley. From creatures of habit to goal-
directed learners: Tracking the developmental emergence of model-based reinforce-
ment learning. Psychological Science, 27(6):848 – 858, 2016. Pmid: 27084852.
[30] P . Dellaportas, J. J. Forster, and I. Ntzoufras. On bayesian model and variable selection
using mcmc. Statistics and Computing, 12(1):27 – 36, January 2002.
[31] A. Dezfouli and B. W. Balleine. Actions, action sequences and habits: Evidence that
goal-directed and habitual action control are hierarchically organized. PLoS Compu-
tational Biology, 9(12):1 – 14, 12 2013.
[32] R. J. Dolan and P . Dayan. Goals and habits in the brain. Neuron, 80(2):312 – 325, 2013.
[33] A. Doll, B. K. Hölzel, C. C. Boucard, A. M. Wohlschläger, and C. Sorg. Mindfulness is
associated with intrinsic functional connectivity between default mode and salience
networks. Frontiers in Human Neuroscience, 9:461, 2015.
[34] B. B. Doll, K. G. Bath, N. D. Daw, and M. J. Frank. Variability in dopamine genes disso-
ciates model-based and model-free reinforcement learning. Journal of Neuroscience,
36(4):1211 – 1222, 2016.
[35] N. Doñamayor, D. Strelchuk, K. Baek, P . Banca, and V . Voon. The involuntary nature of
binge drinking: Goal directedness and awareness of intention. Addict Biol, 23(1):515 –
526, 2018.
[36] J. Duncan, H. Emslie, P . Williams, R. Johnson, and C. Freer. Intelligence and the frontal
lobe: The organization of goal-directed behavior. Cognitive Psychology, 30(3):257 –
303, 1996.
[37] M. Economides, M. Guitart-Masip, Z. Kurth-Nelson, and R. J. Dolan. Arbitration be-
tween controlled and impulsive choices. NeuroImage, 109:206 – 216, 2015.
133
[38] M. Economides, Z. Kurth-Nelson, A. Lübbert, M. Guitart-Masip, and R. J. Dolan.
Model-based reasoning in humans becomes automatic with training. PLoS Compu-
tational Biology, 11(9):1 – 19, 09 2015.
[39] C. K. Enders. The impact of nonnormality on full information maximum-likelihood
estimation for structural equation models with missing data. Psychological methods,
6(4):352, 2001.
[40] B. Eppinger, M. Walter, H. Heekeren, and S.-C. Li. Of goals and habits: Age-related and
individual differences in goal-directed decision-making. Frontiers in Neuroscience,
7:253, 2013.
[41] I. Erev and G. Barron. On adaptation, maximization, and reinforcement learning
among cognitive strategies. Psychological review, 112(4):912, 2005.
[42] G. Falcone and M. Jerram. Brain activity in mindfulness depends on experience: A
meta-analysis of fmri studies. Mindfulness, February 2018.
[43] M. Friese and W. Hofmann. State mindfulness, self-regulation, and emotional experi-
ence in everyday life. Motivation Science, 2(1):1, 2016.
[44] E. L. Garland. Restructuring reward processing with mindfulness-oriented recovery
enhancement: Novel therapeutic mechanisms to remediate hedonic dysregulation in
addiction, stress, and pain. Annals of the New York Academy of Sciences, 1373(1):25 –
37, 2016.
[45] E. L. Garland and M. O. Howard. Opioid attentional bias and cue-elicited craving pre-
dict future risk of prescription opioid misuse among chronic pain patients. Drug and
Alcohol Dependence, 144:283 – 287, 2014.
[46] S. J. Gershman. Empirical priors for reinforcement learning models. Journal of Math-
ematical Psychology, 71:1 – 6, 2016.
[47] N. Geschwind, F . Peeters, M. Drukker, J. van Os, and M. Wichers. Mindfulness training
increases momentary positive emotions and reward experience in adults vulnerable
to depression: A randomized controlled trial. Journal of consulting and clinical psy-
chology, 79(5):618, 2011.
134
[48] C. M. Gillan, M. Kosinski, R. Whelan, E. A. Phelps, and N. D. Daw. Characterizing
a psychiatric symptom dimension related to deficits in goal-directed control. eLife,
5:e11305, March 2016.
[49] C. M. Gillan, S. Morein-Zamir, G. P . Urcelay, A. Sule, V . Voon, A. M. Apergis-Schoute,
N. A. Fineberg, B. J. Sahakian, and T. W. Robbins. Enhanced avoidance habits
in obsessive-compulsive disorder. Biological Psychiatry, 75(8):631 – 638, 2014.
Obsessive-Compulsive Disorder and the Connectome.
[50] C. M. Gillan, A. R. Otto, E. A. Phelps, and N. D. Daw. Model-based learning protects
against forming habits. Cognitive, Affective, and Behavioral Neuroscience, 15(3):523 –
536, September 2015.
[51] J. Gläscher, N. Daw, P . Dayan, and J. P . O’Doherty. States versus rewards: Dissocia-
ble neural prediction error signals underlying model-based and model-free reinforce-
ment learning. Neuron, 66(4):585 – 595, 2010.
[52] P . W. Glimcher. Understanding dopamine and reinforcement learning: The dopamine
reward prediction error hypothesis. Proceedings of the National Academy of Sciences,
108(Supplement - 3):15647 – 15654, 2011.
[53] R. A. Gotink, R. Meijboom, M. W. Vernooij, M. Smits, and M. G. M. Hunink. 8-week
mindfulness based stress reduction induces brain changes similar to traditional long-
term meditation practice â
˘
A¸ S a systematic review. Brain and Cognition, 108:32 – 41,
2016.
[54] P . J. Green. Reversible jump markov chain monte carlo computation and bayesian
model determination. Biometrika, 82(4):711 – 732, 1995.
[55] J. Gu, C. Strauss, R. Bond, and K. Cavanagh. How do mindfulness-based cognitive ther-
apy and mindfulness-based stress reduction improve mental health and wellbeing? a
systematic review and meta-analysis of mediation studies. Clinical Psychology Review,
37:1 – 12, 2015.
[56] X. Gu, U. Kirk, T. Lohrenz, and P . Montague. Cognitive strategies regulate fictive, but
not reward prediction error signals in a sequential investment task. 35, 08 2014.
135
[57] A. Horstmann, A. Dietrich, D. Mathar, M. Pössel, A. Villringer, and J. Neumann. Slave
to habit? obesity is associated with decreased behavioural sensitivity to reward deval-
uation. Appetite, 87:175 – 183, 2015.
[58] Q. J. M. Huys, R. Cools, M. Gölzer, E. Friedel, A. Heinz, R. J. Dolan, and P . Dayan. Disen-
tangling the roles of approach, activation and valence in instrumental and pavlovian
responding. PLoS Computational Biology, 7(4):1 – 14, 04 2011.
[59] C. G. Jensen, S. Vangkilde, V . Frokjaer, and S. G. Hasselbalch. Mindfulness training af-
fects attentionâ
˘
A
ˇ
Tor is it attentional effort? Journal of Experimental Psychology: Gen-
eral, 141(1):106, 2012.
[60] M. Keramati, A. Dezfouli, and P . Piray. Speed/accuracy trade-off between the habitual
and the goal-directed processes. PLoS Computational Biology, 7(5):1 – 21, 05 2011.
[61] M. Keramati, P . Smittenaar, R. J. Dolan, and P . Dayan. Adaptive integration of habits
into depth-limited planning defines a habitual-goal–directed spectrum. Proceedings
of the National Academy of Sciences, 113(45):12868 – 12873, 2016.
[62] B. Khoury, M. Sharma, S. E. Rush, and C. Fournier. Mindfulness-based stress reduction
for healthy individuals: A meta-analysis. Journal of Psychosomatic Research, 78(6):519
– 528, 2015.
[63] A. Klein and H. Moosbrugger. Maximum likelihood estimation of latent interaction
effects with the lms method. Psychometrika, 65(4):457 – 474, December 2000.
[64] W. Kool, F . A. Cushman, and S. J. Gershman. When does model-based control pay off?
PLoS Computational Biology, 12(8):1 – 34, 08 2016.
[65] S. L. Koole. The psychology of emotion regulation: An integrative review. Cognition
and Emotion, 23(1):4 – 41, 2009.
[66] J. K. Kruschke. Bayesian estimation supersedes the t test. Journal of Experimental
Psychology: General, 142(2):573, 2013.
[67] J. K. Kruschke and T. M. Liddell. The bayesian new statistics: Hypothesis testing, esti-
mation, meta-analysis, and power analysis from a bayesian perspective. Psychonomic
Bulletin and Review, 25(1):178 – 206, February 2018.
136
[68] M. Lansman, S. E. Poltrock, and E. Hunt. Individual differences in the ability to focus
and divide attention. Intelligence, 7(3):299 – 312, 1983.
[69] M. Lau, S. R. Bishop, Z. V . Segal, T. Buis, N. D. Anderson, L. Carlson, S. Shapiro, J. Car-
mody, S. Abbey, and G. Devins. The toronto mindfulness scale: Development and
validation. Journal of Clinical Psychology, 62(12):1445 – 1467, 2006.
[70] J. J. Lee and M. Keramati. Flexibility to contingency changes distinguishes habitual
and goal-directed strategies in humans. bioRxiv, 2017.
[71] S. W. Lee, S. Shimojo, and J. P . O’Doherty. Neural computations underlying arbitration
between model-based and model-free learning. Neuron, 81(3):687 – 699, 2014.
[72] G. Lefebvre, M. Lebreton, F . Meyniel, S. Bourgeois-Gironde, and S. Palminteri. Opti-
mistic reinforcement learning: Computational and neural bases. bioRxiv, 2016.
[73] M. Liljeholm and J. P . O’Doherty. Contributions of the striatum to learning, motivation,
and performance: An associative account. Trends in Cognitive Sciences, 16(9):467 –
475, 2012.
[74] P . W. Maloney, M. J. Grawitch, and L. K. Barber. The multi-factor structure of the brief
self-control scale: Discriminant validity of restraint and impulsivity. Journal of Re-
search in Personality, 46(1):111 – 115, 2012.
[75] K. Matheson, J. G. Holmes, and C. M. Kristiansen. Observational goals and the in-
tegration of trait perceptions and behavior: Behavioral prediction versus impression
formation. Journal of Experimental Social Psychology, 27(2):138 – 160, 1991.
[76] K. Miller, A. Shenhav, and E. Ludvig. Habits without values. bioRxiv, 2018.
[77] K. J. Miller, M. M. Botvinick, and C. D. Brody. Value representations in orbitofrontal
cortex drive learning, not choice. bioRxiv, 2018.
[78] K. J. Miller, C. D. Brody, and M. M. Botvinick. Identifying model-based and model-free
patterns in behavior on multi-step tasks. bioRxiv, 2016.
[79] Y. Niv. Reinforcement learning in the brain. Journal of Mathematical Psychology,
53(3):139 – 154, 2009. Special Issue: Dynamic Decision Making.
137
[80] E. Obst, D. J. Schad, Q. J. M. Huys, M. Sebold, S. Nebe, C. Sommer, M. N. Smolka, and
U. S. Zimmermann. Drunk decisions: Alcohol shifts choice from habitual towards
goal-directed control in adolescent intermediate-risk drinkers. Journal of Psychophar-
macology, 32(8):855 – 866, 2018. Pmid: 29764270.
[81] I. Olkin and T. A. Trikalinos. Constructions for a bivariate beta distribution. Statistics
& Probability Letters, 96:54 – 60, 2015.
[82] B. D. Ostafin, M. D. Robinson, and B. P . Meier. Handbook of Mindfulness and Self-
regulation. Springer, 2015.
[83] A. R. Otto, S. J. Gershman, A. B. Markman, and N. D. Daw. The curse of planning:
Dissecting multiple reinforcement-learning systems by taxing the central executive.
Psychological Science, 24(5):751 – 761, 2013. Pmid: 23558545.
[84] A. R. Otto, C. M. Raio, A. Chiang, E. A. Phelps, and N. D. Daw. Working-memory capac-
ity protects model-based learning from stress. Proceedings of the National Academy of
Sciences, 110(52):20941 – 20946, 2013.
[85] A. R. Otto, A. Skatova, S. Madlon-Kay, and N. D. Daw. Cognitive control predicts use
of model-based reinforcement learning. Journal of Cognitive Neuroscience, 27(2):319
– 333, 2015. Pmid: 25170791.
[86] S. Palminteri, M. Khamassi, M. Joffily, and G. Coricelli. Contextual modulation of value
signals in reward and punishment learning. Nature communications, 6:8096, 2015.
[87] S. Palminteri, E. J. Kilford, G. Coricelli, and S.-J. Blakemore. The computational devel-
opment of reinforcement learning during adolescence. PLoS Computational Biology,
12(6):1 – 25, 06 2016.
[88] S. Palminteri, G. Lefebvre, E. J. Kilford, and S.-J. Blakemore. Confirmation bias in
human reinforcement learning: Evidence from counterfactual feedback processing.
PLoS Computational Biology, 13(8):1 – 22, 08 2017.
[89] H. Park, D. Lee, and J. Chey. Stress enhances model-free reinforcement learning only
after negative outcome. PLOS ONE, 12(7):1 – 12, 07 2017.
138
[90] E. Payzan-LeNestour and P . Bossaerts. Risk, unexpected uncertainty, and estimation
uncertainty: Bayesian learning in unstable settings. PLoS Computational Biology,
7(1):1 – 14, 01 2011.
[91] J. Piironen and A. Vehtari. Comparison of bayesian predictive methods for model se-
lection. Statistics and Computing, 27(3):711 – 735, May 2017.
[92] T. C. S. Potter, N. V . Bryce, and C. A. Hartley. Cognitive components underpinning the
development of model-based learning. Developmental Cognitive Neuroscience, 25:272
– 280, 2017. Sensitive periods across development.
[93] C. Radenbach, A. M. F . Reiter, V . Engert, Z. Sjoerds, A. Villringer, H.-J. Heinze, L. De-
serno, and F . Schlagenhauf. The interaction of acute and chronic stress impairs model-
based behavioral control. Psychoneuroendocrinology, 53:268 – 280, 2015.
[94] C. H. Rankin, T. Abrams, R. J. Barry, S. Bhatnagar, D. F . Clayton, J. Colombo, G. Coppola,
M. A. Geyer, D. L. Glanzman, S. Marsland, F . K. McSweeney, D. A. Wilson, C.-F . Wu,
and R. F . Thompson. Habituation revisited: An updated and revised description of
the behavioral characteristics of habituation. Neurobiology of Learning and Memory,
92(2):135 – 138, 2009. Special Issue: Neurobiology of Habituation.
[95] J. Raven et al. Raven progressive matrices. In Handbook of nonverbal assessment, pages
223 – 237. Springer, 2003.
[96] T. W. Robbins, C. M. Gillan, D. G. Smith, S. de Wit, and K. D. Ersche. Neurocognitive
endophenotypes of impulsivity and compulsivity: Towards dimensional psychiatry.
Trends in Cognitive Sciences, 16(1):81 – 91, 2012. Special Issue: Cognition in Neuropsy-
chiatric Disorders.
[97] M. F . S. Rushworth, R. B. Mars, and C. Summerfield. General mechanisms for making
decisions? Current Opinion in Neurobiology, 19(1):75 – 83, 2009. Cognitive neuro-
science.
[98] J. Sandson and M. L. Albert. Varieties of perseveration. Neuropsychologia, 22(6):715
– 732, 1984. A special issue of Neuropsychologia dedicated to the memory of Henry
Hecaen.
139
[99] D. J. Schad, E. Jünger, M. Sebold, M. Garbusow, N. Bernhardt, A.-H. Javadi, U. S. Zim-
mermann, M. N. Smolka, A. Heinz, M. A. Rapp, and Q. J. M. Huys. Processing speed en-
hances model-based over model-free reinforcement learning in the presence of high
working memory functioning. Frontiers in Psychology, 5:1450, 2014.
[100] T. Schönberg, N. D. Daw, D. Joel, and J. P . O’Doherty. Reinforcement learning signals in
the human striatum distinguish learners from nonlearners during reward-based deci-
sion making. Journal of Neuroscience, 27(47):12860 – 12867, 2007.
[101] W. Schultz. Updating dopamine reward signals. Current Opinion in Neurobiology,
23(2):229 – 238, 2013. Macrocircuits.
[102] M. Sebold, S. Nebe, M. Garbusow, M. Guggenmos, D. J. Schad, A. Beck, S. Kuitunen-
Paul, C. Sommer, R. Frank, P . Neu, U. S. Zimmermann, M. A. Rapp, M. N. Smolka,
Q. J. M. Huys, F . Schlagenhauf, and A. Heinz. When habits are dangerous: Alcohol
expectancies and habitual decision making predict relapse in alcohol dependence.
Biological Psychiatry, 82(11):847 – 856, 2017. Learning Theory, Neuroplasticity, and
Addiction.
[103] M. Sebold, D. J. Schad, S. Nebe, M. Garbusow, E. Jünger, N. B. Kroemer, N. Kathmann,
U. S. Zimmermann, M. N. Smolka, M. A. Rapp, A. Heinz, and Q. J. M. Huys. Don’t think,
just feel the music: Individuals with strong pavlovian-to-instrumental transfer effects
rely less on model-based reinforcement learning. Journal of Cognitive Neuroscience,
28(7):985 – 995, 2016. Pmid: 26942321.
[104] C. Seger and B. Spiering. A critical review of habit learning and the basal ganglia. Fron-
tiers in Systems Neuroscience, 5:66, 2011.
[105] I. Selbing, B. Lindström, and A. Olsson. Demonstrator skill modulates observational
aversive learning. Cognition, 133(1):128 – 139, 2014.
[106] M. E. Sharp, K. Foerde, N. D. Daw, and D. Shohamy. Dopamine selectively remediates
â
˘
AŸmodel-basedâ
˘
A
´
Z reward learning: A computational approach. Brain, 139(2):355 –
364, 2016.
[107] M. M. Short, D. Mazmanian, L. J. Ozen, and M. Bédard. Four days of mindfulness med-
itation training for graduate students: A pilot study examining effects on mindfulness,
140
self-regulation, and executive function. The Journal of Contemplative Inquiry, 2(1),
2015.
[108] A. Skatova, P . Chan, and N. Daw. Extraversion differentiates between model-based
and model-free strategies in a reinforcement learning task. Frontiers in Human Neu-
roscience, 7:525, 2013.
[109] V . Skvortsova, S. Palminteri, and M. Pessiglione. Learning to minimize efforts versus
maximizing rewards: Computational principles and neural correlates. Journal of Neu-
roscience, 34(47):15621 – 15630, 2014.
[110] A. F . M. Smith and G. O. Roberts. Bayesian computation via the gibbs sampler and
related markov chain monte carlo methods. Journal of the Royal Statistical Society.
Series B (Methodological), pages 3 – 23, 1993.
[111] P . Smittenaar, G. Prichard, T. H. B. FitzGerald, J. Diedrichsen, and R. J. Dolan. Tran-
scranial direct current stimulation of right dorsolateral prefrontal cortex does not af-
fect model-based or model-free reinforcement learning in humans. PLOS ONE, 9(1):1
– 8, 01 2014.
[112] A. P . Steiner and A. D. Redish. Behavioral and neurophysiological correlates of regret in
rat decision-making on a neuroeconomic task. Nature neuroscience, 17(7):995, 2014.
[113] I.-W. Su, F .-W. Wu, K.-C. Liang, K.-Y. Cheng, S.-T. Hsieh, W.-Z. Sun, and T.-L. Chou.
Pain perception can be modulated by mindfulness training: A resting-state fmri study.
Frontiers in Human Neuroscience, 10:570, 2016.
[114] J. W. Suchow, D. D. Bourgin, and T. L. Griffiths. Evolution in mind: Evolutionary dy-
namics, cognitive processes, and bayesian inference. Trends in Cognitive Sciences,
21(7):522 – 530, 2017.
[115] M. Sutter, M. Kocher, and S. Strauß. Bargaining under time pressure in an experimen-
tal ultimatum game. Economics Letters, 81(3):341 – 347, 2003.
[116] R. S. Sutton and A. G. Barto. Introduction to reinforcement learning, volume 135. MIT
press Cambridge, 1998.
141
[117] Y. K. Takahashi, H. M. Batchelor, B. Liu, A. Khanna, M. Morales, and G. Schoenbaum.
Dopamine neurons respond to errors in the prediction of sensory features of expected
rewards. Neuron, 95(6):1395 – 1405 – e3, 2017.
[118] Y.-Y. Tang, R. Tang, and M. I. Posner. Mindfulness meditation improves emotion reg-
ulation and reduces drug abuse. Drug and Alcohol Dependence, 163:S13 – S18, 2016.
Emotion Regulation and Drug Abuse: Implications for Prevention and Treatment.
[119] J. P . Tangney, A. L. Boone, and R. F . Baumeister. High self-control predicts good adjust-
ment, less pathology, better grades, and interpersonal success. In Self-Regulation and
Self-Control, pages 181 – 220. Routledge, 2018.
[120] S. D. Team. RStan: The R interface to Stan, 2018. R package version 2.17.3.
[121] E. C. Tolman. Cognitive maps in rats and men. Psychological review, 55(4):189, 1948.
[122] A. Toyama, K. Katahira, and H. Ohira. A simple computational algorithm of model-
based choice preference. Cognitive, Affective, and Behavioral Neuroscience, 17(4):764
– 783, August 2017.
[123] L. R. Tucker and C. Lewis. A reliability coefficient for maximum likelihood factor anal-
ysis. Psychometrika, 38(1):1 – 10, March 1973.
[124] M. Ussher, A. Spatz, C. Copland, A. Nicolaou, A. Cargill, N. Amini-Tabrizi, and L. M.
McCracken. Immediate effects of a brief mindfulness-based body scan on patients
with chronic pain. Journal of Behavioral Medicine, 37(1):127 – 134, February 2014.
[125] D. R. Vago. Mapping modalities of self-awareness in mindfulness practice: A potential
mechanism for clarifying habits of mind. Annals of the New York Academy of Sciences,
1307(1):28 – 42.
[126] D. van der Linden, M. Frese, and T. F . Meijman. Mental fatigue and the control of cog-
nitive processes: Effects on perseveration and planning. Acta Psychologica, 113(1):45
– 65, 2003.
[127] T. van Gog, F . Paas, N. Marcus, P . Ayres, and J. Sweller. The mirror neuron system and
observational learning: Implications for the effectiveness of dynamic visualizations.
Educational Psychology Review, 21(1):21 – 30, March 2009.
142
[128] A. Vehtari, J. Gabry, Y. Yao, and A. Gelman. Loo: Efficient leave-one-out cross-
validation and waic for bayesian models, 2018. R package version 2.0.0.
[129] A. Vehtari, A. Gelman, and J. Gabry. Practical bayesian model evaluation using leave-
one-out cross-validation and waic. Statistics and Computing, 27:1413 – 1432, 2017.
[130] V . Voon, K. Derbyshire, C. Rück, M. A. Irvine, Y. Worbe, J. Enander, L. R. N. Schreiber,
C. Gillan, N. A. Fineberg, B. J. Sahakian, et al. Disorders of compulsivity: A common
bias towards learning habits. Molecular psychiatry, 20(3):345, 2015.
[131] V . Voon, A. Reiter, M. Sebold, and S. Groman. Model-based control in dimensional
psychiatry. Biological Psychiatry, 82(6):391 – 400, 2017. Computational Psychiatry.
[132] M. M. Walsh and J. R. Anderson. Learning from experience: Event-related potential
correlates of reward processing, neural adaptation, and behavioral choice. Neuro-
science & Biobehavioral Reviews, 36(8):1870 – 1884, 2012.
[133] Y. Wang, N. Ma, X. He, N. Li, Z. Wei, L. Yang, R. Zha, L. Han, X. Li, D. Zhang, Y. Liu,
and X. Zhang. Neural substrates of updating the prediction through prediction error
during decision making. NeuroImage, 157:1 – 12, 2017.
[134] P . Watson and S. de Wit. Current limits of experimental research into habits and future
directions. Current Opinion in Behavioral Sciences, 20:33 – 39, 2018. Habits and Skills.
[135] W. Wood and D. T. Neal. A new look at habits and the habit-goal interface. Psychol Rev,
114(4):843 – 863, 2007.
[136] W. Wood and D. Rünger. Psychology of habit. Annual Review of Psychology, 67(1):289
– 314, 2016. Pmid: 26361052.
[137] Y. Worbe, S. Palminteri, G. Savulich, N. D. Daw, E. Fernandez-Egea, T. W. Robbins, and
V . Voon. Valence-dependent influence of serotonin depletion on model-based choice
strategy. Molecular psychiatry, 21(5):624, 2016.
[138] K. Wunderlich, P . Smittenaar, and R. J. Dolan. Dopamine enhances model-based over
model-free choice behavior. Neuron, 75(3):418 – 424, 2012.
143
[139] H. H. Yin, B. J. Knowlton, and B. W. Balleine. Blockade of nmda receptors in the dor-
somedial striatum prevents action–outcome learning in instrumental conditioning.
European Journal of Neuroscience, 22(2):505 – 512, 2005.
[140] H. H. Yin, B. J. Knowlton, and B. W. Balleine. Inactivation of dorsolateral striatum
enhances sensitivity to changes in the actionâ
˘
A¸ Soutcome contingency in instrumental
conditioning. Behavioural Brain Research, 166(2):189 – 196, 2006.
144
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Value-based decision-making in complex choice: brain regions involved and implications of age
PDF
Bayesian hierarchical and joint modeling of the reversal learning task
PDF
Essays on behavioral decision making and perceptions: the role of information, aspirations and reference groups on persuasion, risk seeking and life satisfaction
PDF
Toward a more realistic understanding of decision-making
PDF
The effect of present moment awareness and value intervention of ACT on impulsive decision-making and impulsive disinhibition
PDF
Sequential decisions on time preference: evidence for non-independence
PDF
Behabioral and neural evidence of state-like variance in intertemporal decisions
PDF
Predicting and planning against real-world adversaries: an end-to-end pipeline to combat illegal wildlife poachers on a global scale
PDF
Neuroeconomic mechanisms for valuing complex options
PDF
Choice biases in making decisions for oneself vs. others
PDF
An evaluation of general education faculty practices to support student decision-making at one community college
PDF
The role of good habits in facilitating long-term goals
PDF
Three essays on behavioral economics approaches to understanding the implications of mental health stigma
PDF
A neuropsychological exploration of low-SES adolescents’ life goals and their motives
PDF
Bounded technological rationality: the intersection between artificial intelligence, cognition, and environment and its effects on decision-making
PDF
Taking the temperature of the Columbia Card Task
PDF
The evolution of decision-making quality over the life cycle: evidence from behavioral and neuroeconomic experiments with different age groups
PDF
Computational intelligence: prediction, control and memory in artificial and biological agents
PDF
Geographic variation in the feeding ecology and long-distance vocalizations of orangutans
PDF
Machine learning paradigms for behavioral coding
Asset Metadata
Creator
Keller, Brenton James
(author)
Core Title
Deciphering the variability of goal-directed and habitual decision-making
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Neuroscience
Publication Date
12/12/2018
Defense Date
08/28/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Bayesian inference,decision-making,goal-directed behavior,Habits,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Coricelli, Giorgio (
committee chair
), Galstyan, Aram (
committee chair
), Monterosso, John (
committee chair
)
Creator Email
bjkeller@usc.edu,research@brentonkeller.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-111519
Unique identifier
UC11675408
Identifier
etd-KellerBren-7009.pdf (filename),usctheses-c89-111519 (legacy record id)
Legacy Identifier
etd-KellerBren-7009.pdf
Dmrecord
111519
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Keller, Brenton James
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Bayesian inference
decision-making
goal-directed behavior