Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Evaluating aleatory uncertainty assessment
(USC Thesis Other)
Evaluating aleatory uncertainty assessment
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
EVALUATING ALEATORY UNCERTAINTY ASSESSMENT
_______________________
A Dissertation
Presented to
Faculty of the Graduate School
University of Southern California
Los Angeles, California
______________________
In Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy in Psychology
_______________________
by
Kenneth D. Nguyen
August 2018
Kenneth Dac Nguyen 2018
Keywords: subjective probability, meta-analysis, Brier, sport forecasts, individual differences
2
COMMITTEE MEMBERS
Committee Chair: Richard S. John, Ph.D.
Associate Professor in Psychology
Committee Member: John Monterosso, Ph.D.
Associate Professor in Psychology
Committee Member: Morteza Dehghani, Ph.D.
Assistant Professor in Psychology
Committee Member: Detlof von Winterfeldt, Ph.D.
Professor of Industrial and Systems Engineering
3
ABSTRACT
This dissertation aims to address two broad research questions. The first, and a
fundamental one, is to what extent subjective assessments of uncertainty are informative and
useful. The second, and a practical one, is what factors determine probability assessment
performance. The first question is addressed in a meta-analysis designed to test whether the
performance of judges in 69 different empirical studies on probability assessment is better than
the performance that would be obtained when judges make uninformative guesses. The meta-
analysis also provides a platform to examine the effects of various important methodological
factors on probability assessment performance including the effects of expertise, elicitation
methods, and assessment contexts. The results indicated that the Brier score, a performance
measure of probability judgments, was statistically better than a score that would be obtained
when judges made uninformative guesses although the overall quality was quite modest.
Additionally, judgments from experts were significantly more calibrated and more resolute than
judgments from lay individuals. In addition, aleatory uncertainty assessment resulted in a better
Brier score than epistemic uncertainty assessment. Interestingly, judges performed better when
they were asked to estimate probability for a single hypothesis compared to when they were
asked to estimate probabilities for mutually exclusive outcomes.
4
The second question is addressed in two forecasting studies. Particularly, these
experiments were designed to evaluate alternative methods to elicit subjective probability as well
as to explore the effects of individual differences on probability assessment performance.
Extensive efforts were made to recruit individuals knowledgeable in assessment domains to
participate in the studies. The results indicated that numeric representation of probability was
more extreme but less overconfident and more resolute than verbal judgments. In addition, there
were no apparent benefits of using a gamble method to elicit probability. Equally interesting,
judges who received incentives based on their assessment performance did not perform better
than those who did not receive such performance-based incentives. The ability to think
reflectively emerged as a consistent predictor of (good) probability judgments across assessment
contexts. Numeracy and active-open-minded thinking were also predictors of (good) probability
judgment performance.
There are three major contributions from this dissertation. First, this dissertation is the
first attempt to quantify probability assessment performance from 69 primary studies on
probability judgments. Second, this dissertation is one of the firsts to assess verbal aleatory
uncertainty assessments. Third, this dissertation demonstrates an innovative quantification
method for verbal expressions of uncertainty. These contributions provide further insights and
tools for researchers to continue examining various aspects of subjective assessments of
uncertainty.
5
ACKNOWLEDGMENTS
There are a number of individuals whom I would like to express my gratitude for helping
me with this dissertation. I truly appreciate valuable feedback from my committee members:
Prof. Detlof von Winterfeldt, Prof. John Monterosso, Prof. Morteza Dehghani, and Prof.
Richard John. I am especially thankful for having Richard as my graduate advisor. He has
provided me with excellent opportunities to pursue my research interests, to work on important
and interesting projects, to present at numerous conferences, and to publish my works. He is
always generous with his time to help me develop my research and writing skills. Most
importantly, he has taught me how to make good decisions by ‘dividing and conquering’
complex problems, focusing on my values, and appreciating the role of uncertainty. These are
invaluable skills that would have a lasting impact on my personal life. I want to let him know
that he has played an instrumental role in my academic and professional successes. I also want to
thank many individuals affiliated with the Center for Risk and Economic Analysis of Terrorism
Events (CREATE) for their support and encouragement throughout my graduate career. I also
wish all of my graduate friends to succeed in their academic and professional journeys, and I
hope we can keep in touch.
There are simply too many members in my extended family whom I need to express my
deep gratitude. My mother always does what she thinks would be best for me. I hope I have
made her proud. The Dao’s family has provided me with generous support for my studying in
6
Vietnam. The Nguyen’s family gave me an opportunity to come to the States. Stephanie Dao and
Danny Nguyen have provided me with generous support for my undergraduate study. If Dr.
Richard John plays an instrument role in my professional success, Helen Thanh Nguyen plays an
indispensable role in my well-beings and personal development. She takes good care of me, so I
can focus on my research and career. She makes me pay for her stuff (rightly so), so I become
better at managing my finance. She helps me recover from an ACL, so I can enjoy playing
soccer again. Put simply, she has been the source of my joy in the last seven years, and I hope we
can continue sharing memorable moments in the years to come.
7
TABLE OF CONTENTS
ABSTRACT ................................................................................................................................ 3
ACKNOWLEDGMENTS ............................................................................................................. 5
LIST OF TABLES ......................................................................................................................... 8
LIST OF FIGURES ..................................................................................................................... 10
CHAPTER 1: OVERVIEW ......................................................................................................... 11
CHAPTER 2: A BRIEF OVERVIEW ON PROBABILITY JUDGMENT RESEARCH .......... 16
CHAPTER 3: MEASURING SUBJECTIVE PROBABILITY JUDGMENTS .......................... 32
CHAPTER 4: A META-ANALYSIS ON THE QUALITY OF
PROBABILITY ASSESSMENT......................................................................... 39
CHAPTER 5: VERBAL PROBABILITY EXPRESSIONS ....................................................... 67
CHAPTER 6: EVALUATING VERBAL ASSESSMENT OF
ALEATORY UNCERTAINTY........................................................................... 77
CHAPTER 7: COMPARING ALTERNATIVE METHODS TO
ELICIT SUBJECTIVE PROBABILITY .......................................................... 104
CHAPTER 8: GENERAL DISCUSSION ................................................................................. 131
REFERENCES .......................................................................................................................... 141
APPENDIX A: CODING MANUAL ........................................................................................ 184
APPENDIX B: MATERIALS IN STUDY 1............................................................................. 189
APPENDIX C: MATERIALS IN STUDY 2............................................................................. 196
8
LIST OF TABLES
Table 1. Variables coded in the meta-analysis...............................................................................55
Table 2. Descriptions of primary studies in the meta-analysis ......................................................58
Table 3. Correlations among the covariate s in the meta-regression model ..................................61
Table 4. Means and SDs of numeric values of the verbal expressions in the
ambiguity aversion study ................................................................................................76
Table 5. Psychological measures of individual differences in the NFL study ..............................90
Table 6. Characteristics of the MIN sample ..................................................................................92
Table 7. Descriptive statistics of the numeric translations of verbal expressions
in the NFL study .............................................................................................................93
Table 8. Correlations among performance indexes in the MIN sample in the NFL study ............93
Table 9. Performance measures in the MIN and most informative samples .................................95
Table 10. Measures of individual differences in the NBA study ................................................120
Table 11. Means and SDs of the numeric translations of
verbal expressions in the NBA study ............................................................................123
Table 12. Correlations among performance measures in the MIN sample in the NBA study…123
Table 13. Means and SDs of the performance measures in the MIN sample ..............................124
Table 14. Regression results in the MIN sample in the NBA study...…………………..……...124
Table 15. Means and SDs of the performance measures in the most
informative sample in the NBA study………………………………………………….126
9
Table 16. Comparing probability values of verbal expression of uncertainty... ……………….138
10
LIST OF FIGURES
Figure 1. Signal and noise distributions in SDT analyses. ............................................................37
Figure 2. An example of the receiving operator characteristics curve. .........................................38
Figure 3. Literature search process. ...............................................................................................51
Figure 4. Forest plot with a summary effect size ...........................................................................58
Figure 5. Cummulative meta-analysis ...........................................................................................64
Figure 6. An example of a binary choice displayed to respondents ..............................................72
Figure 7. Logical sequence of the quantification methodology………………………………….72
Figure 8. RED vs. BLUE conditions in the ambiguity study ………………….………………...73
Figure 9. Flow of respondents in the main and the follow-up studies ……………...……….…..92
Figure 10. Covariance graph in MIN sample………………………. ……………...……………94
Figure 11. An example of how the Brier score was described to experts in the
Incentivized/Number condition……………….……………………………………...117
Figure 12. An example of how the Brier score was described to experts in the
Incentivized/Verbal condition……………………………………………………...…118
Figure 13. Distributions of numeric values of verbal probability expressions…………………123
11
CHAPTER 1
OVERVIEW
Probability is the language of uncertainty, and it provides a mathematical vehicle to
quantify the degree of incertitude. There are at least three different views on probability. In the
classical interpretation, probability of an event is defined as the division between the number of
outcomes comprising that event and the total number of possible outcomes (Winkler, 1996).
Yet, according to the frequentist point of view, the probability of an event is defined as the
proportion of times that the event occurs over many identical trials. As the number of trials gets
larger, the proportion of times the event occurs approaches its true probability value. In contrast,
the Bayesian school of thought defines probability as the expression of subjective belief (Savage,
1954).
The notion that probability is subjective may be unfamiliar and perhaps counterintuitive.
Yet, subjective probability plays a central role in many real-life decisions. Although relevant
historical data should be used whenever possible to inform judgments and choices, situations
arise such that data are limited, not in usable forms, or even risky to collect. In these
circumstances, probability estimates can be obtained via experts’ judgments. Data represent only
a fraction of knowledge about a future event, and decision makers must combine what they learn
from the data with other sources of information to assess uncertainty (e.g. their beliefs about the
12
world). Even though the Bayesian perspective has been criticized for its subjectivity, it has been
argued that subjective judgments are unavoidable in assessing uncertainty (Winkler, 1996). For
example, unless assumed, subjective judgments are needed to determine whether all outcomes
are equally likely in the classical interpretation. Similarly, it is difficult, if not impossible, to
meet the frequentist’s assumption that individual trials are replications of one another unless
subjective judgments are used.
Because probability, according to the Bayesian view, is subjective, it must be elicited.
This begs the question how we can elicit subjective probability from individuals. Numerous
approaches exist to elicit subjective probability estimates (for a review, see Galway, 2007).
However, the unresolved question is whether there are substantial differences among these
approaches that would make one method better than the others. A critical gap in prior research is
the tendency to generalize results from studies that examine epistemic uncertainty to assessment
of aleatory uncertainty. The former involves an assessment of knowledge whereas the latter
assessment is concerned with stochastic behaviors. Recent research reveals important
psychological and behavioral distinctions between the two types of assessments (Tannenbaum,
Fox, & Ulkumen, 2016), rendering it untenable to generalize results from one context to another
(Ronis & Yates, 1987). In addition to this generalizability issue, little research has been
conducted to examine the effects of alternative representations of uncertainty on aleatory
probability assessment. For example, although numeric probabilities are preferred for
computational purposes, probabilities can also be expressed verbally. Indeed, research shows that
individuals prefer to use verbal expressions of uncertainty over numeric representation of
probability (Erev & Cohen 1990). Yet, there is limited research to evaluate the quality of non-
numeric representation of probability (see Wallsten, Budescu, & Zwick, 1993 for an example).
13
A related, and perhaps more fundamental, question is whether subjective estimates are
coherent and useful. Subjective probabilities are coherent when they follow principles of
probability theory such as the sum of probabilities for all mutually exclusive outcomes must be
one (Savage, 1954). In addition, subjective probability should help to inform decisions better
than random guesses. Decades of behavioral research suggest that individuals rely on heuristics
or mental shortcuts to make judgments and choices (see Kahneman, Slovic, & Tversky, 1982).
Although these heuristics are efficient and often sufficient, they can lead to inconsistent, biased,
and undesirable outcomes (see Gilovich, Griffin, & Kahneman, 2002). The abundance of
evidence that human judgments are prone to error raises a critical question: Are subjective
probability estimates reliable, coherent, and useful? Given the extensive literature on probability
judgments, it is not a surprise that several excellent reviews on probability judgments have been
written over the years (Alpert & Raiffa, 1982; Lichtenstein, Fischhoff, & Phillips, 1981;
Wallsten & Budescu, 1983). Nonetheless, many of these reviews are concerned with specific
aspects of probability judgments such as calibration (Keren, 1991) and do not address the
empirical question whether individuals can make informative probability judgments.
One of the biggest challenges to address this question is the fact that primary studies on
probability judgments differ widely in their objectives, designs, elicitation methodologies,
assessment contexts, and samples. As a result, a qualitative summary of the literature, albeit
providing useful information, hardly provides a conclusive answer for the proposed question.
What is needed is a systematic approach that allows researchers to synthesize quantitative results
from prior research and to model the effects of various factors related to probability judgments.
This dissertation is an attempt to address the two proposed research questions. To address
the first question whether individuals can provide informative probability estimates, a meta-
14
analysis has been conducted to quantify probability judgment performance in prior research. The
objective is to evaluate whether the quality of subjective probability estimates elicited from
human judges is better than the result from uninformative guesses. To address the second
question regarding the effects of alternative elicitation methods, two experiments were conducted
to compare different approaches to assess aleatory uncertainty. In particular, the focus is on non-
numeric methods to elicit probability including verbal assessment of uncertainty and inferences
of probability values from choices between gambles.
Results in this dissertation studies revealed that 1) the Brier score, a performance measure
of probability judgments, was better than a score that would have been obtained when human
judges made uninformative guesses although the overall quality was quiet modest; 2) Numeric
representation of probability was more extreme but less overconfident and more resolute than
verbal judgments; 3) There were no apparent benefits of incentivizing judges when they
completed a probability judgment task; 4) There were significant effects of individual
differences, especially the ability to think reflectively, on distinct attributes of probability
judgments.
This dissertation not only makes significant substantive contributions to the current
research literature on probability judgments but also contributes two important methodological
innovations. This dissertation is one of the first experiments to evaluate verbal assessments of
aleatory probability judgments. In addition, this dissertation also presents an innovative
quantification method for verbal expressions of uncertainty. This method is not only
conceptually sound as it is based on the long-held notion that subjective probability can be
inferred from choices between gambles (Edwards, 1954; Savage, 1954), but it can also be easier
for respondents to complete as it simply requires choices between pairs of gambles.
15
This dissertation is organized as follows. Chapter 2 provides a brief overview of research
on subjective probability judgments and explains quantitative measures of probability judgments.
Chapter 3 describes and presents results from a meta-analysis designed to quantify probability
judgment performance. In subsequent sections, a novel method to quantify verbal probability
judgments is introduced. Then, the results from two forecasting experiments designed to explore
the effects of alternative representations of uncertainty (verbal probability and choices in
gambles) on aleatory probability judgments is described.
16
CHAPTER 2
A BRIEF OVERVIEW ON PROBABILITY JUDGMENT RESEARCH
The study of probability judgments spans multiple disciplines including, but not limited
to, psychology, economics, probability and statistics, decision and risk analysis, and philosophy.
The goal of the current review is to focus on one important issue: The extrapolation of research
findings on probability judgments in laboratory research (or lab study) to studies conducted in
natural settings (or expert elicitation). Probability judgment research can be broadly categorized
into two groups. The first category includes laboratory research in which respondents are asked
to provide confidence ratings for answers to general knowledge questions. The respondents are
typically undergraduates and have little incentive to perform well on the given tasks. In contrast,
the second category consists of research in natural settings in which probability distributions are
elicited from subject experts. The samples of judges are usually very knowledgeable on domains
that they are asked and often have strong incentives to perform well.
Results from lab studies suggest that probability judgments are inconsistent with
principles of probability theory and human judges often rely on irrelevant cues to make
judgments (Gigerenzer & Gaissmaier, 2011; Gilovich, Griffin, & Kahneman, 2002; Kahneman,
Slovic, Tversky, 1982;). These findings clearly challenge the efficacy of relying on experts’
17
subjective probability estimates in the absence of relevant “hard” data. Even though laboratory
research can be useful in informing expert elicitation by highlighting conditions under which
probability judgments can be inconsistent, there are clear distinctions between expert elicitation
and lab studies. Expert elicitation is not only different from studies conducted in laboratory
settings (hence “lab studies”) in terms of research objectives but the former also requires a more
robust and elaborative procedure to elicit probability from experts. These distinctions raise the
question to what extent findings from lab research can be generalized to expert elicitation. As a
result, if the goal of lab research on probability judgments is to become more practically
relevant, lab studies should be tied more closely to the practice of expert elicitation.
In this chapter, probability judgments in expert elicitation and laboratory research are
discussed with a focus on identifying gaps in research that limit the generalization of research
findings from one domain to another. The first section of this chapter highlights key
experimental findings that have practical implications for expert elicitation. In the second
section, several distinctions between expert elicitation and laboratory research on probability
judgments are discussed. The third section addresses the effects of individual differences on
probability judgments. The chapter concludes with a summary of key research questions that this
dissertation attempts to address.
Cognitive Biases in Probability Judgments
Results from experimental psychology can inform expert elicitation by highlighting
conditions under which probability judgments are suboptimal. Decades of research suggest that
human judgments are fallible (Gigerenzer & Gaissmaier, 2011; Gilovich, Griffin, & Kahneman,
2002; Kynn, 2008). Individuals are often overconfident in their judgments (Ferrell & McGoey,
1980), do not update their beliefs as much as they should in lights of new data (Rapoport,
18
Wallsten, Erev, & Cohen, 1990), make incoherent judgments (Tversky & Koehler, 1994), to
name just a few biases. Montibeller and von Winterfeldt (2015) provided an excellent summary
of cognitive and motivational biases in expert elicitation. The authors argued that cognitive
biases such as ambiguity aversion, conjunction fallacy, base rate fallacy, endowment effect,
sample size insensitivity, gambler fallacy, non-regressive predictions, subadditivity and super
additivity are relatively easy to attenuate or even eliminate. This is because individuals are
willing to make corrections once those inconsistencies are identified. In contrast, motivation
biases including scaling, anchoring, availability heuristics, equalizing, desirability, affect
heuristics, and overconfidence are more difficult to reverse.
The last bias involves an important property of probability judgments, namely
calibration. Indeed, a substantial amount of research has been devoted to this topic (see Keren,
1991 for a review). The following key experimental results can be summarized from this
research program: 1) Individuals tend to be more confident than they should be in their
judgments; 2) as the proportion of correct responses decreases, overconfidence is exaggerated,
the so-call ‘hard-easy effect’, 3) there is no difference in calibration between experts and
laypeople, and 4) calibration can improve with training (Ferrell & McGoey, 1980; Keren, 1991).
Although overconfidence is a fairly robust experimental phenomenon, several authors are
skeptical about this effect and have proposed alternative explanations. Gigerenzer, Hoffrage, and
Kleinbölting (1991) disputed the effect of overconfidence when experimental subjects were
asked to provide confidence for answers to general knowledge questions. These authors argued
that because few people know the answers to difficulty questions, their probability judgments are
bound to be overconfident when the elicitation task includes more difficult questions. Stated
differently, “If general and the knowledge domains under conditions in which knowledge
19
questions were a representative sample from the knowledge domain, zero overconfidence will be
expected” (Gigerenzer, 1993, p. 304). Dawes and Mulford (1996) correctly pointed out that
whether judgments are overconfident or underconfident depends on how researchers analyze
data. The relationship between accuracy and confidence rating is expressed in a calibration
curve. In typical behavioral research, accuracy is plotted as a function of confidence rating, and
the general finding is that judgments are often overconfident. However, confidence can also be
plotted as a function of accuracy. Because the slope of a calibration curve would be different
when confidence is treated as either a dependent or an independent variable, different
conclusions can be drawn from the calibration analysis, i.e. judgments can be either
overconfident or underconfident. However, evidence from later research suggested contrasting
evidence to these claims.
Brenner, Koehler, Liberman, and Tversky (1996) provided their subjects with different
personality profiles and asked them to predict whether those who fit a particular profile would
engage in certain behaviors. In contrast to the representativeness and regression artifact
arguments, the authors found that overconfidence was not eliminated by randomly selecting
questions and could not be treated as a merely regression artifact. In addition, Keren (1991)
correctly pointed out that the overconfidence effect was generalized to tasks in which
respondents were asked to provide likelihood assessments of future events whereas the effect has
been mostly found when respondents were asked to provide confidence ratings for general
knowledge questions. This distinction between confidence rating and likelihood judgments is
important because people use different psychological strategy to make judgments in these tasks.
Peterson and Pitz (1988) showed empirical evidence that judgments in tasks involved likelihood
assessment was determined by the number of different predictions that could be generated,
20
whereas judgments in tasks included confidence ratings was influenced by salient factors that
people believe to affect the accuracy of their assessments.
Experimental results also suggest that probability judgments are incoherent. Coherence is
determined by comparing the properties of subjective probability estimates against the calculus
of probability theory (Savage, 1954). For example, the sum of probabilities of two mutually
exclusive hypotheses should equal to one, or the probability of an intersection between two
hypotheses should be less than the probability of either hypothesis. In a series of experiments,
Kahneman and Tversky (1982) demonstrated incoherent probability judgments. One example is
the Linda bank-teller problem, in which subjects judged the conjunction probability of Linda’s
being a feminist and a bank-teller to be greater than the probability that Linda’s being a bank
teller alone.
Tversky and Koehler (1994) demonstrated another violation of probability judgment,
which the authors termed “subadditivity.” They showed that the total probability of all
hypotheses increased and exceeded unity when the level of detail was increased in the
partitioning of the event space into mutually exclusive and exhaustive events. Another
interesting violation comes from Bayesian reasoning studies. Bayes’ rule suggests that
probability should be updated in light of new information. Early empirical research suggested
that individuals were conservative Bayesian thinkers. Although experimental subjects adjusted
their probabilities when receiving new information, their adjustments were too conservative
when compared to results from Bayesian analyses (Edwards, 1968).
However, these violations are specific to how probability responses are presented. For
example, when the Linda’s problem was presented in the frequency format, only a small group
of subjects committed the error (Gigerenzer & Hoffrage, 1995). In addition, Teigen and Brun
21
(1999) showed that using verbal expressions of uncertainty may reduce the effect of conjunction
fallacy. The authors reasoned that positive verbal expressions of uncertainty such as “probable”
invite subjects to think of reasons for the proposition (e.g. what is more probable: Linda is a bank
teller or Linda is a bank teller and a feminist). On the other hand, negative terms such as “less
likely” invite subjects to think of the opposite direction (e.g. what is least likely to be true: Linda
is a bank teller or Linda is bank teller and a feminist). Thus, framing Linda’s problem using
negative terms can reduce the conjunction bias. Similarly, using the frequency format of
probability reduced the effect of subadditivity (Tversky & Kohler, 1994) and improved Bayesian
reasoning (Ayton & Wright, 1994; Betsch, Biel, Eddelbuttel, & Mock, 1998).
Distinctions between Expert Elicitation and Laboratory Research
An interesting and important question is to what extent cognitive and motivational biases
affect expert elicitation. Because experts’ judgments are not immune from many biases that
affect lay judgments (Tetlock, 2005), experts’ judgments may not be significantly better than lay
judgments. On the other hand, the process of expert elicitation has many built-in mechanisms
(e.g. consistency check, decomposition, and structured analyses) to safeguard experts’ judgments
against the effects of cognitive and motivational biases. Hence, it is reasonable to expect better
judgments from experts. Indeed, there are a number of differences between expert elicitation and
lab studies on probability judgments.
One of the goals of descriptive research in judgment and decision-making is to
characterize how people make decisions. As such, experimental psychologists often design
studies within a context that increases the chances that their respondents make judgments
inconsistent with normative standards (Stanovich & West, 2000). Psychologists, then, can
systematically investigate how and why those judgments deviate from the normative standard. In
22
contrast, one of the primary goals in expert elicitation is to ensure that the probabilities elicited
from experts are at the highest standard. As such, probability elicitors design their assessments to
minimize errors and maximize their experts’ performance.
More importantly, the process of expert elicitation differs substantially from procedures
to obtain probability judgments used in psychological experiments. In a typical lab study, college
students are asked to indicate how confident they are in their answers to general knowledge
questions. The student respondents often have little opportunity to learn about (or refresh their
memory of) principles of probability theory and have little incentive to make careful
assessments. On the other hand, expert elicitation is a labor-intensive process that involves
multiple stages and complex procedures. Keeney and von Winterfeldt (1991), drawing on their
experience in conducting a complex nuclear risk assessment, identified essential components of
eliciting probability from experts. These components are: 1) identification and selection of
issues, 2) identification and selection of experts, 3) discussion and refinement of the issues, 4)
training for elicitation, 5) elicitation, 6) analysis, aggregation, and resolution of disagreement, 7)
documentation and communication. In addition to different research focuses, there are several
key differences between probabilities elicited from experts and those elicited from experimental
subjects. As a result, it difficult to generalize results from experimental research to expert
elicitation.
One of the most distinctive differences between expert elicitation and lab research is the
type of probability judgments. In a typical psychology study, undergraduate students are
presented with general knowledge questions. They are asked to select correct answers, usually
from binary choices, and provide ratings to indicate how confident they are in their answers. This
is a confidence rating task. On the other hand, experts are often asked to assess probability
23
distributions over a range of possible outcomes that could happen in the future. This is a
likelihood assessment task. The difference between a likelihood assessment task and a
confidence rating task underscores the distinction between aleatory versus epistemic uncertainty.
Aleatory uncertainty stems from the stochastic process of an event whereas epistemic uncertainty
entails missing information, facts, or expertise concerning a knowable event (Fox & Ulkumen,
2011). Aleatory uncertainty is irreducible, but it can be estimated via empirical observations. In
contrast, epistemic uncertainty can be reduced through expertise and/or information gathering.
Both types of uncertainty are present in many risk analysis applications (Hora, 1996).
Consider an example in which an engineer is asked to quantify the probability of a system
failure. The engineer may be unsure whether he has a correct understanding of how the system
operates. A correct epistemic representation of the system allows the engineer to understand the
root cause when the system fails. This uncertainty can be reduced by having the engineer study
the system more carefully and consult with other experts. The engineer may also be concerned
about the chance that the system still fails even when all safety measures function properly. This
aleatory uncertainty is irreducible, but the failure rate can be estimated by examining historical
data. This example also illustrates another distinction between the two sources of uncertainty.
Whereas aleatory uncertainty is naturally measured by relative frequency (e.g. historical data),
epistemic uncertainty is often quantified from subjective probability judgments (e.g. how much
judge knows about what they know).
Recent research conducted by Ülkümen, Fox, and Malle (experiment 5, 2016) reveals
intriguing psychological differences between aleatory and epistemic uncertainty assessments.
They found that their subjects were more attentive to frequency information when they were
primed with a likelihood-statement response, but they paid more attention to feeling-of-knowing
24
information when they were primed with a confidence-statement response. The same authors
also showed that the distinction between aleatory and epistemic uncertainty was reflected in
everyday use of language (experiment 1 and 2, 2016). Because epistemic uncertainty, in
principle, is knowable, it would be more natural for individuals to use confidence statement (e.g.
“I am sure that X is true”) to reflect the degree of (epistemic) uncertainty when they are asked
about past or current events. On the other hand, future events, by definition, have not been
occurred. Therefore, they are unknowable and have multiple possibilities. As a result, speakers
tend to use likelihood statements (e.g. “It is likely that X will occur”) to capture the degree of
(aleatory) uncertainty.
Examining the articles published in the New York Times between 2008 and 2009, the
authors found that speakers were more inclined to use likelihood than confidence statements
when adopting a first-person perspective, expressing some feeling of control over the outcome,
relying on intuition to make predictions, speaking about events in the present or past, and when
uncertainty could be attributed to internal sources such as the lack of knowledge. These
contextual characteristics are representative of epistemic uncertainty. On the other hand,
speakers tended to use likelihood statements when they adopted a third-person perspective,
appeared to have limited control over the outcome, relied on logic and calculation to make
prediction, and spoke about events in the future. These linguistic qualities map onto
characteristics associated with aleatory uncertainty.
Using a different paradigm, Carlson (1993) asked subjects to assess the probabilities of
events that happened in the last two weeks and events that would occur in the next two weeks.
Judgments of past events require an assessment of epistemic uncertainty whereas judgments of
future events assess aleatory uncertainty. The author found that the Brier score, calibration score,
25
and noise were better when subjects assessed future events compared to past events.
Interestingly, discrimination, or the ability to distinguish when an event happens, was better in
past assessments compared to future assessments.
The enhanced effect of calibration in aleatory judgment as opposed to the increased
discrimination in epistemic judgments is probably due to the difference in the level of extremity
in probability responses across tasks. Questions concerning epistemic uncertainty, in principle,
are knowable. Thus, it is expected that individuals either know or do not know the answers.
Hence, their probability responses should be close to 0 and 1. Indeed, Tannenbaum Craig, and
Ülkümen (2014) found that their subjects made more extreme probability judgments for events
that they viewed as being more epistemic in nature, which in effect, increased the discrimination
score. Nevertheless, the trade-off was a decrease in calibration score, which is expected as
calibration and discrimination are inversely related (Yates, 1990).
Another key difference between expert elicitation and lab studies on probability judgment
is the level of expertise. Subjects in typical lab studies are undergraduate students while
participants in expert elicitations studies are knowledgeable individuals with mastery in specific
domains. Interestingly, the effect of expertise on probability judgments is less clear. Even though
it is reasonable to expect that forecasts by experts are more accurate than those from lay
individuals, experts are susceptible to many of the same cognitive and motivational biases that
affect lay judgments (Tetlock, 2005). Even though probability judgments from experts in certain
fields such as meteorology are very well calibrated, there is evidence that probability judgments
made by experts in many professional settings show poor calibration (see Shanteau, 1992). It is
important to note that the effect of expertise is often confounded with the effect of assessment
26
context as experts are often asked to assess aleatory uncertainty whereas lay individuals are often
requested (in research studies) to assess epistemic uncertainty.
There are also major differences between expert elicitation and lab research in the
procedures used to elicit probability. Experts are usually allowed to carefully study assessment
problems before they are tasked with providing probability estimates. In contrast, experimental
respondents are often asked to generate probability estimates for events for which they often
have limited knowledge (e.g. almanac questions). Importantly, respondents in psychological
experiments are often asked to directly assess the probability for problems (or events) of interest
whereas knowledgeable individuals in expert elicitation tasks are often asked to provide
estimates for component events.
The divide and conquer principle in decision analysis presumes that a probability
estimate for an event would be more accurate when it is aggregated from probabilities of
component events that make up the target event (Raiffa, 1968). For example, the probability that
the Belgium soccer team wins the coming 2018 World Cup can be decomposed into the
conjunction of conditional probabilities, including chance of advancing at several stages of play:
group, one-sixteenth round, quarterfinal, semi-final, and ultimately winning the final game. The
target probability can be computed by having these conditional probabilities elicited and
combined according to probability rules for a conjunctive event. Interestingly, evidence for the
effectiveness of decomposition is mixed. Decomposition leads to better calibrated judgments in
some contexts (Armstrong, Deniston, & Gordon, 1975; MacGregor, Lichtenstein, & Slovic,
1988, Alpert & Raiffa, 1982), but not in others (Henrion, Fischer, Mullin 1993).
Interestingly, even though numeric probabilities are required in practical applications of
risk and decision analysis, experimental researchers have examined the possibility of using non-
27
numeric representation of probability to quantify uncertainty. For example, because verbal
expressions of uncertainty are often the preferred communication mode, a number of studies
have been conducted to assess properties of verbal assessment of uncertainty (Brun & Teigen,
1988; Erev & Cohen, 1990; Olson & Budescu, 1997; Wallsten, Budescu, & Zwick, 1993).
Despite the apparent difference in response mode, verbal assessments of uncertainty are more
similar than different from numeric expressions of uncertainty (Wallsten, Budescu, & Zwick,
1993; cf. Windschitl & Well, 1996)
Individual Differences
Even though there is an enormous body of literature on cognitive and motivational biases,
research that examines the factors associated with good probability assessment skills is limited
(Kynn, 2008). Identifying individual factors that predict good probability judgments is a crucial
task as it enables decision makers to foster these positive attributes. Results from prior research
suggest the existence of a distinctive set of skills that predicts good judgments.
Cognitive ability, especially intelligence, has been thought to be the most powerful
predictor of judgment performance. Accordingly, intelligent people would be more likely to use
cost-benefit reasoning, “…because intelligence is generally regarded as being the set of
psychological properties that makes for effectiveness across environments. . . [Thus,] intelligent
people should be more likely to use the most effective reasoning strategies than should less
intelligent people” (Larrick, Nisbett, & Morgan, 1993 p. 333)”. Yet, the relationships between
cognitive ability and performance in decision-making tasks are somewhat inconsistent. Although
SAT scores, a commonly used measure of cognitive ability, were positively correlated with
better Bayesian reasoning and the tendency to avoid the conjunction fallacy (Stanovich & West,
1998), general intelligence did not help to improve cost-benefit reasoning (Stanovich,
28
Grunewald, & West, 2003). Furthermore, the magnitude of myside bias, the tendency to generate
and evaluate evidence consistent with one’s own hypotheses, has little relationship with
intelligence (Stanovich, West, & Toplak, 2013).
The inconsistent relationship between cognitive ability and judgment performance
requires an alternative explanation to fully account for the effect of individual differences on
judgment and decision making. Stanovich, West, and Toplak (2018) reasoned that traditional
measures of cognitive ability such as intelligence tests or academic aptitude have little impact on
judgment performances because these measures are incomplete metrics of the type of cognitive
control pertinent to rational thoughts. These authors distinguished between two cognitive
constructs thought to have relationships with judgment performance: an algorithmic mind and a
reflective mind. An algorithmic mind concerns the information processing steps that facilitate the
analysis and synthesis of information for thinking, reasoning, and decision-making. Cognitive
psychologists characterize a typical algorithmic mind as including the following hardware: input-
output mechanisms, perceptual registration mechanisms, long-term memory storage, and short-
term working memory among others. In contrast, a reflective mind directs the operations of an
algorithmic mind to achieve systemic goals given one’s beliefs. In other words, a reflective mind
is concerned with the why. That is, why does an algorithmic mind need to process information
the way it does? Cognitive styles and thinking dispositions are probably the more popular terms
to describe a reflective mind in the psychology literature.
Another important distinction between an algorithmic mind and a reflective mind is how
they are measured. The former is typically reflected in cognitive ability measures such as
standardized tests whereas the latter was typically characterized by factors related to, but distinct
from, general cognitive ability. This includes factors such active-open minded thinking, the need
29
for cognition, probabilistic thinking, need for closure, and avoidance of miserly information,
among others (see Stanovich, West, & Toplack, 2011 for a review). Interestingly, even though
measures of cognitive ability and thinking dispositions are often positively correlated, the
relationships between performance in decision-making tasks and thinking disposition remain
significant even when the effect of cognitive ability is partitioned out (Bruine de Bruin, Parker &
Fischhoff, 2007; Klaczynski & Lavallee, 2005). This suggests that elements of a reflective mind
can be important predictors of performances on decision-making tasks, many of which concern
probabilistic reasoning.
A limitation of prior research is the use of a task-based paradigm to assess the
relationship between cognitive factors and judgment performance. There is an array of
performance tasks that can be used to assess judgment performance, and many of these tasks
concern probabilistic reasoning (Stanovich, West, & Toplak, 2018). While this task-based
paradigm is useful, it does not address whether elements of a reflective mind relate to judgments
performed in natural settings. For example, there is limited research that examines elements of a
reflective mind in relation to probabilistic forecasts (cf. Mellers et al., 2014).
Research Questions
The review underscores several research gaps that must be addressed. First, while the
presence of cognitive and motivational biases may challenge the quality of probability estimates
elicited from experts, prior research provides inconsistent results due to differences in
methodologies to elicit and assesses subjective judgments. These inconsistent findings provide
little confidence to conclude whether subjective judgments are informative and useful for making
decisions. Thus, a systematic review is warranted to evaluate the performance of probability
30
judgments. Importantly, such a review should allow for a rigorous study of the effects of various
methodological factors on the performance of subjective judgments.
Second, the literature review also highlights several distinctions between expert
elicitation and experimental studies on probability judgments. If the goal of experimental
research is to become more relevant to practical applications of probability elicitation, lab-based
studies should closely follow procedures and protocols in expert elicitation. In particular, the
focus should be on expert selection (or at least selecting knowledgeable individuals) and
assessment of aleatory uncertainty, as these are probably the most important factors to determine
the quality of judgments. On the other hand, experimentation is an excellent tool to examine
critical issues in expert elicitation. For instance, practitioners may be interested in knowing what
elicitation methods they should use for particular problems. Furthermore, experimental research
can also evaluate elicitation methods that have been traditionally ignored in expert elicitation,
such as the use of verbal expressions of uncertainty. Third, there is a need to explore the effects
of individual differences in aleatory probability assessments.
This dissertation is an attempt to close these research gaps. First, a meta-analysis is
conducted to synthesize results from prior studies on probability judgments. The objectives were
to assess whether subjective judgments would be better than uninformative guesses, and to
model the effects of various methodological factors on probability judgments. Second, two
forecasting experiments were utilized to compare alternative methods to elicit probability with an
eye toward identifying methods that can improve subjective judgments. The current study
evaluated the performance of verbal expressions of uncertainty in tasks that required judges to
forecast future events (aleatory uncertainty) and evaluate a probability elicitation method that
infers probabilities from choices between gambles. The two experiments also involve analyses of
31
the relationships between elements of a reflective mind (e.g. active-open minded thinking,
numeracy, and cognitive reflection) and aleatory probability judgments. However, before
presenting results from the meta-analysis and the two forecasting experiments, it is necessary to
understand different performance metrics of probability judgments.
32
CHAPTER 3
MEASURING SUBJECTIVE PROBABILITY JUDGMENTS
There is a need to derive metrics that assess the correspondence between subjective
probability forecasts and the reality. Because forecasters may be in favor of certain outcomes,
they can hedge their estimates. For example, a sport editor forecasts an 80% chance that his
favorite team would win the next game, but the evidence suggests a much more modest
probability. Any good evaluation metrics should avoid this problem by discouraging forecasters
from capitalizing on the metrics. There is a class of measures that fulfill this requirement. These
measures are known as proper scoring rule because forecasters can only maximize their
performance scores by reporting their honest beliefs (Lichtenstein & Fischhoff, 1980). The Brier
score (Brier, 1950) is a special case of a proper quadratic scoring rule for events with binary
outcomes, and can be expressed as,
PS
̅̅ ̅
= 𝑁 −1
∑ (𝑝 𝑗
− 𝑑 𝑗 )
2 𝑁 𝑗 =1
(1)
where N is the total number of predicted events, pj is the subjective probability estimate that an
event Xi occurs, “dj” equals 1 when the event occurs and 0 when it does not occur. The best
possible score is 0.0 and the worst possible score is 1.0. Equation 1 also indicates that for each
probability judge, her Brier score, denoted as PS
̅̅ ̅
for probability score, is the mean of the squared
deviations between her assessment of uncertainty and the true state of the world.
33
The Brier score is, indeed, a strictly proper scoring rule because forecasters can only
obtain the best possible score if and only if they report their honest belief. A comparison between
the expected score based on the Brier score and a linear scoring rule, the difference between a
probability forecast and the true state of the world (0 or 1), shows that the Brier score is strictly
proper whereas the linear rule is not. The following notations are needed:
p: forecaster’s true belief (true probability)
f: forecaster’s reported probability
x: outcome of a binary event (0 = Not occur, 1= Occur)
A generic form of the scoring rule (Brier vs. linear) is:
𝐵 = (𝑓 − 𝑥 )
𝑛 (2)
where n = 1 represents the linear scoring rule and n = 2 represents the Brier score. For
simplicity, B represents a score when judges make a single forecast.
The expected value (score) of the forecast is:
𝐸 (𝐵 ) = 𝑝 (𝑓 − 1)
𝑛 + (1 − 𝑝 )(𝑓 )
𝑛 (3)
The problem becomes find the value of f that makes E(B) maximum (or minimum). Thus,
we set 𝐸 (𝐵 )
′
= 0. In other words, we set the derivative of E(B) with respect to f equal 0. We
have:
𝑝𝑛 (𝑓 − 1)
𝑛 −1
+ (1 − 𝑝 )𝑛 (𝑓 )
𝑛 −1
= 0 (4)
Re-arranging the terms such as p is a function of f yields
𝑝 =
𝑓 𝑛 −1
𝑓 𝑛 −1
+ (1−𝑓 )
𝑛 −1
(5)
If the linear scoring rule is used (n = 1), then p = ½. However, if the Brier score is used (n
= 2), the best possible score is obtained when 𝑝 = 𝑓 .
34
Because the Brier score is a summary score, it indicates the overall goodness of fit
between judgments and realized outcomes. Although the Brier score is a useful index to gauge
probability assessors’ performance, it reveals little about how judgments differ in terms of
calibration and discrimination (also known as resolution), which are the two important attributes
of probability judgments (Stone & Opel, 2000). A judge is said to be well-calibrated when his or
her probability response to a target event matches the proportion of times that the class of the
target event occurs. On the other hand, a judge with a high discrimination skill can indicate when
a target event occurs versus when it does not occur. There is a trade-off between calibration and
discrimination such that well-calibrated judgments are often less resolute and vice versa. Thus,
two judges can have the same Brier score, but their calibration and resolution scores can be very
different.
To examine the calibration and resolution properties of subjective judgments for events
with binary outcomes, Murphy (1973) provided a decomposition rule,
PS = 𝐶 (1 − 𝐶 ) + 𝑁 −1
∑ 𝑁 𝑗 (𝑝 𝑗 −
𝐽 𝑗 =1
𝐶 𝑗 ) + 𝑁 −1
∑ 𝑁 𝑗 (𝐶 𝑗 − 𝐶 )
2
𝐽 𝑗 =1
(6)
where C is the proportion of correct predictions and 𝐶 𝑗 is the proportion of correct predictions in
the j interval, 𝑝 𝑗 is the probability estimate for an event in the j interval. In his covariance
framework, Yates (1990) decomposed the Brier score further:
PS
̅ ̅ ̅
= Variability + Bias
2
+ Variability*Slope*(Slope – 2) + Scatter (7)
Because the Yates’ decomposition approach provided the most comprehensive
examination of probability judgments, this approach is described in detail. The mathematical
expressions for the indexes on the right can be found in Yates (1990) and will be presented in
later chapters, but this section is on the substantive interpretations of these indexes. The
variability index simply reflects the proportion of times the target events occur, and this index is
35
outside the assessor’s control. In behavioral research, this index is often taken as a measure of
difficulty of an assessment task. The terms bias reflects the difference in the mean of the
subjective probability estimates and the proportion of times the target events (or rather the class
of the target events) happen. A positive score indicates an upward bias in which respondents’
probability responses, on average, are higher than the base rate. A squared bias score, known as
reliability-in-the-small, is conceptually related to Murphy’s calibration index (Murphy, 1973). A
lower calibration score is better. The term slope measures resolution—the ability of an assessor
to use appropriate labels to discriminate an occurrence from a non-occurrence event—and this is
conceptually related to Murphy’s discrimination index. A higher resolution score is desirable.
Finally, scatter reflects the degree of “noise” in judgments. Conceptually, scatter is similar to the
standard deviation of one’s subjective probability estimates. Thus, less variability is usually more
desirable.
Area under the Receiver Operating Characteristics Curve
Signal Detection Theory (SDT) provides an alternative framework to analyze subjective
probability judgments (Levis, 1985). SDT has been applied to a variety of detection problems in
medical decision-making, forensic sciences, clinical psychology, and many other fields (see
Swets et al., 2000 for a review). A primer on SDT is necessary to discuss its application in this
research.
SDT quantifies the ability of a human (or machine) detector to distinguish signal from
noise. In a prediction problem, this means a forecaster is asked to predict whether an event will
occur by a certain date. If the prediction is binary, there are four possible outcomes resulting
from the combination of the forecaster’s prediction and the true state of the world. For example,
the forecaster can predict that country X will soon have enough uranium to build nuclear
36
weapons. The decision is considered a “hit” if this is true and a “false alarm” if this is not true. In
contrast, if the forecaster does not issue a warning, and it is true that country X is building up its
uranium, the prediction is regarded as a “miss.” The case of binary prediction belongs to a
general class of SDT problems in which signals and noises come from two distributions. Because
the noise and signal distributions are overlapped, the detector uses a cut-off or a threshold to
decide whether the evidence comes from a signal or a noise distribution. An observation is
judged to be a “signal” when its value is greater (or lower) than a decision threshold. Figure 1
illustrates this idea.
In a prediction problem, this is tantamount to asking a forecaster for his probability
estimate that an event will occur by a certain date. For example, a policymaker requests an
intelligence officer to predict whether a coup will occur in the next three months in country Z.
The officer has observed some evidence suggesting a potential coup. Yet, the “evidence” could
be a sign of a military coup (“signal”), but it could also be due to other reasons (“noise”). The
policymaker believes that a coup would soon happen when the officer’s probability estimate of
the event exceeds the policymaker’s threshold. For example, when the estimate is 0.9 and the
policy maker’s threshold is 0.7, the policy maker would believe that the coup is unavoidable.
Note that the policymaker’s threshold is unrelated to the probability estimates. Yet, the
threshold determines the utility of the forecast. The decision threshold (at p = 0.7 in Figure 1)
divides the two distributions into four areas, each corresponding to one of the decision outcomes
discussed earlier. The policymaker can shift his or her decision threshold left or right. A forward
shift reduces the number of hits by requiring a larger probability to declare a potential coup, but
this also reduces the number of false alarms. A backward shift has the opposite effect, increasing
the number of hits at the cost of an increase in the number of false positives. Figure 1 provides a
37
visualization of the growth and shrinkage of the hit and false positive areas. By employing a
decision threshold, the continuous case of SDT is reduced to a binary forecasting problem.
[Figure 1 about here]
There are several metrics that evaluate how well a forecaster performs a forecasting task.
These metrics are: total accuracy, hit/true positive, miss/false negative, false alarm/false positive,
and correct rejection/true negative. Total accuracy refers to the probability of making correct
predictions and a hit/true positive indicates the probability of recognizing signal. A miss/false
negative is the probability of mistaking signal for noise, while a false alarm or false positive is
the probability of mistaking noise for signal. Finally, correct rejection (or true negative) is the
probability of recognize noise. Although these metrics can be used for an evaluation
performance, Swets and Pickett (1982) pointed out several shortfalls of using these metrics.
Chief among them is the fact that none of these reflect the inherent trade-offs between true
positive and false positive or the freedom of a decision maker to choose the decision threshold—
his or her “bias.”
Recall that in the previous example, the policymaker can choose any points along the
abscissa as a cut-off point. Each of these choices will result in a pair of true positive and false
positive values. When these values are traced in a two-dimensional coordinate, a receiver
operating characteristics or ROC curve is created. Figure 2 is an example of a ROC curve. Point
C represents a conservative threshold that results in a small number of true positives, whereas
point L represents a liberal threshold that results in a higher number of hits. The area under the
curve (AUC) is a metric that captures the accuracy of the prediction performance. The higher the
curve toward the upper left, the higher the level of accuracy. The diagonal reflects chance
performance which is making uninformative random predictions. Another measure that is often
38
used together with AUC is d’ (d-prime). This index reflects how well a forecaster discriminates a
signal distribution from a noise distribution. In other words, it measures the degree of separation
between the masses of the two distributions.
[Figure 2 about here]
There is a conceptual relationship between the AUC and behavioral measures of
resolution, such as Yates’ slope and Murphy’s discrimination index (Keren, 1991). These
measures quantify a forecaster’s ability to assign a higher probability to an occurrence event and
a lower probability to a non-occurrence event. Yet, there is a fine distinction. A steeper slope
depends not only on the ability to recognize occurrence events, but also on the ability to use
appropriate labels to describe the uncertainty. In contrast, the AUC simply quantifies the ability
to recognize signals from noises and does not measure how well judges can use different
probability labels (e.g. numbers, words, odds, etc.) to describe uncertainty. This chapter provides
a summary of key performance measures of probability judgments. The next chapter addresses
the first research question: Can individuals make informative probability judgments?
39
CHAPTER 4
A META-ANALYSIS ON THE QUALITY OF PROBABILITY ASSESSMENT
Subjective probability judgments are essential for decision-making. While data should be
used whenever possible to inform decisions, certain circumstances make data collection can be
extremely difficult, expensive, and risky. Moreover, results from different statistical models can
be inconsistent or even conflicting, creating ambiguity and confusion, and decreasing confidence
in the use of statistical information. Subjective probability judgments can address these issues by
supplying necessary and relevant estimates and by providing a mechanism to reconcile any
differences in the formal analyses of probabilities. Because subjective probabilities must be
elicited from individuals, it is important to evaluate how well people can assess uncertainty.
Although numerous studies have examined various distinct properties of subjective probabilities
(Kahneman & Tversky, 1972; Lyon & Slovic, 1976; Tversky & Kahneman, 1973; Bar-Hillel,
1980), there has not been a systematic review to assess the performance of probability
assessment.
The current study attempted to close this research gap by quantitatively synthesizing 60
years of research on probability judgments. The focus in this meta-analysis is on probability
judgments for events with discrete outcomes (as opposed to continuous variables). The objective
was to obtain an effect size that measures probability judgment performance. If individuals can
40
provide meaningful assessments of uncertainty, the effect size would be significantly better than
a score that would be obtained when judges make uninformative guesses. This quantification of
the effect size is important because it provides a reference for future research on probability
judgment. Such reference can be used, for example, to set performance goals in probability
training studies. This meta-analysis also investigated the moderating effects of various
methodological factors on probability judgments. This investigation could provide researchers
with a better understanding of how differences in the elicitation methods, assessment contexts,
and expertise levels can affect the quality of probability judgments.
The remaining chapters first briefly characterize the research literature on probability
judgments and describe a quantitative framework to evaluate subjective probability assessments.
Several hypotheses regarding the moderating effects of some important predictors of subjective
probability are then identified. The second section provides the sampling frame of this meta-
analysis, the data collection procedures and inclusion criteria used, and the effect size coding
procedure. In the following section, the results are presented as follows: (a) results on different
effect sizes that assess distinct qualities of probability judgments, (b) the moderating effects of
various methodological variables on the effect sizes, and (c) results from a publication bias
analysis. Finally, the main conclusions of this research are provided.
Probability Judgment Research
Past research has revealed the complexity of probability judgments but has failed to
provide an unequivocal answer to the question whether individuals can accurately assess
uncertainty. Decades of behavioral research on decision-making underscore the fallibility of
human judgments (Kahneman, Slovic, & Tversky, 1982). Research on cognitive heuristics and
biases has dominated the field of judgment and decision-making for many years, and a number
41
of excellent reviews have been written on the topic (Gigerenzer & Gaissmaier, 2011; Gilovich,
Griffin, & Kahneman, 2002). Perhaps most notably, Ferrell and McGoey (1980) reviewed the
literature on probability judgments and summarized the following key experimental results: 1)
individuals were overconfident in their estimates—meaning that they tend to be more confident
than they should be in their judgments, 2) overconfidence was exaggerated as the proportion of
correct responses decreased, 3) calibration could be improved with training, and 4) there was no
difference in calibration between experts and laypeople. The last point is noteworthy because it
suggests that experts’ judgments may not be better than lay judgments. This conjecture has some
support from research that shows that political experts committed a wide range of cognitive
biases when making geopolitical predictions including overconfidence, hindsight bias, and self-
serving bias in counterfactual reasoning (Telock, 2005). In addition, Stanovich and West (2008)
discovered that individuals were equally susceptible to the same cognitive biases regardless of
level of intelligence. More recently, Montibeller and von Winterfeldt (2015) provided an
excellent summary of common cognitive and motivational biases in expert elicitation. The
authors argued that while it may be easy to correct for the effects of cognitive biases, it is much
more difficult to control for the effects of motivational biases.
However, Stanovich and West (2000) criticized the research program on cognitive
heuristics and biases. The authors highlighted four (contested) reasons to explain why subjects in
psychological experiments often fail to make “correct” choices consistent with normative
standards set by the experimenters. These reasons are computational power limitation,
performance error, experimenters’ failure to apply correct normative rules, and subjects’
misinterpretation of the experimental tasks. Thus, when these experimental artifacts are
controlled, there is less room for errors in judgments. For example, Keren (1991) pointed out that
42
although the overconfidence effect is often generalized to tasks in which respondents are asked
to provide probability estimate of future events, it is mostly found when respondents are asked to
provide probability estimates for current or past events such as those in general knowledge
question tasks. In his earlier work, Keren (1988) showed that respondents were well calibrated
when assessing uncertainty for a set of psychophysical stimuli, whereas the same respondents
were poorly calibrated when assessing uncertainty in general knowledge tasks (see Chapter 2 for
a detailed discussion of this issue).
Furthermore, there are effective mechanisms to minimize the impact of cognitive biases
and improve the quality of subjective judgments. For example, results from the Good Judgment
Project (GJP) uncover several psychological strategies that can be used to enhance probability
judgments (Mellers et al, 2014). GJP was the winning teams in a large-scale geo-political
forecasting tournament sponsored by the U.S. intelligence community. GJP researchers
identified several factors associated with more accurate probabilistic forecasts, and they
incorporated these factors to boost their experts’ judgment accuracy (as measured by the Brier
score). The successful factors are: teaming, recruiting, and training (Mellers et al., 2015). Each of
these strategies was found to account for 10% improvement in prediction accuracy.
Prior research on probability judgments is wrought with inconsistencies that result from
variation in the methodological strategies employed (measurement of probability judgments,
elicitation methodologies, designs, sample sizes, etc.). For example, researchers have employed
an array of methodologies to elicit subjective judgments for a number of different tasks in
distinct assessment contexts. Respondents have been asked to complete a variety of tasked such
as assess probabilities of future events (McClish & Powell, 1989), provide confidence ratings for
correct answers in general knowledge questions (Ronis & Yates, 1987), and assess how jurors
43
responded in legal cases (Sieck & Arkes, 2005). Moreover, an array of probability elicitation
methods has been used in prior research. In some cases, judges were asked to simply state
probabilities for the events of interest (the full-range approach) while in others they were asked
to provide their confidence on the answers they have selected (the half-range approach)
(Lichtenstein, Fischhoff, Phillips, 1981). Judges have also been asked to provide probabilities for
conditional events so that the probability for a target event could then be computed (Doyle,
2011).
All of these (and other) factors make it difficult to compare and qualitatively synthesize
results between studies. The solution to this heterogeneity problem is to quantitatively synthesize
the results in primary studies. Such meta-analytic approach allows for a synthesis of evidence to
address the proposed question. In addition, the effects of various methodologies in primary
studies can be modeled within a meta-analytic framework, potentially providing a richer
understanding why performance of probability judgments differ vastly between studies.
A Quantitative Approach to Assess Subjective Probability Judgments
This section reviews the key ideas regarding the application of proper scoring rules in
evaluating subjective judgments that were discussed in depth in Chapter 3. The Brier score, a
special case of the quadratic scoring rule, has been used in a number of studies on probability
judgments (see Chapter 1 for a definition and Chapter 3 for an in depth discussion of the Brier
score). As such, the Brier score can be considered a standardized measure that allows researchers
to evaluate probability assessments across (many) studies. It is a summary score that indicates
the overall goodness of fit between judgments and realized outcome. Yates (1990) suggested that
although the Brier score reflects the overall performance, it reveals little information on how
44
assessors make judgments. Thus, researchers should look beyond the Brier score and examine its
various components.
Murphy (1973) presented an approach that has been used widely to decompose the Brier
score into three components: outcome index, calibration, and resolution (see Chapter 2 for
detailed discussion). The calibration index reflects the external correspondence of probability
judgments. An assessor is said to be perfectly calibrated when, for example, he or she predicts a
70% chance of rain, and it rains 70% of the times. Similar to the Brier score, a lower calibration
score is better than a higher calibration score. The discrimination/resolution index reflects how
well assessors can detect “successful” events from “unsuccessful” events. For example, a
physician is said to have a high level of discrimination skill when he or she can detect malign
from benign tumors. In the current research, the Brier score is used as the primary effect size.
Because authors differed widely in their reporting practices, calibration and resolution scores
were reported in some articles but not others. Thus, effect sizes based on the Brier score is
reported for a subgroup of studies.
The question remains, however, what is a good Brier score? A naïve strategy is to simply
report 50%. This approach would lead to a score of 0.25 given that the Brier score is scored
conventionally, a lower score is better, and the score is normalized between 0 and 1. Individuals
can make use of the 50% category for a variety of reasons. They choose this category because
they genuinely believe the target event has a 50% chance, or they simply do not have adequate
knowledge to make a more informative forecast (de Bruin et al., 2000). Regardless of the
reasons, a 50% forecast for an event with binary outcomes provide decision makers with little
confidence to make decisions. Another benchmarks often used to evaluate probabilistic forecasts
is the score assessors would obtain when they simply report the base rate. For instance, a soccer
45
fan may believe that the LA Galaxy soccer team has a 60% chance of winning an MLS match
next year because the team has won 60% of the games in their last season. Indeed, this is a
reasonable strategy that would return a better score than what would be obtained from the naïve
strategy. For instance, using equation (1) an assessor would obtain a score of 0.16 when making
60% forecasts for 10 binary events that occur six out of 10 times. On the other hand, he would
obtain a score of 0.25 when he applies the naïve strategy. Although the base-rate strategy can be
a reasonable benchmark to evaluate probabilistic forecasts, the problem is that individuals may
not be aware of relevant base rates when making judgments. Indeed, empirical evidence suggests
that individuals tend to ignore base rates in making probability judgments (Kahneman &
Tversky, 1973)
In this meta-analysis, judgments with the Brier score of 0.25 or greater are considered
uninformative. These judgments are not informative because the 50-50 forecasts (or close to)
provide decision makers with little confidence to make decisions. Thus, the ‘informative
judgment’ hypothesis is examined by testing whether the summary Brier score would be
significantly smaller than 0.25.
Some Predictors of Probability Judgments
When there is evidence (of the lack thereof) to support the claim that individuals can
make informative probability judgments, the next logical step is to identify key predictors of
probability judgments. A review of the published studies suggests that the key differences among
primary studies on probability judgments are the choice of an elicitation technique, the
assessment context, and expertise. Each of these factors may account for variability in the
summary Brier score.
46
The criteria to determine expertise must depend not only on the characteristics of the
probability assessors but also on the assessment domain (the context of a study). For example,
football fans are considered experts in football when there are indications that they have
specialized knowledge about football. The same fans, however, would not be considered experts
when asked about other sports. Most psychological studies make use of a convenience sample
(e.g., students), whereas studies in professional settings (e.g. economic forecasts) have subject
matter experts (SMEs) provide probabilistic forecasts. Note that there may be a trade-off such
that researchers tend to obtain a smaller sample when they recruit SMEs to participate in their
research, whereas they would get a much larger sample size when they rely on a convenience
sample.
Results from prior studies seem to suggest contradicting evidence regarding the effect of
expertise. Experts are susceptible to the same cognitive biases that affect lay assessors without
domain expertise (Tetlock, 2005). For example, Mellers et al. (2014) found that the forecasting
performance of intelligence analysts of geo-political events was significantly worse than the
performance of those who have knowledge about geo-political events but were not considered
experts. Yet, weather-related probability judgments are so reliable that they become a standard
against which judgments in other domains should be evaluated (see Stewart, Heideman,
Moninger, Reagan-Cirincione, 1992 for an example). Thus, it is unclear whether experts’
judgments are better than judgments from individuals with little or no knowledge in the
assessment domains.
The choice of an elicitation method can also be an important predictor of judgment accuracy. In
primary studies, researchers can elicit subjective probability using an array of available methods
and techniques. For example, subjective probability can be inferred from choices in gambles or
47
can be elicited directly by asking judges to indicate their responses on a probability scale. When
the latter approach is adopted, there are two general methods of eliciting subjective probability.
On one hand, judges are presented with different possibilities of a target event and asked to
select an outcome. They are then asked to assess the probability that their chosen answer is
correct (for a general knowledge task) or is going to happen (for a prediction/forecasting task).
Such an approach is referred to as a “half-range” method. On the other hand, in the “full-range”
approach, judges are not required to choose an answer but to simply report their probability.
Further, judges can be asked to assess probability for a single outcome of interest (single-
judgment approach) or to assess probabilities for mutually exclusive outcomes (multiple-
judgment approach). An example of the single-judgment approach would be to ask judges to
assess the probability that it was going to rain in Los Angeles by a certain date. In contrast, the
multiple judgment approach involves asking judges for both probabilities of rain and no-rain.
Various versions of these elicitation methods have been used in prior research. In this meta-
analysis, the interest was to examine whether judgment accuracy would be improved when
judges were asked to assess single or multiple probability judgments. This comparison is
important because both the single and the multiple judgment approaches have been used in past
research even though it is unclear which method leads to better judgment accuracy.
Eliciting multiple judgments may improve accuracy. When judges are asked to assess
probabilities for mutually exclusive outcomes, they may seek both confirmatory and contrasting
evidence, which may lead to better judgments. On the other hand, when they are asked to assess
the probability of a single outcome, they may be more likely to seek only confirmatory evidence.
Research on individual differences and decision-making supports this hypothesis. Baron (2008)
describes a “good thinker” as someone who seeks both evidence supporting his or her conclusion
48
and evidence that challenges such conclusion. Past research reveals that individuals with higher
scores on an actively open-minded thinking scale (or its variant) performed better in an array of
decision-making tasks (Kokis, Macpherson, Toplak, West, & Stanovich, 2002; Sá & Stanovich,
2001). This meta-analysis examined whether having judges assess probabilities for mutually
exclusive outcomes would lead to better judgments.
Assessment context is another important variable (see Chapter 2). In
forecasting/prediction studies, judges are asked to assess probability estimates for some events of
interest, while in feeling-of-knowing studies, judges are typically asked to indicate their
confidence in an answer they have chosen to respond to a general knowledge question. Although
both tasks require judges to assess uncertainty, research suggests that the approach to doing so
varies depending on the task. Subjects completing the confidence rating task approach the task as
if they had been asked to assess epistemic uncertainty, whereas those completing the likelihood
assessment task utilized a different strategy to assess aleatory uncertainty (Ülkümen, Fox, &
Malle, 2016).
Aleatory uncertainty stems from the stochastic process of an event whereas epistemic
uncertainty entails missing information, facts, or expertise concerning a knowable past event or
proposition (Fox & Ulkumen, 2011). Aleatory uncertainty is irreducible, but it can be estimated
via empirical observations (e.g. frequencies of target events). In contrast, epistemic uncertainty
can be reduced through gaining more knowledge or expertise in a subject. Since different types
of uncertainty appear to activate distinct psychological mechanisms, there has been a call for
researchers to exert greater caution when generalizing findings from one context to another
(Ülkümen, Fox, & Malle, 2016; Keren, 1991). Although the distinction between the two sources
of uncertainty was noted more than 20 years ago (Keren, 1991), this distinction has only recently
49
been examined empirically. Thus, the present study explored the effect of different sources of
uncertainty on probability judgments.
The Current Research
In this meta-analysis, a summary probability score for discrete events was quantified. A
small value of the Brier score, ideally statistically smaller than 0.25, provides support for the
claim that human judges can make informative probability judgments. Importantly, the impact of
various methodological variables on the effect size is examined. Particularly, this study
examined the relationships between the Brier score and each of the following variables:
Assessment context, elicitation method, and expertise.
Method
Inclusion Criteria
Primary studies were included in the final analyses if they met the following inclusion
criteria: having respondents make probability judgments for events with discrete outcomes,
reporting at least the Brier score, reporting the formula of the Brier score or providing
descriptions how the Brier score was computed, and reporting enough information to compute
the study-level effect size including mean(s), standard deviation(s), and sample size(s). Studies
must have been written between January 1
st
, 1950 and January 1
st
, 2017. The year of 1950 was
chosen because the Brier score was introduced at this time (Brier, 1950). Only studies written in
English were included.
Literature Search
Figure 3 presents a schematic approach of the search strategy. A total number of 69
studies were coded in this meta-analysis. A combination of database and manual searches was
conducted to retrieve relevant records. The digital search process began in ProQuest, a hosting
50
service that comprises of 113 different databases for different disciplines that also includes
databases for dissertation and theses. The search was specified between January 1
st
, 1950 and
January 1
st,
2017 and only records in English were included. The following search query was
used to retrieve digital records,
((AB(probabil* PRE/0 assess*) OR AB(probabil* PRE/0 judg*) OR AB(probabil*
PRE/0 elicit*) OR AB(probabil* PRE/0 predict*) OR AB(probabil* PRE/0 forecast*)
OR AB(confiden* PRE/2 judg*)) AND ((Brier PRE/0 scor*) OR (quadratic PRE/0 scor*)
OR calibrat* OR resolut*) AND (AB(subjective) OR AB(subjects) OR AB(respondents)
OR AB(participants))) NOT (ccl.exact("Arts, entertainment & recreation" OR "Civil
engineering" OR "Curricula" OR "Educational software" OR "Issues in Sustainable
Development" OR "Literacy" OR "Mass media" OR "Network design" OR
"Neurosciences" OR "Philosophy" OR "Software & systems" OR " 721.1 Computer
Theory (Includes Formal Logic, Automata Theory, Switching Theory, Programming
Theory)" OR " 902.2 Codes and Standards" OR " Aerospace Engineering (General)
(MT)" OR "Academic guidance counseling" OR "Academic Learning & Achievement"
OR "Agricultural economics" OR "Agriculture") AND la.exact("ENG") AND
stype.exact("Scholarly Journals" OR "Dissertations & Theses" OR "Books" OR
"Reports" OR "Working Papers" OR "Conference Papers & Proceedings" OR "Other
Sources")).
The notation “PRE/0” was used to indicate that the term on its left should immediately
proceed the term on its right. For instance, the expression “probabil* PRE/0 assess*” requests
ProQuest to search for all records that contain either one of the following phrases: probability
assessment(s), probabilistic assessment(s), or probabilistically assessing. The query instructed
51
ProQuest to search for the keywords in the articles’ abstracts. A preliminary exploration of the
records suggested that records in certain fields were clearly irrelevant. Thus, these records were
excluded prior to initial screening (“NOT” in the query). The specific search strategy was
designed to minimize the number of false alarms because the term “probability” is used
ubiquitously in sciences.
The query returned a total of 466 records. Records were screened by reviewing the
abstract and skimming the entire article to ascertain whether the effect size was reported.
Records with adequate relevant information (met the inclusion criteria) were advanced to the
coding phase. Some records include multiple studies or experiments. For purposes of this meta-
analysis, the results from these studies were considered independent and each nested
experiment/study was counted as a standalone record. A total of 38 records from this initial
search were included in the coding phase.
[Figure 3 about here]
Additional efforts were made to supplement the ProQuest results. First, an ancestry
search was conducted based on the reference sections from the ProQuest-retrieved records.
Second, additional articles were retrieved after searching specific journals that have a strong
focus on publishing judgment and decision-making studies. The keyword “Brier” was used to
search for relevant records in the following journals: Judgment and Decision Making, Decision,
Decision Analysis, Risk Analysis, Journal of Risk and Uncertainty, Management Science,
International Journal of Forecasting, and Journal of Medical Decision Making. An additional 29
records were included in the coding phase.
An important requirement, if not a standard in meta-analysis, is the collection of
unpublished works. Several ‘listserv’ calls have been made to invite researchers to submit their
52
unpublished manuscripts or raw data. Contacts have been made to members of the following
societies and organizations: Society of Judgment and Decision Making, Society of Mathematical
Psychology, Society of Risk Analysis, Decision Analysis Group, European Association for
Decision Making, and Medical Decision Making Society. Despite these efforts, no single
unpublished study was obtained. However, two unpublished experiments from this dissertation
were included.
Effect Size Computation
Brier scores, standard deviations, and sample sizes from primary studies were extracted
to compute the summary effect size (ES). Study-level effect size was denoted as Brier* to
indicate that the Brier score was weighed (w) by its inverse variance. The following formula was
used to compute the study-level effect size or the weighed Brier score from each study:
𝐵𝑟𝑖𝑒𝑟 ∗
= 𝐵𝑟𝑖𝑒𝑟 ∗ 𝑤 (8)
where 𝑤 =
1
𝑉 (9)
v is the variance of the Brier score and is computed from the sample size and standard deviation
as,
𝑉 =
𝑆𝐷
2
𝑁 (10)
A total of 69 studies were included in the analyses. The summary effect size was then
computed:
𝐸𝑆 =
∑ 𝐵𝑟𝑖𝑒𝑟 𝑖 ∗ 𝑁 =69
𝑖 =1
∑ 𝑤 𝑖 𝑁 =69
𝑖 =1
(11)
The effect sizes for calibration and resolution scores were computed using the same
approach.
53
Effect Size Coding
Brier scores from primary studies were linearly transformed to the conventional standard
(between 0 and 1 where lower scores are better)
1
. I was able to obtain mathematical formulas or
descriptions of how the Brier scores were computed in most of the primary studies. When
mathematical formulas or descriptions were not provided, the primary authors were contacted for
additional information. Records were not coded when it could not be determined how to scale the
Brier score. Calibration and resolution scores were coded according to Murphy’s decomposition
formula (see Chapter 3). Means, standard deviations, and sample sizes were coded to compute
the three summary effect sizes: the Brier score, a resolution (or discrimination) score, and a
calibration score.
The following coding procedure was established. A standard coding form and manual
(see the Appendix) were created to facilitate the coding process. The coding manual has detailed
instructions on how to code the effect sizes. A second-year PhD student in quantitative
psychology trained in Bayesian data analyses and decision analyses assisted with the coding
process. The same five studies were coded by the two researchers every week. Consultation
sessions were held to discuss coding practices and resolve any issues. After coding 20 studies
and holding four consultation sessions, both of us coded the same number of independent
records. We continued holding weekly consultation sessions to discuss any coding issues. Any
differences were resolved by consensus.
Because studies vary greatly in design and methodology, the following decision rules
were developed to ensure uniformity in effect size coding. When a primary study employed an
experimental design, the results of the control group (if provided) were coded. When it was
1
Variances of transformed scores were also transformed where appropriate.
54
impossible to determine a control group, the experiment was not coded. When a study employed
a within-subject pre-post design, results were coded in the pre-treatment phase. When a study
had both experts and laypeople make judgments, results were coded for each subgroup separately
as if they were independent studies. When a study reported results for different non-experimental
subgroups (such as males versus females), results were combined. Pooled standard deviations
were used for results combined from multiple groups or when group-specific standard deviations
were not available.
Thirty-six studies did not have standard deviations. These studies’ missing standard
deviations were computed using comparable values found in similar studies. There are multiple
methods to impute missing values in meta-analyses (Chowdhry, Dworkin, & McDermott, 2016).
A simple approach is to match studies with missing values (e.g. variances) with similar studies
that report the missing information. In the current research, studies were matched based on the
expertise of the sample, the assessment context, and the elicitation method. When there were
multiple matches, median values were used to impute the missing information.
Variable Coding
Table 1 provides a list of substantive predictors that were coded and their definitions
2
. I
described the coding procedure for the three predictors of interest: Assessment context,
elicitation method, and expertise. An assessment context was coded “prediction” when
respondents in a primary study were asked to predict/forecast events that would occur in future
time. The context was coded “feeling-of-knowing” (FOK) when the prediction condition was not
met. As such, FOK studies included subjective probabilities assessed in a variety of contexts
2
Meta-information variables (e.g. author name, record type, published years, etc.) are not shown.
55
such as general knowledge question tasks, psychophysical perception task, and pedagogical
settings. All these tasks were coded as ‘feeling-of-knowing’.
Generally, judges were considered ‘experts’ when authors had a criterion (or criteria) to
examine their respondents’ expertise and the included respondents met this criterion (or criteria).
In addition, probability assessors were considered experts when they were asked questions about
topics for which they had substantial knowledge; this information was easily determined from
the primary texts. For example, medical students would be considered “experts” when they were
asked to answer exam medical questions. Likewise, sports editors would be considered “experts”
when they were asked to predict outcomes of their sports. When either one of the two coders
coded the sample as “experts,” a discussion was held to further examine the result.
Disagreements in coding were resolved through deliberation and consensus. When such
disagreements retained, the studies were excluded from the analyses.
Finally, elicitation method was coded “single” when respondents were asked to provide
probabilities for single outcomes. The method was coded “multiple” when respondents were
asked to provide probabilities for mutually exclusive outcomes. For example, we would code a
method “single” when judges are asked to assess the probability of rain the next day. We would
code the method “multiple” when judges are asked to assess both the probability of raining and
the probability of not raining.
[Table 1 about here]
Analytical Methods
The current meta-analysis followed an approach described by Borenstein, Hedges,
Higgins, & Rothstein (2009). As described earlier, Brier scores from primary studies were
weighted by their respective inversed variance. A summary Brier score or the effect size was
56
then computed. A random-effect meta-analysis was performed to test whether the summary
effect size was different from the reference point 0.25. The random-effect model was chosen
because it makes a more realistic assumption (compared to the fixed-effect model) such that true
effect sizes are normally distributed, and each individual study estimates only a value drawn
from this distribution. The 95% confidence interval were reported. Tau statistics were reported to
represent the standard deviation of the true effect sizes in the random-effect model. The
DerSimonian-Laird method (1986) was used to estimate the between-study variance.
Subgroup analyses were conducted to examine the effects of the proposed predictors
(expertise, assessor, and elicitation). Mixed-effect models were used to compare the effect sizes
between subgroups. The mixed-effect model presumes that variation in effect size between
studies is random, while the difference between subgroups is fixed. Since the interest was to
examine differences between subgroups, the mixed-effect model was appropriate. The number of
studies for each subgroup analysis was different because studies differed in their reported
information. The notation k (instead of N) was used to indicate the number of studies included in
each analysis (with N demarcating the sample size of any of the k studies included in the
analysis). A meta-regression analysis was also conducted to examine the effects of various
predictors while controlling for the effects of other covariates. The Knapp-Hartung method was
used to estimate the between-study variance (Tau squared) in meta-regression analyses (Knapp
& Hartung, 2003). The analyses were conducted using the software Comprehensive Meta
Analyses.
57
Results
Description of Studies
A total of 69 records were included in this meta-analysis, 49 of which were peer-
reviewed articles and 20 were dissertations. Overall, probability assessments of 99,997 questions
from 24,667 individuals were collected. While 16 studies elicited judgments from experts, 51
studies elicited judgments from non-experts. The coders could not agree on how to code the type
of assessor for two studies. In the first, respondents were asked to assess prognostic estimates of
their treatments (Arkes et al, 1995) and in the second study, business students were asked to
forecast stock market behaviors (Önkal, Yates, Simga-Mugan, & Öztin, 2003). Consequently,
these were excluded from the predictor analysis of expertise. There were 35 records coded as
FOK studies and 25 coded as prediction studies. The coders disagreed how to code the
assessment context of nine studies. The assessment contexts in these studies were predicting
personality (Greblo, Mirels, & Herbert, study 1 and 2, 1997), predicting roommates’ preference
(Brake, 1998), judging whether drawings were completed by European or Asian children
(Lichtenstein & Fischhoff, study 1, 1977), judging whether handwritten phrases were completed
by European or American adults (Lichtenstein & Fischhoff, study 3, 1977), assessing confidence
in a perceptual (vision) task (Kvidera & Koutstaal, study 1, 2, and 3, 2008), and judging the
periods of artworks (Stone & Opel, 2000). As a result, these studies were excluded from the
analysis of assessment context.
A majority of the studies included in this meta-analysis asked to provide probabilities for
single outcomes (k=61). Only eight studies had respondents make judgments for mutually
exclusive outcomes. Reporting practices also varied across studies. Murphy’s calibration and
resolution indexes were reported in 23 studies while all of Yates’ performance indexes were
58
reported in only two studies. Researchers in eight studies reported indexes from both
decomposition approaches. Calibration score (based on either Murphy’s or Yate’s
decomposition) was reported in 30 studies whereas discrimination/resolution score (based on
either approaches) were reported in 23 studies. The top three journals in the dataset were:
Journal of Behavioral Decision Making (k=20), Organizational Behavior and Human Decision
Processes (k=16), and Medical Decision Making (k=6). Researchers in 32 studies reported
standard deviations for the Brier scores or provided enough information standard deviations
could be computed, leaving 37 records without standard deviations. Missing values were
imputed using the approach described above (see p. 50).
Performance of Probability Judgments
A random-effect model was conducted to estimate the summary Brier score. The effect
size was 𝑃𝑆 ̅ ̅ ̅ ̅
= 0.22 (N = 69, 95% CI [0.20, 0.24]). The standard deviation of the effect size was T
= 0.09 The null hypothesis that 𝑃𝑆 ̅ ̅ ̅ ̅
= 0.25 was rejected, Z = -3.15 (p = .002). Yet, there was a
large heterogeneity or systematic variability left to be explained. Indeed, about 99% of the
variability in the effect size was not accounted for (𝐼 2
= 99.92%) and the test of heterogeneity
was significant, Q (68) = 82189.43 (p < .001). Figure 4 displays a forest plot of the effect sizes.
Note that the studies were sorted based on the Brier score in an ascending order. Table 2
provides a summary of the effect sizes.
[Figure 4 about here]
[Table 2 about here]
Another random-effect model was conducted to estimate the summary calibration score.
The effect size was 𝐶 ̅
= 0.07 (k = 30, 95% CI [0.06, 0.08]). The standard deviation of the effect
size was T = .03. About 98% of the variability in the effect size was not accounted for (𝐼 2
=
59
98.48%) and the test of heterogeneity was significant, Q (29) = 1909.31 (p < .01). Another
random-effect model was conducted to estimate the summary resolution/discrimination score.
The effect size was 𝑅 ̅
= 0.06 (k = 23, 95%, CI [0.03, 0.08]). The standard deviation of the effect
size was T = .05. About 99% of the variability in the effect size was not accounted for 𝐼 2
=
99.75%) and the test of heterogeneity was significant (Q (22) = 8959.230.31, p < .001).
Subgroup Analyses
The results reveal significant difference in 𝑃𝑆 ̅ ̅ ̅ ̅
between experts and laypeople,
(𝑄 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (1) = 13.21, p < .001 with 𝑃𝑆 ̅ ̅ ̅ ̅
𝑒𝑥𝑝𝑒𝑟𝑡 = 0.16, 95% CI [0.12, 0.20], 𝑇 𝑒𝑥𝑝𝑒𝑟𝑡 = 0.07,
𝑘 𝑒𝑥𝑝𝑒𝑟𝑡 = 16, whereas 𝑃𝑆 ̅ ̅ ̅ ̅
𝑙𝑎𝑦 = 0.24, 𝑇 𝑙𝑎𝑦 = 0.09, 95% CI [0.22, 0.265], 𝑘 𝑙𝑎𝑦 = 51). Thus, the
null hypothesis that judgments made by experts and lay people have the same Brier score was
rejected. How probability is elicited also had a significant effect on the Brier score (𝑄 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (1)
= 4.23, p = .04 with 𝑃𝑆 ̅ ̅ ̅ ̅
𝑠𝑖𝑛𝑔𝑙𝑒 = 0.21, 95% CI [0.19, 0.23], 𝑇 𝑠𝑖𝑛𝑔𝑙𝑒 = 0.085, 𝑘 𝐹𝑂𝐾 = 61,
𝑃𝑆 ̅ ̅ ̅ ̅
𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒 = 0.30, 𝑇 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒 = 0.13, 95% CI [0.22, 0.39], 𝑘 𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒 = 8). Surprisingly, eliciting
multiple judgments worsened the Brier score. Note that even though the null hypothesis was
rejected, the small number of studies that employed multiple judgments rendered the estimated
effect size less precise under this condition. Indeed, the 95% Cis were overlapped. There was
also significant difference in the assessment context (𝑄 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (1) = 7.82, p < .005 with 𝑃𝑆 ̅ ̅ ̅ ̅
𝐹𝑂𝐾 =
0.23, 95% CI [0.21, 0.26], 𝑇 𝐹𝑂𝐾 = 0.076, 𝑘 𝐹𝑂𝐾 = 35, whereas 𝑃𝑆 ̅ ̅ ̅ ̅
𝑝𝑟𝑒𝑑𝑖𝑐𝑡 = 0.19, 𝑇 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 = 0.09,
95% CI [0.16, 0.21], 𝑘 𝑝𝑟𝑒𝑑𝑖𝑐𝑡 = 25). Thus, the null hypothesis that assessment context has no
effect on the overall judgment accuracy was rejected. Judgments from prediction studies were
much better than judgments from FOK studies.
Subgroup analyses also reveal significant difference in 𝐶 ̅
between expert and laypeople
(𝑄 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (1) = 9.17, p = .002 with 𝐶 ̅
𝑒𝑥𝑝𝑒𝑟𝑡 = 0.04, 95% CI [0.01, 0.06], 𝑇 𝑒𝑥𝑝𝑒𝑟𝑡 = 0.03, 𝑘 𝑒𝑥𝑝𝑒𝑟𝑡
60
= 6, whereas 𝐶 ̅
𝑙𝑎𝑦 = 0.08, 𝑇 𝑙𝑎𝑦 = 0.03 , 95% CI [0.06, 0.09], 𝑘 𝑙𝑎𝑦 = 24). Thus, expert judgments
were twice better calibrated compared to lay judgments. However, there was no significant
difference in the assessment context (p > .05). The effect of elicitation method (single vs.
multiple) was not examined because multiple judgments were only elicited in a single study.
Subgroup analyses also reveal significant difference in 𝑅 ̅
between expert and laypeople
(𝑄 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 (1) = 5.67, p = .002 with 𝑆 ̅
𝑒𝑥𝑝𝑒𝑟𝑡 = 0.10, 95% CI [0.06, 0.14], 𝑇 𝑒𝑥𝑝𝑒𝑟𝑡 = 0.04, 𝑘 𝑒𝑥𝑝𝑒𝑟𝑡 =
5, whereas 𝑆 ̅
𝑙𝑎𝑦 = 0.04, 𝑇 𝑙𝑎𝑦 = 0.05, 95% CI [0.02, 0.07], 𝑘 𝑙𝑎𝑦 = 18). Thus, expert judgments
were more resolute or discriminatory than lay judgments. However, although judgments in FOK
studies were more resolute than judgments in prediction studies, the effect of the assessment
context just missed the significant level. The effect of elicitation method (single vs. multiple)
was not examined because multiple judgments were elicited only in a single study.
As a reliability check, the results were compared across coders. The results suggested that
the identity of the coder had no effect on performance scores (p > .05).
Meta-Regression Analyses
There could be associations among the covariates. For example, studies that evaluate
judgments in a prediction context may be more likely to recruit experts to participate than studies
that evaluate judgments in FOK tasks. Thus, a multiple meta-regression was conducted to
examine the effects of several predictors while controlling for the effects of other covariates. In
addition, a new variable was added into the regression analyses that represented the number of
elicitation questions. The effect size was regressed on the following variables: expertise (expert
vs. lay), assessment context (FOK vs. prediction), elicitation method (single vs. multiple), and
the number of elicitation questions.
61
The test that all coefficients (except the intercept) were zero was significant (F(4, 53) =
6.24, p < .001, k = 58
3
). The predictors accounted for a modest 13% of the total systematic
between-study variance. Judgments from experts were better than those from lay people such
that 𝑃𝑆 ̅ ̅ ̅ ̅
𝑒𝑥𝑝𝑒𝑟𝑡 was 0.08 lower than 𝑃𝑆 ̅ ̅ ̅ ̅
𝑙𝑎𝑦𝑝𝑒𝑜𝑝𝑙𝑒 (95% CI [0.03, 0.13], t(53) = 2.78, p = .003).
Similarly, judges performed better when they were asked to provide probability judgments for a
single outcome such that 𝑃𝑆 ̅ ̅ ̅ ̅
𝑠𝑖𝑛𝑔𝑙𝑒 was 0.11 lower than 𝑃𝑆 ̅ ̅ ̅ ̅
𝑚𝑢𝑙𝑡𝑖𝑝𝑙𝑒 , 95% CI [0.04 0.18], t(53) =
3.24, p = .002.
Interestingly, while the significant effects of expertise and elicitation method remained,
the effect of the assessment context disappeared. This suggested that the effect of assessment
context was confounded by at least one of the covariates. Table 3 displays the correlations
among the covariates. Compared to FOK studies, prediction/forecasting studies were more likely
to have respondents make multiple probability judgments (r = .34). To further examine the
results, stepwise meta-regression analyses were conducted. After the effect size was regressed on
the variable ‘Assessment Context,’ one of the three potential confounders (expertise, elicitation
choice, and the number of questions) was included. A total three stepwise regression models
were run with respect to the three potential confounders. In the first model, the effect size was
regressed on the assessment context. The model was significant (F(1, 56) = 4.91, p = .03, k =
58). The Brier score in prediction/forecasting studies were 0.05 units lower than the Brier score
in FOK studies, t(56) = -2.22, p = .030. In the three stepwise regression models, the effects of
assessment context were, at best, marginally significant.
[Table 3 about here]
3
Only studies with complete information were included.
62
A separate regression model was conducted to examine the effects of the number of
questions, expertise (expert vs. lay), and assessment context (FOK vs. prediction) on calibration
score. The test found that all coefficients (except the intercept) were not significant (k = 23, p >
.05). The model for resolution was also non-significant (k = 17, p > .05).
Publication bias
Because the current research did not include any unpublished studies, the effect of
publication bias remained unknown. However, it is important to point out that the conventional
logic of publication bias may not apply to the current research. In a typical meta-analysis,
publication bias exists when there are missing studies with small sample sizes, hence large
standard errors. Importantly, this missingness is systematic such that only (small) studies with
large effect sizes are reported whereas similar studies with small effect sizes are omitted.
Conventional methods to examine publication bias such as the funnel plot, the Egger’s test, and
the trim-and-fill methods are based on this logic (Borenstein, Hedges, Higgins, & Rothstein,
2009). However, these methods are applicable to cases in which a larger effect size is better
whereas a smaller effect size is desirable in the current research. Thus, it would be illogical to
apply the conventional methods.
The real concern about publication bias is the possibility that there is a systematic
difference in the effect size between published and non-published studies. More accurately,
publication bias refers to the systematic difference in the effect size between studies published in
peer-reviewed journals and studies that do not pass the peer-reviewed process. Although
dissertations and masters’ theses are usually accessible via online database, they are not
considered published papers. In fact, dissertation and masters’ theses are frequently classified
into a category that has been known as a ‘gray research literature’ which include dissertation,
63
theses, technical reports, book chapters, conference proceedings, etc. (Hartling et al., 2017).
Thus, a possible method to examine the effect of publication bias is to compare the effect size in
peer-reviewed articles versus those in dissertations. In fact, subgroup analyses revealed that there
were not significant differences in the summary Brier scores, calibration scores, and resolution
scores reported in dissertation versus those reported in peer-reviewed journals (ps > .05).
Another method to examine the effect of publication bias is to conduct a cumulative
meta-analysis. A cumulative meta-analysis is a meta-analysis in which each study is added
sequentially until all studies are added in the analyses. Typically, studies are sorted based on an
ascending order of standard errors (or a descending order of sample sizes), and a cumulative
analysis is conducted. When the summary effect size does not change much with the inclusion of
smaller studies (large standard errors), there is prima facie evidence that the effect of publication
bias deems minimal and would not change the results substantially (Borenstein et al., 2009).
The same logic could be applied here except that studies would be sorted based on the
effect size in a sequence from smallest to largest. This was due to a concern that studies with
large values of the Brier score were omitted. When the inclusion of additional studies with large
values of the Brier score did not substantially change the summary effect size, there would be
prima facie evidence to suggest that the effect of publication bias was minimal. Figure 5 displays
the forest plot and descriptive statistics of the cumulative meta-analyses. The forest plot revealed
that the cumulative effect size moved closer to the point 0.25 when studies with larger effect
sizes were included, but the shift was not dramatic. The inclusion of additional studies with large
values of the Brier score gradually shifted the effect size. Indeed, whereas the Brier score in the
study with the best performance (Merkle, Steyvers, Mellers, & Tetlock, 2017) was 10 times
64
better than the Brier score in the study with the worst performance (Gurcay, study 1B, 2016), the
change in the cumulative effect size was only half of that magnitude.
[Figure 5 about here]
Conclusion
A meta-analysis was conducted to examine the hypothesis that individuals can make
informative probability judgments. Even though statistical tests revealed that the summary Brier
score across 69 studies was less than 0.25, the quality of subjective probability judgments was
modest (𝑃𝑆 ̅ ̅ ̅ ̅
= 0.220). Equally interesting was the fact that there was large systematic variability
to be explained in the effect size. All the selected variables (assessment context, expertise, and
elicitation method), were significantly related to the effect size.
Despite efforts to collect unpublished studies, no single unpublished report was obtained
besides the two experiments presented (later) in this dissertation. In a typical meta-analysis,
publication bias implies that small sample size studies are unlikely to be accepted for
publications, mainly due to the null effects and/or modest effect sizes. Thus, a meta-analysis
based solely on published studies is likely to yield a large summary effect size. When this
principle is applied to the current research, this means that studies with Brier scores equal to or
greater than 0.25 (after being scaled according to the conventional standard) are less likely to be
published because they suggest uninformative judgments. However, there was a sizable portion
of studies (30%) with poor Brier scores included in this meta-analysis.
More importantly, there are reasons to believe that the effect of publication bias could be
minimal. The Brier score is often used as a measure to examine the associations between
psychological factors and probability judgments. Thus, researchers and journal editors are
probably less interested in the Brier score per se. This means that journal editors may be less
65
likely to reject a study with a poor Brier score. In addition, an argument could be made that the
effect of publication bias would eventually be cancelled out when the current meta-analysis
included missing studies from fields that require judgments to be made in natural settings. For
example, the Brier score has been one of the key performance metrics in evaluating weather
forecasts for many decades, but only a single study involving weather forecasts was included in
this meta-analysis. Anecdotal and empirical evidence seems to suggest well-calibrated
probability judgments in weather-related events. Thus, the summary Brier score was likely to be
smaller when the (huge) number of weather-related studies was included.
In this chapter, I presented results from a meta-analytical study to examine the
fundamental question: Can individuals make informative judgments of uncertainty? Although the
results suggested evidence supporting the informative judgment hypothesis, the performance of
probability judgments was relatively modest. Thus, further research needs to examine methods to
improve and factors associated with the accuracy of subjective probability judgments. In fact,
there are a number of studies that explore the effects of alternative elicitation methods on
probability judgments that were not included in the meta-analysis. These studies evaluated
various strategies that can improve the performance of probability assessment such as
decomposition (e.g. Alpert & Raiffa, 1982; Armstrong, Deniston, & Gordon, 1975; MacGregor,
Lichtenstein, & Slovic, 1988;), incentivization (e.g., Hollard, Massoni, & Vergnaud, 2016;
Phillips & Edwards, 1966), and verbal expressions of uncertainty (Wallsten., Budescu, Rapoport,
Zwick, & Forsyth, 1986).
Even though there has been a strong interest in exploring verbal assessments of
uncertainty, none of the studies in this research program were included in the current meta-
analysis. In fact, verbal assessment of uncertainty is an important topic in probability judgment
66
research and has been studied extensively in the last several decades (Rapoport et al., 1990;
Renooij & Witteman, 1999; Wallsten, Budescu, Rapoport, Zwick, & Forsyth, 1986; Witteman &
Renooij, 2003) However, verbal assessment of uncertainty is not used very frequently, if at all, in
expert elicitation due to the concern that it is ambiguous and not as precise as numeric
probability (von Winterfeldt & Edward, 1986). Instead, there has been a strong research focus on
evaluating different methods to transform probability words (of phrases) into numbers
(Windschitl & Wells, 1996), but there are relatively few empirical studies have been conducted
to evaluate the performance of verbal aleatory probability judgments (cf. Friedman, Lerner, &
Zeckhauser, 2017), although such studies would be pertinent to applications of verbal probability
in expert elicitation. In the remaining chapters of this dissertation, I attempted to evaluate the
effects of verbal responses on aleatory probability judgments. This evaluation, however, is
incomplete without first discussing methods to translate probability words into numbers. The
next chapter presents an innovative method based on probability theory to transform verbal
expressions of uncertainty into numeric values.
67
CHAPTER 5
VERBAL PROBABILITY EXPRESSIONS
While numeric probability estimates are desired for computations, verbal expressions of
uncertainty are often the preferred communication mode (Brun & Teigen, 1988; Erev & Cohen,
1990; Olson & Budescu, 1997). People rarely speak of uncertain events with precise estimates,
but they rather use words such as improbable, could, likely, maybe, doubtful, etc., to express their
uncertainties. Even professional organizations have guidelines on how to communicate
probability estimates verbally (e.g. the intelligence community; Intergovernmental Panel on
Climate Change, medial professional guidelines, etc.). From a technical perspective, an
interesting question arises regarding how to best convert verbal expressions of uncertainty into
numerical forms for computation purposes. In this section, I briefly present key findings on
research that dissects applications of verbal probability expressions. Importantly, I propose a
novel method to quantify numeric values associated with verbal expressions of uncertainty.
Verbal Expressions of Uncertainty Are Ambiguous
Teigen and Brun (1999) discussed the communicative function of probability words and
pointed out that words used to describe uncertainties often have multiple meanings. They can
refer to the degree of uncertainty, to the positive or negative valence of an outcome, to the
68
complete lack of uncertainty in an outcome, and/or to the locus of uncertainty. Nevertheless,
despite the inherent ambiguity in probability words, many people are resistant to the idea of
learning and using numbers to calibrate their probability estimates. Indeed, Wallsten and
colleagues (1993) found that 77% of their respondents prefer to express uncertainty verbally in
everyday life. Interestingly, while people may prefer to verbally express uncertainty, they prefer
to receive numeric information of probability (Erev & Cohen 1990).
Several explanations have been proposed to account for why people prefer to express
uncertainty through words. Zimmer (1983) argued that because natural language evolved much
earlier than probability theory, people have become accustomed to and more adept at using their
language to express uncertainty. In fact, children grow up using words to describe uncertainty
and they only learn to use numbers when they learn more about math, which may account for a
general preference for using verbal probability (Gelman, 1990). Additionally, expressing
uncertainty numerically is a cognitively challenging task. Erev and Cohen (1990) proposed a
distinction between spontaneous and controlled behavior to account for why people prefer to
receive numerical estimates but prefer to express uncertainty in verbal terms. When people are
asked to provide an assessment of uncertainty, they are probably not aware of a decision on
which mode (verbal or numerical) to use, and they spontaneously communicate their estimate
with the mode that they believe is easier to understand. However, when asked which mode of
probability expression they prefer to receive, they prefer to receive information in the mode that
they believe can help them make more “accurate” decisions. Others argued that a numeric
representation of probability conveys a sense of confidence, precision, and authority that people
do not necessarily have in carrying out daily conversations (Rapoport et al., 1990).
69
Although there are many competing reasons to explain the general preference for using
verbal probability, verbal expressions of uncertainty are ambiguous. The same person can use an
identical expression such as “likely” to describe a 60% or an 80% chance. Furthermore, two
individuals can have different interpretations of the same probability expression. Indeed, the
between-person variability has been found to be greater than the within-person variability in
ranking the orders of different probability terms (Budescu & Wallsten, 1985). For example,
although different individuals ranked the orders of distinct verbal probability terms differently,
the same individuals were consistent in ranking the orders of the same set of probability
expressions (Clark, 1990; Wallsten & Budescu, 1990). A lack of symmetry is another interesting
property of verbal probability Symmetrical verbal probability expression did not have
symmetrical numeric probability values. For example, research revealed that the combined
numeric values of the terms “quite likely” and “quite unlikely” covered a range of probabilities
from 0.01 to 0.99 (Lichtenstein & Newman, 1967)
Methods for Quantifying Probability Words
Because verbal probability expressions are ambiguous and overlapped, it is important to
disambiguate meanings of verbal probability expressions. Wallsten et al. (1986) developed a
methodology to quantify probability words by characterizing the vagueness of verbal
probabilities by membership functions. This Membership Function (MF) approach has its
theoretical foundation in Fuzzy Set Theory (e.g. Norwich & Turksen, 1982). The key idea is that
the meaning of each probability word is fuzzy, and it can be represented by a membership
function, or a distribution of possible numeric probability values between 0 and 1. The
membership function is derived based on respondents’ judgments of how well a probability
value represents a probability term. In the earlier version of the MF methodology, respondents
70
were presented with a verbal probability expression and two probability wheels where each
represents a specific probability level. Using a rating scale, the respondents ranked how much
better the right-side spinner represented the value of the expression than the left-side spinner.
Respondents continued the ranking with different pairs of probability wheels for the same
expression. The procedure was repeated until respondents complete the rankings for all verbal
expressions that cover the interval [0, 1] of probability values. Scale values were then assigned to
the rankings in a way that they captured the magnitude in the differences between the left and the
right spinners. These scale values were then normalized on the scale from 0 to 1, and a
membership function for each probability term is established (Wallsten et al., 1986; Rapoport et
al., 1990). An expected value for a probability term is then computed:
𝑊 𝑣
=
∑ 𝜇 𝑖 𝑚 𝑖 =1
(𝑝 𝑖 )𝑝 𝑖 ∑ 𝜇 𝑖 𝑚 𝑖 =1
(𝑝 𝑖 )
(8)
where i = 1, . . ., m indexes specific probability values in a membership function for a specific
expression, 𝜇 𝑖 denotes a specific probability value, and 𝑝 𝑖 is the probability of that value. W is
the expected numerical value of the expression.
The MF methodology has been used to quantify probability words in numerous studies
that compare verbal versus numeric expressions of uncertainty (Budescu, Karelitz & Wallsten,
2003; Budescu & Wallsten, 1995; Rapoport et al., 1990). Using this approach, researchers found
that there are more similarities than differences between verbal and numeric probability. For
example, individuals were equally susceptible to judgment biases irrespective of the response
modes (Erev & Cohen, 1990). Similar to the numeric representation of probability, verbal
expressions of uncertainty are also directional. Optimistic expressions (e.g. likely) suggest a
tendency for an event to happen whereas pessimistic expressions (e.g. unlikely) indicate the
71
opposite (Teigen & Brun, 1999). The two response modes also led to similar epistemic
probability judgments (Wallsten, Budescu, & Zwick, 1993).
Savage’s Approach to Quantify Verbal Expressions of Uncertainty
The membership function is a sophisticated approach to capture the vagueness of
probability words. Nevertheless, this methodology is based on Fuzzy Set Theory that has little
connection to probability theory. This dissertation proposes an alternative quantification
methodology that has its root in subjective probability theory. This ‘new’ methodology is both
simple (for subjects to complete) and theoretically sound. It is based on the long-held notion that
subjective probabilities can be inferred from choices between gambles (Edward, 1954; Savage,
1954). For example, if a gamble pays out $100 when the Los Angeles Rams win the 2019 Super
Bowl and takes away the same amount of money when the Rams do not win, then a person who
accepts the gamble is said to believe that the probability of the Rams’ winning the Super Bowl is
greater than 0.5. This assumes that the decision maker is risk neutral and they are seeking to
maximize expected value.
A modified titration procedure of the previous example can be used to quantify verbal
probability expressions. Figure 6 illustrates a choice between a pair of gambles presented to
assessors. In the first gamble, a blindfolded person will draw a ball from a container that contains
100 red and blue balls, the exact proportions of which are communicated to the respondent. If a
red ball is drawn, respondents win $100, and they receive nothing otherwise. The second gamble
is similar to the first, such that the assessor does not know the proportion of red and blue balls.
However, a blindfolded person who knows the exact ratio of red and blue balls in the container
will describe his or her uncertainty before drawing a ball. Importantly, the uncertainty is
72
expressed in natural language. As in the first gamble, drawing a red ball results in a $100 win,
and drawing a blue ball returns nothing.
[Figure 6]
The key difference between the two gambles is the format of the information describing
the gamble made available to the respondent (probability assessor). In the first gamble, the
drawing blindfolded person provides a numeric estimate of the probability of a red ball (p = 0.5)
to the respondent, whereas in the second gamble, the respondent receives a verbal description of
the chances of drawing a red ball (“slight chance”). Respondents can choose to play either of the
two gambles or they can choose “indifferent.” When the latter choice occurs, the assessment is
completed. When a respondent chooses to play either of the two gambles, a subsequent choice
between a new pair of gambles is presented, in which the numeric value in the first gamble is
changed to make this game more attractive. The probability of winning (red ball) is increased
when respondents chose the verbal gamble and decreased when they chose the numeric gamble
in the prior round. The procedure is repeated until respondents indicate “indifferent,” at which
point the numeric value for a specific verbal probability term can be matched with an
“equivalent” numeric probability. In principle, the procedure is continued until the respondent
becomes indifferent. Figure 7 displays the sequential logic of the gambles using the titration
methodology described above
4
.
[Figure 7]
The proposed procedure is conceptually sound as it allows an inference of probability
values from choices between gambles. The procedure is simple because it relies strictly on
respondents’ binary choices between gambles and it does not require any scaling procedures.
4
The steps in the titration methodology are arbitrary. In principle, this should not be a problem as the
gambles will continue until a respondent becomes indifferent
73
Yet, a possible issue with the titration (gamble) methodology is the effect of ambiguity aversion
(Ellsberg, 1961). In the illustrative example above, an ambiguity-averse decision maker is more
likely to choose the numeric gamble because the verbal gamble is inherently more ambiguous.
As seen in Figure 6, this bias leads to an underestimation of the probability value. An ideal
correction for such ambiguity requires ground-truth knowledge of the numeric value for each
verbal expression, which defeats the purpose of the methodology itself.
A more sensible approach is to ‘neutralize’ the effect of ambiguity. A simple fix for this
bias is to change the description of the gambles to assess the complementary probabilities of
drawing a blue ball. Instead of describing the gambles in terms of drawing a red ball (“RED
frame”), the gambles are now described in terms of drawing a blue ball (“BLUE frame”). These
two frames are illustrated in Figure 8. Ambiguity aversion predicts that respondents would
choose the numeric gambles regardless of how the gambles are described. Consistent choices of
the numeric gambles in the RED frame lead to smaller values for the verbal expressions, whereas
the same pattern in the BLUE frame leads to greater values for the verbal expressions. Thus, for
each verbal expression, the average value between the RED and BLUE frames could be used to
‘correct’ for ambiguity bias. Note that the term ‘correction’ is used here to indicate that the effect
of ambiguity bias has been adjusted and does not imply any normative numeric values for the
verbal expressions.
[Figure 8 about here]
The Current Research
Savage’s gamble methodology provides an innovative and more consistent approach to
probability theory to quantify verbal expressions of uncertainty. Yet, a possible issue with this
method is the effect of ambiguity bias. Thus, the current study was conducted to examine the
74
applicability of Savage’s methodology in quantifying verbal expressions of uncertainty. The goal
of this so-called ambiguity study was to estimate the effect of ambiguity bias, if any, on the
numeric estimates of verbal probability. These estimates provide reference values for future
studies on quantifying verbal expressions of uncertainty
Method
Sample
A hundred and twenty-eight Mechanical Turk workers were invited to participate in the
ambiguity study. Respondents were asked to play series of gambles from which the implicit
numeric values of the verbal expressions could be quantified. Respondents could choose up to 3
choices in 16 different blocks, each for a specific expression, for a total number of 48 trials.
Respondents were randomized into one of the two conditions: BLUE (n = 63) and RED groups
(n = 65). The only difference between the two groups was the description of the gambles. BLUE
respondents played gambles that were described in terms of drawing a blue ball whereas RED
respondents played gambles that were described in terms of drawing a red ball.
Procedure
All respondents were asked to play series of gambles constructed based on the method
described earlier. A working example helps to illustrate the methodology. A respondent is
presented with two gambles. In the first one, the respondent receives a hypothetical prize of $100
when a red ball is drawn from a container that has 50 red and 50 blue balls (see Figure 4). In the
second, the respondent receives the same prize when a red ball is drawn from a container that has
some blue balls and some red balls. The respondent is told that there is a ‘slight chance’ a red
ball is drawn. If the respondent chooses to play the numeric gamble, a new pair of gambles is
provided such that the chance of drawing red ball is reduced to 0.1 or 10% (i.e., the number of
75
red balls in the container is 10). On the other hand, the chance of drawing a red ball is increased
to 0.9 or 90% in a new pair of gambles when the respondent chooses to play the verbal gamble.
This sequential logic is illustrated in Figure 7. The starting value is at 0.5, where the upper arrow
pointing to 0.9 suggests that a verbal gamble is chosen whereas a lower arrow pointing to 0.1
suggests the opposite choice. The respondent continues choosing their preferred option in a
series of binary gambles until they become indifferent, meaning that they are equally happy
choosing either option, or until they reach the third trial.
When the respondent becomes indifferent, the value of the expression ‘slight chance’ is
set equivalent to the value in the numeric gamble. When a third trial is reached, and the
respondent still does not choose the option ‘indifferent,’ the numeric value of the expression is
set to the value at the end of the logic tree in Figure 7. These end-point values are derived from
the midpoints in the presumed ranges based on the respondent’s previous choices. For instance,
if a respondent has the choice pattern verbal-verbal-numeric, the presumed range of values for
the expression ‘slight chance’ is [0.9, 0.99], and the median of this range is 0.945.
All respondents in the study were asked to complete 16 different sessions, and each
session was designed to quantify a specific verbal expression of the 16 phrases/expressions used
in subsequent studies. The expressions/phrases were: almost impossible, highly improbable, very
unlikely, very doubtful, improbable, unlikely, doubtful, slight chance, probable, no doubt, likely,
good chance, pretty sure, very likely, highly probable, almost certain. Note that the use of
qualifiers such as ‘almost’ and ‘very’ was intended to distinguish meanings of the expressions
(Budescu, Broomell, & Han-Hui, 2009). Importantly, the descriptions of the gambles in the RED
versus BLUE conditions were different. The winning possibility in the RED gambles was
76
described in terms of drawing a red ball whereas the same possibility in the BLUE gambles was
described in terms of drawing a blue ball. Figure 6 displays this logic.
Results and Conclusions
Table 4 provides a summary of the mean values associated with the probability
expressions in the two conditions, the average values between the conditions, and the correction
values for ambiguity bias. Examining the table suggested that values in the RED frame were
smaller than values in the BLUE frame although the differences were relatively small. In fact,
the MANOVA test indicated that there was evidence of ambiguity bias for the values of the
probability expressions (Pillai = 0.24, F(16, 126) = 2.13 , p = .01). There were significant
differences due to ambiguity bias for eight of the sixteen verbal probability expressions: Very
unlikely (F(1, 126) = 4.35, p = .04), very doubtful (F(1, 126) = 13.23, p < .01), improbable (F(1,
126) = 5.69, p = .02), doubtful (F(1, 126) = 6.20, p = .01), slight chance (F(1, 126) = 6.2 , p =
.05), probable (F(1, 126) = 5.72, p = .02), likely (F(1, 126) = 5.40, p = .02), and pretty sure (F(1,
126) = 5.92, p = .02). As anticipated, numeric values associated with these expressions were
higher in the BLUE frame than in the RED frame, but the differences were relatively small,
between 2% to 5%.
In this chapter, a new method to quantify verbal probability expressions was proposed.
The proposed method adds an extra tool to help researchers and analysts to disambiguate vague
meanings of verbal probability expressions. Importantly, the effect of ambiguity aversion was
estimated, and this information could be useful for subsequent research employing the proposed
method. In the next chapter, I describe and present results from a forecasting experiment
designed to evaluate verbal forecasts of future events.
[Table 4 about here]
77
CHAPTER 6
EVALUATING VERBAL ASSESSMENT OF ALEATORY UNCERTAINTY
Accurate risk assessment not only helps organizations to manage undesirable risks but
also to recognize opportunities. Yet, while numeric probability estimates are desired for
computations, verbal expressions of uncertainty are often the preferred communication mode.
For example, the U.S. National Intelligence Council (2007) recommended that intelligence
analysts use only five specific verbal expressions (almost certainly, probably/likely, even chance,
unlikely, and/or remote) to characterize the degree of uncertainty in their intelligence
assessments. However, verbal probability expressions are inherently ambiguous. Interpretations
of verbal expressions of uncertainty vary greatly across individuals and contexts, creating
ambiguity, misunderstanding, and a loss of accountability in risk communication (National
Research Council, 2014). A more troubling concern is that verbal probability judgments may be
more biased and less accurate than numerical probability judgments.
Evidence is mixed regarding the effects of communication mode on probability
judgments. Early empirical research suggests little difference between numerical and verbal
expressions of uncertainty (e.g., Budescu & Wallsten, 1995; Huizingh & Vrolijk, 1997). Yet,
recent research reveals important differences between verbal and numeric expressions of
uncertainty. For example, individuals were less cautious and less likely to seek additional
78
information when responding to verbal probabilistic forecasts (as opposed to numeric
judgments). Yet, their verbal probability estimates were less extreme compared to their numeric
counterparts (Friedman et al., 2017). These findings depict the nuances of verbal probability
assessment, requiring additional research to further examine the quality of verbal probability
judgments.
However, evaluating verbal assessment of uncertainty is a challenging task. Although
verbal and numeric assessments of uncertainty have been compared in prior research (Wallsten,
Budescu, & Zwick, 1993), such a comparison was made when the task was to assess epistemic
uncertainty, the type of incertitude that is knowable in principle but is not in practice due to
insufficient knowledge. Nonetheless, individuals and organizations are probably more concerned
about the likelihoods of events that would happen in the future, or aleatory uncertainty. Because
there are fundamental differences in judgments and behaviors when individuals are asked to
assess different categories of uncertainty (Fox & Ülkümen, 2011), it is unknown whether the
effect of verbal mode on probability judgments remain the same when individuals are asked to
assess aleatory uncertainty. In addition, most measures of probability judgments are numerical.
Thus, verbal expressions of uncertainty must be transformed into numerical values. Yet, there is
no standard methodology to achieve this purpose.
The objective of the current research is to understand whether and how verbal expression
of uncertainty differed from the conventional use of numeric probability. The next section first
reviews prior research on verbal probability judgments and then provides the hypotheses. The
chapter concludes with a presentation and discussion of the results.
79
The Effects of Response Modes on Probability Judgments
Previous attempts to evaluate verbal expressions of uncertainty suffer from
methodological limitations that preclude a definitive conclusion about the effect of verbal versus
numeric response mode on probability judgments. Wallsten, Budescu, & Zwick (1993) were
amongst the first to assess verbal probability judgments. The authors asked a sample of college
students to evaluate the accuracy of 300 general knowledge propositions. Students were asked to
rate the extent to which they believed each of the 300 statements were true. Importantly, students
completed the task using either a numerical probability value or by assigning one of the verbal
probability words they selected in advance to describe a full range of uncertainty. By way of a
fuzzy-set membership function, the researchers quantified numeric meanings of the verbal
expressions at an individual level so that probability scores could be computed. The results
suggested little difference between the two response modes in the spherical probability score. It
is important to note that this study concerned verbal assessments of only epistemic uncertainty,
which refers to uncertainty in the state of knowledge, and did not address aleatory uncertainty,
which refers to stochastic properties of future events. Chapter 2 presents an extensive discussion
on this issue. Because the distinction between aleatory and epistemic uncertainty is well
documented in the literature (see Keren, 1988 for a review), it is an unclear whether results from
Wallsten et al. (1993) would be generalizable to studies that assess aleatory uncertainty.
A recent forecasted study addressed the effect of the assessment context. Friedman, et al.
(2017) asked a sample of professional intelligence analysts to make forecasts for a wide range of
geopolitical events that would happen by specified dates. They found that those who used
probability words to make geopolitical forecasts achieved better performance (better Brier
scores) than those that used numbers to make the same forecasts, and the difference was
80
especially large among low performers. Yet, there was a methodological issue with this study.
Instead of translating verbal expressions into numerical values at an individual level, the
researchers codified numeric values for the verbal expressions based on results from prior
research. This is important methodologically because the meanings of verbal expressions of
uncertainty are ambiguous. Research suggests that the between-person variability in the
interpretations of verbal probabilities was greater than the within-person variability (Budescu &
Wallsten, 1985). Although individuals were consistent in ordering probability expressions over
multiple occasions, different individuals ordered the same expressions differently (Clark, 1990;
Wallsten & Budescu, 1990). Chapter 5 presents an extensive discussion on this issue.
Importantly, Because the Friedman et al. (2017) study did not employ a personalized approach to
quantify verbal probability the extent to which the results in that experiment generalize to
instances when verbal probability expressions are uniquely mapped into numerical probabilities
are unknown
Even when these methodological gaps are resolved as they are in the present study, the
effect(s) of verbal mode on probability judgments is still unclear. In fact, there are two
conflicting hypotheses regarding the effect of verbal probability judgments. Because verbal
expressions of uncertainty share many characteristics with numeric representation of probability
(Budescu et al., 2003; Budescu, Weinberg, & Wallsten, 1988), it is expected that there is little
difference in performance between verbal and numeric response modes for probability
assessment. Because probability scores are often used to evaluate subjective probability
assessments. this expectation suggested that there would be no significant difference between the
response modes in a chosen probability score. On the other hand, because most people acquire
language skills before they learn about numbers and math, using numbers to express uncertainty
81
is like using a different language to express belief. Indeed, both layman and professional experts
expressed preference for using words to express uncertainty because words are more natural for
many (Moxey & Sanford, 2000). Likewise, numeric expressions of uncertainty could create less
accurate forecasts because the translation between the first (verbal) and the second (number)
languages could introduce errors. This number-as-a-second-language hypothesis suggests that
numeric forecasts would be less accurate (i.e., verbal assessment of uncertainty would lead to a
better probability score) (Friedman et al., 2017). Given the contradicting rationale, the extant
literature provides little guidance as to the nature of the effect of different response modes on
probability elicitation performance in an aleatory uncertainty task.
Whereas there may be no significant difference between the effects of response mode on
the probability score, there can be important differences between the two response modes on the
component scores of probability judgments. A probability score such as the Brier score can be
decomposed into different components, and each captures a unique aspect of probability
judgments (Yates, 1990). To get a good probability score, forecasters must balance calibration
and discrimination/resolution, two important and distinct aspects of judgment accuracy (Stone &
Opel, 2000). There is a trade-off between calibration and discrimination (also known as
resolution) such that well-calibrated judgments are often less discriminate and vice versa. This
implies that two different judges can have the same probability score, but their calibration and
resolution scores can be very different.
Wallsten, Budescu, and Zwick (1993) found verbal judgments to be more overconfident
than numeric judgments. Judges are overconfident when they indicate that an event happens
more frequently than it does. However, since epistemic uncertainty was assessed, it is unknown
whether and how findings from that study can be generalized when judges are asked to make
82
forecasts of future events. In fact, Friedman et al. (2017) found that professional intelligence
analysts made more extreme probability judgments when they used numbers compared to when
they used words to estimate uncertainty. A possible consequence of this behavior is that numeric
judgments could be less calibrated compared to verbal judgments. Additionally, because
probability assessors are generally overconfident, more extreme judgments may exacerbate the
overconfidence effect. Nevertheless, Friedman et al. (2017) measured neither calibration nor
overconfidence. In this research, it is expected that numeric judgments would be more extreme,
leading to lower calibration and greater overconfidence.
In contrast, verbally assessing probabilities may lead to lower resolution. Since resolution
measures the ability of an assessor to distinguish occurrence from non-occurrence, an assessor
with better resolution skills should use different (and appropriate) labels, either numbers or
words, to describe different degrees of uncertainty. That is, an assessor who uses more response
categories is likely to have a higher resolution score. Because the meanings of probability words
are fuzzy and overlapping, it may be difficult for assessors to use different verbal labels to
describe different degrees of uncertainty. As a result, it is expected that verbal judgments will be
less resolute or discriminatory compared to numeric judgments. Indeed, this might explain
Wallsten et al.’s (1993) findings. In their research, more than half of the respondents chose
between 11 and 15 words to describe a full range of uncertainty, although they were given a
much larger number of possible probability expressions. Additionally, because resolution is
partly driven by the variability in probability responses, the resolution bias hypothesis also
implied that the degree of scatter in verbal responses should be less than the degree of variability
in numeric judgments.
83
The Effects of Individual Differences
Because substantial individual differences are expected in forecasting accuracy (Mellers
et al., 2015), another aim of this research is to explore the effects of individual differences on
probability judgments. Expertise could be a variable that predicts quality of probability
judgments. Specifically, those who have knowledge of an assessment topic may perform better
than those who do not show the same level of knowledge. Knowledgeable judges may be better
at assessing when an event happens, and accordingly, their resolution scores would be higher
than those with less extensive knowledge. Indirect evidence supporting this hypothesis comes
from studies that compared judgments related to general knowledge questions and forecasts of
future events. For example, because answers to general knowledge questions are known, whereas
outcomes of future events are unpredictable, judgments in the former task were much more
resolute than judgments from the latter task (Tannenbaum, Fox, & Ulkumen, 2016). Similarly,
less knowledgeable judges could perceive the forecasting events as being unpredictable, partly
because of their limited knowledge, whereas ‘true’ experts could be more certain with their
assessments because they know more. As a result, it is expected that judgments from
knowledgeable judges will be more resolute.
The ability to think reflectively and analytically may also be an important predictor of the
quality of probability judgment. It is well established that individuals often rely on heuristics or
System 1 to make decisions (Kahneman, 2011). While these heuristics can serve well in many
situations, they often lead to systematic judgment errors. For example, judges may be
overconfident in their assessments when they focus their attention solely on one of the
hypotheses (Tversky & Koehler, 1994). As a result, their assessments can be less well calibrated.
On the other hand, judges who engage in more reflective thinking can mitigate biases introduced
84
by System 1 thinking, hence performing better (Mellers et al., 2015). Thus, it is expected that
those who are more adept at resisting the influences of the automatic System 1 thinking would
perform better, at least in terms of overconfidence, than those who are less skillful at controlling
the influences of heuristics-based thinking system.
Another factor that may relate to forecasting performance is numeracy. Research suggests
that numerical individuals are less susceptible to cognitive biases (Patalano, Saltiel, Machlin, &
Barth, 2015; Peters et al. 2006; Schwartz, 1997). Thus, it is expected that those with a higher
level of numeracy skill perform better than those with a lower level of numeracy skill.
Specifically, there is an expected relationship between numeracy and one of the aspects of
probability judgment, namely consistency. A judge is said to be consistent when her probability
assessment of an event is larger than her assessment of another event conditioned on the original
event (i.e., the chance of winning a conference championship is larger than the chance of
winning a Super Bowl). Because this property requires an understanding of probability rules, or
even simply a preference for using numeric probability, those with a higher level of numeracy
skill would be more likely to make consistent judgments than those with a lower level of
numeracy.
Summary of Research Questions and Hypotheses
In this research, I explore differences between numeric and verbal judgments of aleatory
uncertainty. It is expected that numeric judgments will be less calibrated, more overconfident,
but more discriminatory than verbal judgments. It is also expected that knowledgeable
respondents are more resolute than less-knowledgeable respondents. In addition, more numerate
respondents and respondents who are able to resist the influences of heuristic thinking would
perform better than those less numerate and less capable of resisting heuristic thinking.
85
Method
Design Overview
Football experts were recruited from Amazon Mechanical Turk to participate in an “NFL
Prediction Tournament.” Experts were selected based on their responses to a screening
instrument (described below). Half of the experts were randomized into a numeric response
condition in which they used a numeric probability scale to make judgments for 50 different
binary events after the fourth week of the NFL 2016-2017 regular season. The questions span a
variety of league events, such as outcomes of individual games, rankings in divisions, playoff-
results, etc. About one-third (32%) of the 50 binary target events (e.g., Team A defeats Team B
during the season) were realized (occurred) at the end of the season. The other half of the experts
expressed their uncertainties about the target events by choosing their responses from a list of 11
different verbal expressions of uncertainty.
Experts in the verbal condition were later invited to participate in a separate study,
designed to quantify the numeric values of verbal probability expressions at an individual level.
A gamble methodology (described later) was used to quantify the implicit numeric values
associated with eight probability expressions that the verbal response mode experts had used in
the judgment study (the 3 anchor terms impossible, toss-up, and certain were set at p = 0.0, 0.5,
and 1.0, and they were not quantified individually). Data was also collected from a separate
study to estimate the effect of ambiguity bias on the translated probability values and applied
these estimates to correct the elicited numeric values associated with the eight verbal expressions
of uncertainties (described below). This is referred to as the “ambiguity bias study” hereafter.
All experts were paid $2.00 for their participation in the main study, $1.00 for
participation in the follow-up word-to-number transformation study, and $1.50 for participation
86
in the ambiguity bias study. In the main study, they were told that their performance would be
scored, and the three persons with the highest scores would receive a special bonus of $50, $20,
and $5, respectively. Only a qualitative description of the Brier Score formula was provided to
respondents. Specifically, the experts were told:
We will use something called the Brier scoring rule to evaluate your performance.
You do not need to worry about the technical details. The most important thing to
remember is that the Brier score is designed to prevent people from "gaming the
system." Thus, TO GET THE BEST SCORE, YOU NEED TO REPORT YOUR
TRUE BELIEF ABOUT THE UNCERTAINTY OF AN EVENT.
Data collection was between the fourth (September 29,
2016) and the fifth (October
6, 2016) weeks of the regular NFL season.
Experimental Manipulation
The experimental variable in the main study is the response mode. Half of the
respondents used an 11-point numeric scale whereas the other half used 11 different verbal
expressions of uncertainty to describe their uncertainty about future NFL outcomes. The numeric
scale is from 0% to 100%. The experts were told:
In the next section, you will be asked to make a number of predictions
about the NFL 2016-2017 football season. We are interested in your best
estimate of the chance from 0% to 100% that each event will happen. If
you are sure that the event will happen, choose 100%. If you are sure that
the event will not happen, choose 0%. If you feel the event as likely to
happen as it will not happen, choose 50%. In all other cases, select an
option between 0% and 100%.
87
The verbal list contains precisely 11 different probability words. The probability
expressions were adopted from Wallsten et al. (1993) to allow a direct comparison of results.
These words/expressions are: impossible, improbable, unlikely, doubtful, slight chance, tossup,
good chance, probable, likely, pretty sure, and certain. Wallsten et al. (1993) reported that the
meanings of the anchoring terms; impossible, toss-up, and certain, were well agreed upon. The
instructions for the verbal experts were almost identical to the instructions for the numeric
experts, except that the numeric anchors were replaced with the verbal anchors.
Quantify Verbal Probability Expressions
Prior to computing the performance metrics, verbal responses were converted into
numerical values. Experts in the verbal group were invited to participate in a follow-up study one
week after the main forecasting experiment had finished. They were asked to play series of
binary gambles (as described earlier) and were not told about the (true) purpose of the research.
The experts were asked to simply indicate their gamble preference in up to 8 (expressions) x 3
(steps) = 24 trials. Using the titration (gamble) methodology described in Chapter 5, each of the
eight probability expressions used in the main forecasting study (not included the anchor terms)
were converted into a numerical value at an individual level. Results from the ambiguity study
(see Chapter 5) were used to adjust for the ambiguity effect. A correction was applied to each
translated value.
Dependent Variables
The Brier score was used to assess the overall accuracy of probability judgments, and it
was computed for each expert as.
PS
̅̅ ̅
= 𝑁 −1
∑ (𝑝 𝑗
− 𝑑 𝑗 )
2 𝑁 𝑗 =1
(9)
where N is the total number of predicted events, pj is the subjective probability estimate that an
88
event Xj occurs, and dj equals 1 when the event occurs and 0 when it does not occur. The best
possible score is 0.0 and the worst possible score is 1.0. Calibration and resolution scores were
computed based on Yates’ (1990) decomposition approach:
PS
̅ ̅ ̅
= Variability + Bias
2
+ Variability*Slope*(Slope – 2) + Scatter (2)
Accordingly, bias score is computed as the difference between average probabilities and the base
rate of the target events. The formula for Bias is:
𝐵𝑖𝑎𝑠 𝑖 = 𝑁 −1
[∑ 𝑝 𝑖𝑗
𝑁 =50
𝑗 =1
− ∑ 𝑑 𝑗
𝑁 =50
𝑗 =1
] (10)
Note that because the base rate is a constant across all experts, the expert’s bias score is a
linear transformation of his or her average probability rating. A positive score indicates that an
assessor believes, on average, an event happens more frequently than it does, and vice versa
when there is a negative score. Squaring the bias score returns a Reliability-in-the-Small or a
Calibration index.
The resolution score or slope is the difference between two mean conditional
probabilities. Thus, resolution (slope) is the difference between the mean probability of the
events that happened and the mean probability of the events that did not happen. The larger the
slope is, the more discriminatory an expert’s judgments are:
𝑆𝑙𝑜𝑝𝑒 𝑖 = 𝑝 ̅
𝑖 | 𝑑 =1
− 𝑝 ̅
𝑖 | 𝑑 =0
(11)
Computation of the AUC is more involved. In this study, the AUC for each expert
respondent was computed by using the R package ROCR (Sing, Sander, Beerenwinkel, &
Lengauer, 2005). One can approximate the value of the AUC by dividing the U statistics from
the Mann-Whitney test by the sample sizes of two comparison groups
5
.
5
http://blog.revolutionanalytics.com/2017/03/auc-meets-u-stat.html
89
Other performance indexes. Other measures were also computed to further examine the
quality of the experts’ judgments. Because we did not ask the experts to pick an option and then
assign probability for that option, the proportion of “correct prediction” is defined as the
proportion of times an expert indicates a probability greater than 0.5 for an occurrence event or
when the expert indicates a probability smaller than 0.5 for a non-occurrence event.
There are 11 questions designed to check for coherent judgments
6
. These questions
pertained to the chances that specific NFL teams would win their conferences and win the Super
Bowl. A response was considered consistent when a subjective judgment of p (winning the Super
Bowl) was less than a subjective judgment of p (winning the conference)
7
. The consistence
measure was simply the proportion of consistent answers (per expert).
An overconfidence index was computed as,
OC =
1
2
(∑[𝐶 ̅
𝑝 < 0.5
− 𝑝 ̅
𝑝 <0.5
] + ∑[𝑝 ̅
𝑝 > 0.5
− 𝐶 ̅
𝑝 > 0.5
]) (12)
This index is the average difference between the mean probability ratings and the
proportion of correct answers. A positive score indicates overconfidence, meaning that judges
believe they get more correct answers than they actually did. The definition of “correct answers”
has been defined earlier. I computed the index for each expert and took the average across the
sample.
Psychological and behavioral measures. Respondents concluded the study by providing
responses to various questionnaires, including the eight-item subjective numeracy scale
6
There are many ways to define “coherent judgments”. I used this term loosely to describe whether judgments
follow a specific probability rule.
7
Strictly speaking, one could believe a team is “built for the playoffs” and has a better chance of winning super
bowl if they get to the playoffs, even as a wild-card team. This is unlikely, but it is possible to have this belief and
still be coherent.
90
(Zikmund-Fisher, Smith, Ubel, & Fagerlin, 2007), the three-item cognitive reflection test
(Frederick, 2005), a self-reported 5-item measure of interests in NFL football, and a 10-item
measure of football-related fan activities. Previous research demonstrates a strong correlation
between the subjective numeracy scale and objective measures of numeracy (Fagerlin et al.,
2007). The NFL measures were different from the screening measure. Whereas the screening
measure contained factual questions, the additional NFL scales contained items that measure
interests and behavioral engagement in NFL football. These scales served as proxy measures for
expertise because those who show more interests in football and engage in more football-related
behaviors are likely to know more about football. Table 5 provides a summary of the
psychological measures and their reliabilities.
[Table 5 about here]
Sample
Recruitment strategy. I recruited Mechanical Turk workers who responded to a call for
individuals interested in American football to participate in a “sport and decision making” study.
A total of 376 volunteered for the study. These volunteers completed a screening instrument
designed to assess football expertise, comprised of four multiple-choice questions: 1) Which of
the following two teams played in the 50th Super Bowl in early 2016? 2) How many teams are
there in the NFL? 3) What are the two conferences in the NFL? 4) Do you consider yourself to
be an NFL fan? Only those who correctly answered the first three screening questions and self-
identified as fans were deemed eligible to participate in the main study
8
. Two hundred and thirty-
six respondents were classified as NFL experts based on their responses to the four screening
questions. These respondents were deemed experts for purposes of the current study and invited
8
Answers to the factual questions were timed (< 5 seconds) to prevent respondents from consulting other sources to
answer the screening questions.
91
to complete the main NFL judgment study. Experts in the verbal condition were invited to
complete a follow-up study a week later. A total number of 97 verbal experts completed the
follow-up word-to-number transformation study.
Response screening. Prior to conducting statistical analyses, I checked for unreliable
responses. Forecasters can repeatedly choose the probability of 0.50 to describe their
uncertainties for all of the target events, which results in PS
̅̅ ̅
= 0.25. Exploratory analyses
suggested that the sample mean (PS
̅̅ ̅
= 0.28) was significantly greater than 0.25 (t(214) = 8.17 p
< .01), suggesting that the experts’ performances, on average, were worse than no-skill forecasts.
However, there were great discrepancies in performance, with the Brier score falling between
0.18 and 0.45.
Given the heterogeneity in performance, the following approach was used to test the
research questions. First, I removed respondents associated with a slope equal to or less than 0 or
associated with an AUC measure equal to or less than 0.50. These resolution values suggest an
inconsistency between probability responses and beliefs. For example, assessors may indicate
that a target event has a 90% chance of occurring when they actually believed that the event
would not happen. I considered these responses unreliable because these experts might not have
paid enough attention, or they might have simply made random responses to complete the study
quickly (and get paid). Thus, I removed these responses from subsequent analyses. The sample
that excludes those who did not provide reliable responses are referred to as the MIN sample to
indicate that the included experts passed the minimum screening criteria of having reasonable
resolution scores (i.e., slope is greater than 0 and AUC is greater than 0.50).
For the MIN sample, data from 53 experts with slopes equal to or less than 0 and/or with
AUC measures equal to or less than 0.50 were removed (21 experts rejected on both criteria).
92
The final sample included 101 experts in the numeric condition and 82 experts in the verbal
condition. In other words, 14.4% and 15.5% of participants were removed from the number and
verbal conditions, respectively. There was no statistically significant dependency between the
experimental groups and the removed responses (χ
2
(1) = 1.95, p = 0.16). Table 6 displays the
characteristics of the MIN sample.
[Table 6 about here]
In summary, I collected data from two separate studies: the main judgment study and the
follow-up word-to-number transformation study. Data from the follow-up word-to-number study
allowed for a numeric translation of the verbal probabilities in the main judgment experiment at
an individual level. These numeric estimates of verbal probabilities were then corrected for the
ambiguity bias using results from the ambiguity study described in Chapter 4. Figure 9 displays
the flow of respondents in this study.
[Figure 9 about here]
Results
Quantifying probability words
Table 7 provides the means and the standard deviations of numerically translated
probability words. These estimates were corrected for ambiguity bias at the individual level.
Note that the values of the anchor terms (impossible, toss-up, and certain) were not included as
they were anchored at 0%, 50%, and 100%, respectively. Table 4 indicates that some of
symmetric verbal expressions of uncertainty are not symmetric in numeric value around 0.50
(i.e., do not sum to 1.0). For example, the numeric sum of “likely” and “unlikely” was not close
to 1.0, nor was the sum of “probable” and “improbable.” In addition, the medians of the
optimistic expressions were relatively low and close to 0.50. Finally, examining Table 4 reveals
93
small distinctions among verbal probability expressions in the same direction at the group level.
Examining the results at the individual level reveals greater heterogeneity. For example, the IQR
of “improbable” was between 0.06 and 0.10, suggesting that the meaning of this expression was
well agreed upon. However, the IQR of “doubtful” was somewhat wider, between 0.06 and 0.15.
[Table 7 about here]
Correlations among Performance Measures
Table 8 displays the correlations among the assessment performance measures
Highlighted in grey are correlations in the Numeric condition. In general, Brier scores were
significantly associated with their component scores including slope, scatter and bias. Some of
the component scores were significantly associated with two additional measures of probability
judgments: proportion of correct answers and the proportion of consistent judgments. Because
both AUC and slope measures resolution, they were significantly associated. Indeed, the
magnitude of the correlations across samples and groups were large ranging from 0.67 to 0.89.
[Table 8 about here]
Group Comparison
A common method to visualize probability judgments is the covariance graph (Yates
1990). Figure 10 displays the covariance graph of probability judgments. A total number of 183
experts * 50 judgments/experts = 9150 judgments
9
, each represented by a separate data point,
were plotted. Data points from the verbal group were represented by a triangular shape whereas
data points from numeric group were represented by a square. The cluster of points at x = 0 on
the vertical axis are subjective probability judgments when the target events did not occur.
9
Because many data points were overlapped, we jittered the data points to make the distributions of probability
responses recognizable.
94
Similarly, the cluster points at x = 1 on the vertical axis are the probability judgments of the
events that occurred. The vertical dotted black line crosses the horizontal axis at the value that
indicates the proportion of times the target events occur (32%). The intersection between the
dotted black line and the diagonal represents perfectly calibrated probability judgments. The
horizontal colored (grey versus black) and dashed lines cross the vertical axis at the means of the
experts’ probability judgments, each representing a specific group. The slopes of the solid gray
and black regression lines represent the degrees of resolution in the number and verbal
conditions whereas the slope of the diagonal represents the best possible resolution score (S=1).
The intersections between the colored regression lines and the horizontal colored lines indicate
the degree of bias. When an intersection point is above the reference solid black point, judgments
are said to be biased upward, meaning that the mean probability responses are greater than the
base rate (32%).
[Figure 10 about here]
Several insights become immediately clear upon visually inspecting the covariance
graph. First, because both the intersection points of the two experimental groups were above the
perfect judgment point, experts in both groups showed an upward bias—they believed a class of
target events happened more frequently it did. Indeed, numeric experts showed a stronger bias
compared to verbal experts. Second, because the two regression slopes were relatively flat,
experts in both groups showed a low level of resolution in their judgments. In addition, because
the regression lines were almost parallel, there was little difference in the degree of resolution
between the experimental groups. Finally, there was also little difference in the variability of
probability judgments between the two groups.
95
The means and standard deviations of the performance measures were reported in the
first two columns in Table 9. A series of t-tests revealed no significant effects of response mode
on the Brier score, degree of scatter, calibration, or resolution, including both the slope index and
AUC (ps > .05). However, t-tests revealed significant effects of the response mode on the degree
of bias, the proportion of consistent judgments, and the confidence index. Specifically, verbal
judgments were less biased than numeric judgments (t(142) = -3.22, p < .01) and this was
because the mean probability response was higher in the numeric condition than in the verbal
condition. Yet, verbal judgments were less consistent than numeric judgments (t(181) = -4.23, p
< .01). Both verbal and numeric judgments were overconfident, evidenced in the positive
confidence indexes. However, verbal judgments were more overconfident than numeric
judgments (t(160) = 2.05, p = .04). The effect of the response mode on proportion of correct
answers just missed significance at the .05 level (t(161) = 1.8, p = .06).
[Table 9 about here]
Examining the Effects of Individual Characteristics
To examine the effects of cognitive ability, numeracy, and interests and behavioral
engagement in NFL football on probability assessment, I conducted series of multiple
regressions. Specifically, the Brier score, bias, slope, AUC, scatter, overconfidence index,
proportion of correct answers, and proportion of consistent judgments were regressed on the
predictors and the experimental group. Although the effect of response mode has been discussed
thoroughly, it was included it in the models for statistical control.
The multiple regression model predicting Brier score was significant, F(5, 174) = 2.86 , p
= .01. The five-predictor model accounted for a modest amount of the variance, 𝑅 2
= 8%. A unit
increase in cognitive reflection score significantly predicted a decrease of 0.03 units in the Brier
96
score, b(SE) = -0.03(0.01), p = .01, holding other predictors constant at their mean levels. This
means that those who scored high in this cognitive ability measure performed significantly better
than those who scored low in the measure.
The multiple regression model predicting bias score was significant, F(5, 174) = 2.52,
p = .01. The five-predictor model accounted for a modest amount of the variance, 𝑅 2
= 7%.
Consistence with the t-test, numeric experts were more biased in their judgments than verbal
experts, b(SE) = 0.04(0.01), p < .01, holding other predictors constant at their mean levels. The
scatter model was statistically significant F(5, 174) = 2.43, p = .01. The five-predictor model
accounted for a modest amount of the variance, 𝑅 2
= 7%. A unit increase in cognitive reflection
scale significantly predicted a decrease of 0.02 units in the degree of scatter, b(SE) = 0.02(0.01),
p < .01, holding other predictors constant at their mean levels. A unit increase in numeracy score
significantly predicted an increase of 0.01 units in the degree of scatter, controlling for other
predictors, b(SE) = 0.01(0.00), p = .02.
Finally, the regression model predicting the proportion of consistent judgments was
significant, F(5, 174) = 6.33 , p < .01. The five-predictor model accounted for a moderate
amount of the variance, 𝑅 2
= 15%. Numeric judgments were 20% more consistent than verbal
judgments, b(SE) = -0.20(0.05), p < .01, holding other predictors constant at their mean levels. A
unit increase in numeracy score significantly predicted 6% increases in the proportion of
consistent judgments, controlling for other predictors, b(SE) = 0.06(0.03), p = .05. The multiple
regression models predicting the two measures of resolution, slope and AUC, calibration,
confidence, and the proportion of correct answers were not statistically significant at p = .05.
97
Analyses on Performance among Top Most Informative Performers
An interesting question was whether the effects of the response modes on probability
judgments were the same among the best forecasters. Thus, a subgroup of forecasters whose
Brier scores were less than 0.25 was identified. I called this sub-sample most informative experts
because their judgments were better than chances (i.e., an expert who repeatedly use the 50% (or
toss-up) response category for all questions would get a Brier score of 0.25). For the most
informative experts or MAX sample, 116 expert respondents were removed from the MIN
sample because their Brier scores were greater than 0.25. Specifically, 62.37% and 64.63% were
removed from the number and verbal conditions, respectively. The final most informative MAX
sample included 38 experts in the number condition and 29 experts in the verbal condition. There
was no statistical dependency between the experimental groups and the most informative sample
group, χ
2
(1) = 0.03, p = 0.87.
The last two columns in Table 9 summarize the performance measures among the most
informative (MAX) sample. A series of t-tests revealed that there were no significant effects of
response mode on the Brier score, degree of scatter, proportion of correct answers, calibration
and the slope index, ps > .05. However, t-tests revealed significant effects of response mode on
the degree of bias, the proportion of consistent judgments, and AUC. Specifically, verbal
judgments were less biased than numeric judgments, t(42) = -2.21, p = .03, and this was due to
the fact that mean probability response in the numeric condition was higher than the mean
probability response in the verbal condition. Verbal judgments were less consistent than numeric
judgments, t(60) = -4.72, p < .01. Interestingly, when resolution was measured in terms of the
area under the ROC curve (AUC), numeric judgments were more resolute than verbal judgments,
t(49) = -2.61, p = .01.
98
The multiple regression model predicting the Brier score, degree of scatter, slope, and
proportion of correct answers, confidence, and calibration were not significant. The multiple
regression model predicting bias score was significant, F(5, 61) = 3.67 , p < .01. Numeric
judgments were more biased in their judgments than verbal judgments, b(SE) = - 0.03(0.01), p =
.03. A one unit increase in cognitive reflection score significantly predicted an increase of 0.05
units in the bias score, b(SE) = 0.05(0.02), p = .04. The model predicting AUC was significant,
F(5, 61) = 2.86 , p = .04. The five-predictor model accounted for a moderate amount of the
variance, 𝑅 2
= 19%. Numeric judgments were more resolute than verbal judgments, b(SE) = -
0.04(0.01), p < .01, holding other predictors constant at their mean levels.
Finally, the regression model predicting the proportion of consistent judgments was
significant, F(5, 61) = 7.61 , p < .01. The five-predictor model accounted for a moderate-to-large
amount of the variance, 𝑅 2
= 33%. Numeric judgments were 35% more consistent in their
judgments than verbal judgments, b(SE) = -0.35(0.07), p < .01, holding other predictors constant
at their mean levels. A one unit increase in cognitive reflection score significantly predicted 28%
increases in the proportion of consistent judgments, controlling for other predictors, b(SE) =
0.28(0.11), p = .01.
Conclusions
The current research evaluates the effects of different response modes on forecasting
performance and examines the roles of individual characteristics on probability assessments.
Verbal and numeric forecasts of various outcomes in the 2016-2017 NFL football season were
compared. Results revealed that compared to verbal probability expressions, numeric probability
judgments produced superior performance. Numeric judgments were more consistent and less
overconfident than verbal judgments. Interestingly, when analyses were restricted to the most
99
informed respondents and when resolution was measured in terms of the area under the ROC
curve, results indicated that numeric judgments were more resolute than verbal judgments. The
only advantage that verbal judgments had over numeric judgments was that the former were less
extreme, hence less biased, compared to the latter.
The current findings offer mixed support for the proposed hypotheses. Based on
Friedman et al.’s (2017) findings, it was expected that numeric judgments would be more
extreme than verbal judgments, which could lead numeric judgments to be less calibrated and
more overconfident. Although the results indicated no significant difference between the
response modes in calibration score, our data were consistent with Friedman et al.’s (2017)
finding, such that numeric judgments were more extreme compared to verbal judgments. In fact,
our findings extend the previous results in several ways. First, verbal expressions of uncertainty
were quantified at an individual level. This is important given that the meanings of probability
words/phrases are ambiguous and specific to each individual. Second, Friedman et al.’s (2017)
study considered extreme responses as ‘biased responses.’ However, extreme responses may not
be biased as long as they are consistent with the base rate (i.e., extreme positive responses are
desirable when the base rate is high). Using Yates’ (1990) bias index as a measure of bias, I
extended Friedman et al.’s (2017) findings by showing that numeric responses were much higher
than the base rate compared to verbal responses. Substantively, this means that compared to
respondents who used words, respondents who used numbers to make forecasts believed the
target events would occur more often than they actually did.
The finding that verbal expressions led to greater overconfidence is noteworthy. A
possible account for this difference in the effect of myside bias (Stanovich, West, & Toplak,
2013). Myside bias is a tendency for individuals to evaluate or generate evidence biased toward
100
their own opinions. Myside bias could operate in the current assessment when respondents
assigned unjustifiable high probabilities to outcomes that they desire to happen (e.g. their
favorite team wins a difficult game). Consequently, the effect of myside bias could be greater in
the verbal condition because there are no numeric values associated with verbal terms, leaving
respondents with more freedom to interpret values of the expressions. This possibility is
supported by research showing that subjects were motivated to interpret meanings of verbal
probability terms to be consistent with their desired outcomes (Piercey, 2009). Nevertheless, this
might not be the case in this research because respondents in the verbal condition completed the
quantification study about a week after the assessment study. This procedure, in effect, reduced
any biases associated with the interpretation of the verbal terms at the moments judgments were
made.
The previous interpretation also implies that verbal epistemic uncertainty assessment
would be less prone to the effect of myside bias because respondents probably have less
desirability for certain historical outcomes to be happened. However, the effect of
overconfidence has also been found in research that compares numeric versus verbal assessment
of epistemic uncertainty (Wallsten, Budescu, & Zwick, 1993). On the surface, this finding seems
to run counter to the result that numeric judgments were more extreme. However, measures of
bias and overconfidence were mathematically different. The latter is the difference between the
proportion of correct responses and the mean probability rating whereas the former is the
difference between the mean probability rating and the base rate. As such, overconfidence is
more like a measure of meta-cognition, knowledge about one’s own knowledge, while Yates’
bias is simply an indicator of a tendency to over or under estimate the frequency of an event in
reality. Thus, it was entirely possible that respondents who used words to make forecasts were
101
more confident in their belief that they correctly answered more questions (hence
overconfidence), but at the same time they were correct that the target events would not happen
as frequently as their counterparts (hence less biased).
The effects of individual characteristics are also important. A higher score on the
Cognitive Reflection Test (CRT) was associated with a better (lower) Brier score and less
scattered and more consistent judgments; these findings are consistent with Mellers el al. (2015).
On the other hand, interests and behavioral engagement in an assessment topic (football) and
proxy measures of NFL knowledge were not related to any of the performance measures. Put
simply, the ability to engage in deeper reflection and overcome instinctive responses (i.e., higher
CRT score) was more predictive of better probability assessments than domain (NFL) knowledge
and involvement. These findings are potentially useful to practitioners who select respondents’
judgments, as they highlight the utility of using a series of measures of cognitive function to
identify individuals with good probability forecasting skills. Those with higher CRT scores (or
comparable measures) should be perhaps given more weight when aggregating probabilities
from individual judges (see Cooke, 1990 for a review).
It is also important to note that the average Brier scores for the original sample and the
MIN sample were worse than no-skill probability judgments (reporting p=0.50 for every
question). This was a surprise given our efforts to screen for judges with relevant domain
knowledge and involvement. One possible explanation is that the screening instrument was too
brief to provide a reliable measure of NFL knowledge and involvement that is important to
predicting season events. However, a more valid measure of expertise does not necessarily
guarantee better performances. For example, Ronis and Yates (1987) found that self-reported
expertise was not predictive of the Brier score.
102
The modest Brier scores were probably due to the difficulty of the task of predicting NFL
season outcomes in Week 3 of the season. While sports broadcasters seem to have a great deal of
expertise, most of their expertise comes from the benefit of hindsight. That is, they are always
able to find a post hoc explanation that makes the outcome of any sporting event seem as though
it was inevitable. In a domain such as the NFL, in which one team can defeat another “on any
given Sunday,” the task is extremely difficult. The task can be particularly hard when the events
are far in the future, and many relevant uncertainties apply, such as injuries to key players.
Benson and Onkal (1992) showed that judgment accuracy measured by the Brier score worsened
for events far in the future compared to those in closer time. Some of the questions in this study
require assessors to judge probabilities of events that would not be resolved for several weeks,
which might have been the reason for attenuated Brier scores. However, it is important to note
that the average values of the Brier scores found in this study were on par with values reported in
prior research on NFL game outcomes (Boulier, & Stekler, 2003; Doyle, 2010), and the fact that
the Brier score was modest does not affect the comparison of different response modes.
The current study fills a gap in research comparing verbal versus numeric assessments of
uncertainty by quantifying probability judgments of aleatory uncertainty and by translating
verbal expressions into numbers at an individual level. Whereas Wallsten et al. (1993) found
little difference between the two response modes, verbal versus number, in an assessment of
general knowledge questions (epistemic uncertainty), findings from the current research suggest
several advantages of using numbers for probability assessment when the assessment context is
aleatory in nature (i.e., when assessors are asked to predict the future). Additionally, by using
multiple metrics to evaluate and compare different modes of probability judgments, findings
were both contradictory and consistent with previous research. Whereas Friedman et al. (2017)
103
found numeric forecasts were less accurate (larger Brier score) than verbal forecasts, I found no
such evidence. In fact, our results are more consistent with Wallsten et al.’s findings (1993). This
could be attributed to the fact that verbal expressions were translated into numerical values at an
individual level. Moreover, consistent with Wallsten et al. (1993), I found that verbal forecasts
were more overconfident. In addition, I extended Friedman et al.’s finding (2017) by showing
that numeric forecasts were more biased against the base rate, yet the former was more resolute
than the latter.
While the results suggest that numerical probability assessment offers more advantages
than verbal assessment of uncertainty, there are interesting questions awaiting future research
efforts. For instance, the effect of the selection of verbal expressions on probability forecasts was
not studied in this research. In the current experimental setup, respondents in both conditions
received an 11-point probability scale. While this presentation helps to control for free selection
of probability words by respondents, it does not permit an empirical test on the effect of using
personalized vocabulary of uncertainty. Having respondents choose their own sets of verbal
probabilities might have helped them to improve their verbal forecasts, perhaps because it makes
it easier for people to express uncertainty using their own language. Another interesting question
is the effect of using modifiers on verbal forecasts. The use of modifiers (e.g. “very,” “much,”
“extremely,” etc.) may help to distinguish different degrees of certitude. Because this research
did not use any modifiers, one could argue that this decision led to the much-overlapped
meanings (numeric values) of the verbal expressions. The next study attempts to explore these
issues further. In addition, other elicitation approaches are explored in an effort to improve the
performance of probability judgments.
104
CHAPTER 7
COMPARING ALTERNATIVE METHODS TO ELICIT SUBJECTIVE PROBABILITY
Results from the NFL study reveal important differences between verbal and numeric
assessment of probability. Nevertheless, both approaches require direct probability judgments.
Alternatively, probability can be elicited from having respondents choose between different
gambles. However, it is unknown whether this approach offers any advantages over the direct
elicitation method. In addition, while performance incentive is often used to encourage
probability assessors to think carefully about their responses, there is still limited research that
evaluates the effectiveness of this approach. The goal of the second experiment in this
dissertation is to explore how different methods of probability elicitation affect the quality of
subjective forecasts. Specifically, the effects of direct versus indirect probability elicitation
methods are compared. Furthermore, this research also explored whether using a performance-
based incentive strategy can improve probabilistic forecasts. Lastly, to resolve an outstanding
issue from the previous study, this study tested whether simply increasing the number of verbal
categories can improve the quality of verbal forecasts. This chapter begins by first proposing the
research hypotheses and presenting the methodology. The results from a forecasting experiment
designed to test the proposed hypotheses are then reported and discussed.
105
The Effect of Response Categories in Verbal Probability Assessments
Verbal expressions of uncertainty are ambiguous. When individuals are asked to interpret
meanings of verbal expressions of uncertainty, there is greater variability between subjects than
within subjects (Clark, 1990; Wallsten & Budescu, 1990). Budescu and Wallsten (1985)
recommended that subjects in verbal probability study should be allowed to freely choose their
own vocabulary of uncertainty. When subjects have freedom to select their own expressions,
they may become better at expressing their uncertain beliefs, presumably because they have the
“right” language to describe their feelings of incertitude. This conjecture suggests a possible
explanation of why verbal judgments were less resolute and more overconfident than numeric
judgments in the NFL study (see Chapter 6).
Respondents in the previous study were asked to make forecasts using either a verbal or a
numeric scale. Importantly, the number of categories was the same in both conditions.
Nonetheless, the pre-defined verbal scale might have impeded the performance of those in the
verbal condition by making it difficult for them to express their belief without having the “right”
language of uncertainty. An alternative approach is to have respondents select their own set of
verbal expressions of uncertainty and later convert these expressions into numerical values at an
individual level. However, this approach may prove to be logistically difficult to scale, especially
when it is desirable to collect judgments from hundreds of respondents. More importantly,
because many organizations have guidelines on the use of verbal probability (e.g. National
Intelligence Council, 2007), probability assessors may not have the freedom to choose their own
expressions when they make forecasts for their organizations.
A compromise approach is to provide respondents with many verbal expressions from
which they can choose to make forecasts. This solution provides respondents with many
106
expressions to choose from and makes it feasible to scale. Another advantage of this approach is
that it may provide respondents with a richer set of vocabulary. Because numeric probability
may be “a foreign language” for some (Friedman et al., 2017), it can be challenging for
individuals to create a list of verbal expressions that covers different values in the [0, 1] interval.
For example, subjects inclined to use the “fifty-fifty” expressions when they were prompted to
provide predictions even though they did not really mean a 50% chance, suggesting that
respondents had difficulty expressing uncertainty verbally (de Bruin, & Fischhoff, Millstein, &
Halpern-Feisher, 2000). There is also anecdotal evidence illustrating the difficulty in creating a
refined verbal probability scale. For instance, the National Intelligence Estimate guideline
suggests a 5-point verbal probability scale (National Intelligence Council, 2007) and the
Intergovernmental Panel on Climate Change recommended a 7-point scale (Budescu, Por,
Broomell, & Smithson, 2014) to describe the full spectrum of probability values.
With a larger set of probability words to choose from, it may be possible for respondents
to improve the resolution of their judgments. Recall that the measure of slope quantifies the
ability of an assessor to choose and assign appropriate response categories to different target
events. The assessor should assign a higher probability value to an event that is going to happen
and a lower probability value to an event that is not going to happen. Thus, verbal assessors may
do better when they can choose from a larger pool of verbal expressions that capture more levels
of uncertainty.
Nevertheless, this possibility depends on the assumption that respondents are capable of
distinguishing different levels of probability when using verbal expressions of uncertainty.
Because a higher resolution score reflects not only the ability of an assessor to properly use
different response categories but also his or her skill in distinguishing signals from noises (Yates,
107
1990), the increase in the number of response categories may help an assessor to improve his or
her labeling skill by providing more options to choose from. At the same time, however, it may
do little to enhance his or her discrimination skill. This means that when assessors are given
more probability categories, their responses are likely to be more scattered, but this does not
necessarily imply a better resolution score. In fact, because the meanings of verbal probability
expressions overlap, providing individual judges with more verbal options can have a reactive
effect that may be confusing due to greater complexity, which may actually decrease the quality
of her judgments. Given the contrasting hypotheses, the goal of this study is to examine whether
the quality of verbal probability judgments, particularly the degree of resolution, is a function of
the number of verbal response categories. It is expected that the resolution score from verbal
judgments would be better, or at least comparable, to those from numeric judgments.
The Effects of Direct versus Indirect Probability Elicitation on Future Forecasts
Another class of probability elicitation is based on inference from a subjective expected
utility model (Edward, 1954; Savage, 1954). The basis of this model is that subjective
probability can be inferred or revealed by observing how people choose between different
gambles. The so-called lottery method is an implementation of this approach. In a typical
application, probability assessors are asked to consider a pair of gambles in which they would
win X money when an event E occurs, or they would win the same amount of money when an
event F with a probability P occurs. The probability of an event E is set equivalent to P when
probability assessors become indifferent in choosing between the two gambles. The probability
wheel method is a classic implementation of the lottery approach. One of the key differences
between the direct judgment method and the lottery approach is that the former method requires
subjects to come up with an estimate of uncertainty, in either a numeric, verbal, or odd form,
108
whereas the latter method requires subjects to indicate their preferences for gambles. As such,
the lottery method could be cognitively easier for subjects to complete. The lottery method can
also reduce judgment errors.
Erev, Wallsten, and Budescu (1994) described the process of formulating probability
judgments. First, assessors construct internal beliefs about the uncertainties of target events.
Second, the assessors map these beliefs onto an artificial confidence scale, with the mapping
being dependent on the nature of the task. Importantly, random errors can occur at either or both
steps. In the lottery method, however, assessors are simply asked to choose between options and
are not required to map their feeling of incertitude on an artificial probability scale. As a result,
their probability judgments may be less prone to errors.
There are numerous variations of the lottery approach and its related methods (see O’
Hagan et al, 2006 for a review). Although a large body of research has been devoted to assessing
various aspects of using the lottery approach or its variants (Gavasakar, 1988; Merkhofer, 1987;
Price, 1988; Savage, 1971; Shepherd & Kirkwood, 1994), relatively few experiments have been
conducted to compare probability elicited from gambles and from direct judgments. The limited
evidence seems to suggest little difference among alternative elicitation methods. Chesley (1978)
compared different qualities of probability judgments for a continuous variable elicited from
three methods, odds, direct estimation, and the lottery conceptualization of probability, and
found little difference. More recently, Abbas, Budescu, Yu, and Haggerty (2007) also found little
difference when comparing fractiles of probability distributions elicited from two distinct
methods. In the fixed probability method, subjects were asked to indicate a value of a variable
that makes the probability of this value greater than a threshold V equal to P. In the fixed value
method, subjects were asked to indicate a cumulative probability for a fixed value of a variable
109
of interest. Both methods were implemented by having subjects simply choose between binary
gambles. Abbas et al. (2007) found little difference between the two methods over a range of
performance evaluation criteria. The primary advantage of the fixed value approach was that it
was completed faster. Nevertheless, these studies elicited probability distributions for continuous
variables. it is unknown to what extent these results can be generalized to probability
distributions of discrete variables. In addition, Abbas et al., (2017) asked respondents to match
values rather than to indicate preference for gambles. Probabilities elicited from this so-called
matching method, a variant of the gamble methodology, can be different from values inferred
from choices between gambles. Indeed, there can be systematic differences when respondents
respond to matching tasks versus when they are asked to choose between options (Tversky,
Sattath, & Slovic, 1988).
The current experiment explored the application of the lottery method in a forecasting
task. Specifically, I used the lottery method to elicit probability estimates. Because there is a
perception that the lottery method is easier for individuals to complete compared to the direct
estimation method, I expect probability estimates obtained from the lottery method to be better
than estimates obtained from a direct estimation approach over a range of evaluation criteria.
Examining the Effect of Incentivization
The focus of the discussion so far has been on methods to obtain subjective probability
estimates. Yet, elicitation method is one of many factors that contribute to the quality of
subjective judgments. Another important contributor is the effect of cognitive heuristics. As
discussed previously (see Chapter 2), decision makers tend to rely on heuristics or mental
shortcuts to make decisions. While these heuristics or System 1 operations are often adequate in
certain situations, they can lead decision makers to make use of non-normative and irrelevant
110
information, leading to inaccurate judgments. As such, strategies that encourage probability
assessors to carefully deliberate their thinking can mitigate some of the undesirable effects of
cognitive heuristics and enhance the accuracy of subjective probability judgments.
In practice, analysts can encourage careful thinking and honest responses from
probability assessors by incentivizing performance with respect to a proper scoring rule. A
proper scoring rule such as the Brier score acts as a truth serum to motivate probability assessors
to provide their earnest beliefs and prevent them from hedging their estimates. Careful
assessments are encouraged by rewarding correct judgments with higher scores based on an
employed scoring rule. Hedging is discouraged by making the cost of incorrect judgments larger
than the rewards of correct assessments (i.e., a penalty for an incorrect judgment is larger than a
reward for a correct judgment at the same probability level) (Carvalho, 2016). Because any
proper scoring rule is designed to penalize extreme judgments when such judgments are not
warranted, performance can be improved when judges follow rules of an employed score.
However, empirical research suggests mixed evidence regarding the efficacy of using
such performance-based incentive approach to improve probability judgments. For example,
Phillips and Edwards (1966) compared the performance of respondents incentivized by the
strictly proper logarithmic and quadratic scoring rules with that of respondents incentivized by
the improper linear scoring rule or by no scoring rule at all. Respondents were assessed on how
well they could apply the Bayesian updating process in the classic “bookbags and poker chips”
task. The authors found that performance was best (close to an optimal measure derived from
Bayes’ rules) when assessors were incentivized by the linear rule and was worst when there was
no scoring rule. Yet, also using a Bayesian revision task as the main performance measure of
probability judgments, Schum, Goldstein, Howell, and Southard (1967) found that respondents
111
receiving logarithmic payoffs performed substantially better than those receiving linear or all-or-
none payoffs.
Fischer (1982) asked a sample of college students to predict the freshman grade point
averages (GPA) of 40 other college students using students' genders, SAT scores, and high
school GPA. The students were paid a flat fee plus or minus a bonus pay contingent upon a
logarithmic scoring rule. The author found that respondents who were incentivized by a
logarithmic scoring performed better than those who did not receive any incentives in terms of
the log score, but the incentive payment scheme did not affect the measure of confidence.
Interestingly, despite the use of incentivization, the average log scores of respondents in the
experiment were still worse than the log scores that respondents would have obtained had they
used a naïve strategy of reporting 50-50 for all questions or had they simply reported the base
rate. In addition, the effect of the incentive payment scheme on the log score was not generalized
to other types of scoring rules such as the spherical score. In addition, Hollard, Massoni, and
Vergnaud (2016) found little improvement in either calibration or discrimination when
respondents were incentivized and paid according to a quadratic scoring rule versus when they
were paid a flat fee.
Previous efforts to incentivize performance failed because subjects were not given
adequate opportunities to demonstrate that they understood the scoring procedures. In fact,
incentivization may work only when the following conditions are met. First, assessors should
clearly understand how their performance is evaluated, or at least, develop an intuition of how
their judgments are scored. Note that an understanding of the mathematical formulation of a
proper scoring rule may not be necessary. Developing an intuition for how the score applies may
be more fruitful. Second, assessors must actively use the scoring rule (presented by researchers)
112
to calibrate their judgments. The former requirements could be more important than the latter.
Once respondents understand how their performance is evaluated, it is difficult to understand
why they do not behave in a manner that helps them maximize their incentives. As such, the
focus of this experiment would be on the first requirement.
There are different methods to communicate a proper scoring rule or the payoffs scheme
to subjects. For example, subjects could simply learn about the mathematical formula of an
employed proper scoring rule (Fischer, 1992), or they could be educated on strategies to
maximize their score without knowing anything about the employed scoring rule (Mellers et al.,
2014). However, these approaches have limitations. Displaying mathematical formulas probably
causes confusion whereas providing decision strategy instruction does not guarantee that
respondents understand how their performance is evaluated. The second study of this dissertation
explored a third approach to communicate performance incentives to probability assessors.
In this approach, judges were provided with a payoff table that demonstrates how their
judgments were scored based on their assessments and the outcomes. Importantly, the payoffs
were scored based on a proper scoring rule, but assessors were not required to understand the
mechanics of the scoring procedure. Importantly, judges were given opportunities to examine
and applied the payoff tables to their judgments in series of examples so that they can develop an
intuition on how the scoring rule operates prior to making “real” assessments. It is expected that
those who undergo a brief training and had access to the payoff table would perform better than
those who did not participate in the training and did not have access to the payoff table.
Importantly, because the payoffs were based on a proper scoring rule that penalizes heavily for
extreme incorrect judgments, it is expected that those who would have access to the payoff table
113
to make less extreme judgments. As a result, their judgments would be more calibrated but less
resolute compared to those who do not receive the payoff table.
The Effects of Individual Differences
The current study continues exploring the roles of individual differences. In the NFL
study, I have found that the ability to think reflectively, measured by the Cognitive Reflection
Test (Frederick, 2005), significantly predicted the Brier score and the degree of consistency in
judgments. One of the goals of the current study was to generalize this finding to a similar but
different assessment context. Motivation can also be an important factor to encourage judges to
perform better. Baron (2008) described that good decision makers are those who are not only
receptive and open to new information but also motivated to seek and examine contrasting
hypotheses to their conclusions. A Bayesian decision maker should be able to update old
information with new data and revise their opinions according to Bayes’ rules. Certainly, this
requires decision makers to seek new information to revise their conclusions. Importantly,
information contrasting the initial conclusion probably provides the most useful insight. Thus, it
is expected that those who score high on a scale that measures a proclivity to seek out and
examine contrasting information would perform better in probability assessment tasks.
Specifically, it is expected that those who are more active to seek out evidence inconsistent with
their (initial) beliefs would obtain a better resolution score.
Summary of Research Questions
The current study attempts to achieve the following goals. First, I examine whether
individuals who use verbal probability expressions performed better when given more verbal
response options compared to those who make numeric judgments. Second, probability
judgments elicited from a lottery method are compared to those from a direct estimation method.
114
Third, I investigate whether providing respondents with clear instructions on how their
performance is evaluated would improve their performance. Finally, the effects of active-open
minded thinking and the ability to think reflectively on probability assessment performance are
examined.
Method
Design Overview
Individuals knowledgeable in NBA basketball (or ‘experts’) were recruited to participate
in a “Sport and Decision Making” study. Experts were selected based on their responses to a
screening instrument (described below), and they were randomized into one of four experimental
conditions. Experts in the incentivized/number condition used a numeric probability scale to
assess the outcomes of target events whereas experts in the incentivized/verbal condition selected
their responses from a set of 19 different verbal probability terms. Experts in these conditions
received bonuses contingent upon their performance. Experts in the direct/numeric condition
used a probability response scale to assess the likelihoods of the target events. Experts in the
gamble condition were asked to choose between series of pairs of gambles from which I could
infer their probability responses. Experts in the incentivized/verbal condition were later invited
to participate in a separate study, designed to quantify the meanings of the verbal probability
expressions. I used the gamble methodology (described in Chapter 5 and applied in the NFL
study) to quantify the implicit numeric values associated with the 16 probability expressions that
the experts had used in the main assessment study (the 3 anchor terms, absolutely impossible,
fifty-fifty, and absolutely certain, were not quantified individually).
All experts were asked to assess the outcomes of 46 events related to the NBA 2017
playoffs. The questions span a variety of events including individual player performances, team
115
performances, game outcomes, and specific events during games, etc. About one-half (52%) of
the binary target events (e.g., Team A defeats Team B during a playoff game) occurred. All
experts were paid $2.00 for their participation in the main study and $1.00 for the follow-up
word-to-number transformation study. Experts in the incentivized conditions received extra
bonuses contingent upon their performance
10
. Data collection was conducted and completed one
week prior to the NBA 2017 playoff (April 15
th
, 2017).
Experimental Manipulation
Experts were randomized into one of the four experimental conditions: Direct/Number,
Incentivized/Number, Incentivized /Verbal, and Gamble. These conditions were a subset of a 2
(Direct x Lottery) by 2 (Verbal x Number) by 2 (Incentive x No Incentive) between-subject
factorial design. These conditions were specifically selected because they allowed for empirical
tests of the hypotheses and because they served well for the purpose of exploratory analyses. I
compared the performances between Direct/Number experts and Incentivized/Number experts to
examine the question whether motivating experts with a performance-based incentive scheme
enhances aleatory uncertainty assessment. These two experimental conditions were identical in
virtually all aspects except that experts in the Incentivized/Number condition received additional
bonuses contingent upon their performances. The performances of Incentivized/Number experts
and Incentivized/Verbal expert were compared to examine the question whether verbal
judgments become more resolute compared to numeric judgments. Finally, I compared the
performances of experts in the Gamble condition against those of experts in the Direct/Number
condition to test whether using the gamble methodology to elicit probabilities results in better
10
I rewarded everyone with the same bonus that was obtained by the best performer.
116
probability assessment performance. Each of the four experimental conditions was described
next.
Direct/number condition. Experts in the Number group used a numeric response scale
with 11 categories from 0% to 100% to make probability assessments. Experts were told:
We are interested in your best estimate of the chance from 0% to 100%
that each event will happen. If you are sure that the event will happen, choose
100%. If you are sure that the event will not happen, choose 0%. If you feel the
event as likely to happen as it will not happen, choose 50%. In all other cases,
select an option between 0% and 100%.
Incentivized/number condition. Experts in the Incentivized/Number group used a
numeric response scale with 11 categories from 0% to 100% to make probability assessments.
Importantly, experts in this condition were incentivized based on their performance. They were
told that their assessments would be scored based on a scoring rule, but they were not told about
the specifics of the scoring procedure. They were given a payoff table illustrating their scores
with respect to their judgments as well as the outcomes.
Figure 11 displays how the payoff was presented to respondents. Note the uses of
different colors and gradient to further emphasize different payoff values. Instructions on how to
read the table were provided. Scores in the table were computed based on the Brier score. The
Brier score is conventionally bounded between 0 and 1 with the preferred scores in a negative
direction (lower is better). The Brier score was transformed such that the score for the 50%
category is 0 using the following formula: 25 - Brier*100. Note that any linear transformation of
a proper scoring rule is also a proper scoring rule. The transformation also reversed the
117
conventional direction such that a score of 25 would be the best possible score and a score of -75
would be the worst possible score.
Experts also completed a series of four examples to ensure that they understood the
information presented in the table. The examples were provided in the Appendix. Experts were
told that the higher their total score was, the more money they could get. In fact, the experts got
paid proportionally to their (Brier) score (i.e., (transformed) Brier score*($0.05)). In addition,
experts were told the person with the highest score got a special bonus of $50.
[Figure 11 about here]
Incentivized/verbal condition. Experts in the Incentivized/Verbal group made
assessments by choosing from a list of 19 different verbal probability expressions. The
expressions were: absolutely impossible, almost impossible, highly improbable, very unlikely,
very doubtful, improbable, unlikely, doubtful, light chance, fifty-fifty, probable, no doubt, likely,
good chance, pretty sure, very likely, highly probable, almost certain, absolutely certain. The list
was ordered such that pessimistic expressions such as unlikely were on the left side of the anchor
term fifty-fifty and optimistic expressions such as likely were on the right side of the anchor term
fifty-fifty. The two extreme anchor terms absolutely impossible and absolutely certain were
positioned at the two ends of the list, and all other expressions were located in between the
extreme terms and the midpoint anchor. Respondents were not told about the order of the non-
anchoring expressions. Note that the use of qualifiers (e.g. almost, very, and absolutely) was
intended to create more differentiation among the verbal terms.
Respondents were told that their assessments would be scored based on a scoring rule,
but they were not told about the specifics of the scoring procedure. They were given a payoff
table illustrating their scores with respect to their judgments as well as the outcomes. Figure 12
118
displays the payoff table. Importantly, the table displays only the anchored verbal terms:
Absolutely Impossible, Fifty-Fifty, and Absolutely Certain and their respected payoffs. Indeed,
this was intended to avoid prescribing the experts with a specific ordering of verbal expressions.
The experts were further told that their performance would be scored accordingly when they
used other verbal expressions. Detailed explanations were provided. Experts also completed a
series of four examples to ensure that they understood the information presented in the table.
[Figure 12 about here]
Gamble condition. Experts in the Gamble condition were presented with series of binary
choices among gambles and asked to choose their preferred options. The following example
provides a concrete demonstration of the gamble methodology:
Proposition: Steve Curry will make at least two three-point attempts per
game throughout the playoffs.
Gamble A: If the proposition is true, you will get $50. If the proposition is
false, you will get nothing.
Gamble B: A ball will be drawn from a container that consists of 100 red
and blue balls. There are 50 red and 50 blue balls in the container. If a red ball is
drawn, you will get $50. If a blue ball is drawn, you will get nothing.
Which gamble do you prefer to play?
If an expert chooses to play Gamble A, Gamble B is altered to become more attractive by
increasing the number of red balls in the container (i.e., there are (now) 90 red balls and 10 blue
balls). If the expert chooses to play Gamble B, Gamble B is altered to become less attractive (i.e.,
there are now 10 red balls and 90 blue balls). The goal of this methodology is to make the expert
become indifferent in choosing either one of the gambles. This indifferent point allows us to set
119
the probability estimate for the target event. If an expert did not choose the “indifferent” option
after the third trial, the indifferent point was set equal to the midpoint of the range of probability
values in the next trial. Experts played up to 46 events (propositions)*3 possible binary gambles
for a total of 138 trials. Importantly, experts were not told about the purpose of the gambles.
Indeed, they were told that, “researchers are interested in learning how people choose between
different gambles.” All experts completed three examples to ensure that they understood the
procedure (see Appendix C).
Quantifying Probability Words
Experts in the verbal condition were invited to participate in a follow-up study. The
procedure in this follow-up study was identical to the procedure in the follow-up study in the
NFL study (see Chapter 6). The only difference is that the invited experts in the current follow-
up study had to complete the translations of 16 different verbal probability expressions. Three
anchor terms, absolutely impossible, fifty-fifty, and absolutely certain were not assessed.
Corrections for the ambiguity bias were completed using results from the ambiguity study (see
Chapter 5).
Dependent Measures
Key performance indexes. Various performance indices were computed
including the Brier score, resolution score, discrimination score, scatter score, AUC, the
proportion of correct answers, and an overconfidence index. The formulas for these
performance indexes were provided in Chapter 6. The researcher computed the index for
each expert and took the average across the sample.
Psychological and behavioral measures. Respondents concluded the study by providing
responses to various questionnaires, including the Cognitive Reflection Test (Frederick, 2005)
120
and the Active Open-Minded Thinking scale (Baron, 1993). Table 10 provides a summary of the
psychological measures and their reliabilities.
[Table 10 about here]
Sample
Recruitment strategy. I recruited Mechanical Turk workers that responded to a call for
individuals interested in NBA basketball to participate in a “sport and decision making” study. A
total of 1581 volunteers signed up for the study. These volunteers completed a screening
instrument designed to assess their expertise in NBA basketball. These fans completed a
screening instrument that was comprised of four multiple-choice questions: 1) Who won the
NBA championship last year? (Cavaliers or Warriors); 2) Who was named the NBA MVP last
year? (Lebron James or Steve Curry); 3) The NBA seeds playoff teams solely by their record.
Teams in each conference will be seeded from one to eight by their won-loss record (true or
false); 4) Have you followed the current NBA season? (yes or no). All questions were timed such
that respondents with excessive response time (> 5 seconds) would not be qualified to continue
to avoid cheating. Only those who correctly answered these four screening questions were
qualified to proceed.
There were 437 respondents that were classified as NBA experts based on their responses
to the four screening questions (27.6% passing rate). Due to a technical issue, IDs of 116
workers in the Incentivized/Verbal condition were not collected. Workers’ IDs were particularly
important in the Incentivized/Verbal condition because these workers were invited to participate
in the word-to-number transformation follow study. Of the 116 IDs that were lost, 85 worker IDs
121
11
were successfully recovered. As a result, the original Qualtrics data had 406 responses after
excluding 31 responses missing from the Incentivized/Verbal condition.
The 406 experts were randomly assigned into one of the four experimental conditions. Of
85 experts in the Incentivized/Verbal condition who were invited to participate in the word-to-
number transformation study, 64 (75.3% of the sample) completed their responses. Thus, the
sample was decreased to 385 experts. Ninety-six experts were randomized into the Number
condition; 110 were randomly assigned into the Incentivized/Number condition; 64 were in the
Incentivized/Verbal condition; and 115 were randomly assigned into the Gamble condition.
Response screening. Prior to conducting statistical analyses, the researcher checked for
unreliable responses. No-skilled forecasters who repeatedly choose the probability of 0.5 to
describe their uncertainties for all of the target events would obtain PS
̅̅ ̅
= 0.25. Exploratory
analyses suggested that the sample mean (PS
̅̅ ̅
= 0.27) was significantly greater than 0.25, t(317) =
4.61 p < .01, suggesting that the experts’ performances, on average, were worse than no-skill
forecasts. However, there were great heterogeneities in performance, with the Brier score falling
between 0.13 and 0.59. In addition, some experts had negative slopes and AUC values below
0.5. These resolution values suggest negative predictors, those who always indicate the opposite
of their beliefs. For example, assessors may indicate that a target event has a 90% chance of
occurring when they actually believed that the event would not happen. These responses were
considered unreliable because respondents might not have simply paid enough attention to the
assessment task. Given the heterogeneity in performance, I removed responses associated with a
slope equal to or less than 0 and responses associated with an AUC measure equal to or less than
0.5, as these respondents’ performance suggests either careless responding or a misunderstanding
11
I was able to retrieve some of the worker IDs based on their IP addresses.
122
of the task. I referred to the remaining sample of experts as the MIN sample to indicate that the
included experts passed the minimum screening criteria of having reasonable resolution scores
(i.e., slope is greater than 0 and AUC is greater than 0.5).
For the MIN sample, data from 71 experts with slopes equal to or less than 0 or with
AUC measures equal to or less than 0.5 were removed (60 experts met both criteria). In other
words, 22.9%, 15.5%, 18.8%, and 17.4% of the responses were removed from the Number,
Incentivized/Number, Incentivized/Verbal, and Gamble conditions, respectively. A concern is
that the number of removed responses could be dependent on the experiment groups. Yet, there
was no statistical dependency between the experimental groups and the removed responses
(χ
2
(3) = 2.02, p > .05). The final breakdown in the MIN sample (N = 314) was: 74 experts were
randomized into the Number condition; 93 were randomly assigned into the
Incentivized/Number condition; 52 were assigned into the Incentivized/Verbal condition; 95
were randomly assigned into the Gamble condition.
To reiterate, I collected data from two separate studies: the main assessment study and
the follow-up word-to-number transformation study. Data from the follow-up word-to-number
study allowed for a numeric translation of the verbal probabilities in the main prediction
experiment. These numeric estimates of verbal probabilities were then corrected for the
ambiguity bias (see also Chapter 5 and Chapter 6). Finally, I removed some responses based on
the MIN criteria.
Results
Quantifying Probability Words
Table 11 provides the means and the standard deviations of the numerically translated
probability words. These values were corrected for ambiguity bias. Consistent with findings in
123
the NFL study, there were modest discriminations among the mean numeric values of the verbal
expressions. Inspecting Figure 13 revealed that there were large between-subject variations in the
interpretations of the verbal expressions of uncertainty. These heterogeneities suggested that
extreme numeric values were cancelled out, which in effect reduced the mean probability values
of the expressions.
[Table 11 about here]
[Figure 13 about here]
Correlations Among Performance Measures
Table 12 displays the correlations among the performance measures. The Brier score was
significantly associated with their component scores including slope, scatter, and bias
disregarding the type of sample. Interestingly, coherent judgments were significantly associated
with the Brier and component scores. As expected, the AUC and the slope measures were
significantly associated regardless of the sample.
[Table 12 about here]
Group Comparison
Linear regressions were conducted to examine the proposed research questions.
Specifically, I regressed each of the performance measures: The Brier score, bias,
overconfidence, calibration, slope, AUC, and scatter on three contrasts, scores in the Cognitive
Reflection Test, and scores in the Active Open-Minded Thinking scale. I made the following
(planned) comparisons: Direct/Number vs. Number/Brier to test the effect of incentivization,
Number/Brier vs. Verbal/Brier to test the effect of different response modes, and Direct/Number
vs. Gamble to test the effect of different probability elicitation methods. In addition, the two
psychological scales Active Open-Minded Thinking (AOT) and the Cognitive Reflection Test
124
(CRT) were included as predictors to examine the effects of individual differences on probability
judgments.
Table 13 displays the means and standard deviations of the performance measures across
groups, and Table 14 provides the results of the regression models. This section focuses on the
substantive interpretations. The results revealed interesting effects of both the experimental
conditions and the individual characteristics on the performance measures
12
. The Brier score
model was significant, F(5, 301) = 7.55, p < .01, and the model explained a modest amount of
variance (11.14%). The Brier score of numeric judgments from incentivized experts were
significantly lower (better) than the Brier score of verbal judgments from incentivized experts,
holding the effects of other predictors at constant. A one unit increase in the AOT score was
significantly associated with a 0.01 decrease in the Brier score, controlling for the effects of
other variables. Similarly, a unit increase in the CRT score was significantly associated with 0.03
decreases in the Brier score, holding other variables at constant. However, there were no
significant effects of the experimental conditions on the Brier score.
[Table 13 about here]
[Table 14 about here]
The bias model was significant, F(5, 301) = 9.07, p < .01, and the model accounted for a
modest amount of variance (13.1%). The Direct/Number and Gamble conditions were
significantly different (statistics here). Direct probability judgments were biased upward,
meaning that the experts believed that the target events happened more frequently than they
actually did. However, probability judgments elicited from the gambles were biased downward,
12
Seven observations were removed due to missingness.
125
meaning that the experts believed that the target events happened less frequently than they
actually did.
The overconfidence model was also significant, F(5, 301) = 5.43, p < .01, and the model
explained for a modest amount of the variance (8.41%). The overconfidence score for numeric
judgments from incentivized experts were significantly lower (better) than the overconfidence
score for verbal judgments from incentivized experts, holding the effects of other predictors at
constant. This means that incentivized experts were less overconfident in their judgments when
they used a numeric response scale than when they selected probability words from a list of
verbal probability expressions. The calibration model was also significant, F(5, 301) = 9.42, p <
.01, and the model accounted for 13.5% of the variance. Direct numeric judgments were also
significantly better calibrated (lower score) than probability judgments elicited from the gambles,
holding other predictors at constant. A one unit increase in the AOT score was associated with
0.01 decreases in the calibration score.
The slope model was significant, F(5, 301) = 9.42, p < .01. The model accounted for
11.9% of the variance. Both of the two individual difference measures were significantly
associated with the slope measure. A one unit increases in the AOT and the CRT significantly
predicted 0.02 and 0.06 increases in the resolution scores, respectively, controlling for the
experimental effects. Similarly, the AUC model was significant, F(5, 301) = 6.42, p < .01. The
model accounted for 9.6% of the variance. Controlling for the experimental effects. A one unit
increase in the AOT and the CRT significantly predicted 0.02 and 0.05 increases in the
resolution scores, respectively, controlling for the experimental effects. The scatter model was
also significant, F(5, 301) = 11.78, p < .01. The model accounted for 9.40% of the variance. The
degree of scatter of numeric judgments from incentivized experts were significantly lower
126
(better) than the degree of scatter of verbal judgments from incentivized experts, holding the
effects of other predictors at constant.
Analyses with Most Informed Responses
I identified a subgroup of most informative forecasters whose Brier scores were less than
0.25. 164 expert respondents were removed from the MIN sample because their Brier scores
were greater than 0.25. Specifically, 44.6%, 51.6%, 65.4%, and 51.7% of the experts (in the MIN
sample) were removed from the Number, Incentivized/Number, Incentivized/Verbal, and
Gamble conditions, respectively. There was no statistical dependency between the experimental
groups and the number of (removed) responses with higher Brier scores, χ
2
(3) = 5.36, p > .05.
The final most informed MAX sample included 150 experts with the following distribution: 41
in the Number condition, 45 in the Incentivized/Number condition, 18 in the
Incentivized/Number condition, and 46 in the Gamble condition.
Table 15 displays the means and standard deviations of the performance measures across
groups. The results revealed interesting effects of both the experimental groups and the
individual characteristics on the performance measures
13
among the most informative experts.
The calibration score model was significant, F(5, 141) = 2.91, p = .01. Direct numeric judgments
were also significantly better calibrated (lower calibration score) than probability judgments
elicited from the gambles, holding other predictors at constant, controlling for the effects of other
predictors. The slope model was significant, F(5, 141) = 3.51, p < .01. A one unit increase in the
CRT score was significantly associated with an increase of 0.05 in the measure of slope, holding
other predictors constant. Had a different significant threshold been used, verbal incentivized
judgments would have been more significantly resolute than numeric incentivized judgments, p
13
Three observations were removed due to missingness.
127
= .052. The scatter model was also significant, F(5, 141) = 2.33, p = .05. A one unit increase in
the CRT score was significantly associated with a 0.01 increase in the degree of scatter in
judgments, holding other predictors constant.
[Table 15 about here]
Conclusion
The second forecasting study in this dissertation was designed to examine strategies to
improve probabilistic forecasts. Overall, the results portray a pessimistic outlook on the
applications of various techniques to improve verbal probability judgments. Results suggest that
increasing the number of available verbal phrases to express probability did not increase the
quality of verbal probability judgments. In fact, the Brier score from verbal judgments were
worse than the Brier score from numeric judgments. Verbal judgments were more overconfident
and more scattered than numeric judgments. In addition, providing incentives had no apparent
effect on performance when making either verbal or numeric judgments. Interestingly, although
the direct estimation method and the gamble methodology led to biased judgments, the natures of
these differences were distinct. Direct probability elicitation led to an upward bias whereas the
lottery approach led to a downward bias. Importantly, direct probability judgments were better
calibrated than probabilities inferred from gambles. These findings contradicted the proposed
hypotheses, but they revealed significant insights.
The data did not support the hypothesis that using a performance-based incentive scheme
would lead to better probabilistic forecasts. There are two possible explanations. The first reason
is that it could be that respondents did not actively use the payoff table to guide their
assessments. Recall that there are two conditions to be met for this intervention to work.
Respondents not only have to understand or at least have an intuition on how their performance
128
is evaluated but also need to actively apply a prescribed scoring rule. While extensive efforts
were made to ensure respondents understood their performance evaluation criteria, this did not
guarantee that respondents would factor in the scoring instruction in their judgments. The
alternative explanation is that the difficulty of the forecasting questions has floored any
improvements. Indeed, the outcome index was at 0.25, suggesting that the questions were, by
conventional standard (Yates, 1991), very difficult for respondents to assess. This does not
necessarily mean that the performance-based incentive strategy does not work, but rather that
they should be explored with a different set of forecasting questions. Future research needs to
explore alternative approaches to actively engage respondents to apply scoring rules when
making probability assessments.
Another aim of this research was to examine the effect of increasing the number of
probability expressions on forecasting accuracy. I hypothesized that resolution scores for verbal
judgments would be better when judges could select from a large list of verbal expressions of
uncertainty. Surprisingly, the results did not support this hypothesis. Indeed, expert respondents
using verbal probability expressions performed worse than experts using a numeric probability
scale in terms of overconfidence, overall accuracy (Brier score), and the degree of scatter. Note
that the effect of overconfidence was consistent with the result in the NFL study as well as the
results in Wallsten et al. (1993).
The effect of the response modes on the degree of scatter was expected because expert
assessors in the verbal condition could choose from a large list of uncertainty descriptors
whereas experts in the Number condition had to use an 11-point probability scale. Because the
meanings of verbal probability expressions are overlapped, providing judges with more options
to choose from might have inadvertently made the assessment task more difficult and confusing.
129
This could explain why performances of verbal experts suffered. Note that this finding implies
that using a short verbal probability scale seems to be preferable than using a long and extensive
scale.
It was also expected the lottery (gamble) method would enhance the quality of subjective
probability estimates. Yet, the evidence did not provide support for this hypothesis. Direct
probability assessments led to better calibrated judgments as compared to probabilities elicited
from gambles. Indeed, direct probability assessments led to a ten-fold increase in the calibration
score. It is possible, though, that expert respondents in the two conditions used different
judgment strategies. O’ Donnell and Evers (2017) showed that even though both tasks were
designed to assess the same objective, their subjects relied on the affective mode (System 1)
when they made choices, but they leaned on the analytical thinking mode (System 2) when they
estimated values of a variable. Following this logic, it could be that those who played gambles
relied more on intuitions whereas those who directly made probability estimates depended on
their analytical thinking to assess the uncertainty of the target events. Yet, because the previous
finding did not address probability estimate (O’ Donnell & Evers, 2017), the proposed possibility
awaits future research.
Consistent with the previous findings in the NFL study, scores in the Cognitive
Reflection Test were predictive of the Brier score and both measures of resolution in the MIN
sample. This finding established the external validity of the effect of cognitive reflection on
probability judgments. In addition, scores in the Active Open-Minded Thinking were also
predictive of the Brier score, both measures of resolution, and calibration in the MIN sample.
These individual effects were significant while controlling for the effect of the other variable,
130
suggesting that they measure unique aspects of cognitive ability, and that they independently
contribute to the quality of probability judgments.
A second study was conducted to explore alternative approaches to elicit subjective
probabilities with an eye toward identifying methods that improve the quality of probability
judgments. In general, the results suggested that different approaches designed to ‘help’
forecasters to improve their performance have failed to achieve their purposes. Nevertheless, the
results from this second study offer important methodological lessons for future research to
continue exploring strategies to improve probabilistic forecasts.
131
CHAPTER 8
GENERAL DISCUSSION
An accurate assessment of uncertainty is essential for decision making under risk. This
dissertation is an attempt to evaluate the accuracy of subjective probability judgments and to
compare alternative methods to elicit subjective probability with a focus on identifying methods
to improve the accuracy of probabilistic forecasts. This final chapter addresses how the current
research help to inform choices of elicitation methods as well as discuss some of the applications
of the findings on individual differences. It also highlights some methodological contributions of
the current research.
On the Accuracy of Subjective Probability
This dissertation is an attempt to address the question: Can individuals make informative
probability judgments? Results indicated that the summary Brier score from 69 probability
judgment studies was significantly smaller (better) than a score that would have obtained when
they simply provided uninformative guesses (i.e., Brier score = 0.25). Yet, the effect size was
relatively modest, 𝑃𝑆 ̅ ̅ ̅ ̅
= 0.22, and there was a large amount of heterogeneity to be explained. The
effects of the predictors were equally interesting and important. Results from this meta-analysis
unequivocally affirmed the superior of experts’ probability judgments. Additional analyses
132
revealed that experts’ judgments were not only better calibrated but also more resolute (see
Chapter 4). Surprisingly, having respondents make multiple probability judgments, which was
expected to lead to better assessments, actually deteriorates performance. However, the number
of studies that required respondents to make multiple judgments was relatively small, and the
effect size in this subgroup was not as precise as the estimate in the other subgroup (single
judgment). As a result, the difference in the effect size when judgments were elicited by the
single-judgment strategy versus the multiple-judgment strategy was relatively uncertain (i.e.,
there was a wide confidence interval). The results also suggested interesting findings regarding
the effect of the assessment context. Whereas the average Brier score was better (lower) when
respondents were asked to assess aleatory uncertainty, this effect became non-significant when
other factors such as the number of questions, expertise, and the elicitation method were
statistically controlled. Indeed, further analyses revealed that the effect of the assessment context
was confounded with other covariates. This interesting phenomenon should be the subject of
future experimental studies. Importantly, some have suggested that epistemic judgments would
be more resolute than aleatory judgments (e.g. Tannenbaum, Fox, & Ulkumen, 2016; Wallsten et
al., 1993), the results showed that this difference was not statistically significant even though the
effect was in the expected direction. Nonetheless, this is not a meta-analysis on the resolution
score. A different meta-analysis is needed to further examine this possibility.
On the Selection of Probability Elicitation Methods
Another primary objective in this dissertation was to explore alternative methods to
improve the performance of probability assessment. A possible approach is to make use of verbal
expressions of uncertainty as most people may find it more natural to express uncertainty
verbally. The results indicated that compared to verbal expression of uncertainty, numeric
133
assessment of probability was the better mode for eliciting subjective probability judgments.
Numeric judgments were more consistent and less overconfident than verbal judgments.
Numeric judgments were also more resolute than verbal judgments when the comparison was
made among the most informed forecasters. Interestingly, increasing the number of verbal
descriptors decreased the quality of judgments. When forecasters had access to more verbal
probability expressions, their forecasting performance suffered as their Brier scores became
worse than those who used numeric probability to make forecasts. These results suggest that
numeric judgments should be used whenever possible. However, when natural language is the
primary vehicle to quantify probability (as part of an organizational practice), probability
assessors should be allowed to select their own vocabulary of uncertainty and the quantitative
meanings of these expressions need to be estimated at an individual level. This is because there
are typically large between-subject variations in the interpretations of the given verbal
expressions (Clark, 1990; Wallsten & Budescu, 1990). Indeed, this suggestion is further
strengthened by the results in the two quantification studies, which underscores the large
between-subject variations in the numeric translations of verbal probability expressions.
Alternatively, when this is not possible, assessors should be given a limited and structured verbal
response scale. Indeed, using a hybrid scale that contains both verbal and numeric
representations of probability could be also another approach (Ronooij & Witteman, 1999;
Witteman & Rennooij, 2003).
On the other hand, there were mixed results regarding the application of the lottery
method to elicit probabilistic forecasts. Even though the lottery method and the direct estimation
approach led to biased judgments, the nature of this difference was distinct. The latter led to an
upward bias, but the former led to a downward bias. Even though there is no distinct advantage
134
of using the lottery approach for eliciting subjective forecasts, the lottery method may be easier
for subjects to complete since it only requires subjects to indicate their preference. Further
research needs to explore this possibility.
It was expected that providing assessors with incentives based on their performance could
improve their forecasting performance. Yet, the results suggested that forecasters who received
performance-based incentives did not perform better than those who simply paid a flat fee for
their participation. Although this finding calls into question the efficacy of using a performance-
based incentive strategy, it may generalize only to research that employs the Brier score (or a
quadratic scoring rule). This is because there are many proper scoring rules that researchers can
choose from (Steyvers, Wallsten, Merkle, & Turner, 2014), and the current research does not
compare the efficacy of using different scoring rules to incentivize performance. The logarithmic
scoring rule can be a reasonable choice because of its stiff loss function. When performance is
coded according to the logarithmic rule, assessors would incur hefty penalties when they make
extreme (more certain) but incorrect forecasts. This question, however, awaits future research to
explore.
On the Effects of Individual Differences
There were also interesting findings regarding the effects of individual differences. In the
two forecasting experiments, scores in the Cognitive Reflection Test (CRT) significantly
predicted different aspects of subjective judgments. In the NFL study, a higher score on the CRT
was associated with a better (lower) Brier score, less scattered and more consistent judgments.
Likewise, in the NBA study, CRT scores were found to be significantly associated with lower
Brier scores and both measures of resolution. A higher score in a numeracy scale also
significantly predicted a higher number of consistent judgments although it was associated with
135
more scattered judgments. Higher scores in the Active Open-Minded Thinking scale were also
correlated with lower (better) Brier scores, lower (better) calibration scores, and higher
resolution scores.
These results have both practical and theoretical implications. Theoretically, these
findings contribute to the growing so-called ‘rationality quotient’ literature (Stanovich & West,
2018). Stanovich and West (1998, 2000, 2008) showed evidence that many thinking skills are
predictive of good judgment performance (including probability assessment) even when
cognitive ability is controlled for. An important contribution of this research is that the results
revealed the correlations between several thinking skills and probability forecasts for future
events. Researchers in prior studies typically employ a task-based paradigm to study individual
differences by having subjects perform tasks designed to measure some aspects of judgments
(see Stanovich & West, 2008 for an example). The limitation of this approach is that it does not
reveal the impact of individual differences on actual realistic performances. This dissertation
addressed this issue by asking respondents to forecast events that have meaningful consequences
(see also Mellers et al., 2014).
The significant effects of individual differences on probability judgments suggest the
utility of using a battery of measures to assess thinking skills (Stanovich, West, & Toplak, 2016).
This battery can be used to assess whether and to what extents job applicants possess core
thinking skills that would enable them to perform well at their jobs. Such battery can be a cost-
effective approach to improve the selection process for positions that require extensive judgment
skills. For example, CRT scores can be used to assess reflective thinking skills among applicants
who apply for an intelligence analyst position. In addition, judgments from individuals with good
thinking skills can be weighed more than those who are less adept at reasoning and thinking.
136
When there is an interest to utilize crowdsourcing as a mechanism to gather predictions or
forecasts (e.g. the Good Judgment Project), probability estimates obtained from forecasters with
strong reasoning skills may be weighed more when aggregating probability estimates from
individual forecasts (see Cooke, 1990, for a review).
On the Values of Verbal Expressions of Uncertainty
Although results from this research suggests few benefits of using verbal expressions of
uncertainty, the practice of using verbal probability to characterize uncertainty is expected to
continue. In fact, even though problems associated with words of probability estimates have long
been recognized (Kent, 1964), applications of natural language in probability assessment
continues expanding. Verbal expressions of uncertainty have continued being the standard
practice within the U.S. intelligence community (National Intelligence Council, 2007) as well as
the primary communication mode in several influential reports such as ones published by the
Intergovernmental Panel on Climate Change (Budescu, Por, Broomell & Smithson, 2014).
Because verbal probability is expected to continue being relevant in many professional practices,
the practical challenge is to establish “best-practice” recommendations for the applications of
natural language in probability assessment and risk communication.
Results from this dissertation offer several contributions that can facilitate the
development of such best-practice guideline. First, because the quantitative meanings of verbal
expressions are likely to overlap, it is desirable to quantify these expressions with the sample of
individuals who actually use the expressions. An interesting possibility that has not been
explored in this research but deserves further attention in future studies is the effect of context on
numeric values of verbal probability. In both the NFL and NBA studies, respondents were asked
to quantify various verbal probability terms in an isolated context irrelevant to the forecasting
137
domains. It is possible that the meanings of verbal probability are highly context-dependent. In
fact, Juanchich, Teigen, & Gourdon (2013) found that their subjects believed probability terms
indicating high uncertainty (e.g. possible) suggest maximal consequence whereas terms
indicating low uncertainty (e.g. certain) suggest less severe outcomes. In addition, Piercey (2009)
demonstrated that his subjects were motivated to interpret verbal probability expressions to be
consistent with their desired outcomes. Whether it is more desirable to quantify numeric
meanings of verbal expressions in an isolated context or it is better to assess these values in
domains consistent with their applications remains unknown. This can be a topic to explore in
future research.
The second contribution, which is related to the first, is the gamble methodology. This
method can be used for the quantification of verbal probability expressions. The gamble
methodology is not only conceptually sound as it is based on the well-accepted idea that
subjective probability can be inferred from choices in gambles (Savage, 1954; Edward, 1954),
but it also generates results consistent with previous research. Table 16 displays the mean
numeric values associated with the verbal expressions of uncertainty used in this research. For a
comparison purpose, minimum and maximum values for the same expressions in prior studies
(Witteman, & Renooij, 2003; Wallsten, Budescu, & Zwick, 1993; Hamm, 1991; Theil, 2002) are
reported. The numeric values for most of the expressions in this research fall within the
boundaries of previous studies although there are some exceptions. Importantly, the gamble
methodology can be cognitive easier for respondents to complete. Unlike the direct estimation
approach in which respondents are asked to provide numeric estimates for verbal descriptors, the
gamble method does not require respondents to hold and map two separate mental
representations of probability (words versus number). Unlike the Membership Function
138
(Budescu, Karelitz, Wallsten, 2003), it does not request respondents to complete a laborious
scaling procedure, which could be cognitive taxing.
Nevertheless, because there is no standard methodology to convert probability
words/phrases into numerical values, an open area of research is to examine whether and how
different quantification methodology may give rise to different values of probability words. An
alternative approach is to assess numeric values associated with verbal probability in natural
language uses. Since verbal probability is used extensively in numerous contexts (e.g. sports,
economic and political analyses, weather, etc.) and in a variety of platforms (e.g. social media,
newspaper, webpages, etc.), corpuses of texts containing verbal expressions of uncertainty can be
compiled. Methods in text analysis, then, can be applied to estimate the implied numeric values
of the expressions. Note that this approach can be used to assess the validity of the results from
the quantification studies.
[Table 16 about here]
Beyond Accuracy: An Alternative Framework to Evaluate Probability Judgments
Several metrics have been used to evaluate subjective probability judgments in this
research, each measuring a distinct quality of subjective forecasts. Because these measures assess
some aspects of judgment accuracy, they provide useful feedback for probability assessors to
improve their predictions and/or forecasts. Yet, from the perspective of a decision maker,
forecasting accuracy is only one piece of the puzzle that he or she needs to resolve to make
informed decisions based on probabilistic information. The other piece involves the values
decision makers (or organizations) ascribe to different consequences resulted from the forecasts
and the reality. In medicine, for example, doctors often provide their patients with information
on their diagnoses, and this information is usually probabilistic in nature (e.g., X% chance of
139
having a disease). The patients, in turn, have to act on this probability, and decide the best course
of actions. The best decision strategy depends on the patients’ values. If the patients are more
concerned about the consequence of a false negative—they are sick, but the doctors’ judgments
indicate otherwise, then they are likely to complete a treatment even though X% is small. In
contrast, if the patients are more concerned about the consequence of a false positive—they are
healthy, but the professional judgments indicate the opposite, they are unlikely to complete a
treatment despite a high X%. In other words, the patients must choose a decision threshold to
decide when it is worth (or not worth) to complete a treatment.
This dissertation explored methods to improve judgment accuracy and did not examine
how individuals’ values drive decisions under uncertainty. When there is an interest in modeling
the role of values in risky decision-making, some of the metrics used in this research may be
insufficient for such an evaluation purpose. Indeed, the loss function in the Brier score does not
consider different utilities and disutilities ascribed to alternative consequences. Thus, the Brier
score is an inadequate measure to assess the quality of decisions based on probabilistic
information. Note that, for clarity, I made the distinction between forecasters and decision
makers (e.g., doctors and patients. Indeed, forecasters and decision makers can be the same
persons).
Levi (1985) described a framework to evaluate probabilistic forecasts based on signal
detection theory. The author suggested that a comparison between forecasting performances
could be evaluated using three criteria: the area under the curve, face value expected utility, and
optimal expected utility (EU). A central piece to understanding the differences among the three
evaluation approaches is the notion of dominance. Dominance occurs when an ROC curve from
one of the two forecasters dominates the curve of the other (i.e., the dominating curve is beyond
140
and does not cross the dominated curve). When this is the case, the utility approaches are
essentially equivalent to the AUC approach. Nonetheless, when the ROC curves cross, the utility
approaches are preferred because they guarantee maximization of expected utility. Particularly,
when there are reasons to believe forecasters are well-calibrated, the face-value EU is
recommended. In contrast, the optimal EU is preferred when it is reasonable to believe
forecasters would hedge their estimates.
The SDT framework clearly separates two elicitation processes. The first deals with
probability elicitation: Probability assessors are asked to estimate chance events. The second
deals with eliciting subjective values attached to different consequences: Decision makers are
asked to quantify the importance of different consequences. Both of these elicitations are needed
to compute a decision threshold that guarantees decision makers to maximize their expected
utility (see Steyvers, Wallsten, Merkle, & Turner, 2014 for an application). In future research, I
plan to apply the SDT framework to evaluate subjective probability judgments in high-stake
decision-making contexts such as those in intelligence analysis and medical decision-making.
The results in this dissertation help to close some existent research gaps by fulfilling three
objectives: 1) synthesizing prior research to quantify the performance of subjective probability
judgments, 2) comparing alternative methods of probability elicitations, and 3) examining the
effects of individual differences. Because subjective probability estimates are fundamental to
both individual and organizational decision-making, findings from this research provide
important suggestions for the continuing search for approaches to improve subjective judgments.
141
References
14
Abbas, A., Budescu, D., Yu, H., & Haggerty, Ryan. (2008). A Comparison of Two Probability
Encoding Methods: Fixed Probability vs. Fixed Variable Values. Decision
Analysis, 5(4), 190-202.
*Arkes, H. R., & Dawson, V., Speroff, T.,…,Connors, F. (1995). The covariance
decomposition of the probability score and its use in evaluating prognostic
estimates. Medical Decision Making, 15(2), 120-131.
10.1177/0272989X9501500204
Alpert, M., & Raiffa, H. (1982). A progress report on the training of probability assessors. In
D. Kahneman, P. Slavic, L A. Tversky (Eds.), Judgment under uncertainty:
Heuristics and biases (pp. 294-305). Cambridge, England: Cambridge Univ.
Press.
Armstrong, Denniston, & Gordon. (1975). The use of the decomposition principle in making
judgments. Organizational Behavior and Human Performance, 14(2), 257-263.
Ayton, P., & Wright, G. (1994). Subjective probability: what should we believe? In G. Wright
& P. Ayton (Eds.), Subjective probability (pp. 163–183). Chichester: Wiley
Bar-Hillel, M. (1980). The base-rate fallacy in probability judgments. Acta Psychologica, 44,
211–233.
Baron, J. (2008). Thinking and deciding (fourth edition). New York: Cambridge University
Press
Benson, & Önkal. (1992). The effects of feedback and training on the performance of
probability forecasters. International Journal of Forecasting, 8(4), 559-573.
14
Studies included in the meta-analysis are denoted with a *
142
*Benson, P. G., & Onkal, D. (1992). The effects of feedback and training on the performance
of probability forecasters. International Journal of Forecasting, 8(4), 559.
*Bersabé Morán, Rosa, Martínez Arias, María del Rosario, & Tejeiro Salguero, Ricardo.
(2003). Risk-takers: Do they know how much of a risk they are
taking? Psychology in Spain, 7, 3-9.
Betsch, T., Biel, G., Eddelbuttel, C., & Mock, A. (1998) Natural sampling and base rate
neglect. European Journal of Social Psychology., 28, 269–273.
Borenstein, M., Hedges, L., Higgins, J., Rothstein, H. (2009). Introduction to meta-analysis.
Chichester, U.K.: John Wiley & Sons.
Boulier, & Stekler. (2003). Predicting the outcomes of National Football League games.
International Journal of Forecasting, 19(2), 257-270.
*Brake, G. L. (1998). Calibration of probability judgments: Effects of number of focal
hypotheses and predictability of the environment. ProQuest Dissertations &
Theses
Brenner, Koehler, Liberman, & Tversky. (1996). Overconfidence in Probability and Frequency
Judgments: A Critical Examination. Organizational Behavior and Human
Decision Processes, 65(3), 212-219.
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly
Weather Review, 78(1), 1-3. doi:10.1175/1520-
0493(1950)078<0001:VOFEIT>2.0.CO;2
Bruine de Bruin, W., Parker, A. M., & Fischhoff, B. (2007). Individual differences in adult
decision-making competence. Journal of Personality and Social Psychology,
92, 938–956.
143
Brun, W. and K.H. Teigen, 1988. Verbal probabilities: ambiguous, context-dependent, or
both? Organizational Behavior and Human Decision Processes, 41, 390-404.
Budescu, D. V., & Wallsten, T. S. (1985). Consistency in interpretation of probabilistic
phrases. Organizational Behavior and Human Decision Processes, 36(3), 391-
405. doi:10.1016/0749-5978(85)90007-X
Budescu, D. V., & Wallsten, T. S., (1995). Processing linguistic probabilities: General
principles and empirical evidence. In J. Busemeyer, R. Hastie, & D. L. Medin
(Eds.), Decision making from a cognitive perspective, psychology of learning
and motivation: Advances in research and theory (Vol. 32, pp. 275–318). San
Diego, CA: Academic Press
Budescu, D. V., Weinberg, S., & Wallsten, T. S. (1988). Decisions based on numerically and
verbally expressed uncertainties. Journal of Experimental Psychology: Human
Perception and Performance, 14, 281-294.
Budescu, D., Karelitz, T., & Wallsten, T. (2003). Predicting the directionality of probability
words from their membership functions. Journal of Behavioral Decision
Making, 16(3), 159-180.
Budescu, D., Por, H., Broomell, S. B., & Smithson, M. (2014). The interpretation of IPCC
probabilistic statements around the world. Nature Climate Change, 4(6), 508-
512.
Budescu, David V., Broomell, Stephen, & Por, Han-Hui. (2009). Improving Communication
of Uncertainty in the Reports of the Intergovernmental Panel on Climate
Change Report. Psychological Science, 20(3), 299-308.
144
*Carlson, B. W. (1993). The accuracy of future forecasts and past judgments. Organizational
Behavior and Human Decision Processes, 54(2), 245-276.
10.1006/obhd.1993.1012
Carvalho, A. (2016). An Overview of Applications of Proper Scoring Rules. Decision
Analysis, 13(4), 223-242.
Chesley, G. (1978). Subjective Probability Elicitation Techniques: A Performance
Comparison. Journal of Accounting Research, 16(2), 225-241.
Clark, H. H. (1990). Comment on Mosteller and Youtz's Quantifying Probabilistic
Expressions. Statistical Science, 5, 12-16.
*Clemen, R. T. (1985). Extraneous expert information. Journal of Forecasting, 4(4), 329-348.
10.1002/for.3980040403
Cooke, R.M. (1991). Experts in Uncertainty, Oxford University Press
*Crown, M. D. (2012). Validation of the NOAA space weather prediction center's solar flare
forecasting look-up table and forecaster-issued probabilities. Space
Weather, 10(6)
Dawes, & Mulford. (1996). The False Consensus Effect and Overconfidence: Flaws in
Judgment or Flaws in How We Study Judgment? Organizational Behavior and
Human Decision Processes, 65(3), 201-211
de Bruin., B., Fischhoff, B., Millstein, S. G., & Halpern-Felsher, B. L. (2000). Verbal and
numerical expressions of probability: “It's a Fifty–Fifty
chance”. Organizational Behavior and Human Decision Processes, 81(1), 115-
131. doi:10.1006/obhd.1999.2868
145
Dersimonian, & Laird. (1986). Meta-analysis in clinical trials. Controlled Clinical Trials, 7(3),
177-188.
*Dolan, J. G., Bordley, D. R., & Mushlin, A. I. (1986). An evaluation of clinicians' subjective
prior probability estimates. Medical Decision Making : An International
Journal of the Society for Medical Decision Making, 6(4), 216.
Doyle, P. (2011). Probability Assessment: Continuous Quantities and Probability
Decomposition, ProQuest Dissertations and Theses.
Edwards, W. (1954). The theory of decision making. Psychological Bulletin, 41, 380-417.
Edwards, W. (1968). Conservatism in Human Information Processing. In B. Kleinmuntz
(Ed.), Formal Representation of Human Judgment (pp. 17-52). New York:
Wiley.
Ellsberg, D. (1961). Risk, ambiguity, and the Savage axioms. The Quarterly Journal of
Economics, 75(4), 643-669.
Erev, I., & Cohen, B. (1990). Verbal versus numerical probabilities: Efficiency, biases, and the
preference paradox. Organizational Behavior and Human Decision Processes,
45, 1-18.
Erev, I., Wallsten, T., & Budescu, D. (1994). Simultaneous Over- and Underconfidence: The
Role of Error in Judgment Processes. Psychological Review, 101(3), 519-27
Fagerlin, A., Zikmund-Fisher, B., Ubel, P., Jankovic, A., Derry, H., & Smith, D. (2007).
Measuring Numeracy without a Math Test: Development of the Subjective
Numeracy Scale. Medical Decision Making, 27(5), 672-680.
146
Ferrell, R. William., & J. McGoey, P. (1980). A Model of Calibration for Subjective
Probabilities. Organizational Behavior and Human Performance. 26, 32-53.
10.1016/0030-5073(80)90045-8.
Fischer, G. (1982). Scoring-rule feedback and the overconfidence syndrome in subjective
probability forecasting. Organizational Behavior and Human Performance,
29(3), 352-369.
Fox, C. R., & Ulkümen, G. (2011). Distinguishing two dimensions of uncertainty. In W. Brun,
G. Kirkebøen, & H. Montgomery (Eds.), Essays in judgment and decision
making. Oslo, Norway: Universitetsforlaget.
Frederick, S. (2005). Cognitive Reflection and Decision Making. The Journal of Economic
Perspectives, 19(4), 25-42.
Friedman, J., Lerner, J., & Zeckhauser, R. (2017). Behavioral Consequences of Probabilistic
Precision: Experimental Evidence from National Security Professionals.
International Organization. 71(4), 803-826.
Galway, L. (2007). Subjective probability distribution elicitation in cost risk analysis: A
review. (Report No. TR-410-AF). Santa Monica, CA: RAND.
Gavasakar, U. (1988). A comparison of two elicitation methods for a prior distribution for a
binomial parameter. Management Science, 34, 784-790
Gelman, R. (1990). First Principles Organize Attention to and Learning About Relevant Data:
Number and the Animate‐Inanimate Distinction as Examples. Cognitive
Science, 14(1), 79-106.
Gigerenzer, Gerd, & Others. (1991). Probabilistic Mental Models: A Brunswikian Theory of
Confidence. Psychological Review, 98(4), 506-28
147
Gigerenzer, G. (1993). The bounded rationality of probabilistic mental models. In K. I.
Manktelow & D. E. Over (Eds.), Rationality Psychological and philosophical
perspectives. London: Routledge
Gigerenzer, G. & Hoffrage, U. (1995). How to improve Bayesian reasoning without
instruction: frequency format. Psychological. Review, 102, 684–704
Gigerenzer, G., & Gaissmaier, W. (2011). Heuristic decision making. Annual Review of
Psychology, 62(1), 451-482. doi:10.1146/annurev-psych-120709-145346
Gilovich, T., Griffin, D. W., & Kahneman, D. (2002). Heuristics and biases: The psychology
of intuitive judgment. Cambridge; New York: Cambridge University Press.
Gonzalez-Vallejo, C., Erev, L, & Wallsten, T. S. (1994). Do decision quality and preference
order depend on whether probabilities are verbal or numerical? American
Journal of Psychology, 107,157-172.
*Greblo, P. (1997). Dispositional factors influencing confidence in and beliefs about the
accuracy of one's own and others' judgment. ProQuest Dissertations & Theses
Full Text
*Gurcay-Morris, B. (2016). The Use of Alternative Reasons in Probabilistic
Judgment. ProQuest Dissertations and Theses
Hamm, R. (1991). Selection of verbal probabilities: A solution for some problems of verbal
probability expression. Organizational Behavior and Human Decision
Processes, 48(2), 193-223.
*Hanea, A. M., Mcbride, M. F., Burgman, M. A., Wintle, B. C., Fidler, F., Flander, L., . . .
Mascaro, S. (2017). InvestigateDiscussEstimateAggregate for structured expert
148
judgement. International Journal of Forecasting, 33(1), 267-279.
10.1016/j.ijforecast.2016.02.008
Henrion, Fischer, & Mullin. (1993). Divide and Conquer? Effects of Decomposition on the
Accuracy and Calibration of Subjective Probability Distributions.
Organizational Behavior and Human Decision Processes, 55(2), 207-227.
Hollard, G., Massoni, S., & Vergnaud, J. (2016). In search of good probability assessors: An
experimental comparison of elicitation rules for confidence judgments. Theory
and Decision, 80(3), 363-387.
Hora, S., Dodd, N., & Hora, J. (1993). The use of decomposition in probability assessments of
continuous variables. Journal of Behavioral Decision Making, 6(2), 133-147.
Hora, S. C. (1996). Aleatory and epistemic uncertainty in probability elicitation with an
example from hazardous waste management. Reliability Engineering and
System Safety, 54(2-3), 217-223.
Huizingh, E. K. R. E., & Vrolijk, H. C. J. (1997). A comparison of verbal and numerical
judgments in the analytic hierarchy process. Organizational Behavior and
Human Decision Processes, 70(3), 237–247.
Hartling, L., Featherstone, R., Nuspl, M., Shave, K., Dryden, M. D., & Vandermeer, B. (2017).
Grey literature in systematic reviews: A cross-sectional study of the
contribution of non-English reports, unpublished studies and dissertations to the
results of meta-analyses in child-relevant reviews. BMC Medical Research
Methodology, 17(1), 1-11.
149
Juanchich, M., Teigen, K., & Gourdon, &. (2013). Top scores are possible, bottom scores are
certain (and middle scores are not worth mentioning): A pragmatic view of
verbal probabilities. Judgment and Decision Making, 8(3), 345.
Kahneman, D., Slovic, P., Tversky, A. (1982). Judgment under uncertainty: Heuristics and
biases. Kahneman, D., Slovic, P., Tversky, A (Ed.). Cambridge, New York:
Cambridge University Press.
Kahneman, D. & Tversky, A. (1982). On the study of statistical intuitions. Cognition, 11, 123–
141.
Kahneman, D. (2011). Thinking, fast and slow (1st ed.). New York: Farrar, Straus and Giroux
Kahneman, D., & Tversky, A. (1972). Subjective probability: A judgment of
representativeness. Cognitive Psychology, 3, 430–454
*Karvetski, C. W., Olson, K. C., Mandel, D. R., & Twardy, C. R. (2013). Probabilistic
coherence weighting for optimizing expert forecasts. Decision Analysis, 10(4),
305-326.
Keeney, R., & Von Winterfeldt, D. (1991). Eliciting probabilities from experts in complex
technical problems. IEEE Transactions on Engineering Management, 38(3),
191-201.
Kent, S. (2008, July 07). Words of Estimative Probability. Retrieved March 17, 2018, from
https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-
publications/books-and-monographs/sherman-kent-and-the-board-of-national-
estimates-collected-essays/6words.html [check format]
150
Keren, G. (1988). On the ability of monitoring non-veridical perceptions and uncertain
knowledge: Some calibration studies. Acta Psychologica, 67(2), 95-119.
doi:10.1016/0001-6918(88) 90007-8.
Keren, G. (1991). Calibration and probability judgments: Conceptual and methodological
issues. Acta Psychologica, 77(3), 217.
Klaczynski, Paul A., & Lavallee, Kristen L. (2005). Domain-Specific Identity, Epistemic
Regulation, and Intellectual Ability as Predictors of Belief-Biased Reasoning:
A Dual-Process Perspective. Journal of Experimental Child Psychology, 92(1),
1-24.
*Kleinmuntz, Fennema, & Peecher. (1996). Conditioned Assessment of Subjective
Probabilities: Identifying the Benefits of Decomposition. Organizational
Behavior and Human Decision Processes, 66(1), 1-15.
Knapp G, Hartung J. (2003). Improved tests for a random effects meta-regression with a single
covariate. Statistical Method. 22, 2693-2710.
*Koehler, D. J., & Harvey, N. (1997). Confidence judgments by actors and observers. Journal
of Behavioral Decision Making, 10(3), 221-242. 10.1002/(SICI)1099-
0771(199709)10:33.0.CO;2-C
*Koehler, D., & Tversky, Amos. (1994). Hypothesis Generation and Confidence in
Judgment, ProQuest Dissertations and Theses.
Kokis, J. V., Macpherson, R., Toplak, M. E., West, R. F., & Stanovich, K. E. (2002). Heuristic
and analytic processing: Age trends and associations with cognitive ability and
cognitive styles. Journal of Experimental Child Psychology, 83 (2002), 26-52
151
*Kvidera, S., & Koutstaal, W. (2008). Confidence and decision type under matched stimulus
conditions: Overconfidence in perceptual but not conceptual decisions. Journal
of Behavioral Decision Making, 21(3), 253-281. 10.1002/bdm.587
Kynn, M. (2008). The ‘heuristics and biases’ bias in expert elicitation. Journal of the Royal
Statistical Society: Series A (Statistics in Society), 171(1), 239-264.
*Lahiri, K., & Wang, J. G. (2006). Subjective probability forecasts for recessions. Business
Economics, 41(2), 26-37 .
Larrick, R. P., Nisbett, R. E., & Morgan, J. N. (1993). Who uses the cost-benefit rules of
choice? Implications for the normative status of microeconomic theory.
Organizational Behavior and Human Decision Processes, 56, 331–347.
*Lechuga Espino. (2008). The Cross Cultural Variation of Probability Judgment Accuracy:
The Influence of Reasoning Style, ProQuest Dissertations and Theses.
Levi, K. (1985). A signal detection framework for the evaluation of probabilistic forecasts.
Organizational Behavior & Human Decision Processes, 36, 143.
Lichtenstein, Fischhoff, Phillips, & Decision Research Eugene OR. (1981). Calibration of
Probabilities: The State of the Art to 1980 (Report 1092-81-6). Retrieved from
http://www.ccnss.org/ccn_2014/materials/pdf/sigman/callibration_probabilities
_lichtenstein_fischoff_philips.pdf
*Lichtenstein, S., & Fischhoff, B. (1977). Do those who know more also know more about
how much they know? Organizational Behavior and Human
Performance, 20(2), 159.
Lichtenstein, S., & Fischhoff, B. (1980). Training for calibration. Organizational Behavior
and Human Performance, 26(2), 149.
152
Lyon, D., & Slovic, P. (1976). Dominance of accuracy information and neglect of base rates in
probability estimation. Acta Psychologica, 40, 287–298.
Macgregor, Lichtenstein, & Slovic. (1988). Structuring knowledge retrieval: An analysis of
decomposed quantitative judgments. Organizational Behavior and Human
Decision Processes, 42(3), 303-323.
*McClish, D. K., & Powell, S. H. (1989). How well can physicians estimate mortality in a
medical intensive care unit? Medical Decision Making, 9(2), 125-132.
10.1177/0272989X8900900207
Mellers, B. et al., (2014). Psychological strategies for winning a geopolitical forecasting
tournament. Psychological Science, 25(5), 1106–1115.
Mellers, B., Stone, E., Atanasov, P., Rohrbaugh, N., Metz, S., Ungar, L., . . . Tetlock, P.
(2015). The psychology of intelligence analysis: Drivers of prediction accuracy
in world politics. Journal of Experimental Psychology: Applied, 21(1), 1-14.
Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K ., & Tetlock, P. E. (2014).
Psychological strategies for winning a geopolitical forecasting tournament.
Psychological Science, 25(5), 1106-1115. doi:10.1177/0956797614524255
Merkhofer, M. W. (1987). Quantifying judgmental uncertainty: Methodology, experiences,
and insights. IEEE Trans. Systems, Man, Cybernetics, 17, 741–752.
*Merkle, E. C., Steyvers, M., Mellers, B., & Tetlock, P. E. (2017). A neglected dimension of
good forecasting judgment: The questions we choose also matter. International
Journal of Forecasting, 33(4), 817-832. 10.1016/j.ijforecast.2017.04.002
Montibeller, G., & Winterfeldt, D. (2015). Cognitive and motivational biases in decision and
risk analysis. Risk Analysis, 35(7), 1230-1251. doi:10.1111/risa.12360
153
Moxey, Linda M., & Sanford, Anthony J. (2000). Communicating Quantities: A Review of
Psycholinguistic Evidence of How Expressions Determine Perspectives.
Applied Cognitive Psychology, 14(3), 237-55.
Murphy, A. H. (1973). A new vector partition of the probability score. Journal of Applied
Meteorology, 12(4), 595-600. doi:10.1175/1520-
0450(1973)012<0595:ANVPOT>2.0.CO;2
National Intelligence Council. (2007). Terrorist Threat to the US Homeland. Retrieved from
http://www.dni.gov/press_releases/20070717_release.pdf.
National Research Council, Committee on Behavioral and Social Science Research to Improve
Intelligence Analysis for National Security. (2014). Intelligence analysis for
tomorrow: Advances from the behavioral and social sciences. Washington,
D.C: National Academies Press.
Norwich, A. M. & T¨urks¸en, I. B. (1982). The fundamental measurement of fuzziness, in R.
R. Yager (ed.), Fuzzy Sets and Possibility Theory: Recent Developments,
Pergamon Press, New York, pp. 49–60
O'Donnell, M., & Evers, E. (2017). Elicitation-Based Preference Reversals in Consumer Good.
(Working paper). Retrieved from
SSRN: https://ssrn.com/abstract=2959609 or http://dx.doi.org/10.2139/ssrn.295
9609
*O'Keefe, K., & Wildemuth, Barbara M. (2000). The Quality of Medical Students' Confidence
Judgments When Using External Information Resources: The Effects of
Different Media Formats, Source of Questions, and Question
Formats, ProQuest Dissertations and Theses.
154
O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J.,
Oakley, J. E., & Rakow, T. (2006). Uncertain judgments: eliciting experts’
probabilities. West Sussex, England: Wiley
*Önkal, D., Yates, J. F., Simga-Mugan, C., & Öztin, Ş. (2003). Professional vs. amateur
judgment accuracy: The case of foreign exchange rates. Organizational
Behavior and Human Decision Processes, 91(2), 169-185. 10.1016/S0749-
5978(03)00058-X
Olson, M. J., & Budescu, D. V. (1997). Patterns of preference for numerical and verbal
probabilities. Journal of Behavioral Decision Making, 10, 117-13
*Paese, P. W., & Feuer, M. A. (1991). Decisions, actions, and the appropriateness of
confidence in knowledge. Journal of Behavioral Decision Making, 4(1), 1-16.
10.1002/bdm.3960040102
*Parker, A. M., Bruin, W. B., Yoong, J., & Willis, R. (2012). Inappropriate confidence and
retirement planning: Four studies with a national sample. Journal of Behavioral
Decision Making, 25(4), 382-389. 10.1002/bdm.745
Patalano, A., Saltiel, L., Machlin, J., & Barth, R. (2015). The role of numeracy and
approximate number system acuity in predicting value and probability
distortion. Psychonomic Bulletin & Review, 22(6), 1820-1829.
Peters, E. et al. (2006). Numeracy and decision making. Psychological Science, 17(5), pp.407–
413.
Peterson, D., & Pitz, G. (1988). Confidence, Uncertainty, and the Use of Information. Journal
of Experimental Psychology: Learning, Memory, and Cognition, 14(1), 85.
155
Phillips, L. D., & Edward, W. (1966). Conservatism in a simple probability inference task.
Journal o] Experimental Psychology, 72, 346-354.
Piercey, M. (2009). Motivated reasoning and verbal vs. numerical probability assessment:
Evidence from an accounting context. Organizational Behavior and Human
Decision Processes, 108(2), 330-341.
*Poses, R. M., Bekes, C., Copare, F. J., & Scott, W. E. (1990). What difference do two days
make? the inertia of physicians' sequential prognostic judgments for critically
III patients. Medical Decision Making, 10(1), 6-14.
10.1177/0272989X9001000103
Price, P. (1998). Effects of a Relative-Frequency Elicitation Question on Likelihood Judgment
Accuracy: The Case of External Correspondence. Organizational Behavior and
Human Decision Processes, 76(3), 277-297.
*Quadrel, M. J. (1990). Elicitation and evaluation of adolescents' risk perceptions:
Quantitative and qualitative dimensions. ProQuest Dissertations & Theses
Raiffa, H. Decision Analysis. Reading, Mass.: Addison-Wesley, 1968.
*Rakow, T., Vincent, C., Bull, K., & Harvey, N. (2005). Assessing the likelihood of an
important clinical outcome: New insights from a comparison of clinical and
actuarial judgment. Medical Decision Making, 25(3), 262-282.
10.1177/0272989X05276849
Rapoport, A., Wallsten, T. S., Erev, I., & Cohen, B. L. (1990). Revision of opinion with
verbally and numerically expressed uncertainties. Acta Psychologica, 74, 61–
79.
156
Renooij, S., & Witteman, C. (1999). Talking probabilities: Communicating probabilistic
information with words and numbers. International Journal of Approximate
Reasoning, 22(3), 169-194. doi:10.1016/S0888-613X(99)00027-4
*Ronis, D. L., & Yates, J. F. (1987). Components of probability judgment accuracy: Individual
consistency and effects of subject matter and assessment method.
Organizational Behavior and Human Decision Processes, 40(2), 193-218.
10.1016/0749-5978(87)90012-4
Ronis, D. L., & Yates, J. F. (1987). Components of probability judgment accuracy: Individual
consistency and effects of subject matter and assessment method.
Organizational Behavior and Human Decision Processes, 40(2), 193-218.
doi:10.1016/0749-5978(87)90012-4
Sá, W. C., & Stanovich, K. E. (2001). The domain specificity and generality of mental
contamination: Accuracy and projection in judgments of mental content. British
Journal of Psychology, 92 (2), 281-302.
Suantak, L., Bolger, F., & Ferrell, W. (1996). The Hard–Easy Effect in Subjective Probability
Calibration. Organizational Behavior and Human Decision Processes, 67(2),
201-221.
*Satopää, V. A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014).
Combining multiple probability predictions using a simple logit model.
International Journal of Forecasting, 30(2), 344-356.
10.1016/j.ijforecast.2013.09.009
Savage, L.J. (1954). The foundations of statistics (2d rev. ed.). New York: Dover Publications.
157
Savage, L.J. (1971). Elicitation of Personal Probabilities and Expectations. Journal of the
American Statistical Association, 66, 783-801
Schum, D. A., Goldstein, I. L., Howell, W. C., & Southard, J. F. (1967). Subjective probability
revisions under several cost-payoff arrangements. Organizational Behavior &
Human Performance, 2(1), 84-104
Schwartz, L. M. (1997). The Role of Numeracy in Understanding the Benefit of Screening
Mammography. Annals of Internal Medicine, 127(11), 966. doi:10.7326/0003-
4819-127-11-199712010-00003
Shanteau, J. (1992). Competence in experts: The role of task characteristics. Organizational
Behavior and Human Decision Processes, 53, 252- 266
*Sharp, G. L., Cutler, B. L., & Penrod, S. D. (1988). Performance feedback improves the
resolution of confidence judgments. Organizational Behavior and Human
Decision Processes, 42(3), 271
Shepherd, G.G. & Kirkwood, C.W. (1994). Managing the Judgmental Probability Elicitation
Process: A Case Study of Analyst/Manager Interaction. IEEE Transactions on
Engineering Management, 41, 414-425
*Sieck, W. R., & Arkes, H. R. (2005). The recalcitrance of overconfidence and its contribution
to decision aid neglect. Journal of Behavioral Decision Making, 18(1), 29-53.
10.1002/bdm.486
Sing, T., Sander, O., Beerenwinkel, N., & Lengauer, T. (2005). ROCR: visualizing classifier
performance in R. Bioinformatics, 21(20), 7881.
Stanovich, K. E., & West, R. F. (1998c). Individual differences in rational thought. Journal of
Experimental Psychology: General, 127, 161–188
158
Stanovich, K. E., & West, R. F. (1999). Discrepancies between normative and descriptive
models of decision making and the understanding/acceptance principle.
Cognitive Psychology, 38, 349–385
Stanovich, K. E., & West, R. F. (2000). Individual differences in reasoning: Implications for
the rationality debate? Behavioral and Brain Sciences, 23(5), 645-665.
doi:10.1017/S0140525X00003435
Stanovich, K. E., & West, R. F. (2008). On the relative independence of thinking biases and
cognitive ability. Journal of Personality and Social Psychology, 94(4), 672-
695. doi:10.1037/0022-3514.94.4.6
Stanovich, K. E., 1950, Ebrary, I. (2009). What intelligence tests miss: The psychology of
rational thought. New Haven: Yale University Press
Stanovich, K. E., West, R. F., & Toplak, M. E. (2011). Intelligence and rationality. In R. J.
Sternberg & S. B. Kaufman (Eds.), Cambridge handbook of intelligence. New
York: Cambridge University Press.
Stanovich, K., West, R., & Toplak, M. (2013). Myside Bias, Rational Thinking, and
Intelligence. Current Directions in Psychological Science, 22(4), 259-264.
Stanovich, K. E., West, R., & Toplak, M. (2016). Rationality Quotient: Toward a test of
rational thinking. Cambridge, MA.: MIT Press
Stewart, Heideman, Moninger, & Reagan-Cirincione. (1992). Effects of improved information
on the components of skill in weather forecasting. Organizational Behavior and
Human Decision Processes, 53(2), 107-134.
159
*Stewart, Heideman, Moninger, & Reagan-Cirincione. (1992). Effects of improved
information on the components of skill in weather forecasting. Organizational
Behavior and Human Decision Processes, 53(2), 107-134
Steyvers, M., Wallsten, T., Merkle, E., & Turner, B. (2014). Evaluating Probabilistic Forecasts
with Bayesian Signal Detection Models. Risk Analysis, 34(3), 435-452.
Stone, E. R., & Opel, R. B. (2000). Training to improve calibration and discrimination: The
effects of performance and environment feedback. Organizational Behavior
and Human Decision Processes, 83(2), 282-309.
*Stone, E. R., Dodrill, C. L., & Johnson, N. (2001). Depressive cognition: A test of depressive
realism versus negativity using general knowledge questions. The Journal of
Psychology, 135(6), 583-602.
Swets, J. A., Dawes, R. M., Monahan, J., & Bjork, R. A. (2000). Psychological science can
improve diagnostic decisions. Psychological Science in Public Interest, 11(3),
1-26.
Swets, J., & Pickett, R. (1982). Evaluation of diagnostic systems: Methods from signal
detection theory. New York: Academic Press
Tannenbaum, Da., Fox, C., & Ulkumen, G. (2016). Judgment Extremity and Accuracy Under
Epistemic vs. Aleatory Uncertainty. Management Science, 63(2), 497-518.
Tanner, W. P., & Swets, J. A. (1954). A decision-making theory of visual detection.
Psychological Review, 61(6), 401-409. doi:10.1037/h0058700
Teigen, & Brun. (1999). The Directionality of Verbal Probability Expressions: Effects on
Decisions, Predictions, and Probabilistic Reasoning. Organizational Behavior
and Human Decision Processes, 80(2), 155-190.
160
Teigen, K. H., & Brun, W. (1999). The directionality of verbal probability expressions: effects
on decisions, predictions, and probabilistic reasoning. Organizational Behavior
and Human Decision Process, 80, 155–190
Tetlock, P. E. (2005). Expert political judgment: How good is it? How can we know?
Princeton, NJ: Princeton University Press.
Toplak, M. E., West, R. F., & Stanovich, K. E. (2011). The Cognitive Reflection Test as a
predictor of performance on heuristics and biases tasks. Memory & Cognition,
39, 1275–1289
Toplak, M. E., West, R. F., & Stanovich, K. E. (2014). Assessing miserly processing: An
expansion of the Cognitive Reflection Test. Thinking & Reasoning, 20, 147–
168
Tannenbaum, D., Fox, C., & Ülkümen, G. (2016). Judgment Extremity and Accuracy Under
Epistemic vs. Aleatory Uncertainty. Management Science, 63(2), 497-518.
Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and
probability. Cognitive Psychology, 5, 207-232
Tversky, A., Sattath, S., & Slovic, P. (1988). Contingent weighting in judgment and choice.
Psychological Review, 95(3), 371-384.
Tversky, A, & Koehler, J. (1994). Support Theory: A Nonextensional Representation of
Subjective Probability. Psychological Review, 101(4), 547-67.
Ulkumen, G., Fox, C., & Malle, B. (2016). Two dimensions of subjective uncertainty: Clues
from natural language. Journal of Experimental Psychology: General, 145(10),
1280
161
*Vescio, M. D., & Thompson, R. L. (2001). Subjective tornado probability forecasts in severe
weather watches. Weather and Forecasting, 16(1), 192-195.
von Winterfeldt, D. & Edwards, W. (1986). Decision analysis and behavioral research. New
York: Cambridge University Press.
Wallsten, T. S., & Budescu, D. V. (1983). State of the art--encoding subjective probabilities: A
psychological and psychometric review. Management Science, 29(2), 151-173.
doi:10.1287/mnsc.29.2.15
Wallsten, T. S., Budescu, D. V., & Zwick, R. (1993). Comparing the calibration and coherence
of numerical and verbal probability judgments. Management Science, 39(2),
176-190. doi:10.1287/mnsc.39.2.176
Wallsten, T. S., Budescu, D. V., Rapoport, A., Zwick, R., & Forsyth, B. (1986). Measuring the
vague meanings of probability terms. Journal of Experimental Psychology:
General, 115(4), 348-365. doi:10.1037/0096-3445.115.4.348
Wallsten, T., & Budescu, D.V. (1990). Comment on Mosteller and Youtz' Quantifying
Probabilistic Expressions. Statistical Science, 5, 23-26.
*Wang, G., Kulkarni, S., poor, H., Osherson, D. (2011). Aggregating large sets of probabilistic
forecasts by weighted coherent adjustment. Decision Analysis, 8(2), 128-144.
10.1287/deca.1110.0206
Windschitl, P. D., & Wells, G. L. (1996). Measuring psychological uncertainty: Verbal versus
numeric methods. Journal of Experimental Psychology: Applied, 2(4), 343-364.
doi:10.1037/1076-898X.2.4.343
Winkler, R. (1996). Uncertainty in probabilistic risk assessment. Reliability Engineering and
System Safety, 54(2), 127-132.
162
Witteman, C., & Renooij, S. (2003). Evaluation of a verbal–numerical probability scale.
International Journal of Approximate Reasoning, 33(2), 117-131.
doi:10.1016/S0888-613X(02)00151-2
*Yates, J. F. (1982). External correspondence: Decompositions of the mean probability score.
Organizational Behavior and Human Performance 30(1): 132-156.
Yates, J. Frank. (1990). Judgment and decision making. Englewood Cliffs, N.J: Prentice Hall.
Zikmund-Fisher, B., Smith, D., Ubel, P., & Fagerlin, A. (2007). Validation of the Subjective
Numeracy Scale: Effects of low numeracy on comprehension of risk
communications and utility elicitations. Medical Decision Making, 27(5), 663-
671.
Zimmer, A. C. (1983). Verbal vs. numerical processing of subjective probabilities. In R.
Scholz (Ed.), Decision making under uncertainty (pp. 159-182). Amsterdam:
NorthHolland.
163
Table 1.
Variables coded in the meta-analysis
Variable Description
Female Proportion of female respondents
Assessor Expert vs. Lay
Elicitation Methods to elicit probability: Single judgments vs.
Multiple judgments for mutually exclusive outcomes
Response mode Format of probability representation (e.g. numbers, verbal,
odds, etc.)
Assessment context Probability assessment tasks (prediction vs. feeling-of-
knowing)
Prediction domain Domain of prediction task
Number of questions The total number of questions judges were asked in a
study
Association with
psychological and
behavioral measures
Whether a record has any association between the
accuracy of probability judgments and
psychological/behavioral/personality measures
Intervention Whether researchers employed a method to improve the
quality of probability judgments
Indexes reported A list of performance indexes reported in each record
Base rate Proportion of target events actually happened in a
prediction task
Proportion of correct
predictions
Proportion of correct predictions in a prediction task
Proportion of correct
answers
Proportion of correct answers in a feeling-of-knowing task
164
Table 2.
Descriptions of primary studies in the meta-analysis
Study Name Sample Brier Standard
Deviation
Record Assessor Elicitation #
Questions
Context
Espino_2008 430 0.242 0.001 dissertation/thesis Laypeople Single 100 FOK
Brake_1998_study2 20 0.139 0.015 dissertation/thesis Experts Single 300 FOK
Ronis_1987_condition1 128 0.256 0.0247 dissertation/thesis Laypeople Single 51 Prediction
Brake_1998_study1 20 0.152 0.0291 dissertation/thesis Experts Single 201 FOK
Greblo_1997_study1_condition1 62 0.218 0.033 dissertation/thesis Laypeople Single 115 FOK
Greblo_1997_study2_condition1 70 0.116 0.036 journal article Laypeople Single 77 FOK
Benson_1992 10 0.208 0.037 dissertation/thesis Laypeople Single 55 Prediction
Sieck_2005_study3 37 0.23 0.041 dissertation/thesis Laypeople Single 60 FOK
OKeefe_2000 154 0.068 0.042 dissertation/thesis Experts Multiple 71 FOK
Sieck_2005_study2 45 0.234 0.042 dissertation/thesis Laypeople Single 60 Prediction
Sharp_1988 54 0.218 0.044 dissertation/thesis Laypeople Single 60 FOK
Sieck_2005_study1 29 0.265 0.044 dissertation/thesis Laypeople Single 60 Prediction
Moran_2003 218 0.227 0.045 dissertation/thesis Experts Multiple 15 FOK
Ronis_1987_condition2 128 0.229 0.0476 dissertation/thesis Laypeople Single 51 FOK
Carlson_1993_study2_condition1 30 0.251 0.0523 journal article Laypeople Single 60 Prediction
Parker_2012_study2 1114 0.07 0.06 journal article Laypeople Single 14 FOK
Parker_2014_study4 556 0.21 0.06 journal article Laypeople Single 70 FOK
Greblo_1997_study1_condition2 62 0.333 0.073 journal article Laypeople Single 55 Others
Gurcay_2016_study6 83 0.33 0.073 journal article Laypeople Multiple 20 FOK
Carlson_1993_study2_condition2 31 0.274 0.0777 journal article Laypeople Single 60 FOK
Carlson_1993_study1 35 0.229 0.08 journal article Laypeople Single 46 Prediction
Carlson_1993_study1_condition1 35 0.229 0.08 journal article Laypeople Single 46 Prediction
Carlson_1993_study1_condition2 35 0.303 0.081 journal article Laypeople Single 46 FOK
Gurcay_2016_study1A 74 0.386 0.083 journal article Laypeople Multiple 20 FOK
Koehler_1997 35 0.292 0.09 journal article Laypeople Single 80 FOK
Gurcay_2016_study1B 93 0.441 0.093 journal article Laypeople Multiple 20 FOK
Gurcay_2016_study2 86 0.336 0.104 journal article Laypeople Multiple 20 FOK
Gurcay_2016_study4 63 0.332 0.125 journal article Laypeople Multiple 10 FOK
Greblo_1997_study2_condition2 70 0.418 0.129 dissertation/thesis Laypeople Single 77 Others
Brake_1998_study5 30 0.208 0.145 dissertation/thesis Experts Single 100 Others
Carlson_1993_study2 35 0.251 0.2 dissertation/thesis Laypeople Single 60 Prediction
Doyle_2011 115 0.261 0.261 journal article Laypeople Single 394 Prediction
Crown_2012 32 0.045 0.0356 dissertation/thesis Experts Single 94452 Prediction
Lahiri_2006 between
15-60
0.071 0.0356 journal article Experts Single 143 Prediction
Vescio_2000 2 0.155 0.0356 journal article Experts Single 322 Prediction
Rakow_2005 24 0.152 0.0356 journal article Experts Single 40 FOK
Dolan_1986 104 0.31 0.0356 journal article Experts Multiple 7 FOK
165
Katzman_1989 6 0.124 0.0356 journal article Experts Single 523 Prediction
Poses_1900 77 0.115 0.0356 journal article Experts Single 8 Prediction
Arkes_1995_condition1 692 0.134 0.0356 journal article Experts Single 1 Prediction
Arkes_1995_condition2 692 0.142 0.0356 journal article Cannot
determine,
explain
Single 1 Prediction
Stewart_1992 99 0.2335 0.0356 journal article Experts Single 99 FOK
Onkal_2003_condition1 40 0.195 0.0356 journal article Experts Single 50 Prediction
Onkal_2003_condition2 57 0.225 0.0356 journal article Cannot
determine,
explain
Single 50 Prediction
Paese_1991 49 0.202 0.06 journal article Laypeople Single 49 FOK
Lichtenstein_1977_study1 92 0.293 0.0777 journal article Laypeople Single 12 Others
Lichtenstein_1977_study2 63 0.291 0.0777 journal article Laypeople Single 12 FOK
Lichtenstein_1977_study3 57 0.281 0.0777 journal article Laypeople Single 10 Others
Lichtenstein_1977_study4 120 0.233 0.0777 journal article Laypeople Single 75 FOK
Lichtenstein_1977_study5 50 0.161 0.0777 journal article Laypeople Single 100 FOK
Kvidera_2008_study3 93 0.209 0.0777 journal article Laypeople Single 92 Others
Kvidera_2008_study2 36 0.201 0.0777 journal article Laypeople Single 94 Others
Kvidera_2008_study1 36 0.213 0.0777 journal article Laypeople Single 104 Others
Merkle_2017 771 0.043 0.08 journal article Laypeople Single 3 Prediction
Hanea_2017 161 0.19 0.08 journal article Laypeople Single 11 Prediction
Wang_2011 15940 0.105 0.08 journal article Laypeople Single 28 Prediction
Yates_1982 38 0.253 0.06 journal article Laypeople Single 20 Prediction
Kohler_1993 32 0.14 0.044 journal article Laypeople Single 96 FOK
Quadrel_1990_study1 85 0.164 0.06 dissertation/thesis Laypeople Single 30 FOK
Quadrel_1990_study2 45 0.176 0.06 journal article Laypeople Single 100 FOK
Kavertsky_2013_study1 30 0.224 0.06 journal article Laypeople Single 60 FOK
Kavertsky_2013_study2 28 0.202 0.06 journal article Laypeople Single 60 FOK
Kavertsky_2013_study1 32 0.317 0.06 journal article Laypeople Single 256 FOK
Kavertsky_2013_study2 29 0.341 0.06 journal article Laypeople Single 256 FOK
Stone_2000 84 0.234 0.06 journal article Laypeople Single 100 Others
Stone_2000 123 0.236 0.08 journal article Laypeople Single 70 FOK
Satopaa_2014 235 0.1415 0.08 journal article Laypeople Single 69 Prediction
Nguyen_2018_condition_study1 101 0.28 0.05 Unpublished Laypeople Single 50 Prediction
Nguyen_2018_condition_study2 46 0.24 0.03 Unpublished Experts Single 46 Prediction
166
Table 3
Correlations among the covariates in the meta-regression model
Expertise
(Expert = 1)
Elicitation
(Multiple =
1)
Number of
questions
Assessment
(Prediction = 1)
Expertise
(Expert = 1)
-0.178 -0.209 -0.183
Elicitation
(Multiple = 1)
0.039 0.341
Number of
questions
-0.112
Table 4.
Means and SDs of numeric values of the verbal expressions in the ambiguity aversions study
Expressions
Average Unbiased
Estimates (SD)
Estimates in the
RED Frame
Estimates in the
BLUE Frame
Correction Values* =
Unbiased - Biased
Almost impossible 0.14(0.15) 0.12(0.18) 0.17(0.29) 0.02
Very unlikely 0.2(0.2) 0.16(0.15) 0.25(0.27) 0.04
Very doubtful 0.22(0.23) 0.13(0.15) 0.3(0.34) 0.08
Improbable 0.22(0.23) 0.17(0.17) 0.28(0.27) 0.05
Highly improbable 0.23(0.24) 0.22(0.26) 0.25(0.29) 0.01
Unlikely 0.23(0.24) 0.19(0.17) 0.26(0.22) 0.04
Doubtful 0.23(0.24) 0.18(0.19) 0.29(0.25) 0.05
Slight chance 0.26(0.26) 0.22(0.23) 0.31(0.26) 0.04
Probable 0.61(0.62) 0.55(0.23) 0.66(0.25) 0.06
Good chance 0.61(0.62) 0.59(0.24) 0.63(0.28) 0.02
Likely 0.65(0.65) 0.6(0.25) 0.7(0.27) 0.05
Pretty sure 0.69(0.67) 0.63(0.28) 0.76(0.27) 0.06
Highly probable 0.69(0.67) 0.66(0.26) 0.71(0.31) 0.03
Very likely 0.71(0.69) 0.67(0.25) 0.74(0.26) 0.04
Almost certain 0.74(0.75) 0.73(0.29) 0.75(0.31) 0.01
No doubt 0.75(0.73) 0.72(0.32) 0.77(0.35) 0.03
167
Table 5
Psychological measures of individual differences in the NFL study
Measure Question Reliability
Relative Expertise
in Football (1-
Extremely
Disagree, 7-
Extremely Agree)
I believe I know more about NFL football than most of my friends and
family members
I am extremely knowledgeable about NFL football
I have been following the NFL for a long time.
I would feel very negative when I miss an important game of my favorite
team
I would feel very sad when my favorite team loses
Cronbach’s alpha =
0.88
Football-Related
Behaviors
(Yes/No)
During the last NFL season, did you typically watch Monday night games?
During the last NFL season, did you typically watch Thursday night
games?
During the last NFL season, did you watch any replay games?
Do you currently have Direct TV NFL Sunday Ticket?
Will you watch an NFL game which does not have your favorite team?
Have you ever bet money on an NFL game?
Have you ever attended an NFL game?
Do you have at least one of the following NFL merchandise: jerseys,
footballs, caps, helmets, flags?
Did you follow news about the NFL at least once a week during the last
year NFL season?
Do you plan to follow news about the NFL at least once a week during this
coming NFL season
Cronbach’s alpha =
0.87
Cognitive
Reflection Scale
A bat and ball together cost $1.10. The bat costs $1.00 more than the ball.
How much does the ball cost (in dollars)?
In a lake, there is a patch of lily pads. Every day, the patch doubles in size.
If it takes 48 days for the patch to cover the entire lake, how long would it
take (in days) for the patch to cover half the lake?
If it takes 5 machines 5 minutes to make 5 widgets, how long would it take
(in minutes) 100 machines to make 100 widgets
Cronbach’s alpha =
0.78
Subjective
Numeracy Scale
First four questions ((1-Not At All Good, 7-Extremely Good)
How good are you at working with fractions?
How good are you at working with percentages?
How good are you at calculating a 15% tip?
How good are you at figuring out how much a shirt will cost if it is 25%
off?
When reading the newspaper, how helpful do you find tables and graphs
that are parts of a story? (1-Not At All Useful, 7-Extremley Useful)
When people tell you the chance of something happening, do you prefer
that they use words ("it rarely happens") or numbers ("there's a 1%
chance")? (1-Always Prefer Words, 7-Always Prefer Numbers)
When you hear a weather forecast, do you prefer predictions using
percentages (e.g., “there will be a 20% chance of rain today”) or predictions
using only words (e.g., “there is a small chance of rain today”)? (1-Always
Prefer Percentage, 7-Always Prefer Words)
How often do you find numerical information to be useful? (1-Never, 7-
Very Often)
Cronbach’s alpha =
0.92
168
Table 6.
Characteristic of the MIN sample
Number Verbal
Sample size 101 82
Proportion of males 65.66% 67.00%
Mean completion
time (minutes)
18.27 16.85
Mean rating of difficulty level of
the prediction questions (0-100)
62.29 60.48
Mean score of
self-reported expertise (1-7)
4.59 4.34
Mean Proportion of football-related
behaviors in which experts engaged
72.40% 70.36%
Table 7.
Descriptive statistics of the numeric translations of verbal expressions
in the NFL study (after correcting for ambiguity aversion)
Expression
Mean
SD
25
th
Median
75
th
Improbable 0.109 0.122 0.055 0.100 0.100
Unlikely 0.135 0.147 0.055 0.100 0.138
Doubtful 0.145 0.182 0.055 0.100 0.152
Slight Chance 0.164 0.156 0.100 0.100 0.175
Good Chance 0.551 0.215 0.500 0.625 0.625
Probable 0.568 0.219 0.500 0.625 0.750
Pretty Sure 0.590 0.252 0.500 0.625 0.750
Likely 0.663 0.216 0.625 0.625 0.825
169
Table 8.
Correlations among performance indexes in the MIN sample in the NFL study
Brier Bias Calibration Slope AUC Scatter Overconfidence Consistency "Correct"
Prediction
Brier
0.62* 0.62* -0.53* -0.69* 0.47* 0.76* - -0.24 -0.78*
Bias 0.49*
0.96* -0.22* -0.18 -0.14 0.11 -0.12 -0.73*
Calibration 0.68* 0.65*
- 0.23* - .14 - 0.19 0.1 -0.18 0.19
Slope -0.26* 0.16 -0.09
0.89* 0.21* - 0.32* 0.34* 0.69*
AUC -0.48* 0.11 -0.14 0.87*
0.05 - 0.61* 0.27* 0.68*
Scatter 0.63* 0.27* 0.12 0.40* -0.18
0.76* 0.08 0.06
Overconfidence 0.78* 0.15 0.29* -0.18 - 0.44* 0.69*
- 0.02 - 0.45*
Consistency 0.06 0.29* 0.19 0.25* 0.23* 0.12 - 0.04
-0.22
"Correct"
Prediction
-0.58* -0.61* - 0.57* 0.30* 0.29* -0.15 - 0.48* 0.19
* p < .05; The upper half shaded cells are correlations in Numeric condition while the lower half cells are
correlations in the Verbal condition
Table 9.
Performance measures in the MIN and most informative samples
Measure
MIN SAMPLE
(N = 183)
MOST INFORMATIVE SAMPLE
(N = 67)
Group
Numeric
Judgment
Verbal Judgment
Numeric
Judgment
Verbal
Judgment
Brier 0.28 (0.05) 0.27 (0.05) 0.23 (0.05) 0.22 (0.02)
Bias 0.17 (0.08) 0.12 (0.12) 0.11 (0.08) 0.08 (0.08)
Calibration 0.03 (0.03) 0.03 (0.03) 0.02 (0.01) 0.01 (0.01)
Resolution (Slope) 0.12 (0.07) 0.12 (0.07) 0.17 (0.07) 0.15 (0.15)
Resolution (AUC) 0.63 (0.07) 0.62 (0.6) 0.69 (0.07) 0.65 (0.06)
170
Scatter 0.07 (0.03) 0.08 (0.04) 0.06 (0.03) 0.05 (0.02)
Proportion of correct
predictions
0.58 (0.08) 0.60 (0.07) 0.65(0.08) 0.67 (0.09)
(Over) Confidence 0.19 (0.09) 0.22 (0.11) 0.12 (0.06) 0.13 (0.10)
Proportion of
consistent responses
0.62 (0.35) 0.42 (0.29) 0.76 (0.35) 0.39 (0.32)
Table 10.
Measures of individual differences in the NBA study
Measure Question Reliability
Cognitive Reflection
Scale (CRT)
A bat and ball together cost $1.10. The bat costs $1.00 more
than the ball. How much does the ball cost (in dollars)?
In a lake, there is a patch of lily pads. Every day, the patch
doubles in size. If it takes 48 days for the patch to cover the
entire lake, how long would it take (in days) for the patch to
cover half the lake?
If it takes 5 machines 5 minutes to make 5 widgets, how long
would it take (in minutes) 100 machines to make 100 widget
Cronbach’s alpha = 0.74
Active Open-Minded
Thinking (1-Completely
Agree to 5-Completely
Disagree) (AOT)
Allowing oneself to be convinced by an opposing argument is
a sign of good character.
People should take into consideration evidence that goes
against their beliefs.
People should revise their beliefs in response to new
information or evidence.
Changing your mind is a sign of weakness.
Intuition is the best guide in making decisions.
It is important to persevere in your beliefs even when evidence
brought to bear against them.
One should disregard evidence that conflicts with one's
established beliefs.
People should search actively for reasons why their beliefs
might be wrong.
Allowing oneself to be convinced by an opposing argument is
a sign of good character.
People should take into consideration evidence that goes
against their beliefs.
People should revise their beliefs in response to new
information or evidence.
Changing your mind is a sign of weakness.
Intuition is the best guide in making decisions.
It is important to persevere in your beliefs even when evidence
brought to bear against them.
One should disregard evidence that conflicts with one's
established beliefs.
Cronbach’s alpha = 0.84
171
People should search actively for reasons why their beliefs
might be wrong.
Table 11.
Means and SDs of the numeric translations of verbal expressions in the NBA study
Probability Expressions
Estimates in the Follow-Up
Study
Estimates After Corrected
for Ambiguity Aversion
Almost Impossible 0.11 (0.18) 0.11 (0.18)
Highly Improbable 0.19 (0.26) 0.21 (0.26)
Very Unlikely 0.12 (0.15) 0.17 (0.15)
Very Doubtful 0.13 (0.15) 0.22 (0.15)
Improbable 0.17 (0.2) 0.21 (0.18)
Unlikely 0.17 (0.18) 0.21 (0.17)
Doubtful 0.18 (0.19) 0.23 (0.2)
Slight Chance 0.22 (0.25) 0.26 (0.24)
Probable 0.55 (0.23) 0.61 (0.23)
No Doubt 0.78 (0.33) 0.81 (0.32)
Likely 0.61 (0.24) 0.64 (0.24)
Good Chance 0.56 (0.23) 0.58 (0.23)
Pretty Sure 0.61 (0.27) 0.68 (0.27)
Very Likely 0.66 (0.24) 0.7 (0.24)
Highly Probable 0.68 (0.25) 0.71 (0.25)
Almost Certain 0.75 (0.27) 0.75 (0.27)
172
Table 12.
Correlations among performance measures in the MIN sample in the NBA study
Brier Bias Overconfidence Calibration Slope AUC Scatter
Brier 0.05 -0.29* 0.8* 0.54* -0.66* -0.81* 0.3*
Bias -0.01 -0.55* 0.19* 0.17* 0.21*
Overconfidence 0.14 -0.34* -0.6* 0.65*
Calibration -0.24* -0.26* -0.24*
Slope 0.9* 0.32*
AUC 0.03
Scatter
*p < .01
Table 13.
Means and SDs of the performance measures in the MIN sample
Group Gamble Direct/Number Number/Brier Verbal/Brier
Brier 0.25 (0.06) 0.24 (0.03) 0.25 (0.04) 0.27 (0.06)
Bias -0.06 (0.12) 0.01 (0.05) 0(0. (0.05) 0.001 (0.08)
Calibration 0.02 (0.04) 0.003 (0.004) 0.004 (0.01) 0.01 (0.03)
Overconfidence 0.14 (0.11) 0.12 (0.08) 0.14 (0.09) 0.22 (0.10)
Slope 0.16 (0.11) 0.15 (0.08) 0.17 7(0.1) 0.20 (0.12)
AUC 0.66 6(0.1) 0.66 (0.07) 0.66 (0.08) 0.66 (0.09)
Scatter 0.06 (0.03) 0.06 (0.02) 0.07 (0.03) 0.10 (0.04)
173
Table 14.
Regression results in the MIN sample
Performance Measures
Number vs.
Number and
Incentivized
Number and
Incentivized vs.
Verbal and
Incentivized
Number vs.
Gamble
Active Open
Minded
Thinking
Cognitive
Reflection Test
𝑅 2
Model Statistics
Brier b 0.004 0.02 -0.01 -0.01 -0.03 0.11 F(5, 301) = 7.55
t(SE) 0.53 (0.01) 2.48 (0.01) -1.32 (0.01) -2.6(0.004) -3.17(0.01)
Bias b 0.00 -0.01 0.07 0.01 0.02 0.13 F(5, 301) = 9.07
t(SE) -0.34 (0.01) -0.82 (0.02) 5.17 (0.01) 1.43 (0.01) 1.55 (0.01)
Overconfidence b 0.02 0.05 0.00 -0.01 -0.03 0.08 F(5, 301) = 5.43
t(SE) 1.04 (0.02) 3.06 (0.02) -0.15 (0.02) -0.96 (0.01) -1.96 (0.02)
Calibration b 0.00 0.01 -0.02 -0.01 -0.01 0.14 F(5, 301) = 9.42
t(SE) 0.48 (0.004) 1.06 (0.01) -4.36
(0.004)
-3.21 (0.002) -1.93 (0.004)
Slope b 0.02 0.03 -0.01 0.02 0.06 0.12 F(5, 301) = 8.1
t(SE) 1.26 (0.02) 1.62 (0.02) -0.77 (0.03) 2.82 (0.01) 3.93 (0.02)
AUC b 0.01 0.00 0.00 -0.02 0.05 0.10 F(5, 301) = 6.42
t(SE) 0.39 (0.01) -0.04 (0.02) -0.35 (0.01) 2.43 (0.01) 3.95 (0.01)
Scatter b 0.01 0.03 0.00 0.01 0.01 0.16 F(5, 301) =
11.78
t(SE) 1.77 (0.01) 4.68 (0.01) 0.69 (0.01) 1.76 (0.003) 1.3 (0.01)
Note: Highlighted in gray are significant results, p < .05. All models were significant
174
Table 15
Means (Standard Deviations) of the performance measures in the most informative sample
Group Gamble Direct/Number Incentivized/Number Incentivized/Verbal
Brier 0.21 (0.03) 0.22 (0.02) 0.21 (0.03) 0.21 (0.03)
Bias -0.02 (0.07) 0.01 (0.05) 0 (0.05) 0.02 (0.06)
Calibration 0.01 (0.01) 0.002 (0.003) 0.002 (0.004) 0.004 (0.005)
Overconfidence 0.05 (0.08) 0.08 (0.07) 0.08 (0.08) 0.1 (0.06)
Slope 0.25 (0.1) 0.21 (0.08) 0.23 (0.11) 0.29 (0.11)
AUC 0.75 (0.07) 0.71 (0.05) 0.73 (0.06) 0.74 (0.08)
Scatter 0.06 (0.03) 0.06 (0.02) 0.06 (0.03) 0.08 (0.03)
Table 16.
Comparing probability values of verbal expression of uncertainty
Expressions Low High Mean
(NFL Study)
Mean
(NBA Study)
Mean
(Ambiguity Study)
Almost Impossible 0 0.1
0.11 0.14
Very Unlikely 0.02 0.28
0.17 0.2
Unlikely 0.02 0.3 0.14 0.21 0.23
Improbable 0.05 0.23 0.11 0.21 0.22
Highly Improbable - -
0.21 0.23
Very Doubtful - -
0.22 0.22
Doubtful - - 0.15 0.23 0.23
Slight Chance - - 0.16 0.26 0.26
Good Chance 0.71 0.82 0.57 0.58 0.61
Probable 0.51 0.96 0.66 0.61 0.61
Likely 0.63 0.85 0.55 0.64 0.65
Pretty Sure - - 0.59 0.68 0.69
Very Likely 0.75 0.92
0.7 0.71
Highly Probable - -
0.71 0.69
Almost Certain - -
0.75 0.74
No Doubt - -
0.81 0.75
Possible 0.01 0.55
- -
Very Low Chance 0.05 0.15
- -
Rare 0.05 0.14
- -
Low Chance 0.1 0.2
- -
Medium Chance 0.4 0.6
- -
Even Chance 0.45 0.55
- -
Frequent 0.56 0.81
- -
Very Possible 0.7 0.01
- -
175
Usually 0.72 0.77
- -
Very Probably 0.75 0.9
- -
High Chance 0.8 0.92
- -
Very High Chance 0.85 0.99
- -
Very Improbable 0.01 0.15
Almost Certain 0.87 0.99
176
Figure 1. Signal and noise distributions in SDT analyses
Figure 2. An example of the receiving operator characteristics curve.
C (hollow)= conservative threshold, L (black) = liberal threshold
L
C
177
Figure 3. Literature search process
178
Figure 4. Forest plot with a summary effect size
179
Figure 5. Cumulative meta-analysis. Studies were first sorted by the Brier scores
in an ascending order, and a cumulative meta-analysis was conducted.
180
Figure 6. An example of a binary choice displayed to respondents
Figure 7. Logical sequence of the quantification methodology. Choosing the numeric gamble in
Lottery 1 (go down) leads to a lower probability value in the numeric gamble in Lottery 2. In
contrast, choosing the verbal gamble in Lottery 1 (go up) leads to a higher probability value in the
181
numeric gamble in Lottery 2. If respondents do not indicate ‘indifference’ in the third lottery, we
used the probabilities in the last column (Final) to set the numeric values for the verbal expressions.
Figure 8. RED vs. BLUE conditions in the ambiguity study
Figure 9. Flow of respondents in the main and the follow-up studies
182
Figure 10. Covariance graph in the MIN sample
Figure 11. An example of how the Brier score was described to experts in the
Incentivized/Number condition
Figure 12. An example of how the Brier score was described to experts in the
Incentivized/Verbal condition
15
15
The color gradients were slightly different from the numeric table due to a technical issue.
183
Figure 13. Distributions of numeric values of verbal probability expressions
184
APPENDIX A
CODING MANUAL
1. How to code effect sizes when a study reports results from multiple conditions or
subgroups?
• The ideal approach is to code the “control” group. The definition of “control” can
be varied from studies to studies. At an abstract level, we define “control” as the
condition that does not have any experimental treatments that aim to alternate the
quality of probability judgments. “Control” can also be thought of a condition
under which a typical experiment of this kind is conducted.
• There may be instance that primary studies’ authors define “control” in an
idiosyncratic manner. This will be discussed case by case.
• When an experiment does not have a control group or when you are not sure how
to define the control group, tag the study at the end (in the “Note” page) so we can
revisit them later.
• When studies report results from multiple subgroups such as males vs. females or
older people vs. younger people or Asians vs. Westerners, we combine the results
across subgroups. The template “combing effects across subgroups” in Drive is
designed to help you combine the effect sizes in these instances.
185
• In general, “subgroups” are defined as different levels of a factor that does not
involve any sorts of “experimental treatments”. Thus, demographic variables such
as sex, age, and race are considered subgroups.
• There is an EXCEPTION to the subgroup rule, when studies report different
results for experts vs. lay people, code the effect sizes separately for each
subgroup. This is because we want to test the effect of expertise on judgments.
• In studies that compare the effects of having people make multiple probability
judgments versus single judgments for a particular outcome, we should code the
effect sizes differently for each condition. This is because we want to compare the
effects of making multiple judgments on the effect sizes.
• There may be a concern about the dependency in this decision rule
because two effect sizes (for two conditions) are coded for the
same study. Yet, the “treatment” here, the choice of having
assessors make a single or multiple judgments on the same
outcome, is not really a manipulation that is unique to such
experiment. Rather it is a design question that researchers should
think of when designing probability judgment studies. In other
words, such experiment (if randomization is used) can be
replicated in two different independent experiments in which the
first has subjects make single judgments whereas the second has
subjects make multiple judgments.
• In contrast, studies that experimentally manipulate different levels
of a factor are dependent and cannot be decomposed into two
186
independent studies. This is because under a usual circumstance of
an experiment of this kind, researchers do not manipulate a
specific level of the factor.
• Likewise, when studies report different results for different tasks (meta-
cognition, prediction, or general knowledge questions), we will code the effect
sizes separately.
• In pre-post design experiments, we will code the pre-condition effect sizes.
• in within-design experiments or when there is dependency between groups (see
pq_recrod20_study4 for an example), we can get the effect size by simply the
average of the effect sizes. However, computation of the pooled variance requires
us to know the correlation between groups and the variance in each group (see
BHHR page 227)
2. What to code when a study does not provide the needed information?
• Simply put NA if there is a text box.
• If you are not sure how to code, put down explanations
3. What is the rule to code numeric value?
• Effect sizes, standard deviations, and proportions of females < 1 with a maximum
of 3 decimals
• Sample size: rounded integer
4. How to determine sample size?
• Sample size is defined as the number of subjects with unique responses that are
included for the final analyses.
187
• If a study provides the general N but does not provide n per group and uses
randomization, then take the average N / (#conditions).
5. How is “expertise” defined?
• “Experts” are defined as those with substantive knowledge in the domains that
they are questioned. Examples include nuclear engineers’ making probability
judgments in nuclear risk assessment, seasoned sport spans’ making predictions,
medical students’ taking an medical exam, students’ taking an exam (we assume
they have studied for the exam), weatherman’s forecasting.
• “Lay people” are defined as people without substantive knowledge in the
questions that are asked. Examples are subjects respond to general knowledge
questions, college students in typical feeling-of-knowing tasks (e.g. is this word in
the list you saw earlier?)
6. What is the difference between single-judgment versus multiple-judgment studies?
• An example illustrates this difference. In a single-judgment study, subjects are
asked to make a single judgment for an event of interest. For example, providing a
probability judgment (p-judgment) for an event in the future, e.g. rain or no rain,
providing a p-judgment for a possible outcome of an event of interest, e.g.
choosing an alternative (True or False) and provide a p-judgment on the selected
option.
• In multiple-judgment study, subjects provide multiple judgments corresponding to
multiple possible outcomes. For example, subjects choose the most likely answer
for a question and provide p-judgments for some or all of the answers including
the answers that were not selected.
188
7. How to count the number of judgments?
• This is the total number of questions that subjects answer. If a study clearly
indicates how many judgments that are used to compute the effect sizes, use this
number instead.
8. Why do we tag studies with “intervention”, “feedback, and “incentivization”?
• We can revisit these studies and do second or third meta analyses
• Here are the definitions of these terms: “intervention”: any treatments that aim to
alternate the Brier score or any of the effect sizes of interest. This excludes studies
that use feedback or incentivization. “Feedback” means subjects are told how well
their judgments are before proceeding to make another judgment(s).
“Incentivization” means subjects are told and explained that their performance
will be scored and evaluated by the Brier score.
9. I don’t understand the “base rate” in the “Difficulty” block in the coding form.
• In non-prediction studies such as in general knowledge question tasks, base rate is
simply 1/k where k is the number of possible answers (provided to subjects). In
other tasks, it is the proportion of successful events where “success” is defined by
the primary studies’ authors.
10. How do we compute standard deviations?
• Must see Lipsey and Wilson’s appendix for explanations
• There is a spreadsheet named “compute standard deviation” in Drive for this
purpose
189
APPENDIX B
MATERIALS IN STUDY 1
“Expert” Screening surveys
Instruction: In the following questions, we are interested in learning how much you know about
sports. Your responses are anonymous, so please answer as truthfully as possible. We are ONLY
interested in what you know and your behaviors. There is no need to do outside research to
answer the questions.
NFL [in parentheses are choices]
o Which of the following two teams played in the last year 50
th
Super Bowl?
[Denver Broncos & Seattle Seahawks,
Denver Broncos & Carolina Panthers,
New England Patriots & Seattle Seahawks,
Dallas Cowboys & Pittsburgh Steelers]
o How many teams are there in the NFL?
[24, 28, 30, 32, 34]
o What are the two conferences in the NFL?
[American Football Conference & National Football Conference,
190
American Football Conference & North American Football Conference,
Pacific Coast Conference & Eastern Conference,
South-West Conference & North-East conference]
Prediction Questions
Number condition
In the following pages, you will be asked to make a number of predictions about the 2016 NFL
season. We are interested in your best estimate of the chances from 0% to 100% that each event
will happen. If you are sure that the event will happen, indicate this by using the 100% option. If
you are sure that the event will not happen, use the 0% option. If you feel the event as likely to
happen as it does not happen, use the 50% option. In all other cases type use a number between
0% and 100% to indicate your likelihood that the event will happen
Word condition
In the following pages, you will be asked to make a number of predictions about the 2016 NFL
season. We are interested in your best estimate of the chances from that each event will happen.
You will provide this estimate by selecting the most appropriate phrase in a provided scale. If
you are sure that the event will happen, indicate this by selecting CERTAIN. If you are sure that
the event will not happen, selecting IMPOSSIBLE. If you feel the event as likely to happen as it
does not happen, selecting TOSS UP. In all other cases, selecting a phrase in between impossible
and tossup or in between tossup and certain to indicate your likelihood that the event will happen
• The AFC-North division includes the following teams: Baltimore Ravens, Cincinnati
Bengals, Cleveland Browns, Pittsburgh Steelers
191
What is the chance that the Cincinnati Bengals wins this division in the 2016 regular
season?
Out of the 16 games that the Cincinnati Bengals plays during the 2016 regular season,
what is the chance that they will win 12 games?
• The AFC-South division includes the following teams: Texans, Indianapolis Colts,
Jacksonville Jaguars, Tennessee Titans
What is the chance that the Texans wins this division 2016 regular season?
Out of the 16 games that the Texans plays during the 2016 regular season, what is the
chance that they will win 9 games?
• The AFC-East division includes the following teams: Buffalo Bills, Miami Dolphins,
New England Patriots, New York Jets
What is the chance that the New England Patriots wins this division 2016 regular
season?
Out of the 16 games that the New England Patriots plays during the 2016 regular
season, what is the chance that they will win 12 games?
• The AFC-West division includes the following teams: Denver Broncos, Kansas City
Chiefs, Oakland Raiders, San Diego Chargers
What is the chance that the Denver Broncos wins this division 2016 regular season?
Out of the 16 games that the Denver Broncos plays during the 2016 regular season, what
is the chance that they will win 12 games?
• The NFC-North division includes the following teams: Bears, Detroit Lions, Green Bay
Packers, Minnesota Vikings
192
What is the chance that the Minnesota Vikings wins this division in the 2016 regular
season?
Out of the 16 games that the Minnesota Vikings plays during the 2016 regular season,
what is the chance that they will win 11 games?
• The NFC-South division includes the following teams: Atlanta Falcons, Carolina
Panthers, New Orleans Saints, Tampa Bay Buccaneers
What is the chance that the Carolina Panthers wins this division 2016 regular season?
(4)
Out of the 16 games that the Carolina Panthers plays during the 2016 regular season,
what is the chance that they will win 15 games?
• The NFC-East division includes the following teams: Dallas Cowboys, New York Giants,
Philadelphia Eagles, Washington Redskins
Out of the 16 games that the Washington Redskins plays during the 2016 regular
season, what is the chance that they will win 9 games?
• The NFC-West division includes the following teams: Arizona Cardinals, St. Louis
Rams, San Francisco 49ers, Seattle Seahawks
Out of the 16 games that the Arizona Cardinals plays during the 2016 regular season,
what is the chance that they will win 13 games?
• What is the chance that the New England Patriots is the winner of the AFC conference?
• What is the chance that the Denver Broncos is the winner of the AFC conference?
• What is the chance that the Cincinnati Bengals is the winner of the AFC conference?
• What is the chance that the Houston Texans is the winner of the AFC conference?
• What is the chance that the Kansas City is the winner of the AFC conference?
193
• What is the chance that the Pittsburgh Steelers is the winner of the AFC conference?
• What is the chance that the Carolina Panthers is the winner of the NFC conference?
• What is the chance that the Arizona Cardinals is the winner of the NFC conference?
• What is the chance that the Minnesota Vikings is the winner of the NFC conference?
• What is the chance that the Washington Redskins is the winner of the NFC conference?
• What is the chance that the Green Bay Packers is the winner of the NFC conference?
• What is the chance that the Seattle Seahawks is the winner of the NFC conference?
• What is the chance that the New England Patriots is the winner of the AFC conference?
• What is the chance that the Denver Broncos is the winner of the AFC conference?
• What is the chance that the Kansas City is the winner of the AFC conference?
• What is the chance that the Pittsburgh Steelers is the winner of the AFC conference?
• What is the chance that the Carolina Panthers is the winner of the NFC conference?
• What is the chance that the Arizona Cardinals is the winner of the NFC conference?
• What is the chance that the Green Bay Packers is the winner of the NFC conference?
• What is the chance that the Seattle Seahawks is the winner of the NFC conference?
• What is the chance that the losing team in trails more than 14 points behind the winning
team during the 51
st
Super Bowl?
• What is the chance that the losing team in trails more than 7 points behind the winning
team during the 51
st
Super Bowl?
• During week 9 of the 2016 regular NFL season, the Indianapolis Colts will play an away
game against the Green Bay Packers. What is the chance that the Green Bay Packers will
win?
194
• During week 7 of the 2016 regular NFL season, the New York Giants will play against
the Los Angeles Rams in London. What is the chance that the Rams will win?
• During week 13 of the 2016 regular NFL season, the Dallas Cowboys will play an away
game against the Minnesota Vikings. What is the chance that the Vikings will win?
• During week 13 of the 2016 regular NFL season, the Carolina Panthers will play an away
game against the Denver Broncos. What is the chance that the Broncos will win?
• During week 14 of the 2016 regular NFL season, the Oakland Raiders will play an away
game against the Kansas City Chiefs. What is the chance that the Chiefs will win?
• During week 6 of the 2016 regular NFL season, the Dallas Cowboys will play an away
game against the Green Bay Packers. What is the chance that the Packers will win?
• During week 16 of the 2016 regular NFL season, the Arizona Cardinals will play an away
game against the Seattle Seahawks. What is the chance that the Seahawks will win?
• During week 8 of the 2016 regular NFL season, the Arizona Cardinals will play an away
game against the Carolina Panthers. What is the chance that the Panthers will win?
• During week 10 of the 2016 regular NFL season, the Dallas Cowboys will play an away
game against the Pittsburgh Steelers. What is the chance that the Steelers will win?
• During week 7 of the 2016 regular NFL season, the New England Patriots will play an
away game against the Pittsburgh Steelers. What is the chance that the Steelers will win?
• During week 13 of the 2016 regular NFL season, the Carolina Panthers will play an away
game against the Seattle Seahawks. What is the chance that the Seahawks will win?
• During week 15 of the 2016 regular NFL season, the Pittsburgh Steelers will play an
away game against the Cincinnati Bengals. What is the chance that the Bengals will win?
195
• During week 4 of the 2016 regular NFL season, the Cleveland Browns will play an away
game against the Washington Redskins. What is the chance that the Redskins will win?
• During week 1 of the 2016 regular NFL season, the New England Patriots will play an
away game against the Arizona Cardinals. What is the chance that the Cardinals will
win?
• During week 7 of the 2016 regular NFL season, the Houston Texans will play an away
game against the Denver Broncos. What is the chance that the Broncos will win?
• During week 10 of the 2016 regular NFL season, the Seattle Seahawks will play an away
game against the New England Patriots. What is the chance that the Patriots will win?
196
APPENDIX C
MATERIALS IN STUDY 2
Practice Questions in the Incentivized/Number condition
Let's answer some questions so we know that you understand how your performance is
evaluated.
Question 1
When you assess a 50% chance of rain and it actually rains, your points would be ___
according to the payoff table.
Options
• 0
• 25
• -75
• 9
If option “0” is not selected, the following message is shown:
When you make a 50% judgment, your point would always be 0 regardless of the status
of the event.
Question 2
197
When you assess a 100% chance of rain and it actually does not rain, your points would
be ___ according to the payoff table:
Options
• 0
• 25
• -75
• 9
If option “-75” is not selected, the following message is shown:
Your answer is wrong. When you assess a 100% chance of rain and it actually does not
rain, your points would be -75.
Question 3
When you assess a 0% chance of rain and it actually rains, your points would be ___
according to the payoff table
Options
• 0
• 25
• -75
• 9
198
If option “-75” is not selected, the following message is shown:
Your answer is wrong. When you assess a 0% chance of rain and it actually rains, your
points would be -75
Question 4
When you assess a 55% chance of rain and it actually does not rain, your points would be
between___ according to the payoff table.
Options
• 0 and -11
• 0 and 11
• -39 and -56
If option “0 and -11” is not selected, the following message is shown:
Your answer is wrong. When you assess a 55% chance of rain and it actually does not
rain, your points would be between 0 and -11.
199
Practice Questions in the Incentivized/Verbal condition
Let's answer some questions so we know that you understand how your performance is
evaluated.
Question 1
When you indicate a fifty-fifty chance of rain and it actually rains, your points would be
___ according to the payoff table.
Options
• 0
• 25
• -75
• 9
If option “0” is not selected, the following message is shown:
When you make a “Fifty-Fifty” judgment, your point would always be 0 regardless of the
status of the event.
Question 2
When you indicate that it is “Absolutely Certain” to rain and it actually does not rain,
your points would be ___ according to the payoff table:
200
Options
• 0
• 25
• -75
• 9
If option -75 is not selected, the following message is shown:
Your answer is wrong. When you indicate that it is “Absolutely Certain” to rain and it
actually does not rain, your points would be -75.
Question 3
When you indicate that it is “Absolutely Impossible” to rain and it actually rains, your
points would be ___ according to the payoff table
Options
• 0
• 25
• -75
• 9
If option “-75” is not selected, the following message is shown:
201
Your answer is wrong. When you indicate that it is “Absolutely Impossible” to rain and it
actually rains, your points would be -75
Question 4
When you indicate that it is “Very Likely” to rain and it actually does not rain, your
points would be___ according to the payoff table.
Options
• Positive
• Negative
• Zero
If option “Negative” is not selected, the following message is shown:
Your answer is wrong. When you indicate that it is “Very Likely” to rain and it actually
does not rain, your points would be negative.
Examples Shown to Respondents in the Gamble condition
NBA game: Kawhi Leonard will win the NBA MVP this year.
If the event occurs, you get $100.
If the event does NOT occur, you get $0
Ball game: A ball will be randomly drawn from a container that has 50 red balls
and 50 blue balls.
If a red ball is drawn, you get $100.
If a blue ball is drawn, you get $0.
Which game would you choose to play?
202
Options
A. Play the NBA game
B. Play the Ball game
C. Indifference
If option B is chosen, the following pair of games is shown:
NBA game: Kawhi Leonard will win the NBA MVP this year..
If the event occurs, you get $100.
If the event does NOT occur, you get $0
Ball game: A ball will be randomly drawn from a container that has 10 red balls
and 90 blue balls. <-NOTE THAT THE NUMBER OF RED AND BLUE BALLS
HAS BEEN CHANGED
If a red ball is drawn, you get $100.
If a blue ball is drawn, you get $0.
Which game would you choose to play?
Options
A. Play the NBA game
B. Play the Ball game
C. Indifference
If option A is chosen, the following pair of games is shown:
NBA game: Kawhi Leonard will win the NBA MVP this year..
If the event occurs, you get $100.
If the event does NOT occur, you get $0
Ball game: A ball will be randomly drawn from a container that has 90 red balls
and 10 blue balls. <-NOTE THAT THE NUMBER OF RED AND BLUE BALLS
HAS BEEN CHANGED
If a red ball is drawn, you get $100.
If a blue ball is drawn, you get $0.
Which game would you choose to play?
Options
A. Play the NBA game
B. Play the Ball game
C. Indifference
If option “Indifference” is chosen, the following message is shown: “You chose
‘Indifference’, this means that you equally prefer to play either one of the two provided games”.
203
Prediction Questions
• What is the chance that the Eastern Conference finals will require game 7th to determine
the champions?
• What is the chance that at least one game in the Western Conference finals will require an
overtime play?
• What is the chance that Stephen Curry will score 12 points or more from three-point
shots per game throughout the playoffs?
• During the Eastern Conference finals, what is the chance that the team with home-court
advantage will win at least two home games?
• What is the chance that Russell Westbrook will win the NBA MVP this year?
• What is the chance that each of the seven games in the Western Conference finals will
finish with both teams scoring more than 90 points?
• What is the chance that each of the seven games in the NBA finals will finish with both
teams scoring more than 90 points?
• What is the chance that each of the seven games in the NBA finals will finish with both
teams scoring fewer than 90 points?
• What is the chance that at least one game during the NBA finals will require an overtime
play?
• What is the chance that Russell Westbrook will average a triple-double through the
playoffs?
• What is the chance that the top two seeds in the Western conference will advance to the
conference finals?
204
• What is the chance that a team from the Western conference will win the NBA 2017
Championship?
• What is the chance that the Western Conference finals will require game 7th to determine
the champions?
• What is the chance that Stephen Curry will win the NBA MVP this year?
• What is the chance that the NBA finals will require game 7th to determine the
champions?
• What is the chance that LeBron James will make fewer than two offensive rebounds per
game throughout the playoffs?
• What is the chance that a player other than Stephen Curry will win the NBA MVP this
year?
• During the Western Conference finals, what is the chance that the team with home-court
advantage will win at least two home games?
• What is the chance that the NBA finals will finish with a result other than a 4-0 record?
• What is the chance that the Western Conference finals will finish with a 4-0 record?
• In the NBA finals, what is the chance that the team with home-court advantage will win
at least one home game?
• What is the chance that the Eastern Conference finals will finish with a result other than a
4-0 record?
• What is the chance that the NBA finals will finish with a 4-0 record?
• What is the chance that James Harden will win the NBA MVP this year?
• In the NBA finals, what is the chance that the team with home-court advantage will score
at least 30 points in the 1st quarter in the first away- game?
205
• What is the chance that each of the seven games in the Eastern Conference finals will
finish with both teams scoring more than 90 points?
• What is the chance that James Harden will score 10 points or more per game throughout
the playoffs?
• What is the chance that Stephen Curry will makes fewer than two three-point attempts
per game throughout the playoffs?
• What is the chance that the Western Conference finals will require game 7th to determine
the champions?
• What is the chance that Lebron James will make at least two successful dunks per game
throughout the playoffs?
• What is the chance that a player or a coach will get ejected in the NBA finals?
• What is the chance that a player other than James Harden will win the NBA MVP this
year?
• What is the chance that John Wall will have an Assist to Turnover ratio of at least 2.5 per
game throughout the playoffs?
• What is the chance that LeBron James will make two offensive rebounds or more per
game throughout the playoffs?
• What is the chance that the top two seeds in the Eastern conference will advance to the
conference finals?
• What is the chance that the Western Conference finals will NOT require game 7th to
determine the champions?
• What is the chance that Stephen Curry will make two three-point attempts or more per
game throughout the playoffs?
206
• In the Western Conference finals, what is the chance that the team with home-court
advantage will win fewer than two home games?
• What is the chance that a players or a coach will get ejected during the Western
Conference finals?
• What is the chance that each of the seven games in the NBA finals will finish with both
teams scoring more than 90 points?
• If the Golden State Warriors play against the Portland Trail Blazers in the Western
Conference quarterfinals, what is the chance that the Warriors will advance to the semi-
finals?
• If the San Antonio Spurs play against Memphis Grizzlies in the Western Conference
quarterfinals, what is the chance that the Spurs will advance to the semi-finals?
• If the Utah Jazz plays against the Los Angeles Clippers in the Western Conference
quarterfinals, what is the chance that the Jazz will advance to the semi-finals?
• If the Houston Rockets play against the Oklahoma City Thunder in the Western
Conference quarterfinals, what is the chance that the Rockets will advance to the semi-
finals?
• If the Cleveland Cavaliers play against the Miami Heat in the Eastern Conference
quarterfinals, what is the chance that the Cavaliers will advance to the semi-finals?
• If the Washington Wizards play against the Milwaukee Bucks in the Eastern Conference
quarterfinals, what is the chance that the Wizards will advance to the semi-finals?
• If the Washington Wizards play against the Atlanta Hawks in the Eastern Conference
quarterfinals, what is the chance that the Wizards will advance to the semi-finals?
207
• If the Toronto Raptors play against the Atlanta Hawks in the Eastern Conference
quarterfinals, what is the chance that the Raptor will advance to the semi-finals?
• If the Toronto Raptors play against the Milwaukee Bucks in the Eastern Conference
quarterfinals, what is the chance that the Raptor will advance to the semi-finals?
• If the Boston Celtics play against the Chicago Bulls in the Eastern Conference
quarterfinals, what is the chance that the Celtics will advance to the semi-finals?
• What is the chance that the Golden State Warriors will win the NBA title this year?
• What is the chance that the Cleveland Cavaliers will win the NBA title this year?
• What is the chance that the NBA finals will go to game 17?
• What is the chance that the NBA finals will have two teams from the same conference?
• What is the chance that the Cleveland Cavaliers will play against the Boston Celtics in
the NBA finals?
Abstract (if available)
Abstract
This dissertation aims to address two broad research questions. The first, and a fundamental one, is to what extent subjective assessments of uncertainty are informative and useful. The second, and a practical one, is what factors determine probability assessment performance. The first question is addressed in a meta-analysis designed to test whether the performance of judges in 69 different empirical studies on probability assessment is better than the performance that would be obtained when judges make uninformative guesses. The meta-analysis also provides a platform to examine the effects of various important methodological factors on probability assessment performance including the effects of expertise, elicitation methods, and assessment contexts. The results indicated that the Brier score, a performance measure of probability judgments, was statistically better than a score that would be obtained when judges made uninformative guesses although the overall quality was quite modest. Additionally, judgments from experts were significantly more calibrated and more resolute than judgments from lay individuals. In addition, aleatory uncertainty assessment resulted in a better Brier score than epistemic uncertainty assessment. Interestingly, judges performed better when they were asked to estimate probability for a single hypothesis compared to when they were asked to estimate probabilities for mutually exclusive outcomes. ❧ The second question is addressed in two forecasting studies. Particularly, these experiments were designed to evaluate alternative methods to elicit subjective probability as well as to explore the effects of individual differences on probability assessment performance. Extensive efforts were made to recruit individuals knowledgeable in assessment domains to participate in the studies. The results indicated that numeric representation of probability was more extreme but less overconfident and more resolute than verbal judgments. In addition, there were no apparent benefits of using a gamble method to elicit probability. Equally interesting, judges who received incentives based on their assessment performance did not perform better than those who did not receive such performance-based incentives. The ability to think reflectively emerged as a consistent predictor of (good) probability judgments across assessment contexts. Numeracy and active-open-minded thinking were also predictors of (good) probability judgment performance. ❧ There are three major contributions from this dissertation. First, this dissertation is the first attempt to quantify probability assessment performance from 69 primary studies on probability judgments. Second, this dissertation is one of the firsts to assess verbal aleatory uncertainty assessments. Third, this dissertation demonstrates an innovative quantification method for verbal expressions of uncertainty. These contributions provide further insights and tools for researchers to continue examining various aspects of subjective assessments of uncertainty.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Choice biases in making decisions for oneself vs. others
PDF
The cost of missing objectives in multiattribute decision modeling
PDF
Probability assessment: Continuous quantities and probability decomposition
PDF
Preparing for natural disasters: investigating the effects of gain-loss framing on individual choices
PDF
Uncertainty in geomorphometry
PDF
Making terrorism risk assessments more useful for decision-making
PDF
Modeling human bounded rationality in opportunistic security games
PDF
Using classification and regression trees (CART) and random forests to address missing data
PDF
Human and machine probabilistic estimation for decision analysis
PDF
Bound in hatred: a multi-methodological investigation of morally motivated acts of hate
PDF
Behavior understanding from speech under constrained conditions: exploring sparse networks, transfer and unsupervised learning
PDF
Incorporating prior knowledge into regularized regression
PDF
On the latent change score model in small samples
PDF
Developing an agent-based simulation model to evaluate competition in private health care markets with an assessment of accountable care organizations
PDF
A risk analysis methodology to address human and organizational factors in offshore drilling safety: with an emphasis on negative pressure test
PDF
Energy efficient buildings: a method of probabilistic risk assessment using building energy simulation
PDF
Utility functions induced by certain and uncertain incentive schemes
PDF
A Bayesian region of measurement equivalence (ROME) framework for establishing measurement invariance
PDF
Identifying Social Roles in Online Contentious Discussions
PDF
Decoding information about human-agent negotiations from brain patterns
Asset Metadata
Creator
Nguyen, Kenneth Dac
(author)
Core Title
Evaluating aleatory uncertainty assessment
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Psychology
Publication Date
08/10/2018
Defense Date
08/09/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Brier,individual differences,meta-analysis,OAI-PMH Harvest,sport forecasts,subjective probability
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
John, Richard S. (
committee chair
), Dehghani, Morteza (
committee member
), Monterosso, John (
committee member
), Winterfeldt, Detlof von (
committee member
)
Creator Email
hoangdun@usc.edu,kennethnguyen1004@yahoo.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-66792
Unique identifier
UC11671755
Identifier
etd-NguyenKenn-6725.pdf (filename),usctheses-c89-66792 (legacy record id)
Legacy Identifier
etd-NguyenKenn-6725.pdf
Dmrecord
66792
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Nguyen, Kenneth Dac
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
Brier
individual differences
meta-analysis
sport forecasts
subjective probability