Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Probability assessment: Continuous quantities and probability decomposition
(USC Thesis Other)
Probability assessment: Continuous quantities and probability decomposition
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
PROBABILITY ASSESSMENT: CONTINUOUS
QUANTITIES AND PROBABILITY DECOMPOSITION
by
Patrick J. Doyle
________________________________________________________________
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(PSYCHOLOGY)
May 2011
Copyright 2011 Patrick J. Doyle
ii
DEDICATION
This dissertation is dedicated to my parents, Thomas and Joanna Doyle.
Thank you for all of your love, encouragement, and support of my education
throughout the years. Most of all, thank you for being such wonderful parents
throughout time in everything that you have done and continue to exemplify.
Without you, none of this would have ever been possible. To make you both proud
means the world to me. I would also like to dedicate this dissertation to the rest of
my family, in the manner that my brother Tom instilled in me the value of higher
education and learning from a very young age. I would like to express gratitude to
my committee members, Rand Wilcox, Steve Read, David Walsh, and Jim Moore
for their support through this process. I would particularly like to thank Rand
Wilcox for furthering my knowledge of statistics through robust evaluation, which
would not have been achievable if not for his tutelage. Consummate appreciation
goes to Richard John, my committee chair, for his unconditional support,
advisement, mentoring and friendship throughout the years. Thank you for
supporting me through the trials and tribulations of life in pursuit of my degree.
Special tribute goes to Andrew Barclay, my undergraduate mentor from Michigan
State University, who through the power of perspective taught me to perceive life
differently. To believe there “are no rocks, but only water” has changed my life
significantly and still defines the person that I am today. Finally, I would like to
sincerely extend appreciation to my friend Melanie James, who has provided with
iii
me endless support, encouragement, and motivation to succeed in this endeavor and
beyond.
iv
TABLE OF CONTENTS
Dedication ii
List of Tables v
List of Figures vi
Abstract vii
Chapter 1: Introduction 1
Chapter 2: Method 23
Chapter 3: Design and Analysis 39
Chapter 4: Results 51
Chapter 5: Discussion 77
References 89
Appendices 94
Appendix A: Information/Facts Sheet for Non-Medical Research 94
Appendix B: Sample Survey Questions 96
v
LIST OF TABLES
Table 1. Demographic Characteristics of Sample 24
Table 2. Sports Knowledge and Wagering Characteristics 25
Table 3. Summary of Match-ups by Week and Original Responses 27
Collected for NFL and NCAA Season Games
Table 4. Point Spreads and Outcomes for Assessed Games (With Point 28
Spread in Parentheses)
Table 5. Results for the Accuracy of Independent Events (With Standard 53
Deviations in Parentheses)
Table 6. Results for the Robust Dependent Groups Analysis of Brier Scores 54
Table 7. Results of Consistency Analysis 64
Table 8. Results for Accuracy for Probability Decompositions 76
vi
LIST OF FIGURES
Figure 1. Cumulative Distribution Function for Total Points Scored 4
in NBA Game
Figure 2. Example Calibration Graphs (Ronis & Yates, 1987) 10
Figure 3. Probability Plot of the Sequential 2-Interval Method 30
Figure 4. Probability Plot of the Simultaneous 4-Interval Forced- 34
Consistency Method
Figure 5. Consistency Analysis 44
Figure 6. Kernal Density Estimates of the Distribution of Scores for 55
each Variable
Figure 7. Calibration Graphs for Prediction that Favorite Wins and 58
Favorite Wins by the Point Spread / 2
Figure 8. Calibration Graphs for Prediction that Favorite Wins by Less 60
than the Point Spread and Favorite Wins by Point Spread
Figure 9. Calibration Graphs for Prediction that Favorite Wins by Point 61
Spread + 7 and that the Dog Wins by the Point Spread / 2
Figure 10. Comparison of Continuous Distributions: Sequential 2-Interval 65
versus Simultaneous 4-Interval Forced-Consistency Method
Figure 11. The Distribution of Signed Differences for the Decomposition 70
that the P (favorite wins the game) Decomposed into the
P (favorite wins by point spread) + the P (favorite wins by
less than the point spread)
Figure 12. The Distribution of Signed Differences for the Decomposition 72
that the P (favorite wins the game) Decomposed into the
P (favorite wins by 7 points or less) + the P (favorite wins by
8 points or more)
Figure 13. The Distribution of Signed Differences for the Decomposition 74
that the P (favorite wins by the point spread) Decomposed into
the P (favorite wins the game) - the P (favorite wins by less than
the point spread)
vii
ABSTRACT
There has been a great deal of research conducted in the areas of probability
assessment and calibration. However, a limited amount of research has been
conducted using continuous random variables where participants construct subjective
probability distributions (SPD’s) and a dearth of research exploring probability
decomposition. The current research had three main goals: 1) to examine accuracy
for various predictions of NFL and NCAA football game outcomes, 2) to compare
different methods of elicitation for SPD’s, and 3) to examine probability
decomposition using stimuli from a real-world domain. It was hypothesized that
participants would demonstrate the most accuracy in prediction for questions
regarding game winners and for questions about winners against the point spread,
since such events are the most common for those who regularly follow sports. It was
also hypothesized that participants would be more accurate for SPD’s based on a
simultaneous 4-interval forced-consistency method versus a sequential 2-interval
method, since all possible outcomes are presented at once using the 4-interval
method. Participants were recruited online through various sources and responded to
online surveys regarding games in the 2009 NFL and NCAA football season.
Results demonstrated participants were not very accurate in assessments of game
winners or of picking winners against the point spread. Also, participants were more
accurate constructing SPD’s using the sequential method. Additionally, probability
decomposition did not lead to more accurate assessments. These results are most
likely due to the difficulties in conducting probability assessment via online survey.
viii
Implications, potential limitations, and recommendations for future research are
discussed.
1
CHAPTER 1
INTRODUCTION
Individuals regularly make informal judgments about uncertainty (e.g. “It is
likely that the Democrats will win the next Presidential election” or “It is likely that
the U.S. economy will improve within the next year”). In decision analysis,
subjective assessments of uncertainty can be made more formally than general
statements of likelihood and expressed in terms of numerical probability. The notion
that these assessments can be made in probabilistic terms is a fundamental notion of
decision analysis (Clemen, 1991). When decision problems include uncertainty,
obtaining subjective probabilities is necessary in order to assess the likelihood of
unknown outcomes. Probability is usually expressed in terms of long-term
frequency (e.g. the probability of rolling a certain value if a die is thrown several
times or the number of times the flip of a coin will turn up “heads” or “tails” if
tossed many times). However, when it comes to many events, assessing long-term
frequency is not possible because the events will only occur once (such as the unique
outcome of a particular sporting event or movement of a particular stock in the stock
market). Thus, for unique events obtaining the probability in terms of long-term
frequency is not possible. Subjective probability is the “degree of belief” that an
individual thinks a certain event will occur and many decision problems require
individuals to convert these beliefs into numbers (Clemen, 1991). It is this
transformation from the personal belief of likelihood that a particular event will
occur to actual numbers that is the process of probability assessment and elicitation.
2
Probability assessment can be made with both discrete variables and
continuous random variables. A discrete variable is one that can only take on certain
values (e.g. rain versus no rain, win versus no win) and a continuous variable is a
variable that can take on any value between two specified values (Clemen, 1991).
The majority of research conducted in the area of subjective probability assessment
has been done using discrete variables (e.g. general-knowledge questions with two-
choice outcomes) and there has been a dearth of research assessing continuous
random variables (Hora & Hora, 1992; Hora, 2004). Part of the reason why this is
the case is that it is a more straightforward task to assess discrete probability because
this requires few judgments (and only one judgment if given dichotomous
assessments). Conversely, in order to assess continuous probability, several
judgments are usually made to obtain probabilities related to points along the
distribution. Thus, respondents usually must provide many more assessments to
construct continuous distributions, and also they also must think more carefully
about their respective probabilities given certain points on a continuous probability
distribution.
Some common ways of assessing discrete probabilities include directly
asking subjects to provide a probability that a specific event will or will not occur,
asking subjects about various gambles that they would be willing to make (with the
goal of finding the point where the decision maker is indifferent between two bets),
or asking participants to choose between different lotteries, each of which can result
in a specific prize (Clemen, 1991). When it comes to the assessment of continuous
3
probabilities, a common approach is to assess several cumulative probabilities (the
probability above or below a specific point on the distribution) and then use these
values to plot (and estimate) a cumulative density function (CDF) across a range of
values (Clemen, 1991; Hora & Hora, 1992). Uncertain continuous quantities can
take on any value within some range, and the goal is to obtain several interval
probabilities that a specific event will occur. For instance, consider the following
hypothetical example for obtaining probability estimates of the total points scored by
both teams in an NBA basketball game. In this example, it may be believed that the
total points scored is somewhere between 110 and 220, and the uncertainties about
the specific points could be expressed by the following probability statements:
P (Total Points ! 110) = 0.00
P (Total Points ! 128) = 0.05
P (Total Points ! 138) = 0.15
P (Total Points ! 147) = 0.30
P (Total Points ! 153) = 0.40
P (Total Points ! 175) = 0.78
P (Total Points ! 192) = 0.91
P (Total Points ! 230) = 1.00
These probabilities can be displayed graphically and are represented in Figure 1.
This graph represents a CDF for total points scored in an NBA game by both teams,
and allows us to calculate the probability for any interval. It is important to note that
other judgments could be made and the CDF function would be even smoother if
4
additional assessments are obtained, but here the above judgments are adequate for
the purposes of illustration. The CDF function should always be sloping upward or
monotonically increasing because as more extreme values are assessed, there is a
greater probability that the interval below contains the actual value. Ultimately,
obtaining a subjective probability distribution (SPD) can only be done approximately
in this manner by obtaining several intervals and inferring or fitting the distribution
through several points (Hora, Hora, & Dodd, 1992).
Figure 1. Cumulative Distribution Function for Total Points Scored in NBA Game
5
Research on Continuous Distributions
A continuous probability can either be estimated by obtaining several points
and estimating the cumulative distribution by fitting a curve, or by obtaining
probabilities for a few specific points and assuming the nature of the distribution
(e.g. normal distribution where the most extreme values of the distribution are highly
improbable). Different methods of probability elicitation exist however, and Hora et
al. (1992) looked at direct assessment methods as previously described, in
comparison to bisection methods where equally likely subintervals are assessed.
Direct assessment requires participants to provide probabilities for intervals of
uncertain quantities and is the most straightforward method of elicitation. An
example of direct assessment is simply asking a respondent a question such as “what
is the probability of interval A occurring”? Questions of this type would be repeated
over different intervals of the variable in question. This was the method used in the
hypothetical example for assessing the total points scored by both teams in an NBA
game. With bisection methods, participants are asked to provide values such that
equally likely variable values are above or below a certain point on a distribution
(Hora et al., 1992). Hence, using this method, subjects are providing values of the
variable as opposed to providing specific probabilities related to the value in an
interval. Hora et al. (1992) concluded in their study that the differences between the
techniques were minimal and any differences were attributed to the differences in
procedures used for obtaining distribution end points, which is a concern with the
assessment of continuous random variables due to potential effects of anchoring and
6
adjustment (Hora et al., 1992). Other studies have also found similar results
demonstrating that research participants have difficulty when assessing continuous
distributions and have a tendency to construct intervals that are too narrow, thus
leading to an overconfidence bias (Alpert & Raiffa, 1982; Lichtenstein, Fischhoff, &
Phillips, 1982). Additional research assessing methods for eliciting continuous
probabilities found that the direct elicitation of probabilities performed either as well
or better than bisection methods (Ludke, Straus, & Guftason, 1977; Seaver, von
Winterfeldt, & Edwards, 1978). In general, most of the results in the literature
suggest that direct assessment works just as well as alternative elicitation methods
that are available (von Winterfeldt & Edwards, 1986). Overall, there has been a
limited amount of research assessing continuous variables and there is a need to
further explore methods of elicitation. Hora et al. (1992) have suggested the need for
further studies exploring elicitation techniques.
Calibration Research
Traditionally, the most common stimulus domain for assessing discrete
probability has been with the use of general-knowledge questions with multiple-
choice outcomes (Lichtenstein et al., 1982). In this paradigm, subjects respond to
general-knowledge questions of varying difficulties, and then provide probability
estimates that they answered questions correctly. Some examples of general-
knowledge items are questions such as “What is the name of the Roman emperor
who fiddled while Rome burned?” or “What is the last name of the astronomer who
published in 1543 his theory that the earth revolves around the sun?” (Nelson &
7
Narens, 1980). General-knowledge questions of this type have been used in various
calibration studies and it is important to point out that asking questions of this type is
very different than asking questions about unknown events. The most common
response mode has been to have subjects answer questions and then provide
probability ratings of how confident they are that they answered a particular question
correctly. For example, in a two-choice answer format, the subjective probability of
choosing the correct answer will vary between .5 and 1.0 since guessing results in a
50% chance of answering correctly. By far the most common method that has been
used to study probability assessment is that of having subjects answer general-
knowledge questions and to give a probability that their answer was correct, although
some calibration studies have asked subjects to make predictions of future events,
such as predicting the winning team for future basketball games (Yates, 1982). The
main focus of traditional calibration research has been to study overconfidence,
which occurs when subjective probabilities regarding particular outcomes
significantly exceeds the percentage of time that the outcomes actually occur (or the
percentage of time that predictions are correct).
Calibration is perfect if the assessed probabilities equal the true (outcome)
values over a series of assessments (Fischhoff, Slovic, & Lichtenstein, 1977; Yates,
1982; Ronis & Yates, 1987; Ravinder, Kleinmutz, & Dyer, 1988; Hora & Hora,
1992). For example, calibration is good if a respondent predicts that 70% of the time
something will occur and 70% of the time it actually occurs (as with weather
forecasting). A proper scoring rule is “an algorithm that assigns a payoff for
8
probability assessment…whereby a person can maximize his subjectively expected
score only by stating his true beliefs” (Jensen & Petersen, 1973, p. 308). Proper
scoring rules scores are maximized when over a series of events, predicted
probabilities are close to the relative frequencies predicted for them (Seaver et al.,
1978). Whereas accuracy is usually assessed using scoring rules to determine the
degree to which a respondent’s confidence exceeds accuracy, calibration is normally
assessed with “reliability” diagrams, where the proportion correct is plotted against
the assigned probability. In other words, if perfectly calibrated, the points in a
reliability diagram will fall on a diagonal line. A commonly used accuracy measure
in calibration research is the “Brier” score, which is also referred to as the mean
probability score (Ronis & Yates, 1987). The Brier score is an overall measure of
the accuracy of a set of probabilistic judgments and lower Brier scores are indicative
of better accuracy (Ronis & Yates, 1987). The Brier score is the mean squared
difference between a stated probability and outcome, can range between 0 and 1, and
can be denoted by the following equation:
, (1)
where f = probability assigned to the target event, and d = outcome for the target
event, which is 1 if the event occurs and 0 if the event does not occur, (with N equal
to the number of judgments). As previously stated, most calibration studies have
been done using two-choice outcomes, so by chance, a respondent has a 50% chance
of being correct simply by guessing (i.e. respondent does not have a preference for
9
one choice versus another). When a respondent is guessing the outcome, and does
not have a preference for one choice versus another, this results in a Brier score of
.25 (e.g. (.5 – 0) " = .25 and (.5 – 1) " = .25)). Following the Brier score equation,
when accuracy is perfect, the Brier score would be equal to 0, and poor performance
would result in a Brier score of 1. Thus, Brier scores less than .25 approaching 0 are
indicative of good accuracy, and Brier scores greater than .25 approaching 1 are
indicative of increasingly poor performance.
In contrast to the Brier score (an accuracy measure), the most common way
that calibration is assessed is through the use of “reliability diagrams”. Reliability
diagrams are plots of the relative proportion correct against assigned probabilities,
and if perfectly calibrated, all points will fall on a diagonal line (Ronis & Yates,
1987). An example of these reliability diagrams, or sometimes commonly called
“calibration graphs” is given in Figure 2. The graphs in Figure 2 are taken from a
study by Ronis and Yates (1987), and show data for two different individuals (a and
b) that made assessments about answers to general-knowledge items. By looking at
graph (a) in Figure 2, it can be seen that this individual has more extreme judgments
that are more accurate. Thus, this person is fairly well calibrated with a Brier score
equal to .218 (less than chance). In contrast, graph (b) in Figure 2 shows an
individual that has few extreme judgments (few greater than .6 and most around
chance) with a higher Brier score of .251 (about chance). Analysis of these two
graphs would reveal that person (a) is better calibrated than person (b). Further, this
example shows the relationship between calibration and accuracy; when judgments
10
are better calibrated they are also likely to be more accurate. However, one can also
be reasonably calibrated but not necessarily more accurate if their judgments are
collectively closer to .5 (indicating no knowledge or “ignorance”).
Figure 2. Example Calibration Graphs (Ronis & Yates, 1987)
11
Early calibration studies found that subjects were extremely overconfident
with both probability and odds responses, under conditions of both minimal and
extreme instructions, and with different types of questions (Fischhoff et al., 1977).
Most of the early calibration studies used either discrete propositions with one or two
alternatives, or open-ended questions with no alternatives. For example, the general-
knowledge question “What is the name of the Roman emperor who fiddled while
Rome burned?” could be asked with one choice given (Nero), two choices given
(Nero and Caesar), or with no choices given (open-ended). In the first case the
respondent would choose if the given answer is correct, in the second case the
respondent would pick one of the choices, and in the third case the respondent would
need to give an answer. For each of these formats, the respondent would give a
probability that their response was correct.
The most pervasive finding in the calibration literature is that subjects are
overconfident with general knowledge items of moderate or extreme difficulty
(Fischhoff et al., 1977; Lichtenstein et al., 1982). In addition, it was determined that
overconfidence was most extreme with tasks of hard difficulty and that
overconfidence is reduced with easier tasks. This can partly be attributed to subjects
being wrong less often when completing easier tasks. This phenomenon has been
termed the hard-easy effect and was attributed to the inability of a respondent to
determine how easy or hard a particular task was (Lichtenstein et al., 1982).
Furthermore, Fischhoff et al. (1977) showed that when judging lethal events, people
engaged in extreme levels of overconfidence. Also, when assessing continuous
12
distributions, overconfidence could also develop due to an anchoring bias that occurs
in the assessment of SPD’s, where choosing different starting points yields
inaccurate estimates that tend to be biased towards these initial values (Tversky &
Kahneman, 1974; Lichtenstein et al., 1982). More recent research has found similar
results of extreme overconfidence when using interval estimates (Soll & Klayman,
2004; McKenzie, Liersch, & Yaniv, 2008). In that research, subjectively assessed
intervals contained the true value a low percentage of the time (e.g. a 90%
subjectively estimated interval would contain the true value only 50% of the time).
In addition, previous research has suggested that interval estimates are more prone to
overconfidence than discrete assessments and that the degree of overconfidence
exhibited depends upon how intervals are elicited (Soll & Klayman, 2004).
Historically, the main focus of calibration research has been to study overconfidence,
with the concern of assessment being the discrepancy between prediction and
outcome, or more specifically to assess whether the expressed probability exceeds
accuracy.
Criticisms of Calibration Research
Liberman & Tversky (1993) presented one of the first critiques with regard to
how data in calibration studies are analyzed. Liberman & Tversky (1993) identified
two different types of overconfidence inherent in calibration studies, “specific” and
“generic” overconfidence. Specific overconfidence occurs when subjects
overestimate the probability of a specific hypothesis (e.g. rain) and generic
overconfidence occurs when subjects overestimate the probability of the hypothesis
13
they think is most likely. The same set of judgments can result in different levels of
accuracy depending on which designation is used (Liberman & Tversky, 1993. The
authors conclude that there are subtle conceptual problems inherent in the approach
to evaluating probability judgments and that it might be of benefit to consider
alternative methodological approaches to analyzing calibration data.
In an influential paper, Dawes and Mulford (1996) provide a critical review
of overconfidence and challenge the contention that this bias is an “established fact
of psychology.” The authors suggest that the empirical support for the
overconfidence effect is “inadequate and logically flawed”, and claim that
overconfidence is due to the way that items and judgments are assimilated (Dawes &
Mulford, 1996). Specifically, Dawes and Mulford (1996) contend that
overconfidence follows analytically from the functional relationship used to
demonstrate it. The functional relationship they describe here is the relationship
between increased levels of confidence and resulting overconfidence. Further,
overconfidence is primarily attributed to “regression effects” which occur as an
inevitable result of the two-choice general-knowledge paradigm with two outcomes,
where confidence levels can range between 50% and 100% (Dawes & Mulford,
1996). Regression effects can be expected when "the most accurate judgments are
the least overconfident, the most accurate judgments are most confident, and the
most confident judgments are most overconfident" (Griffin & Varey, 1996).
Despite these criticisms, Brenner, Koehler, Liberman, & Tversky (1996)
argue for overconfidence as a common and persistent bias and assert that
14
overconfidence cannot merely be explained as byproduct of statistical analysis (i.e.
regression artifact). Further, the authors demonstrate that overconfidence is not
eliminated when a random selection of general-knowledge items is chosen and that it
does not disappear with estimates of relative frequency (Brenner et al., 1996). These
findings are important because the overconfidence effect has been challenged on the
grounds that it is a product of the difficulty level of questions (or that random items
are used) in the general-knowledge paradigm (Gigerenzer, 1993; Juslin, 1994;
Dawes & Mulford, 1996; Wallsten, 1996). Additionally, Klayman, Soll, Gonzalez-
Vallejo, & Barlas (1999) also found evidence for the overconfidence bias in a series
of experiments that separated systematic effects from statistically inevitable results
that result from imperfect judgment. The authors concluded that there is
considerable overconfidence when subjects are asked to construct confidence
intervals (Klayman et al., 1999). These results further suggest that varying results
can be obtained in overconfidence research depending on the method used to elicit
responses.
In a commentary by Griffin and Varey (1996), the authors point out that
although overconfidence is often found in calibration studies, under-confidence is
also found. They point out that overconfidence is common with the general-
knowledge question paradigm when the stimuli material used may not reflect “real-
world phenomena.” Also, they agree that the process of data aggregation “more
often hides the true characteristics of the data than reveals it” (Griffin & Varey,
1996, p. 228). What this means is that different methods of data aggregation and
15
analysis can lead to very different results regarding overconfidence. Further, the
authors acknowledge the existence of the “paradoxical findings” that can be created
by regression towards the mean, due to the imperfect correlation that exists between
confidence judgments and the accuracy of those judgments (Dawes & Mulford,
1996).
No matter what the contention or debate, most of the previous research
reported has alluded to the notion that that calibration studies should turn away from
the standard use of general-knowledge questions in favor of examining real-world
phenomena such as the prediction of future events (Ayton & McClelland, 1997).
Predicting real-life events is perhaps a more substantive task since these events are
generally more familiar to respondents as opposed to answering abstract general-
knowledge items. Also, it is important to point out that the majority of calibration
research has been conducted using discrete assessment and not continuous random
variables. Logically, it would seem that assessments for discrete and continuous
variables should have similar results with regards to prediction outcome. However,
more information is obtained when assessing continuous probability and this has
rarely been done in calibration research.
Probability Decomposition
A less studied aspect of subjective probability and probability assessment is
that of probability decomposition. Direct or holistic assessment has been the most
common approach used to assess subjective probability, whereas decomposition
involves making several sub-component judgments that are related to the same
16
overall probability. For example, the total points scored in an NFL football game
(holistic assessment) could be broken down into the total points scored by the Home
Team (sub-component judgment 1) and the total points scored by the Away Team
(sub-component judgment 2). The main idea behind decomposition is that breaking
down probability into sub-component judgments should lead to more accurate
estimates (and predictions) because this is a seemingly easier task for subjects to
complete. Probability decomposition is thought to be effective because it reduces the
overall complexity of a judgment task by breaking down a larger problem into
smaller, more easily assessed components. This notion is the “divide and conquer”
hypothesis and asserts that “making subjective judgments about components is a
simpler task than making judgments about the whole” and that component judgments
are less prone to error or bias as compared to overall judgments (Raiffa, 1968;
Henrion, Fischer, & Mullin, 1993). The potential usefulness of decomposition can
be demonstrated by thinking about the last presidential election. For example,
making the holistic assessment of who will win the election is intuitively a more
difficult task than making sub-component judgments about the winner when
considering who will win individual states or regions. A great deal of past research
in the area of judgmental biases and heuristics has shown that people use mental
shortcuts when making judgments about unknown quantities (Lichtenstein et al.,
1982) and it can be implied that sub-component probability judgments would allow
respondents to focus more on the simpler event being assessed. Despite the potential
17
benefits of probability decomposition, there are few studies in the literature
exploring the benefits of decomposition.
The limited amount of research done in the area of probability decomposition
has yielded mixed results, with the results of some studies supporting the benefits of
decomposition and the results of others showing that it does not outperform holistic
assessment. McGregor, Lichtenstein, and Slovic (1988) found significant
improvements in accuracy due to decomposition. They compared holistic estimates
to sub-component judgments for both experimenter decompositions and subject
decompositions. “Experimenter decompositions” are decompositions where the
researcher specifies the sub-component variables and “subject decompositions” are
where the sub-components are identified and assessed by the participant. They
found that decomposition outperformed holistic assessment for decompositions
provided by the experimenter but not for subject decompositions. The authors
concluded that it was likely the increased cognitive effort exerted by participants in
the subject decomposition condition that was responsible for the obtained results
(McGregor et al., 1988). Another study conducted by Armstrong, Denniston, and
Gordon (1975) had subjects make holistic probability estimates versus decomposed
assessments for point estimates of almanac quantities (e.g. the number of students
who dropped out of high school in 1969) and found that holistic assessments were
not as accurate as decomposed assessments. Additionally, they also found that
decomposition was more beneficial when there was a higher magnitude of
uncertainty (Armstrong et al., 1975). Hence, both of these early studies
18
demonstrated the usefulness of decomposition, however, an issue with both of these
early studies was that point estimates were used and these results did not give any
insight into the usefulness of constructing full subjective probability distributions
(Henrion et al., 1993).
Henrion et al. (1993) were concerned with assessing probability
decomposition in terms of assessing full subjective probability distributions (SPD’s)
for continuous quantities versus simply obtaining point estimates. They compared
“experimenter decomposition” with “subject decomposition” and contrary to the
“divide and conquer” hypothesis, decomposition did not significantly improve
accuracy or the calibration of subjective probability distributions (Henrion et al.,
1993). Additionally, no difference was found between experimenter or subject
decompositions and both types of decomposition led to over- and underestimating
unknown almanac quantities. The authors suggested that thinking about the full
range of a distribution may compromise the benefits of decomposition, however,
they point out that the mixed findings obtained in decompositions studies point to the
need for further examination of the efficacy of decomposition (Henrion et al., 1993).
Additional research that had the goal of providing a framework for how to
best utilize decomposition looked at the error reduction found due to decomposition
and found this to be considerable based on the efficacy of the conditional
probabilities used (Ravinder et al., 1988). In other words, the precision (or wording)
of the conditional probabilities was a critical component related to the usefulness of
decomposition. Also, it was found that decomposition is only beneficial up to a
19
certain point, depending on the number of sub-component probabilities used. In
other words, there is a diminishing return at a certain point, and increasing the
number of conditioning events is no longer useful. Hence, the utility of
decomposition may directly relate to the number of sub-component judgments used.
A problem with the previously mentioned decomposition studies was that the
majority of stimuli used were almanac questions, the nature of which are not very
intuitive. Overall, few systematic empirical studies have shown that decomposition
reliably improves accuracy when predicting unknown events. The results from
previous research have suggested that assessing decomposition using real-world
stimuli (and when constructing full SPD’s) would provide the most insight into the
usefulness of decomposition.
Purpose of Research
Traditionally, calibration studies and other research that have examined
decomposition have used stimuli that consist of factual information that is generally
unknown to subjects, such as obscure general-knowledge questions (Lichtenstein et
al., 1982; Hora & Hora, 1992; Henrion et al., 1993; Hora, 2004; Soll & Klayman,
2004; McKenzie et al., 2008). An additional criticism of general-knowledge
questions is that they are not intimately related to a participant’s level of expertise
(Hora, 2004). However, there has been a fair amount of research using predictions
of football games since this domain is considered to be more intuitive, interesting,
and motivating to most individuals since this is more of a “real-world” domain
(Vergin & Scriabin, 1978; Tryfos, Casey, Cook, Leger, & Pylypiak, 1984; Dana Jr.
20
& Knetter, 1994; Lee & Smith, 2002; Boulier & Stekler, 2003; Song, Boulier, &
Stekler, 2007; Tsai, Klayman, & Hastie, 2008). Using predictions of football games
also allows for the assessment of uncertainty since the outcomes of future events are
unknown. Further, those regularly following games or wagering on games are
thought to be “intuitive statisticians” since feedback is given quickly and repeatedly
regarding predictions and many who wager have played sports at one time or another
(Lee & Smith, 2002; Tassoni, 1996). For these reasons, the current research will be
using assessments of various outcomes for both National Collegiate Athletic
Association (NCAA) and National Football League (NFL) football games.
Additionally, for any given football game, it is important to note that the “point
spread” is established by oddsmakers and is used as a general indication of the
number of points that a favored team may expect to defeat the weaker team by in the
matchup (Tassoni, 1996). In terms of wagering, the point spread is added to the
weaker (dog) team’s final point total and compared to the favored team’s final point
total to determine if the wager is won or lost.
The current research had three main goals: 1) to examine accuracy for
various predictions of NFL and NCAA football game outcomes, 2) to compare
sequential 2-interval and simultaneous 4-interval forced-consistency methods of
elicitation for SPD’s, and 3) to examine probability decomposition using stimuli
from a real-world domain. Thus, the purpose of the current research was to examine
calibration and accuracy for various independent prediction events based on the point
spread for NFL and NCAA football games, to examine methods of elicitation for
21
continuous subjective probability distributions and to examine probability
decomposition using stimuli that are well-known and intuitive to participants.
Exploring alternative ways to best construct continuous probability distributions is
paramount to calibration research. It is speculated that by using a stimulus domain
that is well-known to participants, greater insight will be gained into the usefulness
of assessing continuous quantities, and more insight will be gained into the efficacy
of probability decomposition.
Hypotheses
It was hypothesized that participants would demonstrate the most accuracy in
prediction for questions regarding game winners and for questions about winners
against the established point spread, since such events are the most common for
those who regularly follow sports. It was also hypothesized that participants would
be more accurate when assessing SPD’s based on a simultaneous 4-interval forced-
consistency method versus a sequential 2-interval method, since the simultaneous 4-
interval forced-consistency method shows respondents the full range of possible
outcomes all at once. It is of interest to explore accuracy for various quantities of
continuous probability distributions and to compare different methods of obtaining
continuous distributions. It was also hypothesized that probability decomposition
will lead to more accurate estimates of unknown events versus holistic assessments.
Further, degree of expertise (experts versus non-experts) will be evaluated and
questions to evaluate expertise will be part of the general survey. These questions
will address the degree to which respondents follow sports in general and wager on
22
games in general as well as how closely they follow the specific sport being assessed
(e.g. NFL vs. NCAA football).
23
CHAPTER 2
METHOD
Participants
The data consisted of 115 response sets and individuals were recruited
through the sports message boards for college and pro football on Cbssportsline.com,
Cnnsi.com, and through the Facebook community. The demographic characteristics
of this sample are summarized in Table 1. The message boards for
cbssportsline.com and cnnsi.com are venues where college and professional football
fans regularly communicate through threads about various topics related to both
professional and college football. For example, the biggest concern with college
football today is the lack of a playoff system and reliance on the Bowl Championship
Series (BCS) to mathematically determine which two teams should play for the
national championship. Since the BCS is a flawed system, fans create new topic
threads daily during the football season to vent their frustrations and communicate
with others, including columnists who vote in the BCS process. Additionally, prior
to each football weekend, fans interact with others to get opinions about potential
wagers for that given week based on the point spread for certain games. Thus, the
message boards are a medium rich with opportunity to seek football fans who care
about the outcome of games, potentially wager on games regularly, and are
motivated to make accurate predictions. Participants were recruited who had an
interest in sporting events, and specifically in NFL and NCAA football games.
Participants were also asked questions about sports knowledge and wagering in the
24
survey. The results of these responses are provided in Table 2 and were used to
evaluate the level of expertise in the sample.
Table 1. Demographic Characteristics of Sample
Frequency Percent
Age (Valid N = 111; 4 Missing)
20 to 24 1 .9
25 to 34 30 27.0
35 to 44 37 33.0
45 to 54 21 18.9
55 to 64 12 10.8
65+ 10 9.0
Highest Education (Valid N = 111; 4 Missing)
Some College 16 14.4
2-year College Degree 8 7.2
4-year College Degree 35 31.5
Master’s Degree 35 31.5
Doctoral Degree 12 10.8
Professional (JD / MD) 5 4.5
Ethnicity (N = 109; 6 Missing)
White / Caucasian 100 91.7
African American 4 3.7
Hispanic 1 .9
Asian 4 3.7
Gender (N = 110; 5 Missing)
Male 78 70.9
Female 32 29.1
25
Table 2. Sports Knowledge and Wagering Characteristics
Frequency Percent
Follow the NFL (N=112; 3 Missing)
Never 9 8.0
Less Than Once a Month 13 11.6
Once a Month 2 1.8
2-3 Times a Month 9 8.0
Once a Week 20 17.9
2-3 Times a Week 31 27.7
Daily 28 25.0
Follow Sports (Valid N = 112; 3 Missing)
Never 7 6.3
Less Than Once a Month 4 3.6
Once a Month 3 2.7
2-3 Times a Month 6 5.4
Once a Week 16 14.3
2-3 Times a Week 23 20.5
Daily 53 47.3
Wager on NFL (Valid N = 112; 3 Missing)
Never 78 69.6
Less Than Once a Month 20 17.9
Once a Month 3 2.7
2-3 Times a Month 4 3.6
Once a Week 4 3.6
2-3 Times a Week 2 1.8
Daily 1 .9
Wager on Sports (Valid N = 112; 3 Missing)
Never 76 67.9
Less Than Once a Month 24 21.4
Once a Month 5 4.5
2-3 Times a Month 3 2.7
Once a Week 2 1.8
2-3 Times a Week 1 .9
Daily 1 .9
26
The exact number of subjects that participated in the study is approximate
(based on 115 response sets) since individuals responded to stimuli over a period of
eight weeks of the football season and participants were uninhibited from responding
to the survey multiple times from a given Internet Protocol (IP) address (e.g. work
location) for reasons of convenience. It is possible that some individuals may have
completed the survey for any given week more than once or responded to surveys
over multiple weeks. However, for the purposes of this research the unit of analysis
that was used is each game. Additionally, with each game as the unit of analysis,
responses were combined for NCAA and NFL games resulting in a total of 425
measurable responses after game responses missing the majority of data were deleted
from the data set. The week of the season, the number of games assessed per week,
the specific games used, and the total number of original responses are shown in
Table 3. The specific games that were used were those deemed to be most
interesting or most important for a given week of the 2009 season.
27
Table 3. Summary of Match-ups by Week and Original Responses Collected for NFL and NCAA Season Games
Week Number of Games Games Assessed Responses Total Responses
Super Bowl 1 Indianapolis Colts vs. New Orleans Saints 22 22
NFL League Championships 2 New York Jets at Indianapolis Colts
Minnesota Vikings at New Orleans Saints
18 36
NFL Wildcard Weekend 4 New York Jets at Cincinnatti Bengals
Philadelphia Eagles at Dallas Cowboys
Baltimore Ravens at New England Patriots
Green Bay Packers at Arizona Cardinals
31 124
NCAA BCS Bowl Games 5 Rose: Oregon Ducks vs. Ohio State Buckeyes
Orange: Georgia Tech Yellow Jackets vs. Iowa Hawkeyes
Sugar: Florida Gators vs. Cincinnatti Bearcats
Fiesta: TCU Hornfrogs vs. Boise State Broncos
Championship: Alabama Crimson Tide vs. Texas Longhorns
30 150
NFL Week 14 5 Carolina Panthers at New England Patriots
Cincinnatti Bengals at Minnesota Vikings
Denver Broncos at Indianapolis Colts
San Diego Chargers at Dallas Cowboys
Philadelphia Eagles at New York Giants
14 70
NCAA Week 15 4 Arizona Wildcats at USC Trojans
Florida Gators at Alabama Crimson Tide
Texas Longhorns at Nebraska Cornhuskers
Georgia Tech Yellow Jackets at Clemson Tigers
10 40
NFL Week 13 5 New England Patriots at Miami Dolphins
Tennessee Titans at Indianapolis Colts
New Orleans Saints at Washington Redskins
Dallas Cowboys at New York Giants
Baltimore Ravens at Green Bay Packers
8 40
NFL Week 12 5 Cincinnatti Bengals at Cleveland Browns
Carolina Panthers at New York Jets
Washington Redskins at Philadelphia Eagles
New England Patriots at New Orleans Saints
Chicago Bears at Minnesota Vikings
1 5
28
Table 4. Point Spreads and Outcomes for Assessed Games (With Point Spread in Parentheses)
Week Games Assessed Outcome
Super Bowl Indianapolis Colts (-6) vs. New Orleans Saints Saints 31-17
NFL League Championships New York Jets at Indianapolis Colts (-8) Colts 30-17
Minnesota Vikings at New Orleans Saints (-4) Saints 31-28
NFL Wildcard Weekend New York Jets at Cincinnatti Bengals (-4) Jets 24-14
Philadelphia Eagles at Dallas Cowboys (-4) Cowboys 34-14
Baltimore Ravens at New England Patriots (-4) Ravens 33-14
Green Bay Packers at Arizona Cardinals (-4) Cardinals 51-45
NCAA BCS Bowl Games Rose: Oregon Ducks (-4) vs. Ohio State Buckeyes Ohio State 26-17
Orange: Georgia Tech Yellow Jackets (-4) vs. Iowa Hawkeyes Iowa 24-14
Sugar: Florida Gators (-12) vs. Cincinnatti Bearcats Florida 51-24
Fiesta: TCU Hornfrogs (-8) vs. Boise State Broncos Boise State 17-10
Championship: Alabama Crimson Tide (-6) vs. Texas Longhorns Alabama 37-21
NFL Week 14 Carolina Panthers at New England Patriots (-14) Patriots 20-10
Cincinnatti Bengals at Minnesota Vikings (-7) Vikings 30-10
Denver Broncos at Indianapolis Colts (-8) Colts 28-16
San Diego Chargers at Dallas Cowboys (-4) Chargers 20-17
Philadelphia Eagles at New York Giants (-4) Eagles 45-38
NCAA Week 15 Arizona Wildcats at USC Trojans (-8) Arizona 21-17
Florida Gators at Alabama Crimson Tide (-6) Alabama 32-13
Texas Longhorns (-15) at Nebraska Cornhuskers Texas 13-12
Georgia Tech Yellow Jackets (-4) at Clemson Tigers Georgia Tech 39-34
NFL Week 13 New England Patriots (-6) at Miami Dolphins Dolphins 22-21
Tennessee Titans at Indianapolis Colts (-8) Colts 27-17
New Orleans Saints (-10) at Washington Redskins Saints 33-30
Dallas Cowboys (-4) at New York Giants Giants 31-24
Baltimore Ravens at Green Bay Packers (-4) Packers 27-14
NFL Week 12 Cincinnatti Bengals (-14) at Cleveland Browns Bengals 16-7
Carolina Panthers at New York Jets (-4) Jets 17-6
Washington Redskins at Philadelphia Eagles (-9) Eagles 27-24
New England Patriots at New Orleans Saints (-4) Saints 38-17
Chicago Bears at Minnesota Vikings (-11) Vikings 36-10
29
Materials
All responses were obtained and recording using Qualtrics online survey
software (www.qualtrics.com). Qualtrics is a powerful tool for designing surveys
with various formats (multiple-choice, rating scales, etc.) and for collecting data in
real-time as participants complete questions about various games. An example of the
survey used for NFL (and similarly for NCAA) football games is provided in
Appendix B. All surveys for each week followed the same format and only differed
in the number of games that were assessed for a given week. When multiple games
were assessed, questions of a similar mode or type would all be asked together at the
same time (e.g. picking the favorite to win and providing a probability). The first
page of each survey was the information sheet, which included information about the
purpose of the study, participant involvement, payment/ compensation details for
participation, confidentiality, and investigator contact information. The last question
on the information sheet asked participants if they read the description of the study
and agreed to participate. In order to access the survey, the agreement to participate
was required.
The first set of questions (six total) in the survey were based on the
established point spread for a particular game and were used to assess events related
to a subjective probability distribution. This will be referred to as the sequential 2-
interval method and is one of the methods used to construct a SPD since obtaining
several points is the most common way of constructing SPD’s (Clemen, 1991; Hora
& Hora, 1992; Hora, 2007). These six questions relate to discrete points on the CDF
30
(and also to six interval ranges for the response variable “favorite – dog). These
points were used to estimate the CDF and these points (P! through P") are shown in
Figure 3 (Probability Plot of Sequential 2-Interval Method). All questions in the
survey were presented in the same order and phrased in term of the team that was
favored to win the game, except for the last of the six questions, which was phrased
in terms of the “dog” or the team that was not favored to win the game. Thus, to
reiterate, the unit of analysis or response variable was “favorite – dog” in terms of
point spread. Please see Appendix B for a sample of the survey questions used.
Figure 3. Probability Plot of the Sequential 2-Interval Method
31
The first question (point P! in Figure 3) simply asked the respondent to
provide a probability (between 0% and 100%) that indicated how certain they were
that the favorite would win the game (Winner). The second question (point P# in
Figure 3) asked the respondent to provide a probability (between 0% and 100%) that
indicated how certain they were that the favorite would win by the point spread
divided by two (PS / 2), the third question (point P$ in Figure 3) asked the respondent
to provide a probability (between 0% and 100%) that indicated how certain they
were that the favorite would win by less than the point spread (< PS), the fourth
question (point P% in Figure 3) asked the respondent to provide a probability
(between 0% and 100%) that indicated how certain they were that the favorite would
win by the point spread (PS), and the fifth question (point P& in Figure 3) asked the
respondent to provide a probability (between 0% and 100%) that indicated how
certain they were that the favorite would win by the point spread plus seven points
(PS + 7). The last of the six questions (point P" in Figure 3) asked the respondent to
provide a probability (between 0% and 100%) that indicated how certain they were
that the dog would win by the point spread divided by two (- PS / 2).
For example, in the Super Bowl match up between the Indianapolis Colts and
the New Orleans Saints, the Colts were favorite to win the game, and the point
spread set by oddsmakers was 6 points. The first question (P!) asked the participant
to give a probability that the Colts would win the game not considering the spread
(Winner), the second question (P#) asked the participant to give a probability that the
32
Colts would win the game by 3 points or more (PS / 2), the third question (P$) asked
the participant to give a probability that the Colts would win the game by 5 points or
less (< PS), the fourth question (P%) asked the participant to give a probability that
the Colts would win the game by 6 points or more (PS), and the fifth question (P&)
asked the participant to give a probability that the Colts would win the game by 13
points or more (PS + 7). The last of the six questions (P") asked the respondent to
give a probability that the Saints would win the game by 3 points or more (- PS / 2).
It is important to point out that for each different game assessed, the point spreads
related to these six points could differ, but the same approach was used to set up the
first six questions. To reiterate, these first six questions (as shown in Figure 3) were
used to estimate the CDF.
The next set of questions in the survey (which again was only one set for the
Super Bowl as shown in Appendix B) was used to examine the simultaneous 4-
interval forced-consistency method of elicitation. These four questions relate to
intervals (a through d) on the CDF and are shown in Figure 4 (Probability Plot of the
Simultaneous 4-Interval Forced-Consistency Method). For questions of this type,
four outcomes were presented and the respondent was to provide probabilities for the
given outcomes so that the total probability was equal to 100%. The favorite was
always listed in the first two choices and the dog was always listed in the last two
choices. The first choice (interval d) asked if the favorite would win by 8 or more
points, the second choice (interval c) asked if the favorite would win by 7 points or
less, the third choice (interval b) asked if the dog would win by 7 points or less, and
33
the last choice asked if the dog would win by 8 points or more (interval a). To
reiterate, the respondent was forced to provide probabilities over the four outcomes
to equal a total probability of 100%. For example, considering the Super Bowl, the
first choice asked if the Colts would win by 8 points or more (interval d), the second
(interval c) if the Colts would win by 7 points or less, the third (interval b) if the
Saints would win by 7 points or less, and the last choice (interval a) asked if the
Saints would win by 8 points or more. The final questions on the survey asked about
sports knowledge and wagering behavior, demographics, email information (if the
participant wished to participate in the lottery/ incentives, which is to be explained in
the procedure sub-section), and for general feedback about the experience of taking
the survey and/ or ways that the survey could be improved. Again, please refer to
Appendix B, which includes an example of the survey used for the NFL Super Bowl,
since the same format was used for all football games assessed, whether it was the
NFL or college football games.
34
Figure 4. Probability Plot of the Simultaneous 4-Interval Forced-Consistency
Method
35
Procedure
Potential participants were provided with a brief description, purpose of the
study, and motivation to participate on the message boards on Cbssportsline.com,
Csnnsi.com, and the Facebook community during the week prior to the weekend
games. The survey was made available until the first game for a given week
commenced (e.g. the first NFL game on Sunday or the first NCAA game on
Saturday) and then was deactivated. Similarly, for the BCS bowl games, the survey
was available to respondents until the first game of the set of bowl games was played
(e.g the first BCS bowl game on January 1
st
). Games were selected based on
researcher intuition about interesting match-ups or games that had particular
consequences in terms of the NFL playoff race or interesting NCAA games (such as
bowl games). Thus, games were not randomly chosen. Obviously, for the NFL
playoffs and for the BCS bowl games, these match-ups were the only ones that were
applicable for the particular set of assessments. One advertising message was posted
on the message boards and a different but similar message posted on Facebook
advertising participation. The following is the advertising message used for
participation for Cbssportsline.com and Cnnsi.com for the NCAA BCS bowl games
(the NFL advertisement was the same except for the direct reference to the type of
game outcomes being assessed:
Hey sports fans! I am posting to let everyone know I am conducting a study
where people make various predictions of sporting event outcomes (game
winners, against the spread, etc.). For those who want to participate, you can
follow the link below to the online survey to complete various questions
predicting BCS bowl game outcomes. For your participation, you will be
36
entered in a drawing to win $250 and for the person who makes the most
accurate predictions, $200 will be rewarded. If you are interested, please
follow the link below. Any questions can be addressed to pjd@usc.edu.
Thanks for your participation!
For advertising to the Facebook community, a similar but condensed message was
used to advertise the survey. A revised, condensed advertisement message was used
for Facebook since space to post messages is much more limited on the site, as
compared to Cbssportsline.com and Cnnsi.com. The following message was used
and posted for public display on Facebook:
I am posting to let everyone know I am conducting a study predicting
sporting event outcomes. You can follow the link below to the online survey
about BCS bowl games. For your participation, you will be entered in a
drawing to win $250 and for the person who makes the most accurate
predictions, $200 will be rewarded. Thanks!
Respondents interested in participation could click on the link provided and
they were then directed to Qualtrics.com where assessment instructions and materials
will be provided in the form of the online survey. As mentioned, motivation to
participate was provided by informing subjects that they would be entered into a
drawing (thus, every subject would have an equal chance of winning this prize) upon
completion of the survey to win a cash reward for $250 and additionally, the
participant with the most accurate assessments would receive $200. Participants
who were interested in the above incentives were asked to leave their email address
in the online survey. The proportion of individuals who left an email address was
71.1%. Previous research with sporting events has used similar incentives for
participation and enhanced motivation (Tsai, et al., 2008). One participant was
37
awarded $250 through a random drawing, and one other participant was awarded
$200 based on having the most accurate assessments in terms of an overall average
of Brier scores across all six point spread predictions.
Point spreads for all games were taken from the online website
Superbook.com, which is an online wagering site. Superbook.com establishes all
point spreads first based on established Las Vegas point spreads, but there may be
slight variations to lines (throughout a given week) based on wagering behavior.
These variations are due to the amount of money that individuals are wagering on a
particular side of a game (e.g. point spread, game winner). This variation is due to
the fact that the goal of the oddsmakers is to keep 50% of the money on each side of
the wager so that the house always makes money for any given wagering situation
(Tassoni, 1996). Historically, point spreads for particular games usually remain in
close proximity to each other for different online betting sites and in close proximity
to the established Las Vegas betting line. For each of the games assessed in the
survey, the opening point spread was used to set up the first six questions in the
survey.
It is important to point out that each subject made probability assessments for
each game and the method was systematic, even though point spreads were possibly
different for various games being assessed (as previously mentioned). A systematic
approach was used to obtain the six estimates or assessments for the sequential 2-
interval method (but not for the simultaneous 4-interval forced-consistency method
as those intervals were always set at the same amounts). As mentioned earlier, the
38
following six points were assessed for the continuous distribution based on the
sequential 2-interval method: P! = P (favorite wins), P# = P (favorite wins by point
spread divided by 2), P$ = P (favorite wins by less than point spread), P% = P
(favorite wins by point spread), P& = P (favorite wins by point spread plus 7 points),
P" = P (dog wins by point spread divided by 2). For example, if the established point
spread for a particular game was 8 points, then P! = (> 0 or Winner), P# = (4), P$ = (7
or less), P% = (8), P& = (15), and P" = (- 4) or that the dog would win by 4 points. For
an established point spread of 4 points, then P! = (> 0 or Winner), P# = (2), P$ = (3 or
less), P% = (4), P& = (11), and P" = (-2) or that the dog would win by 4 points. It is
important to note that if the oddsmakers set any game lines less than 4 points, then 4
points was used as the basis for the established spread so that all of the above points
could be assessed (i.e. if the established spread was 2 points, then all of the six points
could not be calculated to logically have the assessments make sense). Additionally,
any half-spread lines (e.g. 7.5 point favorite) would be rounded to the next highest
number (if 7.5 then 8 points used) and assessment points would be constructed in the
manner outlined above. The same approach was used for all games, whether NFL or
NCAA football games were assessed, and for all weeks that data were collected.
39
CHAPTER 3
DESIGN AND ANALYSIS
Proper Scoring Rules for Accuracy
One of the most common ways of assessing the quality of probability
assessments is through the use of proper scoring rules (Jensen & Peterson, 1973). As
previously mentioned, a common scoring rule that has been used to assess accuracy
when there are two outcomes is the “Brier score” and this measure uses a quadratic
loss function to assess overall performance, where lower scores indicate greater
accuracy (Yates, 1982; Liberman & Tversky, 1993). Again, the Brier score (BS),
sometimes called the mean probability score, is given by Equation 1. The most
common proper score rule that is used to assess accuracy when there are four
outcomes in the quadratic scoring rule (QSR) (Jensen & Peterson, 1973). The
quadratic scoring rule is given by the following equation:
(2)
where pc = probability assigned to outcome that obtains, and pi = probability
assigned to each of the possible k outcomes, including the one that obtains. What is
meant by the “outcome that obtains” is the outcome that actually occurs. The QSR
and BS are related and the main difference is that the QSR is normally used when
there are four outcomes as opposed to two. For 2 intervals, the QSR reduces to the
following: , where pc is same as above. The
QSR ranges from -1 (the worst case scenario, where the probability = 0 for the event
40
that occurs) to +1 (the best case scenario, where the probability = 1 for the event that
occurs). In relation to the QSR, the BS is a special case of the QSR, and can also be
given by the following formula: , where pc is the outcome that
obtains. The BS is a linear transformation of the QSR, which is also a proper
(quadratic) scoring rule given by the following:
(3)
Thus, the BS flips the QSR so that lower scores are better, and shrinks the range to
the unit interval of 0 to 1. In contrast, for QSR scores, higher scores are better, but
since the BS multiplies by -1, lower scores are better for BS. By taking the QSR
formula for 2 outcomes, , expanding the terms,
and then doing the linear transformation:
, (4)
results in exactly the . In summary, the worst case for the QSR is a
value of –1 and for the BS a value of +1 and the best case for the QSR is a value of
+1 and for the BS a value of 0. The BS was used to assess accuracy for two-interval
outcomes and the QSR was used to assess accuracy for four-interval outcomes.
Accuracy and Calibration for Independent Events
For each of the six questions (those relating to P! through P" in Figure 3) used
to assess the continuous distribution using the sequential 2-interval method, these
were treated as independent events and accuracy/ calibration was assessed for each.
41
It was of interest to explore how participants vary in accuracy and calibration
depending on the type of question they are responding to. For each given point
spread question (P! through P"), individuals provided a probability that they believed
that a particular event would occur and this probability was expressed as a
percentage (e.g. 80%). Prior to conducting statistical analyses (such as Brier score
calculations), these given percentages were divided by 100 in order to put them into
decimal form. In the case of the six of events or sequential 2-interval method
questions, the actual outcome for that event was mapped onto the prediction for that
outcome, and Brier scores were calculated for each to assess accuracy. In order to
assess calibration, reliability diagrams (plots of the proportion correct against
assessed probabilities) were created for each of the six (P! to P") sequential 2-interval
method questions. To summarize, accuracy was assessed with the use of Brier
scores (where lower Brier scores closer to 0 are better), and calibration was assessed
with reliability diagrams (where good calibration is indicative of points falling on a
diagonal line). Additionally, accuracy for 4-interval methods was assessed using the
QSR (where scores closer to 1 are better).
Robust Statistical Analyses
Traditional methods that are based on means (such as repeated measures
analysis-of-variance) are flawed in terms of achieving power, when certain test
assumptions are violated such as non-normality, skewness, or the presence of
outliers. The finite sample breakdown point is the smallest proportion of
observations that can make a measure of location arbitrarily small or large (Wilcox,
42
2003). The finite sample breakdown point of the mean, is 1 / N. In other words, it
would only take one extreme or unusual value of a distribution to render the estimate
of the mean meaningless. This means that outliers can have a significant deleterious
effect on statistical results when using traditional methods. Further, even under
conditions of relative normality, using traditional approaches can lead to inaccurate
results in comparison to other methods that are available (Wilcox, 2003; Wilcox,
2005). Alternative methods exist but are rarely employed in real practice. There are
many methods available for comparing dependent groups and most deal with the
problems inherent with standard approaches (Wilcox, 2005). Wilcox (2005)
explains that no single approach outperforms others (in terms of power) in all
situations, but the general recommendation is that modern methods outperform
traditional methods. Hence, for the data analysis of dependent groups, alternative
methods (i.e. dependent groups based on trimmed means) were used instead of
traditional methods. For example, a method based on 20% trimmed means would
first sort (rank order values from lowest to highest) values of the scores in the
distribution, and then trim (remove) the top 10% and bottom 10% of scores, thus
removing outliers and reducing the overall skewness of the distribution (Wilcox,
2003). The specific methods that were used will further be explained in the results
section.
Further, a kernal density estimator (KDE) is an alternative way to examine a
probability density function associated with a random variable (Wilcox, 2003). It is
closely related to a histogram of a continuous variable but involves a smoothing
43
process and in some cases is a much more effective estimate of the nature of the
distribution (Wilcox, 2005). KDE plots (like histograms) can be used to assess
skewness and normality. KDE plots were used to examine the nature of the
distributions for the six independent events variables (P! to P" or the questions used
for direct assessment).
Consistency Analysis
Prior to conducting data analysis of the continuous distributions (and in
particular the sequential 2-interval method), is was of interest to look at the
consistency for the continuous distribution based on the sequential 2-interval method
in order to analyze the consistency of responses for points along the distribution.
Specifically, these responses should be monotonic, or increasing in value in terms of
probability along the points in the distribution. For the purposes of illustration,
please refer to Figure 5 (the plot for the Consistency Analysis). This plot is similar
to Figure 3, but is carefully structured to show how the area under the CDF should be
increasing as moving towards higher values. For example, by looking at the y-axis
of Figure 5, P" should be less than (1 - P!), which should be less than (1 - P#), which
should be less than (1 - P%), which should be less than (1 - P&). Since probabilities
given for P! to P& correspond to the area to the right of the distribution, the area to the
left of those values is 1 minus the given probability.
44
Figure 5. Consistency Analysis
45
Consistency was assessed in terms of analyzing the degree to which (the
percentage value) participant responses decreased when moving from left to right
along the distribution. This was assessed by using the following decision rule: S =
Minimum [(1 - P!) - P", (1 - P# - (1 - P!)), (1 - P% - (1 - P#)), (1 - P& - (1 - P%))]. These
values should be positive, so negative values would indicate inconsistency and the
minimum of these values would indicate the largest difference in the wrong
direction. The following decreasing steps for S (in terms of percentage differences)
were noted and recorded: 0-4%, 5-9%, 10-19%, and 20% or more. To be specific, if
a participant was at least 20% inconsistent over any assessments from left to right
along the distribution, then they were recorded as being 20% inconsistent overall.
These percentage values were arbitrarily chosen with the intuition that values
decreasing more than 10% were very inconsistent. Also, assessments where subjects
assigned all 0, 50, or 100 probabilities across the distribution were noted as well and
individuals responding in this manner were not used for the analyses of the
sequential 2-interval method, since these kinds of responses are indicative that
respondents did not clearly understand the questions. In terms of monotonicity,
respondents with decreasing percentage changes of more than 10% were considered
to be very inconsistent, and highly non-monotonic. For example, if a participant
responded with a probability of 50% for question P! (favorite to win the game), then
1 - P! would equal 50%. In other words, this would indicate that they are 50%
certain the favorite would win the game by the point spread. Now, if that same
participant responded with an 80% probability for question P% (favorite wins by point
46
spread or more), then 1 - P% would be equal to 20%. Since the two assessments
would be non-increasing, they would be considered to be non-monotonic or
inconsistent. Logically, if a participant is 50% sure that the favorite would win the
game, then it does not make sense that they would be more certain that the favorite
would win the game by the point spread (which is more of a constraint). For the
consistency analysis, individuals with missing data for any of the assessed points
were considered missing.
Further, the analysis of consistency was also conducted only using questions
corresponding to the three points of P! (probability favorite wins game), P%
(probability favorite wins by point spread or more), and P& (probability favorite wins
by point spread plus 7 points or more). These three points were chosen with the
belief that these questions may have been easier for participants to assess because
they are similar to the common types of wagers that are available on online betting
sites. The three most common wagers that are available to bettors are wagering on
game winners (called the moneyline), betting on the established point spread, and
betting on a “juiced” spread (Tassoni, 1996). Thus, in terms of consistency, P!
should be less than (1 - P%), which should be less than (1 - P&). This was assessed by
using the following decision rule: S = Minimum [(1 - P% - (1 - P!)), (1 - P& - (1 -
P%))]. Again, these values should be positive, so negative values would indicate
inconsistency and the minimum of these values would indicate the largest difference
in the wrong direction. The same following decreasing steps for S (in terms of
47
percentage differences) were noted and recorded: 0-4%, 5-9%, 10-19%, and 20% or
more. In summary, consistency analyses were ultimately conducted to determine
the degree of non-monotonicity for the sequential 2-interval method. Consistency
analyses did not need to be conducted for the simultaneous 4-interval forced-
consistency method as those assessments were forced to be monotonic or consistent.
Rank-Based Analysis
Matheson and Winkler (1976) present a rank-based scoring rule (RBSR) that
takes into account each of the probabilities assessed for a continuous distribution and
the position of those probabilities for a continuous distribution. This scoring rule is
given as follows:
(5)
and which for four intervals, breaks down algebraically into:
. (6)
In Equation 6, r! to r% correspond to the probabilities given to each of four different
rank-ordered intervals, and j corresponds to the rank of the interval that obtains (i.e.
the interval in which the actual outcome occurs). While the QSR is a measure of
accuracy that takes into account the four different probabilities assessed for a
continuous distributions (with four intervals), the QSR does not take into account the
position or rank order of each. Thus, this RBSR was used to assess performance and
48
was used to directly compare the distributions based on the sequential 2-interval
method and simultaneous 4-interval forced-consistency method of assessment. For
the above formula and ranks based on four outcomes, the RBSR could range from –2
to +1, where the worst case for the RBSR is a value of –2 and the best case for the
RBSR is a value of +1. For this RBSR based on four intervals, chance or guessing
was equivalent to a RBSR of approximately -.29. Thus, RBSR value of -.29
corresponds to no knowledge, or when probabilities assessed for each of four ranked
intervals are .25 (r! = r# = r$ = r% = .25).
Probability Decomposition
Three distinct probability decompositions were examined and compared in
terms of the percent of positive / negative values and accuracy (in terms of Brier
scores). Please again refer to Figures 3 and 4 for clarity of the sub-component
probabilities used for each of the decompositions. For the first two decompositions,
the probability assessed for the game winner (P! or favorite to win the game) can be
decomposed in the following two ways:
1. Decomposition 1: P! (favorite wins the game) = P% (favorite wins by the
point spread) + P$ (favorite wins by less than the point spread) and
2. Decomposition 2: P! (favorite wins the game) = P (favorite wins by 7
points or less) + P (favorite wins by 8 points of more). In the second decomposition,
the two sub-component probabilities were taken from questions (intervals c and d in
Figure 4) pertaining to the simultaneous 4-interval forced-consistency method of
assessment.
49
Next, the probability that the favorite wins by the point spread (P%) can be
decomposed in the following manner:
3. Decomposition 3: P% (favorite wins by the point spread) = P! (favorite
wins the game) - P$ (favorite wins by less than the point spread). Thus, like the first
decomposition, the third decomposition would use only questions from the
sequential 2-interval method used to estimate the continuous distribution.
For the probability decompositions, it was of interest to determine for which
decompositions participants were “incoherent” or violated rules of probability when
responding (e.g. over- or under- estimated sub-component probability) and also to
see for which decompositions overall performance was best in terms of accuracy.
For accuracy, Brier scores were computed for each of the decomposition components
and then analyzed to assess whether or not decomposition improved accuracy or not.
This will further be explained in the results section. Since adding or subtracting sub-
component probabilities for decomposition could lead to values greater than 100 or
less than 0, these aggregated scores were bound at 0 and 100. For example, consider
decomposition 1. It is possible that a participant could have responded as 70% sure
the favorite would win by the point spread (P%) and responded as 50% sure the
favorite would win by less than the point spread (P$). Adding these two sub-
component probabilities together would give an overall value of 120%. When such
overestimations would occur, the scores were bounded at 100% prior to the Brier
score decomposition analysis. As mentioned earlier, and to reiterate, when
calculating Brier scores the percentage probability estimates given by respondents
50
were divided by 100 in order to put these scores into decimal form. Thus, the
aggregated probabilities from decomposition will essentially be bounded at 0 and 1.
This also will be explained further when the decomposition results are presented.
51
CHAPTER 4
RESULTS
Accuracy for Independent Events
Brier scores were calculated for each of the six target events (P! to P"), or
those questions used to assess the continuous distribution based on the sequential 2-
interval method. Thus, each one of these interval questions was treated as an
independent event, and the outcome for each could be determined (i.e. the outcome
would be 1 if the favorite won the game by the point spread plus seven points and
this outcome actually occurred). Each one of the six kinds of point spread questions
were coded for whether or not the outcome related to that question specifically
occurred, and Brier scores were calculated. For example, if a respondent gave a
probability of 30% that the favorite would win by the point spread plus 7 points and
this outcome occurred, the following Brier score for that response was calculated: (1
- .3)! = .490. For predictions that the favorite would win the game (P! or favorite
wins), the mean Brier score was .261 (SD = .261) and the average forecast for game
winners was .624 (SD = .250). For the second question (P# or the favorite to win by
the point spread / 2), the mean Brier score was .330 (SD = .290) and the average
forecast was .615 (SD = .255). For predictions that the favorite would win by less
than the point spread P$), the mean Brier score was .235 (SD = .229) and the average
forecast for this event was .414 (SD = .256). For predictions that the favorite would
win by the point spread (P%), the mean Brier score was .321 (SD = .260) and the
average forecast .512 (SD = .257). For the assessment that the favorite would win by
52
the point spread plus seven points (P&), the mean Brier score was .235 (SD = .282)
and the mean forecast was .280 (SD = .232). Finally, for the assessment that the dog
would win the game by the point spread divided by two (P"), the mean Brier score
was .271 (SD = .287), and the mean forecast was .374 (SD = .270). The Brier score
accuracy results (with mean outcomes and 20% trimmed means) for these
independent target events are summarized in Table 5. It is important to note that all
of the Brier scores are a bit worse than chance (greater than .25) but were improved
for five of the six events using 20% trimmed means (and especially for P&, the
question asking about the favorite to win by the point spread plus seven points (.235
to .180).
Prior to data analysis, the six distributions (for P! to P") were analyzed using
kernel density estimators to get an idea about the shape of each of the distributions.
The KDE plots for each of the six target events (from the sequential 2-interval
method) are provided in Figure 6. The KDE plots are plots of the Brier scores (x-
axis) by the cumulative probability (y-axis). The KDE plots here show that all of the
variables are extremely positively skewed with more severe skewness occurring for
Brier scores for the favorite to win by the point spread plus seven points (P&) and for
the dog to win by the point spread divided by two (P"). Due to the amount of
skewness (and non-normality) apparent in these data, a robust method for dependent
groups was used that employed 20% trimmed means (the maximum recommended
amount of trimming).
53
Table 5. Results for the Accuracy of Independent Events (With Standard Deviations
in Parentheses)
Target Event
Mean Brier
Score
20%
Trimmed
Mean
Mean
Prediction
Mean
Outcome
1) Favorite wins (N = 394;
31 Missing)
.261
(.261)
.239
(.220)
.624
(.250)
.540
(.498)
2) Favorite wins by point
spread / 2 or greater (N =
377; 48 Missing)
.330
(.290)
.347
(.255)
.615
(.255)
.485
(.500)
3) Favorite wins by less than
the point spread (N = 389; 36
Missing)
.235
(.229)
.226
(.176)
.414
(.256)
.113
(.317)
4) Favorite wins by point
spread or greater (N = 396;
29 Missing)
.321
(.260)
.316
(.205)
.512
(.257)
.409
(.492)
5) Favorite wins by the point
spread plus 7 points or
greater (N = 400; 25
Missing)
.235
(.282)
.180
(.212)
.280
(.232)
.260
(.439)
6) Dog wins by the point
spread / 2 or greater (N =
377; 48 Missing)
.271
(.287)
.267
(.253)
.374
(.270)
.443
(.497)
54
Table 6. Results for the Robust Dependent Groups Analysis of Brier Scores
Comparison Event Test Value P-value P-critical se
1 vs. 2* -4.061 < .001 .004 .016
1 vs. 3 0.486 .062 .016 .023
1 vs. 4* -4.773 < .001 .003 .016
1 vs. 5 2.624 .009 .007 .022
1 vs. 6 0.120 .904 .050 .017
2 vs. 3* 3.057 .002 .006 .026
2 vs. 4 -0.557 .577 .012 .017
2 vs. 5* 4.901 < .001 .003 .026
2 vs. 6* 3.103 .002 .005 .022
3 vs. 4* -4.091 < .001 .004 .021
3 vs. 5 2.441 .015 .008 .019
3 vs. 6 -0.392 .069 .025 .023
4 vs. 5* 7.178 < .001 .003 .019
4 vs. 6* 3.857 < .001 .005 .020
5 vs. 6 -2.427 .001 .010 .023
Note: * = significant difference.
Event 1 = Favorite wins
Event 2 = Favorite wins by the point spread / 2 or greater
Event 3 = Favorite wins by less than the point spread
Event 4 = Favorite wins by the point spread or greater
Event 5 = Favorite wins by the point spread plus 7 points or greater
Event 6 = Dog wins by the point spread / 2 or greater
55
Figure 6. Kernal Density Estimates of the Distribution of Scores for each Variable
56
A robust dependent groups analysis using 20% trimmed means was
conducted on the Brier score data to determine if accuracy differed significantly for
any of the six target events assessed. When performing this type of robust analysis,
no “omnibus test” is done (overall F test), but rather the approach is simply to assess
all group differences (Wilcox, 2005). The familywise error rate (FWE) is “the
probability of making at least one Type I error when performing multiple tests”
(Wilcox, 2010). For this reason a 20% trimmed mean method was chosen that
controls for the FWE rate when conducting multiple dependent tests, based on a
sequentially rejective method developed by Rom (Wilcox, 2010). The main idea is
that this approach has more power than the Bonferroni FWE rate approach that is
traditionally done when examining multiple comparisons. Listwise deletion was
used, thus if any of the six measures were missing across questions, those games
were deleted from the analysis. With six repeated factors, a total of 15 comparisons
between each of the measures was conducted to see what target events had
significant differences in Brier scores. Eight of the fifteen comparisons were
significant and the results for the robust dependent groups analysis of Brier scores is
provided in Table 6. Significant differences were found for trimmed mean Brier
scores for the following: 1) between predictions for the favorite to win the game (P!)
and for the favorite to win by the point spread divided by two (P#), test value = -
4.061, p < .001, 2) between predictions for the favorite to win the game (P!) and for
the favorite to win by the point spread (P%), test value = -4.773, p < .001, 3) between
the favorite to win by the point spread divided by two (P#) and for the favorite to win
57
by less than the point spread (P$), test value = 3.057, p = .002, 4) between the
favorite to win by the point spread divided by two (P#) and for the favorite to win by
the point spread plus seven points (P&), test value = 4.901, p < .001, 5) between the
favorite to win by the point spread divided by two (P#) and for the dog to win by the
point spread divided by two (P"), test value = 3.103, p = .002, 6) between the
favorite to win by less than the point spread (P$) and for the favorite to win by the
point spread (P%), test value = -4.091, p < .001, 7) between the favorite to win by the
point spread (P%) and for the favorite to win by the point spread plus seven points
(P&), test value = 7.178, p < .001, and 8) between the favorite to win by the point
spread (P%) and for the dog to win by the point spread divided by two (P"), test value
= 3.857, p = .001. None of the other comparisons were significant. Again, these
results are summarized in Table 6 with adjusted p-values and significant differences
flagged.
Calibration for Independent Events
Calibration for each of the six target events (P! to P") was assessed using
reliability diagrams or calibration graphs. Again, these are plots of probability
judgments against the proportion correct, and good calibration is evident by values
falling on or near a diagonal line, particularly with more extreme predictions (closer
to 0 or 1) falling on the diagonal. Calibration graphs for the prediction that the
favorite would win the game (P!) and that the favorite would win by the point spread
divided by two (P#) are provided in Figure 7.
58
Figure 7. Calibration Graphs for Prediction that Favorite Wins and Favorite Wins by
the Point Spread / 2
59
Calibration graphs for the prediction that the favorite would win by less than
the point spread (P$) and that the favorite would win by the point spread (P%) are
provided in Figure 8. Calibration graphs for the prediction that the favorite would
win by the point spread plus seven points (P&) and for the prediction that the dog
would win by the point spread divided by two (P") are provided in Figure 9.
Analyses of Figures 7 through 9 indicate that overall, calibration was very poor since
values did not fall near or on a diagonal line. Arguably, calibration was better for
predictions that the favorite would win the game (P! shown in Figure 7) and for
predictions that the dog would win by the point spread divided by two (P" shown in
Figure 9). Overall, for all of the six events, there was a great deal of scatter around
the diagonal line, indicating that participants were not very calibrated in their
assessments.
60
Figure 8. Calibration Graphs for Prediction that Favorite Wins by Less than the
Point Spread and Favorite Wins by Point Spread
61
Figure 9. Calibration Graphs for Prediction that Favorite Wins by Point Spread + 7
and that the Dog Wins by the Point Spread / 2
62
Consistency Analysis
Consistency analyses were conducted for assessments of the sequential 2-
interval method. Specifically, it was of interest to explore this elicitation method
since the sequential 2-interval method was not forced to be consistent and these
values should be consistently increasing. As previously mentioned in the analysis
section of the methods, P" should be less than (1- P!), which should be less than (1 -
P#), which should be less than (1 - P%), which should be less than (1 - P&). Again,
please refer to Figure 5 for a graph of the logic behind the consistency analysis.
Consistency was assessed by analyzing the degree to which the percentage value of
participant responses decreased when moving or stepping from left to right in the
distribution as described above (for the five points mentioned). The following
percentage interval values (in terms of S) were assessed for decreasing responses: 0-
4%, 5-9%, 10-19%, and 20% or more. For the five points, 31.4% of responses (109 /
347) had only a 0-4% non-increasing probability, 15.9% of responses (55 / 347) had
at least a 5-9% non-increasing probability, 18.7% of responses (65 / 347) has at least
a 10-19% non-increasing probability, and 30.3% of responses (105 / 347) had a non-
increasing probability of 20% or more. Also, 1.2% of responses (4 / 347) had
responses all equal to 50, and 2.6% of responses (9 / 347) had all 0 or 100 responses.
These results for five points are summarized in the top portion of Table 7.
Consistency analyses were also conducted only looking at the following three
points: P!, P%, and P&. These three points were examined because three points would
be the minimum condition necessary to estimate a continuous distribution based on
63
the sequential 2-interval method. For example, it is possible that a participant
believed the underdog would win the game (P") and putting a high probability on this
assessment would undermine the analysis based on the five points above, since it
would inevitably lead to inconsistency for the assessments made along the
distribution. It is also possible that using three points that are more spread out would
lead to more consistent results. For the second consistency analysis, the assessment
of the favorite to win the game (P!) should be less than (1- P%), which should be less
than (1 - P&). When considering only these three points in terms of S, 78.4% of
responses (294 / 375) only had a 0-4% non-increasing probability, 4.0% of responses
(15 / 375) had at least a 5-9% non-increasing probability, 6.9% of responses (26 /
375) had at least a 10-19% non-increasing probability, and 6.9% of responses (26 /
375) had a non-increasing probability of 20% or more. Also, 1.1% of responses (4 /
375) had responses all equal to 50, and 2.7% of responses (10 / 375) had all 0 or 100
responses. These results for three points are summarized in the bottom portion of
Table 7.
64
Table 7. Results of Consistency Analysis
Frequency Percent
Using 5 Points
(Valid N = 347; 78 Missing)
Minimum 0 to 4% 109 31.4
Minimum 5 to 9% 55 15.9
Minimum 10 to 19% 65 18.7
Minimum 20+% 105 30.3
All 50’s 4 1.2
All 0’s or 100’s 9 2.6
Using 3 Points
(Valid N = 375; 50 Missing)
Minimum 0 to 4% 294 78.4
Minimum 5 to 9% 15 4.0
Minimum 10 to 19% 26 6.9
Minimum 20+% 26 6.9
All 50’s 4 1.1
All 0’s or 100’s 10 2.7
65
Figure 10. Comparison of Continuous Distributions: Sequential 2-Interval versus
Simultaneous 4-Interval Forced-Consistency Method
66
To summarize, one of the main goals of this research was to directly compare
the two estimated distributions, one based on the sequential 2-interval method, and
one based on the simultaneous 4-interval forced-consistency method. It was
necessary to determine a reasonable way to set up equivalent 4-interval distributions
so that comparisons could be made between the two methods (sequential 2-interval
versus simultaneous 4-interval forced-consistency). The points used to estimate the
range of values for the sequential 2-interval method were determined by choosing
estimates from questions that were deemed to be most direct. Please refer to the top
portion of Figure 10, which shows how the distribution for the sequential 2-interval
method was estimated. For interval (1), the probability that the dog would win the
game is given by (1 - P!) or 1 minus the probability that the favorite would win. In
order to estimate interval (2), this question was directly assessed by asking that the
favorite would win by less than the point spread (P$). For interval (4), this question
was also directly assessed for the favorite winning by the point spread plus seven
points (P&). Because intervals 1, 2, and 4 were directly assessed, interval (3) was
estimated by taking the total probability under the curve (1.00) minus the other three
interval assessments. Thus, interval (3) was equal to (1 –(interval 1 + interval 2 +
interval 4)). By estimating the distribution for the sequential 2-interval method in
this manner, it was possible to obtain probability estimates for the outcomes across
these four intervals. In order to ensure that the total probability for all four ranges
summed to 1, the four probabilities were first added together, then divided by the
sum so that that total probability (for 1 to 4) equaled 1. Thus, it was necessary to
67
normalize the distribution in this way since it was possible that the four ranges could
sum to a probability value greater than 1. Figure 10 displays how the sequential 2-
interval method was estimated (top portion) and how the simultaneous 4-interval
forced-consistency method was directly estimated (bottom portion of Figure). These
two configurations were used for the quadratic analysis of the distributions, and for
the rank-based analysis of the distributions.
Quadratic Analysis for Continuous Distributions
After the distribution was formulated for the sequential 2-interval method, the
QSR rule was applied to the distributions for the two different methods (sequential
versus simultaneous). For the sequential 2-interval method, the mean QSR obtained
was .167 (SD = .461) and for the simultaneous 4-interval forced-consistency method,
the mean QSR obtained was .116 (SD = .450). Neither of these QSR scores are
demonstrative of good accuracy, since QSR scores approaching 1 are indicative of
good performance (and approaching –1 is poor performance). Further, both of these
QSR scores are closer to the “ignorance” or chance level (which for the QSR is .25
just as it is with the BS). However, performance as indicated here was a bit better
for the sequential 2-interval method (M = .167) versus the simultaneous 4-interval
forced-consistency method (M = .116) so these two sets of scores based on different
methods were tested for a statistically significant difference. A robust dependent
method proposed by Yuen (Wilcox, 2003) was applied to this data using 20%
trimming. The Yuen test is a robust t-test method for dependent groups based on
68
difference scores. The results revealed that there was no significant difference
between QSR scores for the two methods t (214) = -0.102, p = .918.
Rank-Based Analysis for Continuous Distributions
The RBSR rule was also applied to the distributions for the two different
methods. For the sequential 2-interval method, the mean RBSR obtained was .134
(SD = .438) and for the simultaneous 4-interval forced-consistency method, the
mean RBSR obtained was .074 (SD = .608). Neither of these RBSR scores are
demonstrative of good accuracy, since RBSR scores approaching 1 are indicative of
good performance (and approaching –2 is poor performance). However, both of
these RBSR scores are better than the chance level (which for the RBSR is -.29).
Also, performance as indicated here was a bit better for the sequential 2-interval
method (M = .134) versus the simultaneous 4-interval forced-consistency method (M
= .074) so these two sets of scores based on different methods were tested for a
statistically significant difference. Yuen’s t-test method for dependent groups
(Wilcox, 2003) was also applied to this data using 20% trimming. The results
revealed that there was no significant difference between RBSR scores for the two
methods, t (215) = 1.025, p = 306.
Probability Decomposition
To again summarize, the three following probability decompositions were
evaluated in terms of Brier Score improvement and accuracy:
1. Decomposition 1: P! (favorite wins the game) = P% (favorite wins by the
point spread) + P$ (favorite wins by less than the point spread),
69
2. Decomposition 2: P! (favorite wins the game) = P (favorite wins by 7
points or less) + P (favorite wins by 8 points of more). In this second decomposition,
the sub-component probabilities were taken from the simultaneous 4-interval forced-
consistency method of assessment, and,
3. Decomposition 3: P% (favorite wins by the point spread) = P! (favorite
wins the game) - P$ (favorite wins by less than the point spread).
Probability Decomposition 1
First, difference scores were created to look at the distributions of signed
differences in order to determine the degree to which assessed probability was over-
and under- estimated when considering decomposition sub-components. In the case
of the first decomposition, the two sub-component probabilities (P% or the probability
that the favorite wins by the point spread) and (P$ or the probability the favorite wins
by less than the spread) were added together, and then subtracted by the probability
given for the favorite to win the game (P!). In other words, it was of interest to
assess how closely the decomposition was in terms of additivity. Signed (M = -
30.74, SD = 32.87) differences indicated an overestimation or sub-additive
probability estimation (where the probability of the union is less than the sum of the
individually mutually exclusive outcomes). Additionally, for signed differences,
81.5% of scores were negatively signed and 18.5% of scores were positively signed.
The distribution or signed scores (before setting endpoints of 0 and 100) for this
decomposition is provided in Figure 11.
70
Figure 11. The Distribution of Signed Differences for the Decomposition that the P
(favorite wins the game) Decomposed into the P (favorite wins by point spread) +
the P (favorite wins by less than the point spread)
71
Probability Decomposition 2
Next, the second decomposition was considered: P! (favorite wins the game)
= P (favorite wins by 7 points or less) + P (favorite wins by 8 points of more).
Again, difference scores were created to look at the distribution of signed differences
in order to determine the degree to which probability was over- and under- estimated
when considering sub-components. In the case of this decomposition, the two sub-
component probabilities taken from the simultaneous 4-interval forced-consistency
method (favorite wins by 7 points or less and favorite wins by 8 points or more) were
added together, and then subtracted by the probability given for the favorite to win
the game (P! from the method of direct assessment). Signed (M = - 1.63, SD =
20.86) differences indicated a slight overestimation of probability. Additionally, for
signed differences, 40.7% of scores were negatively signed and 59.3% of scores
were positively signed. The signed distribution (before setting endpoints of 0 and
100) for this decomposition is provided in Figure 12.
72
Figure 12. The Distribution of Signed Differences for the Decomposition that the P
(favorite wins the game) Decomposed into the P (favorite wins by 7 points or less) +
the P (favorite wins by 8 points or more)
73
Probability Decomposition 3
Third, the last decomposition was considered: P% (favorite wins by the point
spread) = P! (favorite wins the game) - P$ (favorite wins by less than the point
spread). As with the previous two decompositions, difference scores were created to
look at the distribution of signed differences in order to determine the degree to
which probability was over- and under- estimated when considering decomposition
sub-components. In the case of this decomposition, the two sub-component
probabilities (P! or favorite wins the game) and (P$ favorite wins by less than the
point spread) were subtracted from one another, and then subtracted by the
probability given for the favorite to win the game by the point spread (P%). Signed
(M = -30.74, SD = 32.87) differences indicated inaccuracy in the estimation of sub-
component probability. Additionally, for signed differences, 14.7% of scores were
negatively signed and 85.3% of scores were positively signed. The signed
distribution (before setting endpoints of 0 and 100) for this decomposition is
provided in Figure 13.
74
Figure 13. The Distribution of Signed Differences for the Decomposition that the P
(favorite wins by the point spread) Decomposed into the P (favorite wins the game) -
the P (favorite wins by less than the point spread)
75
Brier Score Decomposition
The second part of decomposition analysis consisted of comparing Brier
scores for the global target event versus the Brier scores based on decomposition
(decomposed Brier scores based on each sub-component). For the decompositions,
there are three global target events. For decompositions 1 and 2, the “global events”
are the probability assessed that the favorite would win the game (P!). The mean
Brier score for this event was .261 (SD = .261). The third global target event (for
decomposition 3) was the probability assessed that the favorite would win by the
point spread (P%). The mean Brier score for this event was .321 (SD = .260). These
Brier scores were compared against Brier scores obtained using decomposition,
which were calculated for each sub-component separately.
For the first decomposition, breaking down the probability that the favorite
would win the game (P!) into the probability that the favorite would win by the point
spread or more (P%) and the probability that the favorite would win by less than point
spread (P$), the global Brier score was again .261 versus .381 (SD = .426) for the
decomposition sub-components. Yuen’s t-test method for dependent groups
(Wilcox, 2003) was applied to this data using 20% trimming. The results showed
that there was a significant difference between the two Brier scores (global vs.
decomposition), t (221) = -3.44, p < .001. For the second decomposition, breaking
down the probability that that the favorite would win the game (P!) into the
probability that the favorite would win by 7 points or less and the probability that the
favorite would win by 8 points or more (taken from the simultaneous 4-interval
76
forced-consistency method), the global Brier score was again .261 versus .280 (SD =
.284) for the decomposition sub-components. Using Yuen’s robust method, the
results showed that there was not a significant difference between these two Brier
scores (global vs. decomposition), t (228) = -0.480, p = .631. For the last
decomposition, breaking down the probability that the favorite would win by point
spread (P%) into the probability that the favorite would win the game (P!) minus the
probability that the favorite would win by less than the point spread (P$), the global
Brier score was .321 versus .343 (SD = .371) for the decomposition sub-components.
Again using Yuen’s method, the results showed that there was not a significant
difference between these two Brier scores (global vs. decomposition), t (219) =
0.852, p = 395. For all three decompositions, decomposing the probability into sub-
components led to poorer accuracy, particularly for the first decomposition since this
comparison was statistically significant. These results are summarized in Table 8.
Table 8. Results for Accuracy for Probability Decompositions
Decomposition
Global
Brier Score
Decomposed
Brier Score
Difference of
Brier Scores
Decomposition 1
(N = 368; 57 Missing)
.261 vs. .381* -.120
Decomposition 2
(N = 381; 44 Missing)
.261 vs. .280 -.019
Decomposition 3
(N = 366; 59 Missing)
.321 vs. .343 -.022
Note: * = significant difference.
77
CHAPTER 5
DISCUSSION
To again summarize, the current research had three main goals: 1) to examine
accuracy for various predictions of NFL and NCAA football game outcomes, 2) to
compare direct assessment and interval methods of elicitation for SPD’s, and 3) to
examine probability decomposition using stimuli from a real-world domain. The
three main hypotheses of this research were as follows: 1) it was hypothesized that
participants would demonstrate the most accuracy in prediction for questions
regarding game winners and for questions about winners against the established
point spread, 2) it was hypothesized that participants would be more accurate for
SPD’s based on a simultaneous 4-interval forced-consistency method versus a
sequential 2-interval method, and 3) it was hypothesized that probability
decomposition will lead to more accurate estimates versus global assessments.
First, the hypothesis that respondents would demonstrate the most accuracy
in predictions for questions regarding game winners (assessment P!) and for
questions about the established point spread (assessment P%) was not supported. It
was hypothesized that respondents would be most accurate when predicting these
events because these are the two most common wagering or game outcome
assessments that are usually made for sporting events (Tassoni, 1996). Contrary to
prediction, Brier scores for favorites to win by less than the point spread (P$; M =
.235, SD = .229) and for favorites to win by the point spread plus seven points or
more (P&; M = .235, SD = .282) were the most accurate, partly because the
78
predictions for these two events were lower in terms of assessed probability than for
the other point spread assessments (these results are again shown in Table 5). The
least accurate predictions were for assessments of the favorite to win by the point
spread (P%; M = .321, SD = .260) and for assessments of the favorite to win by the
point spread divided by two (P#; M = .330, SD = .290). In general, none of the Brier
scores obtained for the six target events were particularly good since all were greater
than .250, or the value expected simply by chance. Further, the results of the robust
dependent groups comparisons did reveal significant differences in Brier scores for
the six events (8 of the 15 comparisons assessed), so in essence, the kinds of
questions that were asked led to different results in terms of the accuracy of
prediction. Although, because performance was less than desirable overall, this does
raise the question of the efficacy of obtaining probability estimates of this type (for
varying point spreads based on direct assessment) using such stimuli in an online
environment. Thus, the method of elicitation is potentially of concern.
Next, it was hypothesized that participants would be more accurate for
estimating SPD’s based on a simultaneous 4-interval forced consistency method
versus a sequential 2-interval method. This was hypothesized since the simultaneous
4-interval method displayed the full range of choice to the respondent all at once.
The intervals that were compared are shown in Figure 10 and these methods were
compared both for QSR an RBSR analyses. Again, these results were not supported.
Contrary to what was hypothesized, accuracy scores for both the QSR and RBSR
were better for the sequential 2-interval method. The mean QSR obtained for the
79
sequential 2-interval method was .167 (SD = .461) and for the simultaneous 4-
interval forced-consistency method, the mean QSR obtained was .116 (SD = .450).
For the RBSR, the mean RBSR obtained for the sequential method was .134 (SD =
.438) and for the simultaneous method, the mean RBSR obtained was .074 (SD =
.608). For both of these measures, scores closer to 1 are indicative of better
accuracy. Although scores were higher for the sequential method versus the
simultaneous method, these differences were not significant for either the QSR or the
RBSR. Additionally, for both methods the results obtained were only slightly better
than chance. Both of the estimated continuous distribution analyses revealed that
participants were not very accurate in their assessments. It is believed that
participants had a difficult time understanding, interpreting, and responding to the
questions of this type through an online survey and this belief was supported through
the analysis of consistency that was conducted.
Finally, it was hypothesized that probability decomposition would lead to
more accurate estimates versus global assessments. Again, contrary to what was
hypothesized, the three probability decompositions led to a reduction in accuracy, in
terms of Brier scores, as compared to global assessments. In other words,
participants had a tendency to over-inflate sub-component probabilities and thus
Brier scores for the sub-components were less accurate as compared to the global
estimates. These results are consistent with previous research, which has
demonstrated that respondents tend to be sub-additive when assessing unpacked
probability components. These results are also consistent with other prior research
80
that has shown that decomposition does not lead to improved accuracy. These
findings will be further evaluated later in the discussion section.
In addition to assessing the accuracy of predictions based on Brier scores,
calibration was also assessed for the six events based on sequential 2-interval
method. The calibration graphs (reliability diagrams) for the six assessments are
shown in Figures 7 to 9. These graphs revealed that participants were not very
calibrated in their assessments since the plots of predictions against the proportion
correct did not fall near the diagonal for any of the six conditions. Hence, both the
accuracy and calibration results showed that assessments were very poor overall.
Further, participants had a tendency to be very inconsistent in their
probability assessments of varying point spreads for the sequential 2-interval
method. When considering the following five points of a distribution, the dog
winning the game by the point spread divided by two (P"), the favorite to win (P!),
the favorite to win by the point spread divided by two (P#), the favorite to win by the
point spread (P%), and the favorite to win by the point spread plus seven points (P&),
only 31.4% of respondents had non-increasing responses less than 0-4%. Thus, the
majority of respondents had a tendency to provide very inconsistent probability
estimates when considering the above points moving from left to right in the
distribution. Also, when considering only three points in the distribution (P!, P%, and
P&) 78.4% of respondents had 0-4% non-increasing responses. These results suggest
that when estimating continuous subjective probability distributions, the dispersion
of the interval is very important to get more consistent responses. It is likely that
81
predictions for the dog to win a particular game by the point spread divided by two
(P") and for the favorite to win the game by the point spread divided by two (P#)
were more difficult assessments for subjects to make. Additionally, participants may
have truly believed that the underdog would win the game and thus put higher
probabilities on this assessment. In fact, for the games used, the underdog did win
the game by at least the point spread divided by two or more 44.3% of the time (see
Table 5). To reiterate, the consistency results suggest the need to very carefully
consider the kinds of questions that are asked when using an online survey. It is very
possible (and likely) that some of the assessments questions were misunderstood or
were difficult for subjects to understand, hence the inconsistency. Making
assessments of varying point spreads does require a certain degree of knowledge and
sensitivity to the nature of the subtle differences of the questions being asked. The
results of the consistency analyses demonstrate the importance of ascertaining how
subjects respond to questions. Cooke (1991) provided a summary of calibration
studies and noted evidence that showed training in probability elicitation led to
improvement in calibration. For the current research, it is likely that some sort of
training in the nature of assessing probability would have led to better results.
Additionally, the results of the probability decompositions that were explored
showed that decomposition did not aid probability assessment. For the three
decompositions explored, respondents had a tendency to over- or under- estimate
probability when considering sub-components. For the first decomposition, the
favorite to win the game decomposed into the favorite winning by less than the point
82
spread and the favorite to win by greater than the point spread, the signed
distribution (Figure 1) showed that participants had a tendency to overestimate the
joint probability sub-components (M = - 30.74, SD = 32.87). When considering
signed differences for this decomposition, 81.5% of responses were negatively
signed, indicating an overestimation of probability or sub-additive probability
assessment (where the probability of the union is less the sum of the individually
mutually exclusive events). Further, the difference in Brier scores for this
decomposition revealed that decomposition led to significantly poorer assessment
(.261 versus .381).
For the second decomposition, the favorite to win the game decomposed into
the favorite winning by less than 7 points and the favorite to win by 8 points or more,
the signed distribution (Figure 3) showed that participants had a slight tendency to
overestimate the joint probability sub-components (M = - 1.63, SD = 20.86).
However, when considering signed differences for this decomposition, only 40.7%
of responses were negatively signed. Further, the difference in Brier scores for this
decomposition revealed that decomposition led to almost no difference in assessment
(.261 versus .280). Overall, these results suggest that there is something to be gained
when comparing assessments based on the sequential 2-interval distribution versus
those based on simultaneous 4-interval forced-consistency, since participants more
accurately combined sub-component probabilities and their accuracy for the
decomposition was similar to accuracy of the global assessment.
83
For the third decomposition, the favorite to win the game by the point spread
decomposed into the favorite winning the game minus the favorite to win by less
than the point spread, the signed distribution (Figure 5) showed that participants also
had a tendency to overestimate the joint probability sub-components for this
decomposition (M = -30.74, SD = 32.87). These values are the same as the first
decomposition since the individual components of the decomposition are the same.
However, when considering the difference in Brier scores for this decomposition,
this decomposition led to significantly poorer assessment (.261 versus .343).
Overall, none of the decompositions explored improved accuracy in terms of
breaking probability down into sub-component judgments. In a manner consistent
with the previous results obtained, it is likely that participants had a difficult time
making judgments about varying point spreads. The decomposition results do
suggest though that there may be some utility to obtaining predictions based on
simultaneously presented intervals.
There is little previously reported research that has examined both discrete
and continuous random variables and little research directly comparing alternative
methods of elicitation. The results of this research provide insight into strategies for
wagering on football games, but more importantly these results provide
recommendations for the elicitation of subjective probabilities in general, which has
great implications for decision analysis. There are significant methodological
implications with the results of this research. In constructing SPD’s for different
kinds of continuous random variables (e.g. sequential 2-interval versus simultaneous
84
4-interval forced-consistency), the researcher has many different elicitation methods
at their disposal (Lichtenstein et al., 1982; Ravinder et al., 1988). As previously
mentioned, substantial overconfidence has been found using interval estimates and it
has been suggested that the degree of inaccuracy depends upon how intervals are
elicited and depending on the width of the interval being assessed, the degree of
inaccuracy can vary substantially (Soll & Klayman, 2004). Further, differences in
accuracy have been found for studies using two-choice formats versus interval
formats and it has been offered that more research needs to be conducted using
interval assessments (Soll & Klayman, 2004). The current research extends the
insight into differences between traditional approaches (e.g. two-choice discrete
outcomes) and the usefulness of obtaining continuous subjective probability
distributions and some problems inherent in different approaches used to obtain
distributions.
Moreover, consideration for future research should also be given to training
individuals in the process of the elicitation of subjective probability so as to avoid
the pitfalls associated with many of the judgmental heuristics or biases that affect
decision- making. Hora (2007) points out that even experts may not be familiar with
or have much experience with the process of probability elicitation. Thus, training
individuals in the probability elicitation process may have substantial benefits. Some
ideas offered include providing practice with forming probability judgments,
discussing subjective probability and how it may be used for analysis, providing
information about the elicitation questions, and informing individuals about biases
85
that may affect the process (Hora, 2007). In terms of the current research results, the
idea is offered that perhaps probability assessment should be a task that is done with
in-person interaction, where subjects are carefully instructed about how to respond to
stimuli, versus obtaining assessments in an online environment.
Further, it would also be interesting for future research to vary and assess the
degree of instruction and information provided to participants to gain some insight
into how participants use this information to make better judgments. Most
participants in probability assessment studies receive some explanation of how to
make judgments and assess probabilities. However, it would be beneficial to see
what differences are found when participants are informed about potential biases
affecting the elicitation process. Overconfidence has already been discussed and is
one potential bias in probability assessment, but other common biases include
availability and representativeness. The availability bias is demonstrated when
people estimate probability or frequency by the ease with which occurrences readily
come to mind (Tversky & Kahneman, 1973). Representativeness refers to making
erroneous judgments based on perceived similarity of something to a particular
category (Tversky & Kahneman, 1974; Clemen, 1991; Tassoni, 1996). Both of these
biases have significant implications for wagering on football games. For example,
availability can affect judgment when individuals wagering only consider a dominant
performance by one team in the previous week and hence the natural tendency of
regression to the mean is ignored (Lee and Smith, 2002). Representativeness can
affect the prediction of football game outcomes when individuals form impressions
86
of teams based on preseason performance or performance against lesser opponents
(Tassoni, 1996). Thus, if individuals regularly are making week-to-week
assessments, they may be prone to thinking that extreme outcomes are representative
of a particular team’s ability. Further, in terms of the uncertainty of wagering on
football games, de-biasing strategies could also include providing participants with
historical data (e.g. bowl game performance, rivalry history, a particular teams
performance against the spread, etc.) or how various factors such as injuries have an
impact on game outcomes. It is also potentially possible that a desirability bias
occurred in the current study since there was generally an overestimation for the
probability of the favorite winning. It is possible that participants were biased in this
way since the selection of games was not random, but rather were important games
that could have potential emotional biases. Another interesting finding was that for
the games assessed, the favorite only won by the point spread about 40% of the time
(not the expected 50% of the time since point spreads are designed to keep bets 50/
50) and so this affected the results obtained.
Finally, creating a prediction model for the outcome of football games and
sporting events in general is an area for future research to consider. This would
provide a more formal model of assessing this specific type of uncertainty. There are
many factors that could be incorporated into a prediction model such as whether a
team is playing at home versus away, whether it is a day game or night game, if the
game is a rivalry matchup (and matchup history), the performance versus previous
opponent, and so on. Using such a prediction model, historical data could be used to
87
predict the outcome of future events, and then comparisons made to determine the
effectiveness of the model. In general, previous research has shown that certain
wagering strategies for football games are effective. For example, Lee and Smith
(2002) showed that team performance regresses to the mean but those wagering do
not consider this as a factor when betting on the same team in the next week’s event.
Teams usually do not perform the same week to week and this depends on many
factors. Historically, teams who have a dominant performance in one week, tend to
not have a similar performance the following week. Additionally, previous research
has shown that participants wagering on football games are overly optimistic of their
own skill and chances of future success (Gilovich, 1983). Specifically, individuals
have a tendency to accept wins at face value, but rationalize or explain away losses
to fluke events (Gilovich, 1983). The goal of providing recommendations for
wagering on sporting events such as football is something for future studies dealing
with the assessment of uncertainty to consider exploring.
The current research was intended to provide insight into methods and
elicitation techniques that will enhance the process of decision analysis. This
research has provided suggestions for how to improve the process by which
subjective probability elicitation is done. As mentioned previously, most of the
research in the area of subjective probability has been conducted using discrete
variables, and it is beneficial to assess the utility of assessing subjective probability
using continuous random variables. Further, in the study of probability assessment,
the concept of probability decomposition has been minimally studied and still needs
88
further assessment. Most researchers who study subjective probability believe in the
utility of decomposition, but past research has led to mixed results. The results of
this research further disconfirm the efficacy of decomposition and this is potentially
due to the nature of the stimuli used in this research, as well as the sample of subjects
making assessments.
89
REFERENCES
Alpert, M., & Raiffa, H. (1982). A progress report on the training of probability
assessors. In D. Kahneman, P. Slovic, & A. Tversky (Eds.), Judgment under
uncertainty: Heuristics and biases (pp. 294-305). Cambridge: Cambridge
Univ. Press.
Armstrong, J.S., Denniston, W. B., & Gordon, M. M. (1975). The use of the
decomposition principle in making judgments. Organizational Behavior and
Human Performance, 14, 257-263.
Ayton, P., & McClelland, A. (1997). How real is overconfidence? Journal of
Behavioral Decision Making, 10, 279-285.
Boulier, B. L., & Stekler, H. O. (2003). Predicting the outcomes of National
Football League games. International Journal of Forecasting, 19, 257-270.
Brenner, L. (2000). Should observed overconfidence be dismissed as a statistical
artifact? Critique of Erev, Wallsten, and Budescu (1994). Psychological
Review, 107 (4), 943-946.
Brenner, L. A., Koehler, D. J., Liberman, V., & Tversky, A. (1996). Overconfidence
in probability and frequency judgments: A critical examination.
Organizational Behavior and Human Decision Processes, 65 (3), 212-219.
Clemen, R. T. (1991). Making hard decisions: An introduction to decision analysis.
Boston, MA: PWS-Kent Publishing Co.
Clemen, R. T., Fischer, G. W., & Winkler, R. L. (2000). Assessing dependence:
Some experimental results. Management Science, 46 (8), 1100-1115.
Cooke, R. M. (1991). Experts in uncertainty. Oxford University Press, Oxford,
U.K.
Dana Jr., J. D., & Knetter, M. M. (1994). Learning and efficiency in a gambling
market. Management Science, 40 (10), 1317-1328.
Dawes, R. M., & Mulford, M. (1996). The false consensus effect and
overconfidence: Flaws in judgment or flaws in how we study judgment?
Organizational Behavior and Human Decision Processes, 65 (3), 201-211.
90
Fischhoff, B., Slovic, P., & Lichtenstein, S. (1977). Knowing with certainty: The
appropriateness of extreme confidence. Journal of Experimental Psychology:
Human Perception and Performance, 3 (4), 552-564.
Gigerenzer, G. (1991). How to make cognitive illusions disappear: Beyond
“heuristics and biases.” In W. Stroche & M. Hewstone (Eds.), European
review of social psychology, (Vol. 2, pp. 83-115). New York: Wiley.
Gigerenzer, G. (1993). The bounded rationality of probabilistic mental models. In
K. I. Manktelow & D. E. Over (Eds.), Rationality: Psychological and
philosophical perspectives. London: Routledge.
Gilovich, T. (1983). Biased evaluation and persistence in gambling. Journal of
Personality and Social Psychology, 44 (6), 1110-1126.
Griffin, D. W., & Varey, C. A. (1996). Towards a consensus of overconfidence.
Organizational Behavior and Human Decision Processes, 65 (3), 227-231.
Henrion, M., Fischer, G. W., & Mullin, T. (1993). Divide and conquer? Effects of
decomposition on the accuracy and calibration of subjective probability
distributions. Organizational Behavior and Human Decision Processes, 55,
207-227.
Hora, S. C. (2004). Probability judgments for continuous quantities: Linear
combinations and calibration. Management Science, 50 (5), 597-604.
Hora, S. C. (2007). Eliciting probabilities from experts. In W. Edwards, R. F. Miles
Jr. & D. von Winterfeldt (Eds.) Advances in decision analysis. (pp. 129-153).
New York: Cambridge University Press.
Hora, S. C., Hora, J. A., & Dodd, N. G. (1992). Assessment of probability
distributions for continuous random variables: A comparison of the bisection
and fixed value methods. Organizational Behavior and Human Decision
Processes,51, 133-155.
Jensen, F. A., & Peterson, C. R. (1973). Psychological Effects of Proper Scoring
Rules. Organizational Behavior and Human Performance, 9, 307-317.
Juslin, P. (1994). The overconfidence phenomenon as a consequence of informal
experimenter-guided selection of almanac items. Organizational Behavior
and Human Decision Processes, 57, 226-246.
91
Juslin, P., Winman, A., & Olson, H. (2000). Naïve empiricism and dogmatism in
confidence research: A critical examination of the hard-easy effect.
Psychological Review, 107 (2), 384-396.
Klayman, J., Soll, J. B., Gonzalez-Vallejo, C., & Barlas, S. (1999). Overconfidence:
It depends on how, what, and who you ask. Organizational Behavior and
Human Decision Processes, 79 (3), 216-247.
Lee, M. & Smith, G. (2002). Regression to the mean and football wagers. Journal
of Behavioral Decision Making, 15, 329-342.
Lenthe, J. van (1994). Scoring-rule feedforward and the elicitation of subjective
probability distributions. Organizational Behavior and Human Decision
Processes, 59, 188-209.
Liberman, V., & Tversky, A. (1993). On the evaluation of probability judgments:
Calibration, resolution, and monotonicity. Psychological Bulletin, 114 (1),
162-173.
Lichtenstein, S., Fischhoff, B., & Phillips, L. D. (1982). Calibration of probabilities:
The state of the art to 1980. In D. Kahneman, P. Slovic, & A. Tversky
(Eds.), Judgment under uncertainty: Heuristics and biases (pp. 306-334).
New York: Cambridge University Press.
Ludke, R. L., Strauss, F. F., & Guftason, D. H. (1977). Comparison of five methods
of estimating subjective probability distributions. Organizational Behavior
and Human Performance, 19, 162-179.
MacGregor, D., Lichtenstein, S., & Slovic, P. (1988). Structuring knowledge
retrieval: An analysis of decomposed quantitative judgments.
Organizational Behavior and Human Decision Processes, 42, 303-323.
Matheson, J. E., & Winkler, R. L. (1976). Scoring rules for continuous probability
distributions. Management Science, 22, 1087-1096.
McKenzie, C. R. M., Liersch, M. J., & Yaniv, I. (2008). Overconfidence in interval
estimates: What does expertise buy you? Organizational Behavior and
Human Decision Processes, 107, 179-191.
Nelson, T. O., & Narens, L. (1980). Norms of 300 general-information questions:
Accuracy of recall, latency of recall, and feeling of knowing ratings. Journal
of Verbal Learning and Verbal Behavior, 19, 338-368.
92
Oskamp (1965). Overconfidence in case-study judgments. The Journal of
Consulting Psychology, 29, 261-265.
Raiffa, H. (1968). Decision analysis: Introductory lectures on choices under
uncertainty. Reading, MA: Addison-Wesley.
Ravinder, H. V., Kleinmuntz, D. N., & Dyer, James, S. (1988). The Reliability of
subjective probabilities obtained through decomposition. Management
Science, 34 (2), 186-199.
Ronis, D. L., & Yates, F. J. (1987). Components of probability judgment accuracy:
Individual consistency and effects of subject matter and assessment method.
Organizational Behavior and Human Decision Processes, 40, 193-218.
Seaver, D. A., von Winterfeldt, D. & Edwards, W. (1978). Eliciting subjective
probability distributions on continuous variables. Organizational Behavior
and Human Performance, 21, 379-391.
Soll, J. B., & Klayman, J. (2004). Overconfidence in interval estimates. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 30 (2), 299-
314.
Song, C., Boulier, B. L., & Stekler, H. O. (2007). The comparative accuracy of
judgmental and model forecasts of American football games. International
Journal of Forecasting, 23, 405-413.
Tassoni, C. J. (1996). Representativeness in the market for bets on National Football
League games. Journal of Behavioral Decision Making, 9, 115-124.
Tryfos, P., Casey, S., Cook, S., Leger, G., & Pylypiak, B. (1984). The profitability
of wagering on NFL games. Management Science, 30 (1), 123-132.
Tsai, C. I., Klayman, J., & Hastie, R. (2008). Effects of amount of information on
judgment accuracy and confidence. Organizational Behavior and Human
Decision Processes, 107, 97-105.
Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging
frequency and probability. Cognitive Psychology, 5, 207-232.
Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and
biases. Science, 185, 1124-1131.
93
Vergin, R. C., & Scriabin, M. (1978). Winning strategies for wagering on National
Football League Games. Management Science, 24 (8), 809-818.
Von Winterfeldt, D., & Edwards, W. (1986). Decision analysis and behavioral
research, New York: Cambridge Univ. Press.
Wallsten, T. S. (1996). An analysis of judgment research analyses. Organizational
Behavior and Human Decision Processes, 65 (3), 220-226.
Wallsten, T. S., Erev, I., & Budescu, D. V. (2000). The importance of theory:
Response to Brenner (2000). Psychological Review, 107 (4), 947-949.
Wilcox, R. (2003). Applying contemporary statistical techniques. San Diego, CA:
Academic Press.
Wilcox, R. (2005). Introduction to robust estimation and hypothesis testing. Second
edition. Burlington, MA: Academic Press.
Wilcox, R. (2010). Multiple comparisons. Unpublished Manuscript.
Yates, J. F. (1982). External correspondence: Decompositions of the mean
probability score. Organizational Behavior and Human Performance, 30,
132-156.
94
APPENDIX A
INFORMATION/FACTS SHEET FOR NON-MEDICAL RESEARCH
Probability Assessment: Decomposition Using both Discrete and Continuous
Random Variables
PURPOSE OF THE STUDY
Everyday, people make judgments about uncertain events (e.g. "It is likely that the
U.S. economy will improve within the next year"). Assessments of uncertainty can
be made directly when people estimate the confidence in their prediction in terms of
numerical probability for specific outcomes of these events (e.g. “I am 80%
confident the Yankees will beat the Twins in the first round of the baseball
playoffs”). The purpose of the current research is to examine various predictions for
sporting events. By participating, you will learn how to better make predictions of
various sporting event outcomes.
PARTICIPANT INVOLVEMENT
The online survey will ask you to make several predictions about the outcomes of
various sporting events (football games). You will be asked to make several
predictions and provide confidence ratings (or probability estimates) for various
outcomes (e.g. the home team wins a particular game). Assessments will vary in
terms of the amount of information provided and the goal is to see how specific
information leads to the more accurate predictions.
PAYMENT/COMPENSATION FOR PARTICIPATION
For taking part in the study, you will have a chance to win a reward of $250 cash
through a lottery drawing. Once you complete the study materials, you will be given
an entry into the drawing and this participation will be recorded through online
survey software. There will be a 1% chance of winning this cash reward (1 out of
100). Additionally, the person who makes the most accurate predictions will be
given an additional award of $200. You also have the option to participate without
entering the drawing or providing any personal info. (e.g. email address). The
drawing will take place and payment will be made by the principal investigator via
check upon the completion of the study in December, and no later than December
31st of 2009. After completion of the study, you will also be provided with a
summary of the results.
95
CONFIDENTIALITY
The data will be automatically collected online using qualtrics survey software.
Only the primary researcher will have access to the data that are collected and all
responses will be kept confidential. After the completion of the study and the
drawing for the cash reward, any information about your identity will be deleted
from the data set. You will have the option as to whether or not to provide personal
email information for entry into the drawing for the cash reward. You may also
withdraw from the study at any time.
INVESTIGATOR CONTACT INFORMATION
Patrick Doyle Richard John
Email: pjd@usc.edu Email: richardj@usc.edu
IRB CONTACT INFORMATION
University Park IRB, Office of the Vice Provost for Research Advancement, Stonier
Hall, Room 224a, Los Angeles, CA 90089-1146, (213) 821-5272 or upirb@usc.edu
Did you read the above description of the study and do you agree to participate?
Please answer below.
• Yes
• No
96
APPENDIX B
SAMPLE SURVEY QUESTIONS
97
98
99
Abstract (if available)
Abstract
There has been a great deal of research conducted in the areas of probability assessment and calibration. However, a limited amount of research has been conducted using continuous random variables where participants construct subjective probability distributions (SPD’s) and a dearth of research exploring probability decomposition. The current research had three main goals: 1) to examine accuracy for various predictions of NFL and NCAA football game outcomes, 2) to compare different methods of elicitation for SPD’s, and 3) to examine probability decomposition using stimuli from a real-world domain. It was hypothesized that participants would demonstrate the most accuracy in prediction for questions regarding game winners and for questions about winners against the point spread, since such events are the most common for those who regularly follow sports. It was also hypothesized that participants would be more accurate for SPD’s based on a simultaneous 4-interval forced-consistency method versus a sequential 2-interval method, since all possible outcomes are presented at once using the 4-interval method. Participants were recruited online through various sources and responded to online surveys regarding games in the 2009 NFL and NCAA football season. Results demonstrated participants were not very accurate in assessments of game winners or of picking winners against the point spread. Also, participants were more accurate constructing SPD’s using the sequential method. Additionally, probability decomposition did not lead to more accurate assessments. These results are most likely due to the difficulties in conducting probability assessment via online survey. Implications, potential limitations, and recommendations for future research are discussed.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy efficient buildings: a method of probabilistic risk assessment using building energy simulation
Asset Metadata
Creator
Doyle, Patrick J.
(author)
Core Title
Probability assessment: Continuous quantities and probability decomposition
School
College of Letters, Arts and Sciences
Degree
Doctor of Philosophy
Degree Program
Psychology
Publication Date
04/13/2011
Defense Date
06/24/2010
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
continuous random variables,elicitation methods,OAI-PMH Harvest,probability assessment,probability decomposition,subjective probability distributions
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
John, Richard S. (
committee chair
), Moore, James Elliott, II (
committee member
), Read, Stephen J. (
committee member
), Walsh, David A. (
committee member
), Wilcox, Rand R. (
committee member
)
Creator Email
Longshanks@aol.com,pjd@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m3743
Unique identifier
UC1171113
Identifier
etd-Doyle-4539 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-442855 (legacy record id),usctheses-m3743 (legacy record id)
Legacy Identifier
etd-Doyle-4539.pdf
Dmrecord
442855
Document Type
Dissertation
Rights
Doyle, Patrick J.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
continuous random variables
elicitation methods
probability assessment
probability decomposition
subjective probability distributions