Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Human and machine probabilistic estimation for decision analysis
(USC Thesis Other)
Human and machine probabilistic estimation for decision analysis
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HUMAN AND MACHINE PROBABILISTIC ESTIMATION FOR DECISION ANALYSIS
by
Lucas J. Haravitch
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
INDUSTRIAL AND SYSTEMS ENGINEERING
August 2021
Copyright 2021 Lucas J. Haravitch
Acknowledgments
I feel extremely blessed to be afforded the opportunity to make such an investment in my own
education. I owe many thanks to those who have educated me along the way and those who will
continue to do so. I am particularly grateful for my advisor, Ali Abbas, for taking me under his
wing and showing me the ropes of Decision Analysis. Your guidance, patience, support, and
wisdom contributed immensely to this work. And to my research mates: Danny, Max, Ahmed, and
Ehsan; thank you for your conversations, camaraderie, and support.
Although I didn’t call them nearly as much as I should, conversations with my family (mostly
imagined) contributed greatly to this dissertation. Thank you Dad for insisting that there is always
a right way to do things. Thank you Mom for reminding me to do only one thing at a time.
And thank you Ben for encouraging me to look for another side to a problem, that everything is
“different when you see the bigger picture.”
Michael and Jonny- we were in this together. Whether it was homework at the kitchen table,
classes on Zoom, or visits from the Number Devil, we three boys really did a lot of good school
work here in California. Thank you for helping me keep things in perspective.
I know now why people say things like “without my wife, this just wouldn’t have been pos-
sible.” I know now because it is absolutely true: without your patience, love, support, help, and
sacrifices, I couldn’t have done this. Corollary: with your patience, love, etc., I think I could do
anything. Thank you Kari; you are my favorite.
I also owe a special thanks to all those who made it through the Tetris survey- thank you!
ii
This research is based upon work supported in part by the Office of the Director of National In-
telligence (ODNI) Intelligence Advanced Research Projects Activity (IARPA), via IARPA-BAA-
16-02 and IARPA-BAA-17-08. The views and conclusions contained herein are those of the author
and should not be interpreted as necessarily representing the official policies, either expressed or
implied, of ODNI, IARPA or the U.S. Government.
iii
Table of Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
1 Introduction, Motivation, and Literature Review . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction and Scope
(why this is an important problem to solve) . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Probabilisitic Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Policy Decision Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Counterfactual Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.4 The Wisdom of the Crowd . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.5 The Cold Start Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.6 Hybridization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.1.7 Trade-offs Between Accuracy and Informativeness . . . . . . . . . . . . . 11
1.2 Motivation and Contributions
(how this dissertation contributes to the solution) . . . . . . . . . . . . . . . . . . . . . 14
1.3 Concepts and Literature Review
(how others have solved similar problems in the past) . . . . . . . . . . . . . . . . . . . 15
1.3.1 Eliciting Probabilistic Estimates from Human Judges . . . . . . . . . . . . 15
1.3.2 Eliciting Probabilistic Estimates from Data . . . . . . . . . . . . . . . . . 17
1.3.3 Calibrating Probabilistic Estimates . . . . . . . . . . . . . . . . . . . . . . 20
1.3.4 Combining Probabilistic Estimates . . . . . . . . . . . . . . . . . . . . . . 24
1.3.5 Graphical Models for Joint Probabilistic Estimates . . . . . . . . . . . . . 25
1.4 Overview of the Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2 Eliciting and Using Machine- and Human- Generated Probabilistic Forecasts . . . . 30
2.1 Introduction to Data Sets Used in the Dissertation . . . . . . . . . . . . . . . . . . 31
2.1.1 Hybrid Forecasting Competition (HFC) . . . . . . . . . . . . . . . . . . . 31
2.1.2 Forecasting Counterfactuals in Uncontrolled Settings (FOCUS) . . . . . . . 32
2.1.3 The Tetris Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1.4 Other Publicly Available Data Sets . . . . . . . . . . . . . . . . . . . . . . 38
iv
2.2 Using Simulation to Select Arithmetic and Geometric Brownian Motion Model
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3 Case Study: Predicting Commodity Prices with Brownian Motion Models and Hu-
man Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3.1 Comparing Brownian Motion Models to Novice Human Forecasts . . . . . 43
2.3.2 Comparing Brownian Motion Models to Expert Human Forecasts . . . . . 44
2.4 Contributions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3 Calibrating and Combining Probabilistic Estimates . . . . . . . . . . . . . . . . . . 49
3.1 Calibrate-Aggregate-Recalibrate (C-A-R) Methodology . . . . . . . . . . . . . . . 50
3.2 A New Calibration Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Using the Generalized Aggregation Function . . . . . . . . . . . . . . . . . . . . . 62
3.4 The role of Recalibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.5 Case Study: C-A-R for a Forecasting Competition . . . . . . . . . . . . . . . . . . 68
3.6 Case Study: C-A-R for the Tetris Survey . . . . . . . . . . . . . . . . . . . . . . . 81
3.7 Contributions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4 Graphical Models for Joint Probabilistic Estimates . . . . . . . . . . . . . . . . . . . 89
4.1 Hybrid Adaptive Relevance Networks for Estimating System States (HARNESS)
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Case Study: Errors Induced by Approximate Methods of Dependence Elicitation . 94
4.3 Case Study: A Comparison of Methods for Joint Probability Assessment . . . . . . 96
4.4 Demonstration: HARNESS Web Application . . . . . . . . . . . . . . . . . . . . 99
4.5 Contributions and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5 Considerations and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1 Fundamental Contributions of the Dissertation . . . . . . . . . . . . . . . . . . . . 107
5.2 Considerations for Using the CAR Methodology . . . . . . . . . . . . . . . . . . . 108
5.3 Considerations for Using the HARNESS Methodology . . . . . . . . . . . . . . . 109
5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Appendix A Tetris Estimation Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
v
List of Tables
1.1 Summary of Contributions of this Research . . . . . . . . . . . . . . . . . . . . . 16
1.2 Summary of Machine Models Used in this Research . . . . . . . . . . . . . . . . . 19
2.1 Hybrid Forecasting Tournament Questions . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Brownian motion model parameters . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.3 Forecasts and Number of Experts outperformed by Machine Models on LBMA
annual gold survey (by percent error, USD price of gold) . . . . . . . . . . . . . . 45
2.4 Brent Crude Oil Monthly Forecast Results . . . . . . . . . . . . . . . . . . . . . . 46
3.1 Summary of Calibration Function Desiderata. . . . . . . . . . . . . . . . . . . . . 54
3.2 Summary of Calibration-Aggregation-Recalibration Schemes. . . . . . . . . . . . 69
3.3 Effectiveness of Uniform Recalibration Baseline based on 10-fold Cross-Validation 74
3.4 Effectiveness of Uniform Calibration Baseline based on 10-fold Cross-Validation . 75
3.5 Effectiveness of Crowd-based Recalibration Baseline based on External Validation 76
3.6 Effectiveness of Crowd-based Calibration Baseline based on External Validation . . 77
3.7 Effectiveness of Geometric Aggregation based on 10-fold Cross-Validation . . . . 78
4.1 Sample of data from Infection Simulation . . . . . . . . . . . . . . . . . . . . . . 97
4.2 Joint Distribution Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 102
vi
List of Figures
1.1 Overview of the Fundamental Problem to be Solved by this Research . . . . . . . . 2
1.2 Causal Diagram of Relevant Uncertainties in Afghanistan . . . . . . . . . . . . . . 5
1.3 Diagram of Relevant Factors for Estimating COVID-19 Deaths . . . . . . . . . . . 6
1.4 Graphical Overview of Hybridization Options . . . . . . . . . . . . . . . . . . . . 11
1.5 Sample Brownian Motion Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.6 Example Probability Weighting Functions. . . . . . . . . . . . . . . . . . . . . . . 24
2.1 The FOCUS Competition and Civilization V computer game. . . . . . . . . . . . . 35
2.2 Screenshot of the Tetris Survey. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Improvement in mean absolute percent error over Na¨ ıve model . . . . . . . . . . . 41
2.4 Brier score improvement from linear average of novice human forecasters by fore-
cast horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 The decision maker’s toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 The C-A-R methodology follows similar steps as making a stew . . . . . . . . . . 52
3.3 Normalized Calibration Function Demonstration. . . . . . . . . . . . . . . . . . . 57
3.4 Calibration Function Curvature with Informed Baseline (Not Normalized). . . . . . 58
3.5 Calibration Function Extremization with Informed Baseline (Not Normalized). . . 59
3.6 Comparing Calibration Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
vii
3.7 Aggregation Function Demonstration. . . . . . . . . . . . . . . . . . . . . . . . . 64
3.8 Aggregation Parameter Impact on Brier Score. . . . . . . . . . . . . . . . . . . . . 66
3.9 Accuracy Improvement due to Calibration and Recalibration Baselines for price of
gold and oil questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.10 Accuracy Improvement on a Gold question with an accurate machine baseline. . . . 73
3.11 Aggregation Parameter Affects Accuracy and Calibration/ Recalibration Parameters. 79
3.12 Accuracy Improvement due to Calibration and Recalibration Parameters. . . . . . . 80
3.13 Calibration and Recalibration parameters increase over time; Geometric Aggrega-
tion remains effective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.14 Restoring Conviction via Recalibration. . . . . . . . . . . . . . . . . . . . . . . . 82
3.15 Accuracy Improvement on Oil questions. . . . . . . . . . . . . . . . . . . . . . . . 83
3.16 Tetris Survey Questions and Answers . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.17 Impact of Aggregation Parameter on Accuracy (Tetris Data) . . . . . . . . . . . . . 85
3.18 Impacts of Calibration and Recalibration on Brier Score (Tetris Data) . . . . . . . . 86
3.19 Calibration and Recalibration parameters that minimize Brier Score based on Ag-
gregation Parameter (Tetris Data) . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.20 Restoring the Conviction of the Crowd through Recalibration (Tetris Data) . . . . . 88
4.1 Example relevance diagram and correlation matrices . . . . . . . . . . . . . . . . . 94
4.2 Error Induced by Correlation Approximation . . . . . . . . . . . . . . . . . . . . . 95
4.3 Histograms of infection simulation results . . . . . . . . . . . . . . . . . . . . . . 98
4.4 Agent-based infection simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.5 Agent-based infection simulation relevance diagram . . . . . . . . . . . . . . . . . 100
4.6 Infection data sampling schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.7 Screenshot of HARNESS web application: Forecast Variable Selection . . . . . . . 103
viii
4.8 Screenshot of HARNESS web application: System Variable Selection . . . . . . . 104
4.9 Screenshot of HARNESS web application: Posterior Distribution . . . . . . . . . . 105
A.1 Tetris Survey Screenshot: Consent page. . . . . . . . . . . . . . . . . . . . . . . . 113
A.2 Tetris Survey Screenshot: Demographic data collection. . . . . . . . . . . . . . . . 114
A.3 Tetris Survey Screenshot: The first question asks about Blue square blocks with no
debris. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
A.4 Tetris Survey Screenshot: Respondents see the correct answer following their re-
sponse on the first question. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
A.5 Tetris Survey Screenshot: Blue square blocks with 5 rows of debris on the game
board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
A.6 Tetris Survey Screenshot: Blue square blocks with 10 rows of debris on the game
board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
A.7 Tetris Survey Screenshot: The next set of questions asks about Purple I-shaped
blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
A.8 Tetris Survey Screenshot: Purple I-shaped blocks with 5 rows of debris on the
game board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.9 Tetris Survey Screenshot: Purple I-shaped blocks with 10 rows of debris on the
game board. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.10 Tetris Survey Screenshot: Respondents may accept a final challenging question. . . 122
A.11 Tetris Survey Screenshot: Both Blue square blocks and Purple I-shaped blocks
with between 1 and 10 rows of debris on the game board. . . . . . . . . . . . . . . 123
A.12 Tetris Survey Screenshot: Respondents are asked whether they would like to see
additional information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
A.13 Tetris Survey Screenshot: Respondents are shown additional data from similar sim-
ulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
A.14 Tetris Survey Screenshot: Respondents are shown results from a mathematical
model based on similar assumptions. . . . . . . . . . . . . . . . . . . . . . . . . . 126
ix
A.15 Tetris Survey Screenshot: Respondents are shown the correct answers and crowd
average responses at the completion of the survey (1 of 2). . . . . . . . . . . . . . 127
A.16 Tetris Survey Screenshot: Respondents are shown the correct answers and crowd
average responses at the completion of the survey (2 of 2). . . . . . . . . . . . . . 128
A.17 Tetris Survey Screenshot: Exit page. . . . . . . . . . . . . . . . . . . . . . . . . . 129
x
Abstract
The purpose of this research is to advance understanding of the uses of human- and machine- gen-
erated probabilistic estimates. When used for decision analysis, probabilistic estimates provide a
decision maker with a sense of both possibility and risk of future prospects. To mitigate potential
biases in their own thinking and gaps in their own knowledge, wise decision makers elicit such
estimates from multiple sources, such as experts, crowds, and machine models. Different expe-
riences and points of view often help a group come up with better answers than its individuals
would alone. However, multiple judges are not likely to agree and each is subject to their own sys-
temic biases. In addition, many decisions (such as policy decisions) carry large consequences, may
change in scope over time, and are riddled with uncertainties that are mutually relevant. Defining a
joint probability distribution of the pertinent uncertainties then becomes a critical task in decision
analysis. The task is made more difficult by lack of available data, lack of human expert abilities
to handle the cognitive complexity of the estimation task, or often both. To mitigate these potential
problems, this research analyzes data from recent forecasting competitions to arrive at generaliz-
able findings. The findings are combined with novel methodologies that together help answer three
key questions of contemporary significance:
1. How should a decision maker mechanically alter probability estimates from human judges
to account for systemic bias?
2. How should a decision maker combine probability estimates from multiple human judges
and from mathematical models?
xi
3. How should a decision maker elicit joint probability estimates from human judges and from
data?
The answers to the above questions are not cut and dry; they instead rely on the context of the
estimation tasks and the larger decision to be made. The analyses presented in this research help
to codify best practices for decision makers given particular contexts. I use a linear in log-odds
weighting function to alter probability estimates and a generalized aggregation function to combine
them. I rely on graphical models and measures of relevance to describe joint distributions. And
throughout the research, I pursue hybridization options to mitigate the weaknesses and capitalize
on the strengths of both human- and machine- centric methods. The major contributions of this
dissertation are:
1. A newly proposed calibration function that extends previous work to the multinomial case
and allows for incorporation of a baseline probability distribution.
2. A flexible methodology to mechanically adjust and combine univariate probabilistic esti-
mates from multiple sources.
3. A new data set of crowd-provided univariate probabilistic estimates that are more suitable
than traditional forecast competition data sets for examining calibration and combination
methodologies.
4. A graphical methodology to produce joint probabilistic estimates from multiple sources.
xii
Chapter 1
Introduction, Motivation, and Literature
Review
Overview
This dissertation presents methods to generate probabilistic estimates from human experts and
from data to be used for the analysis of policy-type decisions. Probabilistic estimates are neces-
sary to give the decision-maker a more complete understanding of possible outcomes and potential
risks. Both human and machine sources are required to fill in gaps from the other source. Joint
distributions may be necessary when the uncertainty of interest is either too difficult or irrelevant to
estimate on its own. This chapter provides an introduction to the general estimation/ decision prob-
lem to be solved, provides motivation for a new type of solution, and reviews important previous
work that sets the stage for the novel contributions of this dissertation.
1
1.1 Introduction and Scope
It is important to understand the estimation task as a part of decision analysis and thus influenced
heavily by the context of the decision situation. Policy decisions carry large consequences, may
change in scope over time, and are riddled with uncertainties that may be mutually dependent.
They often involve counterfactual estimation tasks- those that ask what would have happened under
a set of conditions or following a hypothetical intervention for which we have no authoritative
data. Defining a joint probability distribution of the relevant uncertainties is a critical task in
decision analysis which is made more difficult by lack of available data, lack of human expert
abilities to handle the cognitive complexity of the estimation task, or often both. Forecasts from
different sources may be available and will then require combinations of some sort. The forecasts
or estimates may be subject to systemic bias and therefore require post-facto adjustment.
Figure 1.1: Overview of the Fundamental Problem to be Solved by this Research
Explanation of Figure 1.1 We have a decision to make now (1), and we choose the appropriate
alternative based on our expectations about the outcome of our decision at some point in the future
(2). An important part of the problem is the fact that we have no way of knowing the actual
complete consequences of our decision, and so we must make an estimate of the uncertainties
bearing on our decision (3). These uncertainties maybe unknowable (4). And so we may instead
2
seek estimates of other relevant uncertainties that are related to our uncertainty of interest (5).
1
We
define all these uncertainties and their relationships as a system and we seek then to know the future
state of the system. Time emerges as another barrier (6) that must be overcome. We rely on human
judges (9) in the form of experts or novice members of a crowd, or machine models to forecast
the values of relevant uncertainties into the future using only what we know about them now: any
available historical data (7), and perhaps a counterfactual antecedent (8) that describes potential
conditions impacting their future values. The alternatives we are considering in our decision are
the most obvious counterfactual antecedents if they are expected to have an impact on the relevant
uncertainties. If our decision will not have an impact on the uncertainties themselves, then we may
wish to consider the effects of other possible shocks to our system. It is worth keeping in mind that
human judges are subject to biases, data is subject to be incomplete, and antecedents may create
complex feedback loops. The aforementioned characteristics of the fundamental problem and
potential dilemmas are described in detail below and help frame the purpose of this dissertation.
1.1.1 Probabilisitic Estimation
Decision analysis, risk analysis, and forecasting in general rely on expert opinions about uncertain
quantities. Among the most useful opinions are those that come from multiple experts and are
presented as probability distributions because they provide the decision maker or policy analyst
with a picture of what is possible as opposed to just what may be most likely (Clemen & Winkler,
1999; Genest & Zidek, 1986; O’Hagan et al., 2006; Page et al., 2020; Winkler et al., 2019). As-
signing probabilities to intervals of the value of a continuous variable reduces the cognitive burden
on the judge but still gives the decision maker necessary information (Abbas et al., 2008; Gaba
et al., 2017; Grushka-Cockayne et al., 2020; Wallsten et al., 2016) and can reduce overprecision
(Haran et al., 2010). Although combining interval probabilities from a crowd of judges can greatly
improve decision quality, problems may arise when the crowd disagrees, their forecasts are depen-
1
For example, we may wish to know the number of deaths attributable to a certain disease in a particular country
for which we have no reliable measurements available. But we can measure attributable deaths in a nearby country
and relevant infection test results in the country of interest as proxy data.
3
dent on one another, or they exhibit systemic biases in judgement. Fortunately, these problems
have been largely addressed by previous research on calibration and aggregation.
This research is constrained to estimation tasks with discrete outcomes because they are most
conducive to selecting appropriate strategies or policies to enact. In such high-level decisions there
are often a relatively small number of candidate strategies or policies that are selected based on
threshold values of critical uncertainties. This research is focused on the estimation of these critical
uncertainties and will thus be constrained to probabilistic estimation of uncertainties with discrete
outcomes. Future decisions, at a more tactical level, may rely on continuous distributions of un-
certainties and iso-preference curves to make finer level trade-offs among candidate alternatives.
These finer-grained uncertainties are outside of the scope of this research.
2
1.1.2 Policy Decision Analysis
Policy decisions are tough business. They often require a careful examination of the relevant
system(s) to understand the dynamic relationships among uncertainties. Just attempting to un-
derstand the system brings complexity to the decision, before we even consider alternatives. In
the summer of 2009, General Stanley McChrystal was the commander of American and NATO
forces in Afghanistan. He was shown a slide during a briefing that was meant to portray some of
the complexities of system about which he was charged with making decisions. When he saw it
(Figure 1.2), he remarked, “When we understand that slide, we’ll have won the war.” (Bumiller,
2010) Although he was attempting dry humor to make light of the great task at hand, he had the
right idea about the deep understanding of the dynamics of a system required to make quality
strategy- or policy- level decisions. The slide is actually a causal-loop diagram that is intended to
help decision-makers ask better questions about the impacts of their decision alternatives(D. Liu,
2013). The diagram displays the conditional dependence relationships among uncertainties, and
may thus be used to aid in probabilistic estimation tasks like the other graphical models presented
in this research.
2
See Howard and Abbas (2016) for a more complete explanation of this decision hierarchy concept.
4
Figure 1.2: Causal Diagram of Relevant Uncertainties in Afghanistan
The recent outbreak of novel coronavirus and its related disease, COVID-19, illustrates another
example of such important policy decisions that require accurate estimates of uncertainties. News
and statistical analysis website, FiveThirtyEight.com exposed some of the uncertainties required
to make an accurate estimate of the death toll of COVID-19(Koerth et al., 2020). Figure 1.3 shows
the necessary inputs to make an accurate prediction. The website described some of the biggest
problems with attaining valid estimates on the relevant uncertainties: lack of data and lack of con-
sensus among experts. They even explained that there is a lack of consensus on the structure of
the model itself and which uncertainties are required to make accurate estimates. Unpredictable
feedback (such as popular response to death toll predictions) is another thing that adds complex-
ity to such an estimation task.
3
On April 17, 2020, Texas Governor Greg Abbot introduced his
3
This is known as “decisions with influence” in the decision analysis literature (Howard & Abbas, 2016).
5
plan for re-opening the businesses and services that were shut down in his state as a response to
slow the spread of COVID-19: “In opening Texas, we must be guided by data and by doctors.” He
underscored the importance of a hybrid approach to obtaining the most accurate estimates where
there is so much conflicting or lacking evidence(Garrett et al., 2020). Problems like these re-
quire not only the statistical analysis of available data, but also the unique human ability to bring
background knowledge, contextual understanding, and causal and counterfactual analysis. They
require an understanding of several uncertainties at a time and thus usually rely on assessments
of the relationships among uncertainties. Further, the addition of statistical analysis must be done
correctly, because constituents place extra confidence in policy decisions that are perceived to be
algorithmicly enhanced with data (Waggoner et al., 2019).
Figure 1.3: Diagram of Relevant Factors for Estimating COVID-19 Deaths
Complex Systems Analysis
Complex systems, such as the ones represented by the COVID-19 or Afghanistan examples, are
notoriously difficult to predict. It may be because humans are just not equipped for the cogni-
tive complexity brought on by feedback loops and exponential growth (Foltice & Langer, 2017)
6
(Cordes et al., 2019). Systems thinking is often posed as a solution for dealing with such complex-
ity by directly addressing the impacts of feedback (Sterman, 1989) (Sweeney & Sterman, 2007),
acknowledging the difference between the whole and the sum of its parts(Sweeney & Sterman,
2000), properly identifying the existing mental models(Cavaleri & Sterman, 1997), and by encour-
aging organizational learning(Senge & Sterman, 1992). The methodology presented in Chapter 4
incorporates many of these solutions.
1.1.3 Counterfactual Estimation
A counterfactual query asks “If X happened, what is the probability that Y would happen?” It is
called counterfactual because we know that X did not in fact happen.
4
Counterfactual forecasting
is unique among other types of forecasting for several reasons. First, because we are concerned
with counterfactual events, we often lack the ability to determine our own forecast error. Second,
we often lack enough historical data with which to build accurate statistical forecast models. Third,
counterfactual forecasting often relies on human understanding of causality of the underlying sys-
tem and so strictly statistical methods may not be applicable anyway (Balke & Pearl, 1995).
Counterfactual thinking is not just for parlor games about alternate histories though; it is a
crucial element in policy analysis. For example, consider the counterfactual queries: ”What would
our greenhouse gas levels be today if we enacted a legislation package curbing emissions 20 years
ago?” or “What would the death toll due to COVID-19 have been with stricter enforcement of so-
cial distancing measures in New York City in April 2020?” Analyzing scenarios like these can help
policy makers properly consider alternatives and potential interactions. Counterfactual reasoning
is also increasingly important in attributing acts to potential adversaries or uncovering adversarial
intent
5
in modern conflict scenarios. It is also used by intelligence analysts to properly identify
the lessons the intelligence community should learn from successful and failed intelligence assess-
ments
6
. Many researchers seek to automate such reasoning in an effort to eliminate the cognitive
4
A classic example asks about the outcome of World War II given that Hitler had been born as a girl.
5
“Making Gray-Zone Activity more Black and White”, 2019.
6
“Forecasting Counterfactuals in Uncontrolled Settings (FOCUS)”, 2018.
7
biases from human analysts. Such approaches may be doomed from the start though as counter-
factual reasoning may be a uniquely human ability (Pearl & Mackenzie, 2018).
Thinking about what would have been is a helpful way to solve problems with significant
consequences (Roese, 1999). Such counterfactual thinking has been demonstrated as a benefi-
cial technique for geo-political forecasting (Tetlock et al., 2012; Tetlock & Gardner, 2015), moral
judgements (Migliore et al., 2014), risk attitude clarification (Dillenberger & Rozen, 2015), policy
evaluation (Balke & Pearl, 1995), and planning and decision analysis in the form of pre-mortem
analysis (Klein, 2007) or backcasting (Robinson, 2003). The more rigorous practice of counter-
factual estimation has been a popular lens for looking at impact evaluation and hypothesis testing
problems in the medical community for some time (Pearl, 2000, 2009), and is gaining popularity in
internet research as a surrogate to alpha/beta testing using deep neural nets (Hartford et al., 2017)
or Bayesian structural time series (Brodersen et al., 2015). Counterfactual estimation becomes all
the more critical for problems (like strategy formulation and policy evaluation) that depend heav-
ily on human intuition and for which we have limited data. These conditions render some (more
popular) problem-solving strategies obsolete (human intuition often fails with complex systems
(D¨ orner, 1996) and many machine models fail with insufficient, inappropriate, or spurious data
(Mitchell, 1997)). Counterfactual estimation, then, becomes a combination of both forecasting
and hypothesis testing, and requires a tool-set that is fit for both aspects and can accept inputs from
both humans (for their intuitive understanding of underlying causes) and data (for its representa-
tion of what has been true under specific conditions). We thus need an apparatus that exploits both
of these aspects and can provide decision-makers with an assessment of its associated costs and
benefits. This research will identify the strengths and weaknesses of reasoning and analysis meth-
ods applicable to counterfactual problems to help decision makers make trade offs about which
methods to pursue for a problem based on its characteristics.
Counterfactual problems have a complicated relationship with data because data represent
facts, things that actually happened. No amount of manipulation will get factual data to answer
a counterfactual query with certainty. Solving such a problem requires a combination of causal
8
understanding from human sources and statistical relationships from machine methods. This re-
quirement leads to my pursuit of hybrid (both human and machine) solutions throughout this dis-
sertation.
1.1.4 The Wisdom of the Crowd
Ensemble machine learning methods like boosting, bagging, and random forests rely on training
several models on subsets of available data and aggregating their individual predictions (Hastie
et al., 2009). The result is often more accurate and more robust to new data than building a
single comprehensive predictive model. Just as combinations of relatively weak learners are used
effectively in machine prediction tasks, large crowds of relatively novice human forecasters provide
the diversity required to produce forecasts that are more robust and often more accurate than those
produced by single experts. This phenomenon is referred to as the wisdom of crowds. Although
made popular by Surowiecki (2005), discovery of the phenomenon is often attributed to Galton
(1907) who noticed that the average of crowd’s point predictions could be very accurate despite
large variance among the estimates. It was this human phenomenon over a century ago that later
influenced the ensemble machine learning methods of the 1990s. In this research, I acknowledge
the
The wisdom of the crowd is not just for point prediction; it has also been shown beneficial for
combinatorial problems (Kung et al., 2012), hybrid forecasting (Miyoshi & Matsubara, 2018), and
has been demonstrated to work with crowds who are generally wrong when coupled with machine
learning techniques (Laan et al., 2017). It has been shown appropriate for national intelligence
assessments (Herhkovitz, 2020) with varied success (Mandel, 2019).
1.1.5 The Cold Start Problem
The “cold start” problem refers to the difficulty in determining the best way to aggregate a crowd’s
forecast before we know any resolved answers and thus lack any data about the potential abilities
(and hence actual accuracy) of the members of the crowd (Himmelstein et al., 2021). This problem
9
is perhaps most identifiable in the beginning of a forecasting tournament with anonymous judges.
We lack any information about their potential or actual performance, and so cannot adequately
inform a hierarchical-type strategy to weight their expected contribution to a more accurate (or
informative) answer.
1.1.6 Hybridization
Hybrid research focuses on how best for human and machine models to interact or support one
another. The importance of pursuing a hybrid solution to the problems outlined above cannot be
understated. Human estimators are prone to cognitive biases (Kahneman, 2011) and machine esti-
mators need unavailable data and may assert causation when it does not exist (Russell & Norvig,
2009). Most research that proposes hybridization of human and machine methods focuses on how
machine models can assist human decision making at the expense of how humans may help ma-
chines
7
(Jarrahi, 2018). Promising research in this area tends to follow a ”System 3” type concept
(Hall, 2016) that treats machine reasoning as an augment to human dual cognitive processes: Sys-
tem 1 (heuristic and intuitive) and System 2 (analytical)(Kahneman, 2011). Graphical models may
provide the link to allow a System 3 construct to exploit the human systems’ potential impact on
machine reasoning and vice versa. This would allow for the human machine collaboration more
along the lines of Centaur Chess where human intuition and computer analysis are both exploited to
form successful teams capable of sound, creative judgement (Kasparov & Greengard, 2017). In this
research, the term hybridization refers to the integration of both human- and machine-derived mod-
els, parameters, or estimates. When we make similar combinations from solely human sources, we
use the term aggregation (such as in aggregating the opinions of experts), and from solely machine
sources, we use the term ensemble.
Hybridization (and indeed ensembles and aggregations) may be be done at different times
in the problem-solving process. Early hybridization may mean soliciting parameter values from
human experts and letting a single machine model use them to produce a single estimate. Late
7
The research in this dissertation takes the view that hybridization should benefit both human and machine models.
10
hybridization may include each individual (or machine) creating their own model. Each model
produces an estimate and all estimates are aggregated. Middle aggregation can take place anywhere
in between, and in more than one place. Human and machine inputs may be used to make several
different models whose outputs are aggregated. Machine-produced estimates may be used by
human judges to refine their own mental models before providing final estimates. In a graphical
model example, both human and machine judges may determine possible graphical structures;
these can be combined to form one structure. Then all judges (human and machine) may make
assessments of univariate distributions depicted in the single structure. Several machine models
may then create possible joint distributions based on the preceding inputs. These options are
depicted pictorially in Figure 1.4. Bottom line: there are a lot of ways to capitalize on the unique
strengths of human judges and machine models. This concept will be revisited as it arises in this
dissertation.
Figure 1.4: Graphical Overview of Hybridization Options
1.1.7 Trade-offs Between Accuracy and Informativeness
Yaniv and Foster (1995) suggest that the precision of uncertain judgements involve a trade-off
between accuracy and informativeness.
In decision analysis we take a “long-run” mindset and we seek accuracy. We think of the
future unfolding as a number of simulation runs. Our responsibility then becomes to choose the
11
right parameters for the simulation and then forecast the distribution of outcomes. The “long-run”
mindset is an appropriate one for policies, investments, and professional gamblers (Duke, 2018).
In a forecasting competition, however, we may be better off taking a ”this-bet-only” mindset and
suppose that this is our last bet at the table, and so we seek informativeness. We seek to know
the bet to make now, rather than how to bet for the rest of the game. Our responsibility then
becomes akin to predicting where a roulette ball will land on this next spin only. This is the
approach of an amateur gambler. Both mindests are important to consider and imply the use of
different methods for mechanically adjusting probabilistic estimates. In this research, I adhere to a
subjective understanding of probability, where the probabilities assigned to outcome bins represent
the decision makers degree of belief that the outcome will occur. I will continue to explain trade
offs between accuracy and informativeness as we encounter them in this dissertation.
Scoring
We need a scoring methodology that works no matter whether we are more interested in immedi-
ate informativeness or long-term accuracy. It should penalize error symmetrically and be strictly
proper in the sense that the best score will be achieved only when the estimate reflects the true
degree of belief held by the estimator. This stipulation implies that the scoring metric cannot be
gamed. We will use mean absolute percent error (MAPE) when appropriate for point estimates.
This research largely deals with probabilities instead of point values, though, and requires another
suitable measure of error. As previously mentioned, our estimation intervals are selected based on
what will be most informative to our decision problem. And there exist no directional relationship
between categorical outcomes. We therefore need not consider the direction of our error, but only
the magnitude (and distance for ordinal intervals of quantitative uncertainties). Brier score fits
these needs well and is widely used in forecasting competitions.
Brier Score Brier score (Brier, 1950) is a strictly proper scoring rule (Gneiting & Raftery, 2007)
that has been widely used to assess accuracy of a series of forecasts using binary or multimodal
12
probability statements. Brier score for an individual probabilistic forecast of an event that either
occurs or does not occur is defined by
m
X
j=1
(f
j
d
j
)
2
; (1.1)
where m represents the number of forecast bin categories (5 in all of our examples), and f
j
represents the forecasted probability for response binj. d
j
is the actual outcome probability and
often takes binary value 1 or 0 depending on whether the event occurred in binj or not. Moreover,
them classes ofj are mutually exclusive and exhaustive so that
m
P
j=1
f
j
= 1 and
m
P
j=1
d
j
= 1. The
worst possible score occurs when the forecaster assigns 100% probability to an event (or bin) that
does not materialize.
Ordinal Brier Score For ordinal response questions, like the price of gold or oil, where there
may be some value in forecasting “close” to the correct answer, we use a slightly different scoring
method as advocated by (Merkle & Hartman, 2018):
1
m 1
m1
X
j=1
(F
j
D
j
)
2
; (1.2)
where a question hasm answer bins, andF
j
andD
j
are the cumulative forecasts and outcomes,
respectively (the probabilities assigned to, and actual realized outcomes of, answer binj and be-
low). This ordinal method is used throughout the rest of the research. It has a highest possible
(worst) score of 1 (where, say a probability of 1 was allocated to bin 1 and the realized outcome
was bin 5).
Brier score was originally intended to help ensure honest reporting, and may not always be
suitable for assessing overall accuracy, especially when dealing with binary outcomes. While the
more Bayesian approach advocated by Steyvers et al. (2014) may be generally suitable, it is shown
to correlate well with Brier score and lacks the ability to provide consistent error measures across
the different data sets used in this dissertation. I therefore use Brier score, and its ordinal version,
13
to assess probabilistic error in this research.
1.2 Motivation and Contributions
The current COVID-19 pandemic highlights several of the issues that are addressed by the general
approach depicted in Figure 1.1:
• We rely on probabilities instead of point estimates to inform decision making.
• We need to account for low-probability events and outcomes in order to make robust policies.
• We must use counterfactual thinking to address potential impacts of yet-to-be-imposed poli-
cies.
• We need input from a diverse set of experts, and their inputs may also be diverse (for ex-
ample, some experts are more comfortable providing point estimates, while others prefer
probabilities, some prefer stories about how uncertainties may be related, and still others
may provide numerical estimates about the relationships of uncertainties, their future trajec-
tories, or their potential reactions to hypothetical antecedents).
Our current COVID predicament also illuminates a potential roadblock to making the scheme
described in Figure 1.1 a useful reality: How do we come up with the best estimates and incorporate
different relevant viewpoints when we can’t meet in the same room? Fortunately, this research
offers a potential solution for distributed digital consensus in the web-based HARNESS application
and an example of remotely crowd-sourced estimates in the Tetris experiment data. Table 3.2 lists
the problems addressed by this dissertation and the solutions it provides. In addition to analytical
results, my research provides:
• Two new problem-solving methodologies
– Calibrate - Aggregate - Recalibrate (C-A-R) for improving crowd probability estimates
that are biased and disagree
14
– Hybrid Adaptive Relevance Networks for Estimating System States (HARNESS) for
describing joint probability distributions from multiple inputs
• A new multinomial probability weighting function for calibrating expert judgements with
respect to a probability distribution of choice.
• A new data set of probabilistic estimates about uncertainties which resolve to non-binary
outcomes.
1.3 Concepts and Literature Review
Several existing concepts address the challenges outlines above. They are presented in this sec-
tion to form a base of understanding for the reader and to help situate the work presented in this
dissertation among the relevant literature.
1.3.1 Eliciting Probabilistic Estimates from Human Judges
It is helpful to think of eliciting human judgements as “a way of obtaining (probabilistic) results
from experiments or measurements which are feasible in principle but not in practice.”(Cooke et
al., 2021)
When eliciting quantity or probability judgements from humans, decomposition has been shown
to simplify and improve the process (Hora, 2007). Decomposition breaks the elicitation problem
down to potentially easier pieces. For example, instead of asking a human judge for a projection
of the number of deaths that will be caused by a certain infections disease, we may instead ask him
about the portion of the population that might be susceptible, the rate at which the infection might
spread, and the portion of the infected likely to die. These three potentially easier elicitations may
be combined to produce an estimate of the total number of deaths in the population. Results are
mixed, however, on how much judgement error can be reduced by using such techniques (Hen-
rion et al., 1993; Wright et al., 1988). Some studies suggest that it reduces only random error
15
Table 1.1: Summary of Contributions of this Research
Existing Problem Potential Solution My Contribution
While forecasting methods
abound, decision makers may
lack simple models that can eas-
ily integrate historical data into
decision analysis frameworks.
Brownian motion models offer
a computationally inexpensive
and intuitive way to provide
probability forecasts.
I demonstrate the effectiveness of Brow-
nian motion models compared to human
judges on forecasts of commodity prices in
Section 2.3.
Probability forecasts from a
crowd of judges disagree with
one another.
Opinion pools combine the con-
tributions of multiple judges
into a single estimate.
I demonstrate the effectiveness of a gen-
eralized aggregation function that can take
the form of the linear or geometric opinion
pools along a spectrum of other options.
Probability forecasts from
judges in a crowd are often
miscalibrated due to systemic
biases.
A calibration function mechan-
ically aligns probability judge-
ments to a desired height and
shape.
I propose a new multinomial calibra-
tion function that extends previous binary
methods and can incorporate a baseline
distribution.
Crowd probability forecasts
must be both calibrated and
aggregated.
Existing methods propose either
calibration followed by aggre-
gation or aggregation followed
by calibration. If we calibrate
both before and after aggrega-
tion, we may tailor each calibra-
tion phase to the crowd and the
answer, respectively.
I propose the Calibrate- Aggregate- Re-
calibrate (C-A-R) methodology that allows
for calibration to address systemic biases
and recalibration to adjust the estimate to
its intended use.
I demonstrate the effectiveness of the
three-part methodology on crowd forecasts
from a recent competition.
Most crowd forecast data sets
have a binary response (the out-
come either occurred in a partic-
ular bin or it did not). Calibra-
tion and combination schemes
do not get rigorously examined
on such responses.
A new data set is required where
judgements and responses fol-
low a non-binary, parameteriz-
able distribution.
I conducted a web-based survey to elicit
probabilistic estimates on the simulated
outcomes of Tetris-like games. The re-
sulting data is now available to the inter-
ested community and thoroughly analyzed
in this dissertation.
Human judgement and histor-
ical data are often both re-
quired to produce appropriate
estimates of the future states of
complex systems.
Graphical models provide an in-
tuitive approach to combine hu-
man inputs with data for estima-
tion problems involving mutu-
ally relevant uncertainties.
I propose the Hybrid Adaptive Rel-
evance Network for Estimating System
States (HARNESS) methodology to pro-
duce joint distributions from multiple in-
puts.
I demonstrate the errors induced by ap-
proximate methods of dependence elicita-
tion.
I demonstrate the errors induced by dif-
ferent methods of joint distribution elicita-
tion.
16
(Kleinmuntz et al., 1996; Ravinder et al., 1988), while others present more optimistic findings
(MacGregor et al., 1988). There has also been considerable work to determine the right decompo-
sition scheme for a problem. While some studies suggest full decomposition is best (Macgregor
& Lichtenstein, 1991), others tout the importance of choosing a scheme familiar to all respondents
(Keeney & von Winterfeldt, 1991).
1.3.2 Eliciting Probabilistic Estimates from Data
As data are increasingly available for even the most obscure uncertainties, data driven estimation
methods become more relevant. They may be used on their own or in conjunction with expert
human judges. Often, experts may use their own data sets and algorithmic models to arrive at their
judgements.
Time Series Forecasting
Most of the estimation problems in this research involve forecasting, estimating a quantity that
occurs in the future. Often, we have available the regularly-recorded historical values of the uncer-
tainty of interest, and so time series forecasting is an appropriate method. Time series are values
of a quantity or uncertainty taken at regular intervals. Forecasting a univariate time series may be
done in myriad ways. Zellner et al. (2021) provide a recent overview of both human and machine
forecasting methods. For our purposes, any forecasting method should be able to provide a predic-
tion interval to be useful for decision or planning purposes (Chatfield, 1993). While it is outside
the scope of this research to describe time series methods in detail, one very helpful method is
Brownian motion because it illustrates many of the key concepts used in this dissertation.
Brownian Motion Models
Brownian motion models have been shown in previous research to be appropriate for modeling
prices of stocks and commodities (Brennan & Schwartz, 1985; Paddock et al., 1988). They are a
natural choice for many because of their versatility (Reddy & Clinton, 2016) and simplicity (May-
17
hew, 2002). The key to using these models correctly is in selecting the appropriate parameters: the
values that represent the drift and diffusion of the underlying quantity of interest. These param-
eters may be estimated from historical values quite easily, but it is often a matter of experience,
rather than rule of thumb, to select an appropriate subset of historical data to use to estimate the
parameters. Previous shocks to the system that regulates a quantity (such as a large trade deal or
embargo in the case predicting crude oil values) may change the parameters for future values. The
parameter selection scheme would ideally account for these shocks (perhaps by ignoring historical
data that was collected prior to the most recent shock as advocated by Cowpertwait and Metcalfe
(2009)). Identifying a shock though, may require more domain experience than is available to a
decision maker or to his advisors (especially if they are crowd-sourced), so a benefit of this ap-
proach is that it may give a decision maker a simple way to make his own rules of thumb about
which models to use and how to select model parameters.
Table 1.2 displays the characteristics of the machine models used in this research. Arithmetic
Brownian Motion (BM) forecasts a price at a horizon of t time units (days) into the future (S
t
) by
considering its current price (S
0
) and the drift (trend,) of the price, and by acknowledging the
diffusion (volatility,) of the price over time (whereB
t
N(0;t) represents standard Brownian
motion). This model assumes that the differences in price from day to day are distributed normally.
Geometric Brownian Motion (GBM) forecasts a price at a horizon of t time units (days) into the
future (S
t
) by considering its current price (S
0
), the drift (trend, ) of the log of daily returns,
and the diffusion (volatility,) of the log of daily returns over time. This model assumes that the
log of daily returns from day to day are distributed normally, so the actual prices are distributed
log-normally. Therefore, this model does not allow a forecast to attain a negative value, making it
an appropriate choice for forecasting commodity prices. Both models’ parameters depend on how
many previous data points (m) are used to estimate them. We refer to this quantity as the look back
window (e.g. a look back window ofm = 30 uses the past thirty days of prices to estimate drift
and/or diffusion parameters).
We also add a “Na¨ ıve” model that assumes prices in the future will be the historical average
18
Table 1.2: Summary of Machine Models Used in this Research
of prices over the look-back period,m, plus some diffusion similar to the BM model. This model
assumes no drift; that the mean of future values will equal the mean of those in the past. The
addition of this relatively weak forecaster allows us to compare the performance of the BM and
GBM models to a baseline model. This weak model may also be of some value in a circumstance
when we require a forecast deeper in to the future than we have available data (as indicated by
the white portions of Figure 2.3 where the na¨ ıve model had lower percent error that either BM
or GBM models). Figure 1.5 depicts possible paths of a Brownian motion model simulation on
the left and the resulting forecast on the upper right compared to novice human forecasters (lower
right). This figure demonstrates the intuitive depictions possible with Brownian motion models
and the flexibility of their outputs: they can provide point forecasts of interval probabilities.
19
Figure 1.5: Sample Brownian Motion Forecasts
Shown are sample Brownian motion point forecasts (left) and probabilistic forecasts from the Brownian motion model
(top right) and aggregation of novice human forecasts (bottom right). Note that simulated Brownian motion models
provide an easily interpretable visual depiction of the expected occurrence of seemingly rare events (such as abnor-
mally high or low asset prices as depicted by the blue (highest and lowest) predicted paths. For point forecasts, we
take the average of the simulated paths on the day of interest. For probabilistic forecasts we count the proportions of
paths that end in each of the forecast bins; the counts from the paths on the left are depicted in the upper right column
chart.
1.3.3 Calibrating Probabilistic Estimates
There has been an abundance of research effort on calibrating expert forecasts to overcome the sys-
temic biases of individual judges.
8
A perfectly calibrated judge would assign a probability to an
outcome that would closely match its actual occurrence rate.
9
Most human forecasters, however,
tend to overestimate the probability of rare events and underestimate the probability of common
ones (Kahneman & Tversky, 1979; Tversky & Kahneman, 1974, 1992). Miscalibration describes
how both over- or under-confidence and systematic over- or under- estimation impact a judge’s
probability assessments. The miscalibration problem may be treated as any other affliction, by di-
8
In this research, I use the subjective understanding of probability as part of general inclusion of not only elicited
numbers between 0 and 1, but also frequencies, occurrences, and other measures that define uncertainty. This is
congruent with the “loose” view of calibration and probability judgements as described by Keren (1991).
9
This is easily verified for weather forecasters for example, by comparing the frequency at which rainy days occur
when the forecaster predicts a 70% chance of rain; it should be rainy on 70% of those days.
20
agnosing and then by appropriate treatment either of the cause(s) or of the symptom(s). Treatment
of the causes is best addressed by education: helping judges understand how to better calibrate
their estimates, such as by taking the ”outside view” (Chang et al., 2016; Chang & Tetlock, 2016;
B. Mellers et al., 2014; Tetlock & Gardner, 2015), and is outside the scope of this paper. Di-
agnosing and treating the symptoms of miscalibration can be done with a probability weighting
function. A probability weighting function is a nonlinear function born out of prospect theory
that was originally devised to describe (not mitigate) the effects of loss-aversion and diminished
sensitivity in decision makers (Kahneman & Tversky, 1979). Many such functions have been used
and studied over the years; among the most popular are those proposed by Tversky and Kahne-
man (1992), Prelec (1998), and Gonzalez and Wu (1999). These three functions all produce an S
shape curve over the interval [0,1] to describe the empirical discrepancy between elicited subjective
probabilities and their objective (or nominal) counterparts; see Figure 1.6.
Tversky and Kahneman (1992) proposed a single parameter function,
w
T&K
(p) =
p
(p
+ (1p)
)
1=
(1.3)
that provided good fits to empirical data, but is not appropriate for our application due to its (lack
of definition near points of certainty and impossibility) non-monotonicity. See, for example, the
inversion of the original probabilities in the middle row of Figure 3.6 as
approaches 0 (also see
Ingersoll (2008)). It is important to note that Tversky and Kahneman (1992)’s original proposition
of this function included a restriction of 0<
1.
Prelec (1998) used an axiomatic approach to propose a two-parameter function,
w
P
(p) =exp[(lnp)
] (1.4)
This formulation solved some of the inadequacies of earlier proposed functions and was still a
good and simple fit for observed data, however, it lacked intuitive explanations for the roles of the
parameters.
21
Gonzalez and Wu (1999) summarized and expanded upon previous work to arrive at a two
parameter function with a psychological rationale,
w
G&W
(p) =
p
p
+ (1p)
(1.5)
They described the role of the two parameters in terms of their psychological meanings. The
parameter controls the curvature of the function which represents discriminability: the relation
between a unit difference in subjective probability and that same difference in weighted probability
along the interval [0,1]. This concept relates closely to the diminished sensitivity of Tversky and
Kahneman (1992). For a binary event, a judge with very low discriminability can discern certain
(p = 1) and impossible (p = 0) situations but may assign a noncommittal (p 0:5) probability
to a very wide range of situations in between. He therefore requires a more step-like weighting
function to calibrate his judgements. The more step like forms in Figure 1.6.D are generated with
higher values of
. A judge with very high discriminability, on the other hand, may appear too
committal and tend too overestimate probabilities slightly higher than 0.5 and underestimate those
slightly less than 0.5. He therefore requires a flatter weighting function to calibrate his judgements.
The flatter forms in Figure 1.6.D are generated with lower values of
. The parameter controls
the elevation of the function which represents attractiveness: how an individual judge values a
deal at a given probability. A judge who tends to overestimate probabilities would require a lower
value of to calibrate his judgement (see the lower forms in Figure 1.6.E). Similarly, a judge who
tends to underestimate probabilities would require a higher value of to calibrate his judgement
(see the higher forms in Figure 1.6.E).
Given the logical independence of discriminability and attractiveness, Gonzalez and Wu de-
rived a function (Equation 1.5) that is linear in logs odds space. The log-odds form is
log
w
G&W
(p)
1w
G&W
(p)
=
log
p
1p
+ (1.6)
where
is the same as in Equation 1.5 and = log(). Note that when = 1 in Equation
22
1.5, the function takes the form of Karmarkar’s log odds function (Karmarkar, 1978). Also note
that in Figure 1.6, all example probability weighting functions will avoid transformation at the
preservation points p = 0; 1. In addition, the Karmarkar form will create another preservation
point atp = 0:5, so probabilities of 0, 0.5, and 1 will all be preserved by the Karmarkar form.
It is important to note that these functions are not strict probability functions because they do
not maintain symmetry (i.e.,w(p) +w(1p)6= 1). They were built instead to describe human
behavior when faced with decisions about uncertain variables. Fortunately, they may also be used
in our (current) application to adjust subjective forecasts mathematically after judges provide them
(Graziani et al., 2021; Hathout et al., 2019; Ranjan & Gneiting, 2010; Turner et al., 2014).
Just as the curvature of the function represented discriminability for descriptive purposes, it
represents extremization for prescriptive applications. We define extremization as the adjustment
of a judge’s forecast that increases its sharpness (Gneiting et al., 2007) or its concentration of
probability assigned to a specific bin. A maximally extremized forecast would place probability
p = 1 on one bin’s outcome andp = 0 on remaining bins. Similarly, we define de-extremization
as an adjustment that flattens a judge’s forecast by allocating probability assigned in higher bins to
lower ones. A maximally de-extremized forecast will appear uniform (p = 1=m assigned to each
ofm bins).
Similarly, as the elevation of the function represented the descriptive phenomenon of attractive-
ness, for prescriptive applications it represents tendency. We use the termtendency to describe the
fixed point towards which judgements are de-extremized and away from which they are extrem-
ized. In the case of binary probability weighting functions like Equation 1.5, the default tendency
is 0.5 (when = 1).
The aforementioned probability weighting approaches have three main limitations for our use
case of a crowd of judges informing a decision. First, many approaches rely on sufficient per-
formance data to assess parameters for each judge. We may lack these data because we have
anonymous judges or because the forecasts we have from judges are not easily verified. Second,
the above approaches were all designed for binary probabilities while we seek a method that can
23
Figure 1.6: Example Probability Weighting Functions.
While they were originally used to describe a judge’s bias, Probability weighting functions may be used to transform
a judge’s subjective probability (x axis) to a weighted probability (y axis). Chart A demonstrates Tversky and Kahne-
man’s function (Equation 1.3) with
from 0.2 (solid line) to 1.8. Chart B demonstrates Prelec’s function (Equation
1.4) as varies from 0.1 to 3 and is held at 0.5. Chart C is the same function with from 0.1 to 1.9. Charts D and
E demonstrate Gonzalez and Wu’s function (Equation 1.5) with
from 0.1 to 3 (D) and from 0.1 to 1.9 (E). When
is set to 1, we get the Karmarkar form in Chart F (w
K
(p) =
p
p
+(1p)
) that shows from 0.1 to 3.
accommodatem-nary probabilities (m 2). Lastly, the above approaches lack a way to incorpo-
rate an a priori baseline probability.
1.3.4 Combining Probabilistic Estimates
There has been a wealth of previous work on aggregating expert opinions.
10
The wisdom of the
crowd (Surowiecki, 2005; Tetlock & Gardner, 2015) is known to be more valuable than individual
judgements, especially with a diverse crowd (Davis-Stober et al., 2015; B. A. Mellers & Tetlock,
2019) and sometimes even with a wrong crowd (Laan et al., 2017). Many methods exist for tak-
ing advantage of the wisdom of the crowd through aggregation (Clemen, 1989; Lichtendahl &
Winkler, 2020). There seems to be consensus from previous research on the efficacy of the lin-
ear mean (Ariely et al., 2000), but other methods such as the trimmed mean (Grushka-Cockayne
10
See McAndrew et al. (2021) for a very recent review.
24
et al., 2017; Lichtendahl & Winkler, 2020), median (Cooke et al., 2021; Gaba et al., 2017; Han &
Budescu, 2019), harmonic mean (Colson & Cooke, 2017), and geometric mean (Satop¨ a¨ a, Baron,
et al., 2014) have also proven effective under some circumstances. Probability judgements may
also be aggregated as odds (Satop¨ a¨ a, Baron, et al., 2014). All methods have one thing in common:
their effectiveness is dependent upon characteristics of the crowd of forecasts including factors like
calibration of forecasts and dependence among forecasters (Lin & Huang, 2021; Wilson, 2017).
To address potential dependence among forecasters, several approaches have been proposed that
rely on information theory (Clemen & Winkler, 1985; Satop¨ a¨ a et al., 2016), Bayesian ensembles
(Lichtendahl et al., 2018), copulas (Clemen et al., 2000; Clemen & Reilly, 1999), or the maxi-
mum entropy principle (Levy & Delic, 1994; Myung et al., 1996). Weighted aggregation can also
address dependence and miscalibration (Du et al., 2017; Winkler et al., 2019). Effective weight-
ing schemes range from very democratic (such as the mean or median of all forecasts) to elitist
(such as taking only the potentially best forecasts and down weighting others to 0 as by B. Mellers
et al. (2015)). Weights may be based on prior performance (Cooke, 1993; Cooke et al., 2014;
Himmelstein et al., 2021), coherence (D. V . Budescu & Du, 2007; Dawid et al., 1995), contri-
bution (D. V . Budescu & Chen, 2015; B. Mellers et al., 2015), self-reported expertise (Satop¨ a¨ a,
Jensen, et al., 2014), question selection (Merkle et al., 2017), or even game-theory (Bickel, 2012).
All performance-based weighting approaches, however, rely on measures of performance for each
judge. With a new or anonymous crowd, these measures may be impossible to obtain in time to
exploit the advantages of such weighting schemes.
1.3.5 Graphical Models for Joint Probabilistic Estimates
A key part of solving the counterfactual problems mentioned above is identifying how different
uncertainties (variables) are related; how they might change as other uncertainties change- either
by chance or by direct manipulation (say, by implementing a policy). Understanding such relation-
ships requires identifying a joint probability distribution either from asking experts how uncertain-
ties are related or by extracting relationships from available data or both. Graphical models have
25
been shown as useful for simplifying (Senge, 2006), solving (Jones, 1998), and communicating
(Krogerus & Tschappeler, 2008) the types of problems that rely on counterfactual thinking and
estimation.
Another key element in this research is that of the graphical models themselves. Influence
diagrams have long been a staple of decision analysis to map a person’s information and pos-
sible actions surrounding a decision (Howard & Abbas, 2016). Influence diagrams that contain
only nodes of uncertainties and their connecting arcs of relevance are called relevance diagrams
(Howard, 1989). In our application, such diagrams are used to capture diverse sets of information,
including marginal distributions of uncertainties and their dependencies on other uncertainties,
from a forecaster. They may also be expanded to incorporate information from multiple forecast-
ers to assist with the deliberation process mentioned above. Models may also be elicited indirectly
by eliciting lower-order dependence assessments (Clemen et al., 2000). Graphical models may
also be inferred from available data using algorithmic approaches (Scutari et al., 2019). Two such
approaches that infer dependence relationships and will be examined in this research are depen-
dence trees (Chow & Liu, 1968) (Ku & Kullback, 1969) and directed acyclic graphs supported
by neural networks (Laur´ ıa, 2005). In this research, such automated models will be referred to
as Algorithmic Graphical Models, while those provided by human experts will be referred to as
Expert Graphical Models. In the hybridization experiments of Chapter 4, this research will also
explore ways to combine graphical models from both sources, such as in (Heckerman et al., 1995)
and (W.-Y . Liu et al., 2011).
Methods abound for eliciting joint probability distributions from data. This research is con-
cerned only with those methods that can use either data or human judgements, can handle new
information (i.e., is capable of updating), can use a graphical model to identify relevant pairwise
relationships among uncertainties, and that can be approximated to reduce the assessment bur-
den on human judges or reduce the data requirements. Three such methods are Bayesian belief
networks, maximum entropy methods, and Gaussian copulas.
26
Using Bayesian Networks for Joint Probabilistic Estimates
Bayesian networks are probabilistic graphical models that describe the joint probability distribu-
tion among uncertainties in terms of their conditional probabilities and conditional dependence
structure (Mitchell, 1997). They are convenient for machine algorithms because they assert con-
ditional independence among uncertainties by the absence of edges (Howard, 1989). The net-
works may also be drawn graphically for production, interpretation, or alteration by human experts
(Pearl, 2000). Original applications were numerous for Bayesian networks (Pearl, 1988) (Russell
& Norvig, 2009) (Heckerman et al., 1995) and they have since only become more popular as data
availability has increased and advances in computing power have made them able to tackle larger
problem sets (Pearl & Mackenzie, 2018). Objections to their use generally revolve around the as-
sertions they may make about causality when only correlation may exist. Most of these concerns
may be addressed by carefully following the procedures outlined by Pearl (2000).
Using Maximum Entropy Methods for Joint Probabilistic Estimates
Maximum entropy methods present a way to assign probabilities based on less than complete
information (Jaynes, 1968), a situation we find ourselves in with counterfactual estimation. The
maximum entropy principle ensures that our assessment of a probability distribution assumes no
more than is present in the available data. It is scalable in that at a minimum, we may assume
independence among uncertainties when only their marginal distributions are available. At the
other end of the spectrum, maximum entropy methods may define a minimally-informative joint
distribution among several dependent uncertainties. They may also be approximated at any point
in between, for example, with pairwise assessments of conditional dependence. Further, maximum
entropy distributions may be determined with minimal information, such as means (Smith, 1993)
or upper and lower bounds (Abbas, 2005) and can be refined with additional information from
data or from human experts (Abbas, 2004). Such approximations and their induced error are the
specific topic addressed in (Abbas, 2006) which considers a graphical model input and application
to a decision problem.
27
Using Copulas for Joint Probabilistic Estimates
Some of the approaches addressed above are most applicable to a low number of discrete outcomes
and may become unwieldy when uncertainties are discretized beyond high/medium/low values or
when unique probability distributions exist for individual uncertainties or their pairwise joint dis-
tributions. Using the diagram and tree reversal techniques in Howard and Abbas (2016) helps to
simplify the process by codifying the structure of the relevance diagram in a tree whose order of
assessment makes the most sense to the forecaster. However, this approach also becomes compli-
cated when there are more than three uncertainty nodes. Further, the aforementioned conditional
distributions are difficult to derive analytically, especially when their marginal distributions are of
different families. To overcome these limitations, a copula structure may be useful to describe the
dependence among marginal uncertainty distributions as outlined by Clemen and Reilly (1999). A
copula function is independent of the marginal distributions it links. It combines them to describe
their multivariate joint distribution (Nelsen, 2006). A normal copula function is ideal for many
of our mentioned applications because the dependence structure is simple to define by eliciting
measures of correlation from forecasters and allows us to use the more efficient tree modeling ap-
proach outlined in (Wang & Dyer, 2012). A normal copula function is appropriate if we assume
linear relationships among marginal distributions and no tail dependencies. Many approximation
techniques are also applicable to copula functions which makes them an attractive tool for this
research (Durante & Sempi, 2016). Further, the pair-wise correlations assessments required for
normal copula function are a good approximation of the joint distribution, as demonstrated by
satisfactory results presented by Abbas (2006).
The above three methods all have potential strengths and weaknesses based on the characteris-
tics of the estimation problem: availability of data, diversity of data, availability and accuracy of
human experts, non-linearities such as tail dependence and feedback loops, etc. It is a goal of this
research to uncover these strengths and weaknesses in order to provide a comprehensive method-
ology for using the methods, either individually or in combination (as in Guo et al. (2018), Kong
et al. (2018), and Scutari (2018)), to help solve counterfactual estimation problems.
28
1.4 Overview of the Dissertation
This chapter provided a map of the problems associated with the complexities of probabilistic es-
timation tasks for contemporary policy-type decisions; the next three will propose blueprints for
their solutions. Chapter 2 introduces the data sets used in the dissertation. It provides an overview
of using human judgements and data to make estimates about the future values of quantitative un-
certainties. Chapter 3 proposes a novel methodology to adjust estimates for bias, combine them
to form a single estimate, and adjust the final estimate based on its intended use. Chapter 4 ex-
pands the research to joint probabilities and provides a methodology to estimate them. Chapter 5
concludes and discusses considerations for putting the lessons uncovered in this dissertation into
practice.
29
Chapter 2
Eliciting and Using Machine- and Human-
Generated Probabilistic Forecasts
Overview
Probabilistic forecasts use expert judgements and/ or available data to assign probabilities to po-
tential future outcomes of uncertainties of interest. This chapter introduces the data sets used in
the dissertation, discusses how to use simulation to select machine-model parameters and gain
intuition about the uncertainty in question, and compares the accuracy of human- and machine-
generated forecasts of the value of gold and oil commodities.
30
2.1 Introduction to Data Sets Used in the Dissertation
2.1.1 Hybrid Forecasting Competition (HFC)
The Intelligence Advanced Research Projects Activity (IARPA) Hybrid Forecasting Competition
(HFC) develops geopolitical forecasting systems that integrate human and machine forecasting
components to increase accuracy, flexibility, and scalability
1
. The University of Southern Califor-
nia (USC) Information Sciences Institute (ISI) competed in the HFC by dynamically aggregating
novice human forecasts with machine models to produce probability forecasts of events occurring
or of realized quantities (such as the number of events to occur in a time period or the price of
an asset at a certain date)
2
, see Abeliuk et al. (2020), Huang et al. (2020), and Morstatter et al.
(2019). In the 2018 season of the HFC, there were 41 questions that involved forecasting the prob-
ability that a numeric quantity would land in each of five ordinal bins. An average of 49 volunteer
forecasters made an average of 95 forecasts per question (see Table 2.1). The ISI team also made
machine-generated forecasts for each of the 41 questions using automatically scraped historic data
and an autoregressive integrated moving average (ARIMA) model. The original intent of these
machine forecasts was to either replace or augment the human judges to improve accuracy. Fore-
casters could provide a new estimate each day; and on days without a new estimate, the judge’s
most recent forecast is repeated for scoring and aggregation purposes. One typical question, for
example, asked respondents on 14 March 2018 to assign probabilities that the price of gold would
be 1.) less than $1221 per ounce, 2.) between $1221 and $1303, inclusive, 3.) more than $1303
but less than $1374, 4). between $1374 and $1456, inclusive, or 5.) more than $1,456 in USD on
26 April 2018. This problem was depicted in Figure 1.5 where the bottom right chart displays the
linear average of all novice predictions during the 28 day forecasting period. The actual price on
26 April was $1321.65, making the middle bin the correct response. Respondents may have been
experts in the field of commodity price forecasting or have had very little experience in such an en-
1
“Hybrid Forecasting Competition (HFC)”, 2016.
2
https://viterbischool.usc.edu/news/2017/10/usc-isi-leads-iarpa-contract-developing-hybrid-forecasting-systems/
31
deavor. They were randomly assigned to one of three conditions: no immediate access to relevant
data (Condition A), presented with a graph depicting historical values of the relevant commodity
price (Condition B), or presented with a graph of historical data and a machine-generated forecast
for the timeframe in question (Condition C).
3
The machine forecasts are used in the current re-
search as a possible baseline for calibration and/or recalibration. Obviously some questions had
very accurate machine models and some did not. Each condition implies different dependencies
among forecasts due to shared information. Respondents may also have been either volunteers
or contracted through Amazon’s mechanical Turk program. “Turkers” earned $16 for two hours
worth of forecasting each week regardless of accuracy. In order to conduct consistent analysis
across all possible calibration, aggregation, and recalibration strategies, we will avoid using zero
probabilities altogether by altering them to 0.01 and re-normalizing the judgement as by Satop¨ a¨ a,
Jensen, et al. (2014). This research does not include an assessment on the potential strengths, bi-
ases, or motivations of the forecasters and so does not take any demographic or performance data
into account when aggregating human forecasts.
2.1.2 Forecasting Counterfactuals in Uncontrolled Settings (FOCUS)
The Intelligence Advanced Research Projects Activity (IARPA) sponsors a forecasting tournament
called Forecasting Counterfactuals in Uncontrolled Settings (FOCUS) to develop and evaluate
approaches to counterfactual forecasting that will help organizations conduct policy analysis and
lessons learned activities in complex domains like geopolitical analysis
4
. A team led by researchers
at the University of Pennsylvania (UPenn) competes by eliciting forecasts from domain experts,
“intuitive statisticians”, and “superforecasters”.
5
Through digital collaboration, the team members
share their interpretations of the system about which they are forecasting, their initial forecasts, and
3
Machine-generated forecasts were the result of an autoregressive integrated moving average (ARIMA) model
(Hyndman, Rob J. & Khandakar, Yeasmin, 2008)
4
“Forecasting Counterfactuals in Uncontrolled Settings (FOCUS)”, 2018.
5
“Intuitive statistician” is a term used by the UPenn team to describe a forecaster who is familiar with statistics
principles and uses that familiarity to shape his or her forecasts. Professional poker players are known to be intu-
itive statisticians. A “superforecaster” is a person who has demonstrated his ability to make accurate predictions by
performing exceptionally well in forecast competitions. See Tetlock and Gardner (2015) for more on what makes a
forecaster super.
32
Table 2.1: Hybrid Forecasting Tournament Questions
Id Description Duration Number of Number of
(days) Forecasters Forecasts
177 Battle deaths in Afghanistan 54 63 92
178 Battle deaths in Yemen 85 63 132
184 Number of worldwide earthquakes 24 184 426
185 Sugar price index 115 75 152
186 Dairy price index 78 42 81
188 Won to Dollar exchange rate 113 84 249
189 Closing price of gold 42 108 239
192 Conflict events in Palestine 24 38 46
194 Libya crude oil production 85 120 207
209 Bering Sea ice extent 27 57 149
220 Number of deaths by Boko Haram 132 27 76
222 Number of data breaches 40 40 73
225 Approval rate of Japanese cabinet 94 21 40
226 Baffin Bay Gulf sea ice extent 20 32 46
235 Benin consumer price index change 33 21 32
243 Egypt consumer price index change 50 21 36
244 Number of worldwide earthquakes 43 33 144
264 Nigerian crude oil production 111 24 66
279 Closing price of Brent crude oil 42 44 183
301 Number of data breaches 29 23 49
316 Venezuela crude oil production 83 25 53
325 Iraq crude oil production 15 23 31
337 Number of deaths by Boko Haram 38 11 23
344 Number of data breaches 62 18 27
345 Closing price of Swiss market index 28 24 75
351 Number of battle deaths in Central African Republic 86 14 29
352 Number of worldwide earthquakes 55 28 123
360 Peso to Dollar exchange rate 22 27 76
365 Closing price of UK’s FTSE 100 Index 62 25 131
366 Closing price of Brent crude oil 19 29 112
371 Number of battle deaths in South Sudan 65 13 31
372 Closing price of gold 29 32 104
373 Saudi Arabia crude oil production 34 23 42
378 Number of battle deaths in Ethiopia 51 15 35
380 Closing price of Brent crude oil 19 33 88
383 Closing price of France’s CAC 40 Index 43 25 81
384 FAO Cereal Price Index 37 17 41
385 Closing price of EURO STOXX 50 Index 34 24 74
388 Closing price of Brent crude oil 35 26 98
396 Closing price of gold 6 24 49
397 Closing price of Japan’s Nikkei 225 Index 6 15 40
33
their rationales for those forecasts. Then through several rounds of deliberation, they refine their
forecasts based on their teammates’ rationales. The team then submits a single aggregate forecast
to IARPA for grading.
IARPA uses computer-simulated scenarios to define the forecast problems and offers a limited
data set that lets forecasters observe outputs from simulated runs under baseline (not counterfac-
tual) conditions. The data set provides enough context for forecasters to get a feel for the workings
of the system of interest and the interrelatedness of variables, but not enough to build an accurate
forecasting model. IARPA wants to avoid teams building machine learning models that exploit
available data at the expense of human analytic methods and tradecraft. For these reasons, I pursue
a hybrid modeling method that allows basic parameterization from any available data but relies
heavily on human estimation of which variables are relevant, how they are related, and how they
will change as a result of a counterfactual condition.
The example set of questions analyzed in this dissertation pertain to the simulated results of
a computer game called “Civilization V .” Civilization V is a multi player strategy game where
players are leaders of civilizations and each turn make decisions about how to use their resources
to advance their own civilizations. They interact with other civilizations by way of treaties, wars,
trade, etc. Instead of human players, however, IARPA uses artificially intelligent agents to play
the scenarios. This way, they can play 100 runs of a scenario and get as a result a distribution of
outcomes. The base scenario is presented as a “world report” to the FOCUS teams (top right image
in Figure 2.1). It outlines what happened in the “factual” play of the game and provides statistics
for each civilization at play in the game (for example, how many resources they had as the game
progressed, when wars occurred, etc.). This report was also accompanied by a small data set of the
results of similar games to give participants a better idea of the possible ranges of values (bottom
right image in Figure 2.1). For example, some historical values of wealth of civilizations at 100
turns into the game and 500 turns into the game can give participants a feel for how wealth gen-
erally accumulates during play. Then IARPA introduces a counterfactual antecedent that changes
the original scenario in some way. For example, one civilization may be given additional resources
34
at turn 100 of the game, or another civilization may start the game of in a desert instead of lush
woodlands. This scenario is played 100 times with the artificially intelligent agents leading the civ-
ilizations. Participants are then asked to estimate the games’ outcomes. For example, participants
may be asked to estimate the happiness levels of the civilization that now started in the desert. Par-
ticipants know the actual (factual) happiness level for the civilization from the world report (which
is based on the factual lush woodlands scenario) and must think through how the antecedent will
affect the game play. This particular scenario is depicted later in Chapter 4.
Figure 2.1: The FOCUS Competition and Civilization V computer game.
The picture is a screenshot of the Civilization V computer game that shows the playing field. In the middle is a new
civilization (a settlement) with nearby livestock and a small army. The top right image depicts the World Report that
contains factual data- data from the original scenario. The lower right image represents additional data provided to
participants about the unfolding of similar games.
A unique aspect of this data set is that the “answers” are not binary as in the HFC data set.
The resolutions are the result of simulations and thus form a distribution of results rather than one
correct bin among otherwise incorrect bins as with the HFC data.
35
2.1.3 The Tetris Survey
For this experiment I sought an uncertainty that would meet the following desiderata:
• Anyone could provide a sound judgement; no expertise required.
• No one could look up the answer on the internet.
• The correct response would have a non-binary distribution (unlike the forecasting competi-
tion data sets where the answer occurs in one bin only).
To meet these desiderata, I decided to build a survey about a simulation based on the Tetris
computer game. I built a game simulation in NetLogo (Wilensky, 1999a) based on an existing
Tetris game model (Wilensky, 2001). I altered aspects of the game, such as which pieces may
fall, where they fall from (as opposed to only the center in the actual Tetris game), and how much
debris is at the bottom of the playing field. Then I played the game (ran the simulation) 1000 times
to observe how many blocks fall before the game ends (when the stacked blocks reach the top of
the playing field). I built a survey using Qualtrics to ask respondents about the outcomes of the
simulations. I showed a short video that explained the simulation and elicited the probabilities that
outcomes would fall in certain intervals. Figure 2.2 shows a screenshot of the survey; additional
screenshots are shown in the Appendix. The survey was declared exempt by the USC Institutional
Review Board following examination of a packet that detailed how data would be collected, stored,
and used. No personally identifiable information was collected. Demographic data was limited to
multiple choice questions about the respondent’s familiarity with Tetris and probability.
Participants Participation was voluntary and without compensation. I solicited survey respon-
dents through three sources. First, I posted the survey on “Psychological Research on the Net”
hosted by Hanover College.
6
Second, I solicited volunteers through an email to the students in the
Industrial and Systems Engineering department at USC. Third, I emailed the survey to family and
6
https://psych.hanover.edu/research/exponnet.html
36
Figure 2.2: Screenshot of the Tetris Survey.
Respondents watched a short video about the simulation then estimated the probability that the result of the simulation
would fall into each of five intervals (bins).
37
friends who I thought would be interested in it. The survey was live on 17 February, 2021. By 17
March, there were a total of 89 respondents who at least started the survey. 51 of them completed
the survey.
2.1.4 Other Publicly Available Data Sets
Forecasting the prices of gold and oil are applicable problems to study for several reasons: they
have abundantly available historical data making them suitable for studying the effectiveness of
time-series forecasting methods, they are interesting to decision-makers in several disciplines, they
are reactive to external conditions and therefore can present challenging forecasting problems, and
they are simple to explain in order to serve as illustrative examples of the methodologies assessed
in this research and their potential applications to other problem sets.
The London Bullion Market Association (LBMA)
7
is an over-the-counter trading market for
gold and silver. It maintains the internationally regarded spot prices for precious metals and runs
an annual forecast survey of top analysts from the international precious metals market. Each year,
around 24 analysts provide their point forecasts for the annual average, high, and low prices of
precious metals. This research relies on LBMA’s daily closing spot price for an ounce of gold and
analyzes the accuracy of their collection of expert annual forecasts on the price of gold which are
available from their website.
The US Energy Information Administration (EIA)
8
collects, analyzes, and disseminates a
wide range of information and data products covering energy production, stocks, and prices. Of
particular interest to our analysis is the EIA historic collection of daily closing spot prices for Brent
crude oil. The EIA also produces regular point forecasts of the monthly averages of these prices
for up to two years in the future. These forecasts may be considered expert forecasts as the EIA
is recognized as global leader of energy production, demand, sale, and price projections. Both
historic and projected prices are available on their website.
7
http://www.lbma.org.uk/
8
http://www.eia.gov/
38
2.2 Using Simulation to Select Arithmetic and Geometric Brow-
nian Motion Model Parameters
The aforementioned Brownian motion models depend only on two parameters: the drift,, and the
diffusion,, as seen in rows four and five of Table 1.2. Both parameters may be estimated from
previous data by stochastic volatility models (such as exponentially weighted moving average or
generalized autoregressive conditional heteroscedasticity models), or more simply estimated as
the mean and standard deviation of historic data. A decision maker’s main choices to make here
are how far back to look at the historic data to estimate the parameters (the lookback window).
Common rules of thumb are to use the previous 90-180 days of price data (when estimating stock
prices) or to use data from an equivalent amount of time before the current day as the forecast is
being made into the future (e.g. use 60 days of historic data when forecasting the price of gold 60
days into the future) (Hull, 2018).
We experimented with various schemes of look-back periods over various forecast horizons in
order to discover simple, easily repeatable heuristics for building Brownian motion models. We
varied the lookback period (m) from 30 to 360 days in 30 day increments in order to find patterns
of the validity of each value of m for each forecast problem. The results of these experiments
are summarized in Figure 2.3. We used data from EIA and LBMA from 2000 to 2017 which
were historic in relation to the forecasting competition windows explored in the next section. This
means that any heuristics we identified did not have the advantage of exposure to “test data” on
which the models will be compared to human forecasters.
We built 100 models for each possible set of lookback windows, each with 1000 paths of
simulated prices. We calculated the mean absolute percent error at each forecast horizon. Results
are depicted in Figure 2.3 where the darker blue pixels indicate that a BM or GBM model with
those parameters performed better than our na¨ ıve model with the same parameters. Specifically, the
upper right most pixel indicates that on average, the Brownian motion model that looked back over
360 days to estimate both drift and diffusion when forecasting the price of gold had an improvement
39
Table 2.2: Brownian motion model parameters
of 12.1% (BM model had a mean absolute percent error of 5.49% while the na¨ ıve model with
similar parameters had 17.5% error). Each of the 100 start days were randomly chosen from
January 1, 2002 through December 31, 2017.
Results indicate that the look-back period for estimating the diffusion of the price had very
little impact on the accuracy of a probabilistic forecast (most large squares in Figure 2.3 do not
vary much in response (color intensity) along their vertical axes). This is to be expected as we
measured error from a point forecast, not over a binned probabilistic distribution.
The best performing models at a forecast horizon of 30 days or less (the range of the HFC
tournament questions) were the arithmetic and geometric BM models that used the last 360 days’
worth of daily prices to estimate drift and diffusion parameters. For the remainder of this paper the
BM and GBM models will use these selection rules for estimating drift and diffusion parameters
(See Table 2.2). The Na¨ ıve model will no longer be considered in this analysis because there are
no conditions that warrant its use in the human forecasting problems of the next section.
2.3 Case Study: Predicting Commodity Prices with Brownian
Motion Models and Human Forecasts
The research presented in this section explores the efficacy of arithmetic and geometric Brownian
motion models to forecast prices of commodities: crude oil and gold. We compare the performance
of these models against each other and against expert and novice human forecasters in order to
provide decision-makers with insights about making, aggregating, and integrating forecasts into
40
Figure 2.3: Improvement in mean absolute percent error over Na¨ ıve model
We varied the lookback period (m) from 30 to 360 days in 30 day increments and simulated forecasts in order to find
patterns of the validity of different values of m for each forecast problem.
their decision making. This research presents repeatable schemas to estimate model parameters,
aggregate simulated forecasts, aggregate human forecasts, generate probabilistic forecasts, and
make relevant comparisons of the accuracies of different forecasting strategies using simulation.
Using Brier score to measure error of probabilistic forecasts and percent error to measure error of
point forecasts, Brownian motion models are shown to be most effective on short-term forecasts
(within 13 days) where they regularly outperform novice human forecasters. Experts, however,
tend to outperform the models on forecasts up to a year in the future.
Decision makers must wrestle with the anticipated effects of uncertainties in their thinking
about a problem. They may seek input from human advisors or machine-generated forecasting
models in order to make better decisions based on more accurate forecasts. Assimilating forecasts
from different sources involves understanding the strengths of each and effective ways to aggre-
41
gate them. While methods abound for forecasting uncertain quantities in the future, this section
assesses only four types: arithmetic and geometric Brownian motion models, and the aggregation
of expert and novice human predictions. Each of these methods are simple to explain, making
them attractive to decision makers who may be uncomfortable with “black-box” type forecasting
schemes (Prahl & van Swol, 2017). And simulation makes their parameterization and results easy
to intuit. Furthermore, accepting any of these methods requires accepting only benign assumptions
about the way the underlying quantity may fluctuate. In this research, we limit our scope to fore-
casting the prices of crude oil and gold at different increments of time into the future, up to one
year.
Our intent in this research is to use the simplest methods possible to make, aggregate, and
compare forecasts in order to provide decision makers with insights into how they may want to
create schemes to integrate forecasts into decision-making, hybridize their forecasts, and allocate
resources among expert advice, crowd-sourced advice, and machine models.
In this research, we simulate Brownian Motion forecasts on historic data (the closing prices of
an ounce of gold and a barrel of Brent crude oil in US Dollars) using a range of possible parameters
(drift and diffusion) in order to generate heuristics for selecting parameters on future forecasting
problems. Next, we compare these models to novice and expert human forecasts on either point
values or distributions of values. We define expert forecasters as humans or organizations with
unique qualifications to predict quantities of interest (experienced gold traders for the price of
gold and the US Energy Information Administration for the price of crude oil). Novice forecast-
ers are those volunteers who participated in a forecasting tournament making predictions about
geopolitical events and the prices of commodities or economic indices. They may or may not have
domain knowledge or experience making forecasts in general. Their forecasts then may be viewed
as crowd-sourced predictions where the decision maker does not have a ready way to assess their
individual qualities. Crowd predictions have been shown to be extremely useful in other research
such as that by Surowiecki (2005) and O’leary (2017). We aggregate these forecasts by linear aver-
aging and compare the forecasts by their error from the true values of the commodities. Error from
42
point estimates is measured in the mean absolute percent error (average percent deviation from
the actual value over all forecasts made) and error on probabilistic forecasts is measured using the
ordinal version of the Brier score, a proper scoring function as defined by Gneiting and Raftery
(2007).
2.3.1 Comparing Brownian Motion Models to Novice Human Forecasts
As confirmed by O’Hagan et al. (2006), an aggregation of human forecasts is usually difficult
to beat with machine models, and results can vary wildly across similar studies because there
are several ways to measure accuracy of forecasts. We chose mean absolute percent error and
ordinal Brier score as the simplest and most bias-free scoring methods for point and bin forecasts,
respectively. There are still, however, multiple ways to apply these scoring methods. Because we
seek to provide a decision maker with insights applicable to decisions made in real time and relying
on forecasts, we compare methods at each day (horizon,h) by scoring the aggregations of forecasts
made on that day. Humans could make a new forecast each day (if several were made on the same
day, we considered only the last one). Brownian motion models made updated predictions each
day over the life of a tournament question as one more day’s data became available. It is important
to note that because the Brownian motion models are stochastic in nature, another run of any
simulation will produce different results. The results presented here are aggregations over 10 such
runs.
Human forecasts may of course be aggregated in more intelligent ways, such as weighting
humans by their past performance, potential biases, claimed areas of expertise, distance from the
mean forecast, or changes in forecasts over time (Satop¨ a¨ a, Jensen, et al., 2014; Turner et al., 2014).
It is an overarching goal of this research, however, to find ways to best aggregate human responses
when little is known about their past performance, making the methods here generalizable to deci-
sion makers using online anonymous opinion pools or more general solicitations of advice.
Figure 2.4 depicts the difference in Brier score from the Brownian motion models when com-
pared to aggregated novice human forecasts. Gold results are on the left and crude oil on the right.
43
The top half of the figure depicts the improvement in Brier score by the BM models over time and
on each of the four gold questions and five oil questions of the 2018 HFC. Points in time where the
BM models outperformed the aggregated human forecasts are marked by dots. A smoothed sum-
mary of the average improvements over time is depicted in the bottom half of the figure. Note that
for both groups of questions, although all sets of forecasts (BM, GBM, and Humans) improved as
they got closer to the close date (and more data became available), the BM models tend to eclipse
the aggregated humans in accuracy at 13 days out from the close (resolved) date of the question.
This point in time is indicated graphically where the blue (BM) and red (GBM) lines cross up over
the black line at zero which represents the performance of the aggregated human forecasts.
Key Finding 2.1. For short range gold and oil forecasts, Brownian Motion models gener-
ally outperform novice human forecasters.
Aggregated human probabilistic forecasts of the price of gold and oil in the 2018 HFC
were more accurate than the Brownian motion models early on in the life of the question.
Both arithmetic and geometric Brownian motion models built from historic price data had less
average error, however, as the resolution date approached. Brownian Motion models tend to
eclipse the aggregated humans in accuracy at 13 days out from the close (resolved) date of the
question.
Averaged over all four Gold IFPs and summed over all days (h), the BM model improved
(lowered) Brier score by 0.00131 and the GBM improved it by 0.00029 (Improvement of 0.00772
and 0.00535, respectively over the five oil forecasts). These are small improvements, to be sure, but
demonstrate that a very simple machine model may in some cases replace the aggregated forecasts
of hundreds of humans who could otherwise work on more difficult problems.
2.3.2 Comparing Brownian Motion Models to Expert Human Forecasts
The LBMA annually publishes a collection of expert forecasts on the price of gold for the coming
year. The forecasts include predicted minimum, mean, and maximum prices of gold. Table 2.4
compares the error of these expert forecasts and of BM/GBM models for the last three years. In
44
Table 2.3: Forecasts and Number of Experts outperformed by Machine Models on LBMA annual
gold survey (by percent error, USD price of gold)
order to provide future estimates for the prices of gold, we created our own pool of ‘experts.’ For
each Brownian motion method, we built ten ‘experts’ comprised of 10 simulated paths each of
forecasts. We used so few paths to temper the minimum and maximum yearly forecasts (assuming
that additional paths may make the mean estimate more accurate but at the same time overestimate
the forecasted maximum value and underestimate the forecasted minimum value due to the increas-
ing volatility over time of BM models). The method did not temper as much as we expected and
the minimum or maximum model forecasts only outperformed the human experts in 2016 when
the price of gold went unexpectedly high. The machine model estimates for the mean price of gold
outperformed 3 or 4 human experts in both 2016 and 2017, and beat 18 of the 24 human experts in
2018.
Key Finding 2.2. For long range gold forecasts, Geometric Brownian Motion models
generally out perform only a small number of experts.
Professional forecasters predicted gold’s minimum, average, and maximum price for the
upcoming year. GBM models tended to overestimate the actual variance of the price (under-
estimating minimums and overestimating maximums), and human experts tended to be better
at predicting future trends in the average price.
The EIA publishes a monthly forecast of the average monthly Brent closing price of a barrel of
45
Table 2.4: Brent Crude Oil Monthly Forecast Results
crude oil for the next year. We compared the mean absolute percent error of these expert monthly
mean forecasts to those of our BM and GBM models. Each model used the last 360 days of daily
returns to estimate 1000 paths of the price of a barrel of crude oil. It is important to note here
that some of the BM models produced negative price estimates, reinforcing that GBM models are
generally better suited for predicting asset prices. It is also worth noting that oil prices tend to
have a strong seasonal component, making them better suited for a time-series forecasting method
that can take this valuable pattern into account (Bennett & Hugen, 2016). Results are summarized
in Table 2.3 where we can see the Brownian motion model outperformed the EIA forecasts over
2018.
Key Finding 2.3. For long range oil forecasts, Brownian Motion models generally failed
to out perform expert forecasts.
Brownian motion models do not consider seasonal fluctuations in price while the EIA does.
This was a large contributor to the success of their monthly forecasts in Brent crude oil price.
2.4 Contributions and Conclusions
These results showed Brownian motion models to outperform aggregations of novice human fore-
casts on the prices of gold and oil when the forecasting horizon was less than 13 days into the
future. For forecasting problems up to a year in the future, experts generally outperformed the
machine models, but Brownian motion forecasts clearly beat both groups of experts in 2018. This
research demonstrated the applicability of using simple schemes to develop intuition about how to
46
use machine models, which models and parameters to select, and in general, how much stock to
put in forecasts from machines, crowds, and experts.
Brownian motion forecasts may be most helpful for more complex questions in forecasting
tournaments which ask questions along the lines of “when will the price of one commodity first
have a certain relationship with the price of another commodity?” (for example, when will the price
of a ton of cobalt exceed that of 25 ounces of gold?) Because the question is complex, meaning that
it involves the relationship of two commodities, and forecasting when that relationship will first
occur, simulation may be the most appropriate way to forecast it. Brownian motion models lend
themselves to such simulation and for helping develop intuition about the nature of the question
and the bounds of its answers.
When framed as a resource-allocation problem, the results presented in this research may help
make decisions about how much to invest in a forecast source. Imagine if an expert forecast costs
$100, a novice costs $1, and a machine forecast costs $0.01. By comparing the performance of
these forecasts on previous problems, we can assess their value to the overall decision and discern
when it is appropriate to invest in each for a particular problem.
47
Figure 2.4: Brier score improvement from linear average of novice human forecasters by forecast
horizon
Bottom charts are smoothed summaries of improvements averaged over each forecasting question depicted in the top
plots.
48
Chapter 3
Calibrating and Combining Probabilistic
Estimates
Overview
This research proposes a three phase scheme to improve accuracy of crowd probability forecasts.
First, individual expert judgments are calibrated to adjust for systemic biases, then aggregated
to arrive at a single crowd judgement, and finally recalibrated to produce an appropriate fore-
cast. We propose a new transformation function that is linear in log odds for both the calibration
and recalibration phases. We demonstrate that this new function allows for calibration of multi-
nomial probabilities with respect to a baseline probability distribution of choice. The proposed
aggregation function generalizes multiple forms, including linear and geometric opinion pools.
The Calibrate-Aggregate-Recalibrate scheme is demonstrated on data from a recent forecasting
competition. Importantly, this research extends the binary probability weighting function to the
multinomial case, allows for incorporation of a baseline distribution, generalizes multiple opinion
pools, and demonstrates the impact of the three phase scheme on forecast accuracy.
49
3.1 Calibrate-Aggregate-Recalibrate (C-A-R) Methodology
The fundamental contributions of the current research are to bring additional flexibility to existing
methods. Our three part scheme allows for different orders of operations and specific tailoring
to the intended use of the forecasts. Our proposed calibration function allows for multinomial
forecast adjustments and incorporates a baseline probability distribution while reducing the total
number of function parameters. Our proposed aggregation function provides a continuum of ag-
gregation options with a single parameter. We demonstrate the value of this enhanced flexibility
by analyzing performance on actual forecast competition data and showing that forecast error is
reduced using scheme settings uniquely possible with our approach (for example: calibration both
before and after aggregation, aggregation somewhere between linear and geometric opinion pools,
or calibration with respect to an informed baseline probability distribution).
Figure 3.1: The decision maker’s toolkit
The top of this figure depicts the old tools available (from existing literature) to decision makers for calibrating and
combining probabilistic estimates. They can either calibrate before or after aggregation. They can use the linear or
geometric opinion pools. And while calibrating, they can de-extremize towards the uniform distribution or extremize
towards a single interval bin. The C-A-R methodology increases the options available to a decision maker (depicted
in the bottom of the figure). They can calibrate with respect to a baseline distribution of choice both before and after
aggregation. Aggregation can take infinite forms including the most popular opinion pools. Importantly, this research
also provides a sort of “quick-start guide” to help the decision maker through using the improved toolkit.
50
Stew Analogy Consider the three stages a chef may take for cooking a stew: ingredient prepara-
tion, ingredient combination, and seasoning. During ingredient preparation, the chef removes un-
wanted flavors (say, by peeling carrots) and preserves desired ones (by browning meat or blanching
vegetables). When the chef combines the ingredients, they become a stew, no longer individual
flavors. He must imagine the desired flavor characteristics of the final product to make decisions
about the proportions and timings of the ingredients to avoid over- or under-representation of a
particular ingredient’s flavor. As the stew simmers, the chef may enhance flavor by adding salt or
a bay leaf which do not provide a distinct flavor on their own. He may also add a specific spice
(say, cinnamon) to move the stew’s flavor towards that profile. In our forecast adjustment scheme,
the calibration stage is akin to ingredient preparation, where we adjust each individual forecast.
These adjustments depend largely on the forecasts themselves (just as preparation tasks depend
on the ingredients). The aggregation stage is akin to combining the ingredients and is largely a
function of the underlying probability distribution of the variable of interest. Recalibration, done
on the aggregated crowd forecast, is akin to the seasoning and is largely a function of the use of the
forecast. For example, we may seek accuracy if using the forecast for decision analysis or we may
seek a forecast that will give us the best score in a competition (thus favoring informativeness).
And for either case, we may adjust the crowd forecast towards or away from a baseline.
1
In this
final stage, we tailor the forecast to our use, just as we would season a stew to our taste.
We propose a general scheme for assembling potentially uncalibrated probabilistic forecasts
from a crowd into a single forecast. First the individual forecasts are calibrated, then they are
aggregated to form a single estimate, then that single estimate is recalibrated. The first calibration
addresses systemic biases on the parts of individual forecasters. The second calibration addresses
the purpose of the forecast and parameters for this recalibration will depend on whether the fore-
casts are to be used in a competition setting (where the intent is to maximize some scoring metric)
or in a decision analysis setting (where we instead seek the most accurate forecast). The calibration
steps are separated by an aggregation step that turns multiple judgements into a single estimate.
1
Here, de-extremizing towards an uniformed (bland) baseline is akin to watering down, while extremizing away
from bland is akin to adding salt. Similarly, adding cinnamon is akin to de-extremizing towards cinnamon.
51
Figure 3.2: The C-A-R methodology follows similar steps as making a stew
This three-part scheme allows a decision maker (or tournament team leader) to either calibrate
then aggregate or aggregate then calibrate by nullifying either the recalibration or calibration step,
respectively. Both calibration steps may also be used to incorporate a baseline probability distri-
bution. This important characteristic allows the decision maker to update his baseline with the
judgements of the crowd, hybridize the crowd judgements with a forecast provided by a machine
model, or artificially sharpen or flatten the crowd estimate by calibrating each individual judgement
with respect to the crowd average.
Importantly, this scheme adapts popular calibration and aggregation methods designed for bi-
nary estimates to multinomial ones. The scheme may be used for ordinal (probability that the price
of gold falls into each of five (mutually exclusive and collectively exhaustive) ordered bins) or
categorical (probability that each of five candidates will win an election) probabilistic estimation
problems. This flexible scheme allows for multiple hybridization options to combine human- and
machine- generated estimates. A probabilistic estimate from a machine model may be hybridized
early in the process by using it as a baseline during calibration, through weighted aggregation
in the middle step, or later by using it as a baseline for recalibration. The Calibrate-Aggregate-
Recalibrate scheme involves three separate functions that are described below.
52
The next section addresses the aforementioned limitations of existing approaches and pro-
poses the more general and flexible Calibrate-Aggregate-Recalibrate scheme comprised of three
functions to overcome them in practice. We then demonstrate the performance of the Calibrate-
Aggregate-Recalibrate scheme on crowd forecasts of the prices of gold and oil from a recent geo-
political forecasting competition. Finally, we discuss implications for application of the scheme.
3.2 A New Calibration Function
We start with the linear in log odds weighting function (Equation 1.5) because of its good fits to
empirical data (Cavagnaro et al., 2013; Zhang & Maloney, 2012) and psychological tractability
(Gonzalez & Wu, 1999). We add a normalizing constant,
j
forj from 1 tom, m = number of
bins, to allow its use for multinomial (non-binary) probabilities:
c(p
j
) =a
j
i;j
p
i
j
i;j
p
i
j
+ (1p
j
)
i
(3.1)
The (discriminability) parameter controls the shape and is indexed by judge. The
(tendency)
parameter controls the height of the function and is indexed by both judge and question bin.
Calibration Function Desiderata
While it would be helpful to assess parameters for each judge that describe his discriminability
and tendency, this is not practical in a forecasting competition setting where little is known about
the judges before the competition begins, the pool of judges changes throughout the tournament,
judges respond to forecast questions across several topics with varying levels of expertise or cal-
ibration, or the judges may be anonymous and therefore impossible to track in order to assess
performance. In such cases, we seek a flexible scheme that allows us to select calibration parame-
ters based on the characteristics of the crowd’s forecasts or of the questions themselves, rather than
on individual judges.
The discriminability characteristics of the function (from the parameter) are suitable for our
53
applications in either forecasting tournaments or decision analysis. The tendency characteristics
(from the
parameter), however, are not. The function must be altered in such a way to allow for
different tendencies for each bin and for tendencies to be defined as a function of the forecasted
quantity rather than the judge himself in order to be useful when little is known about the cali-
bration or tendency of a particular judge. In essence, we would like to incorporate any baseline
probability (^ p) as a sort of prior distribution, not just a uniform one. This would mean that a fully
de-extremized forecast would arrive at the baseline probability distribution instead of a uniform
one. Similarly, a fully extremized forecast would assign maximum probability to the bin(s) in
which the judge assigned more than the baseline. In addition, we would like to preserve any judge-
ment of conviction (p of 0 or 1 assigned to a bin) or of a bin’s judged forecast that matches the
baseline (to include p=1/n if we use an uninformed (uniform) baseline). These (discriminability,
tendency, and preservation) desiderata are summarized in Table 3.1.
Table 3.1: Summary of Calibration Function Desiderata.
Condition: p = 0 0 1) value of. The authors therefore recommend the log approach (Equation 3.5) and will use
55
it for the duration of the current analysis. The form resulting from the log approach (exponential
approach) are undefined with ^ p = 0(1) and so if our desired baseline includes a zero (1) probability,
we will adjust it to 0.01 (0.99) and renormalize.
In order to meet our desiderata following multinomial normalization, we define the normalizing
constants
a
j
=
8
>
>
<
>
>
:
1; if p
j
= ^ p
j
1
P
n
j=1
p
j
8p
j
6= ^ p
j
; otherwise
(3.6)
We now have a function that meets our desiderata for calibration (summarized in Table 3.1).
Next we demonstrate that this approach meets are desiderata and has several useful properties as
outlined by the following propositions.
Proposition 1 (De-extremization).
As the discriminability parameter tends to 0, the calibrated forecast is de-extremized.
lim
!0
c(p
j
) =a
j
1
1 +
1
^ p
j
1
=a
j
^ p
j
8 ^ p
j
;p
j
(3.7)
Implications of Proposition 1. As we decrease towards 0, we adjust a judge’s forecast
towards our baseline (^ p) whether it is informed or not. With an uninformed baseline (using ^ p =
1=m), the calibrated forecast tends towards a uniform distribution (See the bottom left plots in
Figure 3.3).
Example 1.a (Misinformed Crowd). If we expect the crowd (or a subset) of judges to have
inflated confidence due to misinformation, we will wish to remove potential overconfidence before
aggregation by setting parameters< 1. This de-extremization will alter judgements towards our
baseline or towards uniform distribution when ^ p = 1=m.
Example 1.b (Hybridization through calibration). As mentioned previously, we may wish
to hybridize the crowd’s judgements with a machine-generated forecast. An example of “early
56
Figure 3.3: Normalized Calibration Function Demonstration.
In the middle of the figure is an unaltered forecast from a judge (probabilities 0.05, 0.2, 0.25, 0.3, 0.15 assigned to the
five outcome bins). The parameter in Equation 3.3 is adjusted from0to10 (from left to right). On the left (at = 0)
are the baseline probabilities: an informed baseline on top and a uniform baseline on the bottom. As increases past
1, the judges forecast is extremized away from the baseline.
hybridization” is to combine each judge’s forecast with that of a machine model. One way to do
this is to move each judge’s forecast towards the machine forecast by setting ^ p from the machine
forecast and setting < 1. This makes a sort of weighting parameter; the closer it is to 0,
the more we favor the machine baseline, the closer it is to 1, the more we favor the individual
judge’s estimate. In addition, as we move beyond 1, we amplify the uniquely human portion
of the forecast, the part that differs from the machine baseline. This may be a helpful approach
for diversifying a crowd of forecasts from judges who all had access to the same machine model’s
output.
Proposition 2 (Un-altered Judgement).
When the discriminability parameter is set to 1, the calibrated forecast remains unchanged.
lim
!1
c(p
j
) =a
j
1
^ p
j
1
p
j
1
^ p
j
1
p
j
+
1
^ p
j
1
(1p
j
)
=a
j
p 8 ^ p
j
;p
j
: (3.8)
57
Implications of Proposition 2. The decoupling of the discriminability parameter and the
baseline probability allow us to index by judge and ^ p only by bin. This reduces the total amount
of necessary parameters by a factor of the number of judges. The flexibility of the calibration
function allows us to easily visualize the impact of calibration on a crowd and retroactively see the
value(s) of that would have provided the most accurate crowd forecast. See the middle plots in
Figure 3.3.
Figure 3.4: Calibration Function Curvature with Informed Baseline (Not Normalized).
Each chart has a different baseline distribution (^ p). When = 0, any probability (x axis) is transformed to the baseline
probability (denoted by horizontal dark orange line). When = 1 no transformation occurs (masked by diagonal black
identity line). As increases beyond1, the function becomes more step like. Each line is monotonic, starts at (0;0),
ends at (1;1), and crosses the identity line once.
Example 2.a (Trusted Judge). If we have a judge who we feel is always well-calibrated,
we may decide to set his = 1. Then even if we choose to set
according to Equation 3.2, his
judgement will remain unaltered. This characteristic is a departure from the traditional linear in
log-odds function where both disriminability and tendency parameters must be set equal to 1 in
order to avoid altering a judgement.
58
Example 2.b (Aggregate then Recalibrate). Although Turner et al. (2014) found little rea-
son to aggregate then recalibrate using the linear in log-odds recalibration function, this option is
available with the current scheme. This approach may be applicable for a forecasting tournament
with an anonymous crowd of judges about which little is known (no demographic or performance
data). We would certainly not feel comfortable assigning each judge a unique calibration parameter
and hence may not be able to justify any calibration prior to aggregation. An option then is to set
= 1 so no transformation occurs, then aggregate, then recalibrate with a non-trivial recalibration
parameter to adjust the forecast for maximizing our score (usually extremizing, but maybe not so
aggressively if we are unsure of the performance characteristics of the crowd).
Figure 3.5: Calibration Function Extremization with Informed Baseline (Not Normalized).
Each chart has a different baseline distribution (^ p). When = 0, any probability (colored line) is transformed to the
baseline probability. When = 1 (denoted by black vertical line), judges’ probabilities remain untransformed (they
all cross the black line at the same height in each chart). Judges’ probabilities equal to the baseline remain unchanged
(horizontal colored lines). As increases beyond 1, judges’ probabilities are transformed towards conviction (0or1).
Proposition 3 (Extremization).
As the discriminability parameter tends to infinity, the calibrated forecast is extremized.
59
lim
!1
c(p
j
) =
8
>
>
>
>
>
>
<
>
>
>
>
>
>
:
1; if p
j
> ^ p
j
^ p; if p
j
= ^ p
j
0; if p
j
< ^ p
j
(3.9)
Implications of Proposition 3. As we increase above 1, we adjust a judge’s forecast away
from our baseline. That is, we emphasize the differences between the judge’s forecast and our
baseline. In the binary case (without normalization for multinomial forecasts), subjective judg-
ments> ^ p would tend to 1 and judgments< ^ p would tend to 0 monotonically. With normalization
(Equation 3.6), however, the calibrated forecasts move towards conviction (c(p) = 0; 1) in relation
to the other interval probabilities and thus may be non-monotonic (see, for example the probability
assigned to Bin 4 in the bottom plots of Figure 3.3 when> 1).
Example 3.a (Addressing underconfidence). We often observe that judges increase their
confidence throughout the duration of a forecasting problem as they learn more about a particular
question and tournament scoring in general. We also observe that judges rarely change the bin
to which they assign the highest probability. Therefore, it is helpful to extremize judgements in
order to amplify the portion of the probability assessment that differs from uniform (when ^ p =
1=m), the decision maker’s prior probability, a machine-generated estimate, or the crowd average.
Such extremization is also particularly helpful in the recalibration stage when we have already
determined the crowd’s aggregated forecast.
Example 3.b (Artificially diverse crowd). It is well known that crowd diversity can con-
tribute to accuracy (Tetlock & Gardner, 2015). Diversity may be lacking when all judges have
access to the same information (say a machine-generated model). In order to amplify the differ-
ences among the judges, we would wish to adjust judgements away from the common distribution
(in this case a baseline probability generated by the machine model). We do this by choosing
a discriminability parameter () greater than one and by using the machine-generated baseline
60
probabilities as ^ p.
Proposition 4 (Preservation of conviction).
When a judge assigns a probability equal to zero or one, the calibration function will not alter it.
c(1) =a
j
1
^ p
j
1
1
1
^ p
j
1
1
+
1
^ p
j
1
(1 1)
= 1 8 ^ p
j
;: (3.10)
c(0) =a
j
1
^ p
j
1
0
1
^ p
j
1
0
+
1
^ p
j
1
(1 0)
= 0 8 ^ p
j
;: (3.11)
Implications of Proposition 4. It is a special circumstance that a forecaster assigns certainty
or impossibility (p = 0; 1) to an outcome and this circumstance requires special treatment during
calibration.
Example 4.a (Preservation of impossibility). When judges use the subjective (degree of be-
lief) approach to assigning probabilities to possible outcomes, they attach very special meaning to
p = 0. They interpret this as impossibility and its definition is the same for any approach of proba-
bility assignment (including classical and frequentist): the outcome simply cannot happen. When
a judge makes such a probability assignment, their judgement requires no further calibration. In
practice, the decision maker should pay close attention to such extreme probability assignments
to consider what information (or misinformation) the judge has access to. If such extreme assign-
ments do not make sense to the decision maker (as is often the case with continuous valued interval
probabilities), they may wish to down weight (including a weight of 0) the judge’s contribution or
replace it with a very low probability to allow for subsequent calibration and aggregation. See the
p = 0 column of Table 3.1.
Example 4.b (Preservation of Certainty). Similarly, a judge may assign a probability of
occurrencep = 1, indicating certainty that the event will occur. The decision maker would treat
this condition similarly to the impossibility example. See thep = 1 column of Table 3.1.
61
Proposition 5 (Preservation of baseline).
When a judge assigns a probability equal to our baseline probability, the calibration function will
not alter it.
c( ^ p
j
) =a
j
1
^ p
j
1
^ p
j
1
^ p
j
1
^ p
j
+
1
^ p
j
1
(1 ^ p
j
)
= ^ p
j
8 ^ p
j
;: (3.12)
Implications of Proposition 5. It is a special circumstance that a forecaster lacks any con-
viction and assigns a probability equal to our baseline probability (p = ^ p) to an outcome and this
circumstance requires special treatment during calibration. See thep = ^ p column of Table 3.1.
Example 5.a (Maximum uncertainty). When we lack a prior probability, we may assume
the uniform distribution (^ p = 1=n) because it is the one with maximum entropy (Jaynes, 1968).
When a judge also declares a probability equal to the uniform distribution, we will wish to keep
his assignment of maximum uncertainty. Note how thep = 0:2 assigned to Bin 2 in the lower half
of Figure 3.3 is preserved for all values of.
Example 5.b (Forecast equal to baseline probability). When we have a prior probability
(from the decision maker, a machine-generated model, or even an aggregation of the crowd), and
a judge matches it, we will wish to keep his judgement unaltered. Note how thep = 0:15 assigned
to Bin 5 in the upper half of Figure 3.3 is preserved for all values of.
3.3 Using the Generalized Aggregation Function
The aggregation function combines calibrated forecasts to arrive at a crowd judgement. We start
with a generalized mean as described by Norris (1976) and add weighting and normalizing param-
eters:
a(p
j
) =b
j
n
X
i=1
w
i
p
j
1
: (3.13)
62
Figure 3.6: Comparing Calibration Functions
Three calibration functions are displayed and potential problems are pointed out. The normalized Karmarkar form
(top) does not preserve any probabilities and maximizes only the highest probability as the extremization parameter is
increased (as we move to the right on the chart). The Tversky and Kahneman form (middle) reverses the subjective
distribution as the extremization parameter approaches 0 (as we move to the left on the chart). My proposed function
(bottom) can preserve any point through normalization (Equation 3.6).
The normalizing constant,b
j
is required when the aggregation parameter is6= 1. The weight-
ing parameterw
i
allows weighting judges by their expected contribution to the final crowd forecast
as described in the introduction chapter. This weighted form has been used by Abbas (2018) as a
multiattribute extension to a preference function with constant elasticity of substitution.
Proposition 6 (Single Parameter Combination).
A single parameter may be used to perform several different popular aggregations.
As!1, we get the minimum of all judgements,
lim
!1
a(p
j
) =b
j
min(p
1j
;:::;p
nj
); (3.14)
as!1, we get the harmonic opinion pool,
lim
!1
a(p
j
) =b
j
P
n
i=1
w
i
P
n
i=1
w
i
p
ij
!
; (3.15)
63
Figure 3.7: Aggregation Function Demonstration.
The top section of this figure shows probability forecasts from four different judges (colored) separately on the left
and overlayed on the right. The bottom section demonstrates the effects of the Aggregation function (Equation 3.13)
as we aggregate the above forecasts using different values of the parameter. Note that when!1 (set to -100 in
the bottom left pane) the aggregated forecast distribution shares the same shape as the darkest gray (minimum) portion
of the overlayed individual forecasts. And when!1 (set to 100 in the bottom right pane) the aggregated forecast
distribution shares the same shape as the outline (maximum) of the overlayed individual forecasts.
64
as! 0, we get the geometric opinion pool,
lim
!0
a(p
j
) =b
j
n
Y
i=1
p
w
i
ij
; (3.16)
as! 1, we get the linear opinion pool,
lim
!1
a(p
j
) =b
j
n
X
i=1
w
i
p
ij
; (3.17)
and as!1, we get the maximum of all judgements,
lim
!1
a(p
j
) =b
j
max(p
1j
;:::;p
nj
): (3.18)
Similarly, when =3;2; 2; 3, we get the cube root mean, square root mean, quadratic mean,
and cubic mean, respectively.
Implications of Proposition 6. While the most studied of the above forms for aggregating
judgements are the geometric and linear opinion pools, infinite other aggregation options exist as
we vary the parameter. Having one parameter allows an easy visualization of when linear opinion
pools (often the best performing (Ariely et al., 2000; Grushka-Cockayne et al., 2017)) may be out-
performed. See Figure 3.8 which is based on simulated judgements from artificial agents with less
than complete information. Note that the least forecast error is achieved when delta is slightly
less than 1. There is limited usefulness of parameters far from zero in forecasting competition
use because it is very common for every bin of every question to have at least one judgement of
certainty (p = 1) and one judgement of impossibility (p = 0) (in our data set only one bin on
one question did not contain a single zero; 8 questions had only a single one, 17 had 2, 11 had
2, and 5 had 4). This means that the resultant aggregation will be uniform in either case (note
how Brier Score increases as moves farther from 0 in Figures 3.8 and 3.11. The aggregation
parameter may be chosen in such a way as to account for dependencies among forecasts by using
65
the same logic as the Bayesian ensembles presented by Lichtendahl et al. (2018). Judgements may
also be individually weighted for their expected contribution to the aggregate accuracy by setting
b
j
according to any of the studies mentioned earlier in the Introduction section.
Figure 3.8: Aggregation Parameter Impact on Brier Score.
As the aggregation parameter ( in Equation 3.13) is adjusted from -5 to 5 (left to right) the Brier score (y axis,
lower is better) changes. Note that the best score was achieved with a slightly less than one (right black vertical
line), demonstrating the usefulness of a generalized function that allows for aggregations near (but not exactly) linear
( = 1).
Example 6 (Hybridization through aggregation). We may hybridize (combine both machine-
and human-generated forecasts) during the aggregation step by appropriately weighting forecasts
by their source. For example, consider a linear ( = 1) aggregation of 50 human forecasts and one
machine model where we want to evenly hybridize (half machine and half human contribution). In
66
this case, we set eachb = 0:01 for the human forecasts andb = 0:5 for the machine forecast.
3.4 The role of Recalibration
The recalibration function adjusts the crowd judgement with respect to a prior judgement or a
machine-provided judgement to produce an accurate forecast.
r(p
j
) =c
j
1
~ p
j
1
p
j
i
1
~ p
j
1
p
j
i
+
1
~ p
j
1
(1p
j
)
i
(3.19)
where the normalizing constant,c
j
is defined as in the Calibration function with Equation 3.6, and
~ p
j
is the baseline probability for thejth bin.
One obvious use of the recalibration stage is to extremize a crowd forecast for tournament
scoring. Many forecasting questions have a binary resolution (either the outcome occurred by the
specified date or it did not). This is true for both categorical (election outcome) and ordinal (binned
price of gold) questions. It is therefore a potential strategy of the tournament team to use the crowd
to identify the correct bin in to which the question will resolve and submit a certain forecast (p=1)
for that bin in order to maximize their score (by minimizing Brier score). When it is unnecessary
to incorporate a baseline, we can default to the simpler Karmarkar form by setting ~ p
j
= 0:5 for
allj. This changes the effect of extremization from increasing those probabilities greater than ~ p
j
and decreasing all others to increasing only the maximum probability assigned and decreasing all
others.
Deciding whether to incorporate a baseline, or how much to extremize during recalibration,
relies on the trade-offs a decision maker wishes to make between accuracy and informativeness
(Yaniv & Foster, 1995). In a forecasting competition, informativeness is more important than
accuracy to decrease our error and get a better score, and so we will generally wish to extremize
aggressively during recalibration. In a decision situation, however, accuracy will generally be more
important than informativeness and achieving the most accurate shape of the forecasted distribution
will often not require such aggressive extremization.
67
The propositions for the recalibration function are same as for the Calibration function, but the
implications are different as indicated by the examples below.
Revisiting the Calibration Examples
Example 1.c. (Hybridization through recalibration) Use < 1 to move the aggregated
forecast towards a machine generated forecast (~ p)
Example 2.c. (Calibrate then Aggregate) Turner et al. (2014) recommended calibrating
probabilities then aggregating with no further recalibration. This scheme only requires setting
= 1.
Example 3.c. (Restoring Conviction to the Crowd) Hora (2004) notes that (linear) ag-
gregation of calibrated judgements may result in a miscalibrated aggregate. To combat this in
practice, we may wish to restore the crowd’s original average deviation from a known distribution.
For example, if our crowd is rather convinced (each forecaster has a high deviation from a uniform
distribution), the crowd forecast may lack such conviction following aggregation. We can then
restore the original average conviction of the crowd through recalibration by extremizing ( > 1)
with respect to a uniform distribution (~ p = 1=m). See Figure 3.14.
3.5 Case Study: C-A-R for a Forecasting Competition
To demonstrate the performance of the Calibrate-Aggregate-Recalibrate scheme, we use data from
a recent geo-political crowd-forecasting competition (HFC). The competition included questions
about the future prices of commodities for which there are ample available historical data and for
which machine-generated forecasts may be easily constructed. We do not know the expertise or
calibration of the judges before the competition, and so we seek a scheme that does not rely on
such measures to optimally calibrate and aggregate their forecasts.
68
Table 3.2: Summary of Calibration-Aggregation-Recalibration Schemes.
Calibration
Parameters
Aggregation
Parameter
Recalibration
Parameters
Outcome
^ p ~ p r(a(c(p)))
<1 1/m - - -
de-extremize towards uniform
(Example 1.a)
<1
informed
baseline
- - -
de-extremize towards baseline
(Example 1.a, 1.b)
1 - - - -
un-altered judgement (Example
2.a, 2.b)
>1 1/m - - -
extremize away from uniform
(Example 3.a, 3.b)
>1
informed
baseline
- - -
extremize away from baseline
(Example 3.a, 3.b)
1 - !1 1 -
approaches minimum uncalibrated
forecast (Equation 3.14)
1 - -1 1 -
uncalibrated inverse average
(Equation 3.15)
1 - ! 0 1 -
uncalibrated geometric average
(Equation 3.16)
1 - 1 1 -
uncalibrated linear average
(Equation 3.17)
1 - !1 1 -
approaches maximum uncalibrated
forecast (Equation 3.18)
- - - <1
informed
baseline
hybridization through recalibration
(Example 1.c)
- - - 1 -
calibrate then aggregate (Example
2.c)
- - - !1 - maximizing informativeness
- - - >1 -
restoring conviction to the crowd
(Example 3.c)
69
In order to thoroughly test the performance of the Calibrate-Aggregate-Recalibrate scheme,
we first built a data set of possible outcomes for each tournament question on each day. We
calibrated each set of daily forecasts using each of four baselines (Karmarkar, uniform, machine,
and crowd linear average), and parameter set to each of 24 different values (between 0 and 1
by 0.1, between 1 and 2 by 0.2, and whole numbers between 2 and 10). We aggregated using 41
different parameter values (from -2 to 2 by 0.1). Lastly we recalibrated with similar baselines
and parameter values as for the recalibration step. It is important to note that we used only
one calibration parameter for all forecasters on a particular question. We assume an anonymous
crowd and therefore pursue no weighting or tailored parameters based on demographics or prior
performance. Although tedious to build, this complete data set allows us to easily answer many
hypothetical questions and compare the performance of different schemes.
We use both internal and external validation to test the effectiveness of each scheme. Internally,
we use 10-fold cross validation on our own team’s forecasts. For external validation, we test our
selected schemes and parameters on forecasts from another tournament team (forecasting on the
same questions at the same time). The 10-fold cross validation procedure randomly assigns each
of the 41 tournament questions into 10 groups (or folds) of 4 or 5 questions each. For each fold,
we hold out the questions in the fold and determine the best set of parameters on the remaining
90% of the questions. We then test the chosen parameters on the help-out questions and record the
realized error (Brier score). Then, for each tested scheme, we have a set of 10 best parameters and
10 realized errors. This allows us to display a range for each parameter and Brier score, as in the
following tables.
The performance of the Calibrate-Aggregate-Recalibrate scheme is directly dependent upon the
inherent accuracy of the un-calibrated crowd and the machine-provided baseline estimate. The four
questions that asked about the future price of Brent crude oil had comparatively accurate crowds
and poor machine models. For these questions, the most effective scheme was to de-extremize
individual forecasts towards the crowd linear average (< 1; ^ p = p), aggregate between geometric
and linear averages (0 < < 1), and then extremize without a baseline ( > 1). This specific
70
scheme is depicted by the black dot in Figure 3.15. Conversely, for a question about the price of
gold where we had an accurate machine forecast, the most effective scheme was to de-extremize
individual forecasts towards the machine estimate (< 1; ^ p =p
m
), aggregate between geometric
and linear averages (0 < < 1), and then extremize without a baseline ( > 1). This specific
scheme is depicted by the black dot in Figure 3.10. Rationalizations for parameter selections are
explained below.
Choosing the Calibration and Recalibration Baselines.
To demonstrate the value of incorporating a baseline estimate into our calibration and recalibration
functions, we compare our functions to Equation 1.5 when = 1, known as Karmarkar’s log-odds
function. As demonstrated by Figure 3.9, Karmarkar recalibration is often a good choice for the
commodity price questions in this forecasting tournament.
The machine baseline is a good choice if it is known to be particularly accurate or if it is known
that all forecasters had access to the machine model when making their forecast. Figure 3.10
demonstrates the usefulness of a machine-provided baseline on tournament question number 372.
Next, we’ll choose the baseline to use for the calibration and recalibration steps when consid-
ering all 41 questions and using 10-fold cross validation. Table 3.3 displays the errors resulting
from each baseline (~ p) when we use benign calibration and aggregation parameters ( = 1and =
0or1;respectively) and optimal calibration and aggregation parameters. This table demonstrates
the superiority (albeit slight) of the uniform baseline over the other options for recalibration. We
will therefore use the uniform baseline for recalibration for the duration of this paper.
Key Finding 3.1. A uniform baseline distribution with the proposed calibration function
outperforms the normalized Karmarkar form.
For both calibration and recalibration phases, the uniform distribution, which is uniquely
possible with my proposed calibration function (Equation 3.5) outperforms the normalized
Karmarkar form on ordinal tournament forecasts.
Next we examine the impacts of calibration baseline. Table 3.4 shows the errors when we use
71
Figure 3.9: Accuracy Improvement due to Calibration and Recalibration Baselines for price of
gold and oil questions.
Calibration function baselines are shown from top to bottom, recalibration performs best overall with the Karmarkar
form (far right column of charts). The best combination of baselines and parameters (indicated by the black dot),
for the whole set of gold and oil questions, occurs with a uniform baseline for calibration and and the Karmarkar
form of recalibration. This demonstrates two interesting findings: 1. There is value to incorporating baselines to our
calibration function. 2. For the purposes of forecasting competitions, where the correct answer will be binary (whether
the price of gold fell in a particular bin or it did not), Karmarkar recalibration is appropriate because it does not hold
us to the set of desiderata depicted in Table 3.1. The simpler Karmarkar form allows for extremization of only single
bin (maximization of the one with the highest probability assigned) whereas our form (Equation 3.19) may maximize
more than one bin when they are both above their baseline values.
benign and best available parameters for aggregation and recalibration. Again, the uniform base-
line is demonstrated to reduce error compared to the other baselines. This result alone justifies the
proposed calibration function as an alternative to normalized Karmarkar calibration for multino-
mial probabilities. We will therefore also use the uniform baseline for calibration for the duration
of this paper.
Tables 3.3 and 3.4 demonstrated that the uniform distribution was a good choice for a baseline
distribution. It translated well across different questions during the cross-validation because the
interval bins were selected to appear uniform when the questions first became available. If a naive
forecaster simply looked at the previous year’s worth of data to come up with probabilities, he
would likely arrive at a distribution near uniform. This was intended to keep forecasters from
identifying the correct bin early in the lifespan of a question without significant expertise in the
72
Figure 3.10: Accuracy Improvement on a Gold question with an accurate machine baseline.
Calibration function baselines are shown from top to bottom, aggregation parameter values () are from left to right.
This specific question about the price of gold had an accurate machine model. Note how the best boosts in accuracy
come from de-extremizing the individual crowd judgements towards the machine baseline (< 1), and then extrem-
izing ( > 1) with the Karmarkar form following aggregation. Also note that the best improvement in Brier score
(black dot) occurred with an aggregation parameter () of 0.6 – some where between geometric ( = 0) and linear
( = 1) aggregation.
particular quantity. When selecting parameters with one set of forecasters and applying them
to another set, as in the external validations presented in tables 3.5 and 3.6, however, the uniform
baseline did not perform as well. Instead, the crowd average served as a better baseline distribution
to translate across different groups of forecasters.
Key Finding 3.2. When selecting calibration and recalibrtation parameters on one set of
data and using them on another, the baseline distribution should be common across data
sets.
The uniform baseline led to the lowest errors during cross validation across competition
questions (interval cut points were selected to be near uniform based on previous data at the
time the question was introduced). The crowd average baseline led to the lowest errors during
external validation across different forecaster populations.
73
Table 3.3: Effectiveness of Uniform Recalibration Baseline based on 10-fold Cross-Validation
Baseline Calibration Aggregation Recalibration Cross-validated
distribution Parameter () Parameter () Parameter () Brier Score
(~ p): med (min,max) med (min,max) med (min,max) mean (min,max)
uniform 1 (1,1) 1 (1,1) 1.6 (1,2) 0.1320 (0.0702,0.3270)
machine 1 (1,1) 1 (1,1) 1.9 (1,5) 0.1328 (0.0702,0.3448)
crowd 1 (1,1) 1 (1,1) 1.1 (1,2) 0.1336 (0.0702,0.3648)
Karmarkar 1 (1,1) 1 (1,1) 1.2 (1,2) 0.1376 (0.0702,0.3691)
uniform 1 (1,1) 0 (0,0) 1.2 (1,1.8) 0.1229 (0.0564,0.2889)
machine 1 (1,1) 0 (0,0) 1.1 (1,5) 0.1250 (0.0564,0.3045)
crowd 1 (1,1) 0 (0,0) 1.1 (1,2) 0.1279 (0.0564,0.3336)
Karmarkar 1 (1,1) 0 (0,0) 1.1 (1,2) 0.1261 (0.0564,0.3158)
uniform 5.04 (0.4,6) 0.16 (0.1,0.7) 0.77 (0.3,5) 0.1285 (0.0548,0.3393)
machine 3.26 (0.6,10) 0.14 (0,0.8) 1.28 (0.3,8) 0.1320 (0.0589,0.3597)
crowd 1.4 (1,2) 0.18 (0,0.6) 1.07 (0.9,2) 0.1368 (0.0647,0.3648)
Karmarkar 1.65 (0.1,3) -0.28 (-1.2,0.8) 3.3 (0.6,10) 0.1318 (0.0589,0.3594)
We use 10-fold cross validation to determine mean Brier Score resulting from different recalibration baseline distribu-
tions. We use benign ( = 1) calibration and linear ( = 1) aggregation (top rows), benign ( = 1) calibration and
geometric ( = 0) aggregation (middle rows), and optimal calibration and aggregation (bottom rows). The median
recalibration parameters () that minimized the Brier score for each training set (the other 9 folds) are also shown.
The uniform baseline (^ p = 1=m) delivered the best accuracy (lowest Brier score in bold).
Choosing the Aggregation parameter.
Now we turn to the role of aggregation in error reduction. Table 3.7 show the resultant errors
from using linear, geometric, and general aggregations ( = 1; 0;or
^
;respectively). We use both
benign and the best values for both calibration and recalibration. The results indicate that we are
best off using geometric aggregation unless we calibrate first and will not recalibrate, in which
case linear aggregations performs best. Although the generalized form allows us to fine tune the
aggregation and reduce error on known problems, it does not generalize to new data as well as the
geometric form.
Key Finding 3.3. The geometric opinion pool outperforms the linear opinion pool in
reducing estimation error from a crowd of judges.
Using the generalized aggregation function (Equation 3.13) with a parameter near zero
(near the geometric opinion pool) is the best option for combining ordinal forecasts in the
74
Table 3.4: Effectiveness of Uniform Calibration Baseline based on 10-fold Cross-Validation
Baseline Calibration Aggregation Recalibration Cross-validated
distribution Parameter () Parameter () Parameter () Brier Score
(^ p): med (min,max) med (min,max) med (min,max) mean (min,max)
uniform 5.3 (2,10) 1 (1,1) 1 (1,1) 0.1212 (0.0635,0.2410)
machine 2.3 (2,5) 1 (1,1) 1 (1,1) 0.1213 (0.0613,0.2579)
crowd 2.8 (1,10) 1 (1,1) 1 (1,1) 0.1281 (0.0702,0.2830)
Karmarkar 1.9 (1,2) 1 (1,1) 1 (1,1) 0.1218 (0.0622,0.2446)
uniform 1.1 (1,2) 0 (0,0) 1 (1,1) 0.1230 (0.0564,0.2843)
machine 1.1 (1,2) 0 (0,0) 1 (1,1) 0.1241 (0.0564,0.2960)
crowd 1.1 (1,2) 0 (0,0) 1 (1,1) 0.1274 (0.0564,0.3287)
Karmarkar 1.1 (1,2) 0 (0,0) 1 (1,1) 0.1248 (0.0564,0.3025)
uniform 5.04 (0.4,6) 0.16 (0.1,0.7) 0.77 (0.3,5) 0.1247 (0.0515,0.3138)
machine 1.92 (0.6,5) 0.12 (0,0.4) 0.95 (0.6,2) 0.1257 (0.0564,0.2925)
crowd 1.88 (0.9,10) -0.12 (-1,0.8) 1.6 (1,5) 0.1367 (0.0564,0.3648)
Karmarkar 1.5 (1,2) 0.14 (0,0.6) 0.9 (0.6,2) 0.1247 (0.0564,0.2937)
We use 10-fold cross validation to determine mean Brier Score resulting from different calibration baseline distribu-
tions. We use benign ( = 1) recalibration and linear ( = 1) aggregation (top rows), benign ( = 1) recalibration and
geometric ( = 0) aggregation (middle rows), and optimal recalibration and aggregation (bottom rows). The median
calibration parameters () that minimized the Brier score for each training set (the other 9 folds) are also shown. The
uniform baseline (^ p = 1=m) delivered the best accuracy (lowest Brier score in bold).
2018 HFC. The mean Brier score is minimized with a parameter value slightly less than 0 and
the median Brier score is minimized with a parameter value slightly more than 0. When both
calibration and recalibration steps are used, geometric aggregation always outperforms linear
aggregation in this tournament.
Key Finding 3.4. Calibration is useful both before and after aggregation.
The C-A-R scheme outperforms both “Calibrate then aggregate” and “Aggregate then cal-
ibrate” schemes when using the best set of parameters for each scheme.
Choosing Calibration parameters
Figure 3.12 depicts the percent improvement in Brier score due to calibration and recalibration of
the crowd for a single “price of gold” question. This view makes it easy to see the combination
of ranges of parameters that result in more accurate scores. Note that the horizontal set of pixels
75
Table 3.5: Effectiveness of Crowd-based Recalibration Baseline based on External Validation
Baseline Calibration Aggregation Recalibration External Validated
distribution (~ p) Parameter () Parameter () Parameter () Brier Score
uniform 1 1 1.6 0.1101
machine 1 1 2 0.1277
crowd 1 1 1 0.1100
Karmarkar 1 1 1 0.1100
uniform 1 0 1.2 0.1104
machine 1 0 1 0.1095
crowd 1 0 1 0.1095
Karmarkar 1 0 1 0.1095
uniform 6 0.1 0.3 0.1107
machine 2 0 0.6 0.1161
crowd 1 0 1 0.1095
Karmarkar 3 0.2 0.6 0.1099
We use external validation on another forecast competition team’s forecasts to determine mean Brier Score resulting
from different recalibration baseline distributions. We use benign ( = 1) calibration and linear ( = 1) aggregation
(top rows), benign ( = 1) calibration and geometric ( = 0) aggregation (middle rows), and optimal calibration and
aggregation (bottom rows). The recalibration parameters () that minimized the Brier score for ISI team forecasts are
also shown. The crowd-derived (average) baseline (^ p = 1=m) delivered the best accuracy across these two populations
of forecasts.
at = 1 represents the results of aggregating first and then recalibrating (with no initial calibra-
tion). Similarly, the vertical set of pixels at = 1 represents the results of calibrating and then
aggregating (with no recalibration). In all cases with this data set, for any individual question or
set of questions, the best combination of calibration and recalibration parameters was not on either
of these sets. This means that the three part scheme adds value over the alternatives of ”calibrate-
then-aggregate” or “aggregate-then-calibrate.” For most questions, the sets of “best” parameters
also fell in the upper right quadrant of this chart, indicating that extremization both before and
after aggregation was a good rule of thumb for this forecasting competition.
Key Finding 3.5. Aggressive calibration reduces forecast error for binary responses.
Calibration away from a uniform baseline (extremization) amplifies the convictions of
members of the crowd. This artificially increases crowd diversity and produces a more in-
formative aggregate estimate.
76
Table 3.6: Effectiveness of Crowd-based Calibration Baseline based on External Validation
Baseline Calibration Aggregation Recalibration External Validated
distribution (~ p) Parameter () Parameter () Parameter () Brier Score
uniform 9 1 1 0.1097
machine 2 1 1 0.1150
crowd 2 1 1 0.1116
Karmarkar 2 1 1 0.1097
uniform 1.2 0 1 0.1095
machine 1 0 1 0.1095
crowd 1 0 1 0.1095
Karmarkar 1 0 1 0.1095
uniform 6 0.1 0.3 0.1095
machine 2 0 0.6 0.1159
crowd 1 0 1 0.1091
Karmarkar 2 0.2 0.6 0.1095
We use external validation on another forecast competition team’s forecasts to determine mean Brier Score resulting
from different calibration baseline distributions (using the crowd average as the recalibration baseline). We use benign
( = 1) calibration and linear ( = 1) aggregation (top rows), benign ( = 1) calibration and geometric ( = 0)
aggregation (middle rows), and optimal calibration and aggregation (bottom rows). The calibration parameters () that
minimized the Brier score for ISI team forecasts are also shown. The crowd-derived (average) baseline (^ p = 1=m)
delivered the best accuracy across these two populations of forecasts.
Key Finding 3.6. Aggressive recalibration reduces forecast error for binary responses.
Recalibration can either restore the original conviction of the crowd, minimize error for
binary responses, or be optimized to reduce residual bias in an aggregated estimate. For tour-
naments with a binary response, aggressive recalibration moves the combined crowd forecast
towards binary and thus reduces Brier score towards zero (as long as the crowd is correct).
How parameters change over time
Figure 3.13 demonstrates that as a tournament question progresses, forecasters become more ac-
curate. Therefore their forecasts can benefit more from calibration and recalibration. That is why
these parameters should be increased over time as the crowd’s estimates get more and more sure of
the correct answer. Geometric aggregation remains a good option throughout the life of a question
as indicated by the near horizontal red dotted line.
77
Table 3.7: Effectiveness of Geometric Aggregation based on 10-fold Cross-Validation
Aggregation Calibration Aggregation Recalibration Cross-validated
method: Parameter () Parameter () Parameter () Brier Score
med (min,max) med (min,max) med (min,max) mean (min,max)
linear 1 (1,1) 1 (1,1) 1 (1,1) 0.1221 (0.0702,0.2297)
geometric 1 (1,1) 0 (0,0) 1 (1,1) 0.1182 (0.0564,0.2368)
best 1 (1,1) -0.08 (-0.3,0.1) 1 (1,1) 0.1190 (0.0564,0.2376)
linear 1 (1,1) 1 (1,1) 1.62 (1.4,3) 0.1267 (0.0581,0.3258)
geometric 1 (1,1) 0 (0,0) 1.22 (1,1.8) 0.1220 (0.0564,0.2842)
best 1 (1,1) -0.08 (-0.4,0.5) 1.24 (1,2) 0.1242 (0.0564,0.2953)
linear 5.3 (2,10) 1 (1,1) 1 (1,1) 0.1212 (0.0635,0.2410)
geometric 1.1 (1,2) 0 (0,0) 1 (1,1) 0.1230 (0.0564,0.2843)
best 1.28 (1.2,2) -0.02 (-0.3,0.3) 1 (1,1) 0.1217 (0.0535,0.2794)
linear 0.87 (0.7,1) 1 (1,1) 1.68 (1.4,3) 0.1262 (0.0595,0.3159)
geometric 1.62 (0.6,2) 0 (0,0) 0.94 (0.6,3) 0.1229 (0.0519,0.2888)
best 5.07 (0.7,6) 0.16 (0.1,0.7) 0.57 (0.3,3) 0.1244 (0.0515,0.3106)
We use 10-fold cross validation to determine mean Brier Score resulting from different aggregation methods. We use
benign calibration ( = 1) and recalibration ( = 1) (top rows), benign calibration ( = 1) (next rows), benign
recalibration ( = 1) (next rows), and optimal calibration and recalibration (bottom rows). Geometric aggregation
outperforms even the optimal aggregation on test data. Linear aggregation is shown to be most effective with significant
calibration and no recalibration.
Key Finding 3.7. The calibration and recalibration parameter values increase over time
to reduce error of the crowd forecast.
As more data become available and time for uncertainty to act decreases, the judges in
a crowd provide estimates of increasing accuracy and informativeness. Higher parameters
for calibration and recalibration take advantage of this and result in a more informative final
estimate.
Restoring Conviction
As mentioned in Example 3.d., Figure 3.14 demonstrates the impacts of both calibration and aggre-
gation on the ”conviction” of the crowd. Because Brier score is a strictly proper scoring measure,
and its ordered variant (Equation 1.2) captures ”closesness” in our forecasting competition use
case, we use it to measure distance of each forecast from a uniform (p = 1=m) forecast. After
calibration and aggregation, we seek the recalibration parameter () that would restore the average
78
Figure 3.11: Aggregation Parameter Affects Accuracy and Calibration/ Recalibration Parameters.
The left pane shows Brier score with respect to aggregation parameter. It indicates that Brier score is reduced with
an aggregation parameter near 0. Further, it demonstrates that aggregation only (red) is always outperformed by
Calibration and Aggregation (blue) which is always outperformed by Aggregate then Recalibrate (green) which is
always outperformed by C-A-R (black). The right pane shows how the best calibration and recalibrtation parameters
change with the aggregation parameter. Note the inverse relationship between calibration (solid) and recalibration
(dotted) parameter values for C-A-R (black).
conviction of the crowd. Purple in Figure 3.14 indicates we must extremize during recalibration to
restore original conviction, while Orange indicates we must de-extremize. This figure was made
specifically from our forecast competition data and may be re-created by a decision maker for each
recalibration situation they encounter as it does not depend on knowing the correct answer or the
inherent accuracy of the crowd.
Key Finding 3.8. To restore the conviction of the crowd, the calibration and recalibration
parameter values should have an inverse relationship.
When calibration extremizes the crowd forecasts, a de-extremizing recalibration is required
to restore the response to the same average distance from uniform as the average of initial
crowd estimates (and vice versa).
79
Figure 3.12: Accuracy Improvement due to Calibration and Recalibration Parameters.
Brier scores for each parameter pair are compared to a linear average of uncalibrated forecasts (the center white square
at = 1; = 1). The best improvement is indicated by the black dot (at = 5; = 2). For this specific ”price of
gold” question, when when using a linear aggregation, and using a uniform (uninformed) baseline, the ideal scheme
included extremization during the calibration phase and extremization during recalibration. When either parameter is
set= 0 (denoted by the L- shape of dark red squares), the resultant forecast is uniform, which is far from accurate for
this question. This chart is based on Brier score improvement over all time horizons and respondent conditions for a
single ”price of gold” question.
Putting it all together
When adjusting and combining crowd forecasts for tournament purposes, it will be helpful to
follow a few rules of thumb.
1. Use the proposed calibration function (Equation 3.5) with a uniform baseline.
2. Extremize during the calibration phase to address underconfidence in the crowd.
3. Aggregate using geometric opinion pool (or with a parameter near 0).
4. Recalibrate after aggregation to either restore the shape of the original estimate (with < 1)
or to match the binary response of most tournament questions (with > 1)
80
Figure 3.13: Calibration and Recalibration parameters increase over time; Geometric Aggregation
remains effective.
The left pane shows how parameter values change over the duration of a tournament question to minimize Brier score.
The general changes are shown more clearly by lines of best fit on the right pane.
3.6 Case Study: C-A-R for the Tetris Survey
The Tetris survey consisted of seven questions that asked for probabilities that the number of blocks
to fall in a Tetris-like game would end up in fixed interval bins. 51 volunteers were recruited from
multiple sources to participate. Respondents received no compensation and were incentivized to
do their best only by the pride associated with seeing the accuracy of their own responses compared
to the crowd average.
The game simulation was built in NetLogo (Wilensky, 1999a) from an open-source Tetris-like
game (Wilensky, 2001). In the simulation, blocks are dropped randomly from any position along
the top of the game board (see Figure 2.2), and stack up where they land. Respondents estimated
how many blocks will fall before they reach the top of the board and the game ends. Estimates for
this survey are simple in that they require no special knowledge about the way the world works,
just an understanding of blocks fitting on a board. Accurate answers, however, are difficult be-
81
Figure 3.14: Restoring Conviction via Recalibration.
The color of each pixel in this plot represents the value of recalibration parameter that restores the ”conviction of
the crowd” following aggregation of price of Oil and Gold questions. The y axis represents Calibration parameters
() and the x axis represents the aggregation parameter (). Note that for a linear aggregation ( = 1), as original
forecasts are de-extremized ( < 1), they require extremization during recalibration ( > 1, towards purple). We
measure ”conviction” as the ordered Brier Score of each forecast from a uniform distribution. Both calibration and
recalibration are performed with respect to a uniform distribution (^ p;~ p = 1=m). These are average results over all
forecast horizons of all Gold and Oil questions.
cause they cannot be looked up on the internet, and as the questions progress, they require a good
understanding of how the variance of simulation results will change, not just the central tendency.
Wallsten et al. (2016) used a similar set of questions that relied on spatial estimation rather than
subject-matter expertise in their study to determine effective probability elicitation methods.
Respondents saw a short video that explained the game before seeing the first question: “If
we played this simulated game many times, what is the percent probability our ‘Blocks Dropped’
score would end up in each of the following bins? A. Fewer than 25 blocks, B. Between 25 and
29 blocks dropped, C. Between 30 and 34 blocks dropped, D. Between 35 and 39 blocks dropped,
E. 40 or more blocks dropped” After entering their estimates, respondents saw the actual results of
82
Figure 3.15: Accuracy Improvement on Oil questions.
Calibration function baselines are shown from top to bottom, aggregation parameter values () are from left to right.
This chart displays median Brier score improvements over all ”price of oil” questions. Note how the best boost in
accuracy (marked by the black dot) comes from de-extremizing the individual crowd judgements towards the crowd
average ( < 1), and then extremizing ( > 1) with the Karmarkar form following aggregation. Also note that the
best improvement in Brier score occurred with an aggregation parameter () of 0.4 – some where between geometric
( = 0) and linear ( = 1) aggregation. Note how when using the machine baseline (second row of charts from top),
we are best off by not altering individual judgements ( 1).
the simulation after 1000 runs (Figure A.4). This was to allow respondents to better calibrate their
future estimates. Subsequent questions asked about pieces of different shapes and game boards of
different size. The last question offered respondents a choice to help their estimates: they could
see additional relevant data about simulations under similar conditions or they could see the result
of a mathematical model built on similar assumptions.
This data set allows for interesting comparisons of the effectiveness of the Calibrate-Aggregate-
Recalibrate scheme under conditions about the stated abilities of respondents, their accuracy on
the first question, and their use of a machine model. This data set is unique from other forecast-
ing tournaments in that the correct answers are distributions, not a more simple binary answer
like the “price of gold” questions where the answer occurs in one interval and not the others.
Experimenting with the Calibrate-Aggregate-Recalibrate scheme on this data set allows for more
83
nuanced parameter selection, whereas the former forecasting tournament always suggests a final
recalibration step with aggressive extremization.
Figure 3.16: Tetris Survey Questions and Answers
The top row of this figure shows graphics that represent the question scenarios (from left to right: blue blocks with
no debris, blue blocks with 5 rows of debris, blue blocks with 10 rows of debris, purple blocks with no debris, purple
blocks with 5 rows of debris, purple blocks with 10 rows of debris, and both blocks randomly with random rows of
debris). The bottom histograms represent the correct answers (based on 1000 simulated runs of the scenario)
Choosing the Aggregation parameter.
Similar to the HFC data, Geometric aggregation was shown as an effective aggregation choice.
The best aggregation parameter for overall average error reduction was -0.2 (See Figure 3.17).
Choosing Calibration parameters.
Restoring conviction was much more important with this data set because the answers were non-
binary. There exists an even more pronounced inverse relationship between optimal calibration
and recalibration parameters. The best overall parameters were = 7 and = 0:4 for calibration
and recalibration, respectfully.
84
Figure 3.17: Impact of Aggregation Parameter on Accuracy (Tetris Data)
The top row of this figure shows Brier score (y axis) with respect to aggregation parameter (x axis). It indicates that
Brier score is reduced with an aggregation parameter near 0. Further, it demonstrates that aggregation only (red) is
always outperformed by Calibration and Aggregation (blue) and Aggregate then Recalibrate (green) which are always
outperformed by C-A-R (black). The bottom panes show the correct answer and pictorial depiction of the question
scenario.
3.7 Contributions and Conclusions
Contributions of this research First, this research proposed a three part scheme that allows
for calibration to take place before and/or after aggregation. This flexibility is a departure from
previous methods and allows for tailoring the scheme to the intended use of the forecast, such
as for input in a decision analysis or for a competition. Second, I proposed a new linear in log-
odds calibration function that is explicitly designed for multinomial interval probabilities, not just
binary probabilities. Third, the new function incorporates a baseline probability distribution of the
user’s choosing; we are no longer limited to uniform points of preservation. Lastly, the proposed
generalized aggregation function allows for a continuum of combinations that includes both linear
and geometric opinion pools. The effectiveness of these proposed improvements was demonstrated
on two different data sets.
In addition, The Tetris survey presents a new data set that better tests calibration and combi-
nation schemes for multinomial probabilistic estimates. It offers estimation problems that resolve
to non-binary distributions. Analysis of these data supported the rules of thumb developed for
forecasting competitions.
85
Figure 3.18: Impacts of Calibration and Recalibration on Brier Score (Tetris Data)
The top row of this figure shows Improvement in Brier score (blue is better) due to calibration (y axis) and recalibration
(x axis) parameters. The results shown are with optimal aggregation ( =0:2). The bottom panes show the correct
answer and pictorial depiction of the question scenario.
Application Considerations A common situation in crowd forecasting competitions is when we
do not know the calibration or skill of the crowd and we do not have another forecast known to
be accurate. Under these conditions, a good rule of thumb is to de-extremize individual forecasts
towards the crowd linear average, aggregate between the geometric and linear opinion pools, and
the extremize aggressively using the Karmarkar form. The first step removes potential individual
overconfidence, the second produces a crowd forecast that is most likely to correctly identify the
interval in which the question will resolve, and the third step seeks to minimize the squared error
of the forecast when measured against a binary outcome. The more a decision maker knows about
the crowd of forecasters, the more tailoring they can do with the Calibrate-Aggregate-Recalibrate
scheme. Hierarchical methods may assign different calibration parameters to different groups of
forecasters. Similarly, those different groups may be aggregated separately and then weighted
during aggregation to find the crowd forecast. The key point to remember is that this should not
become an exercise in optimization, but instead should rely on what the decision maker believes
to be true about the crowd and about available machine-generated models. The result will be an
explainable system that maintains the “degree of belief” definition of each probability assigned
to an outcome interval. The system also will work immediately, without demographic data on
the judges or waiting for sufficiently many questions to resolve in order to identify the optimal
86
Figure 3.19: Calibration and Recalibration parameters that minimize Brier Score based on Aggre-
gation Parameter (Tetris Data)
The top row of this figure shows Calibration and Recalibration parameters (y axis) that minimize Brier Score when
using a particular aggregation parameter (x axis). The bottom panes show the correct answer and pictorial depiction
of the question scenario.
calibration parameter for each judge.
87
Figure 3.20: Restoring the Conviction of the Crowd through Recalibration (Tetris Data)
The top row of this figure shows the value of the restorative recalibration parameter (color) given the calibration
parameter (y axis) and aggregation parameter (x axis). The bottom panes show the correct answer and pictorial
depiction of the question scenario.
88
Chapter 4
Graphical Models for Joint Probabilistic
Estimates
Overview
Any mathematical method to determine the value of one uncertainty based on counterfactual val-
ues of other, potentially relevant, uncertainties, relies (explicitly or implicitly) on a joint probability
distribution. Graphical models have been shown as useful for simplifying, solving, and communi-
cating the types of problems that rely on counterfactual thinking and estimation of complex systems
in order to come up with the joint distribution. This chapter introduces a methodology to elicit joint
distributions from both human judges and data, and analyzes the impacts of the methodology on
probabilistic estimation tasks.
89
4.1 Hybrid Adaptive Relevance Networks for Estimating Sys-
tem States (HARNESS) Methodology
A hybrid solution/methodology for counterfactual estimation, one that exploits both data and hu-
man intuition in order to improve estimation accuracy, should meet certain design criteria; it should
be:
• Usable: simple to use, tractable/ explainable answers, augments human intuition; does not
seek to replace it
• Effective: improves forecast accuracy, can hybridize and aggregate
• Scalable: robust to additional forecasters, additional variables, additional machine models
• Improving: learns over time with more expert-provided data, provides decision maker with
insights into the human/machine trade-offs for particular problem sets
For our counterfactual forecasting application we use a hybrid relevance diagram that inte-
grates parameters derived from both historic data and from human input about which variables are
relevant and the character of their dependency relationships. An example of a relevance diagram
is depicted in Figure 4.1.a. where variable A is relevant to variable B, and both A and B are rel-
evant to variable C. Note that in this diagram, relevance does not imply dependence or causality,
so the arrows may be reversed if both nodes are conditioned on the same information (have arrows
pointing from only the same nodes- such as the arrow between nodes B and C as both nodes are
conditioned only on node A). The role of the arrows is to codify the direction of assessment of the
system that makes most sense to the forecaster. In this example, variable C is our forecast variable,
so it makes most sense to the forecaster to have arrows pointing to it and to assess its value last.
A natural approach to modeling the multivariate uncertainties of a relevance diagram may be
to specify each uncertainty’s conditional distribution given the range of possible discrete outcomes
90
(such as high/medium/low) of the preceding nodes. This approach becomes unwieldy when con-
sidering the possible orders of modeling the influence on each uncertainty from other relevant
uncertainties. Using the diagram and tree reversal techniques in Howard and Abbas (2016) helps
to simplify the process by codifying the structure of the relevance diagram in a tree whose order of
assessment makes the most sense to the forecaster. However, this approach also becomes compli-
cated when there are more than three uncertainty nodes. Further, the aforementioned conditional
distributions are difficult to derive analytically, especially when their marginal distributions are
of different families. To overcome these limitations, I propose a copula structure to describe the
dependence among marginal uncertainty distributions as outlined by Clemen and Reilly (1999). A
copula function is independent of the marginal distributions it links. It combines them to describe
their multivariate joint distribution (Nelsen, 2006). A normal copula function is ideal for our ap-
plication because the dependence structure is simple to define by eliciting measures of correlation
from forecasters and allows us to use the more efficient tree modeling approach outlined by Wang
and Dyer (2012). A normal copula function is also appropriate because we assume linear relation-
ships among marginal distributions and no tail dependencies.
1
Further, the pair-wise correlations
assessments required for normal copula function are a good approximation of the joint distribution,
as demonstrated by satisfactory results in Abbas (2006).
Because of the diversity of backgrounds of the forecasters in our competition setting, we seek
a methodology that does not greatly increase the burden on forecasters but helps them visually
explain their understanding of the system to be forecasted. In order to reduce the potential burden
on forecasters, we seek to simplify the parameterization of a relevance network by automating
some default values based on available data and eliciting only approximate measures of depen-
dence among system variables. Such an approximate approach was shown to be appropriate by
Clemen et al. (2000).
I propose the following methodology for producing counterfactual posterior forecast probabil-
1
Expert systems dynamics models of the computer game on which the forecasters are asked to predict (Civilization
V) indicate that variables are related linearly and do not have a strong probabilistic relationship between extreme
values (largely because many of the variables do not take extreme values).
91
ity distributions from forecaster-provided information and from available data on variables of the
system of interest:
• Step One: Identify forecast variable and relevant system variables. The forecaster selects
from available data the variables which are relevant to the forecasting problem and selects
the one variable about which he must forecast. For example, the forecaster selects C as the
forecast variable and determines variables A and B to be relevant to it.
• Step Two: Elicit relevance structure and draw relevance diagram. The forecaster draws the
arcs that link the variable nodes according to a few simple rules for relevance diagrams
(Howard & Abbas, 2016). An arrow represents that there may be a relevance relationship
between two nodes but makes no assertions about direction of causality. The absence of an
arrow between two nodes asserts that a relevancy relationship does not exist (we will not as-
sess their pairwise correlation in Step Three). Alternately, in future iterations of this method-
ology we may use network structure learning algorithms to suggest a relevance structure or
copy a defined structure form the experts mentioned in footnote 3. For example, through his
own intuition about the problem, the forecaster determines A is relevant to B, and both A
and B are relevant to C (Figure 4.1.a.)
• Step Three: Elicit joint distribution of system variables. The forecaster characterizes each
pairwise relevance relationship (each arrow drawn in Step Two) by its Pearson product mo-
ment correlation coefficient or as either positive or negative and as either weak or strong.
For example, if A and C have a correlation coefficient of -0.8 (Figure 4.1.b.), the forecaster
may record that as “strongly negative” as denoted by the double minus sign in Figure 4.1.c.
How to determine the definitions of “weak” and “strong” and how much error that induces
over coefficient assessment (a number between -1 and 1) is the topic of Section 4.2.
• Step Four: Determine marginal distributions of system variables. Marginal distributions may
be assessed from available data and modified by the forecaster. The forecaster must provide
any distributions for uncertainties without available data (in the event the forecaster adds
92
another variable in Step One that was not in the set of available data, for example: “public
will to support a war” may not be measured in our data set but the forecaster may feel it is
necessary in order to capture his knowledge in the relevance diagram).
• Step Five: Elicit counterfactual values of system variables. The forecaster provides initial
estimates of variable values conditioned on the counterfactual event. Estimates may be dis-
cretely approximated using the extended Pearson Tukey method as advocated by Wang and
Dyer (2012) or using the equal areas method as explained by Howard and Abbas (2016). If
the forecaster feels that as a result of the counterfactual antecedent, variable A will take a
value at the top end of its distribution and variable B will take a value near the middle, he
may select
A
= 0:95 and
B
= 0:95.
• Step Six: Calculate counterfactual posterior distribution of forecast variable. Because the
copula function is independent of the marginal distributions, we can keep our calculations
in the probability space to find
C
, a standard uniform variable over [0,1], that provides a
point forecast for variable C by inverting through its marginal distribution (F
c
1
(
C
)). We
may also find a distribution if we sample across an appropriate number of values for
C
(100
uniformly distributed values is adequate for our application). We find
C
with the equation
below.
C
= (D
CA
1
(
A
) +D
CB
1
(
AB
) +D
CC
1
(
C
))
Where represents the standard normal distribution and D
ij
is an element in the decom-
posed correlation matrix,D such thatDD
T
= , our original pair-wise correlation matrix.
We find the matrixD, shown in Figure 4.1.d., by Cholesky factorization of (Figure 4.1.b).
93
Figure 4.1: Example relevance diagram and correlation matrices
4.2 Case Study: Errors Induced by Approximate Methods of
Dependence Elicitation
I tested the Gaussian Copula method with a three-node graphical model to determine a suitable
approximation method for humans (or machines) to characterize pair-wise dependence relation-
ships between uncertainties. I conducted sensitivity analysis on a select number of approximation
methods to determine their ideal parameters and the order of the error they induce on our posterior
estimate of the random variableC compared to using actual correlation coefficients to describe
pairwise dependency relationships.
The simulation included 540 randomly assigned correlations among three variables, each eval-
uated at three discrete points of
i
= 0:05; 0:5; 0:95. The correlations were approximated by the
midpoints of the bins depicted in Figure 4.2.b. Error was measured as the mean difference in the
estimation of C using the actual calculated correlation and the approximation. There was no
discernable direction to the error, no consistent over or under estimation.
Results indicate that when we approximate the value of a bin at its midpoint, the ideal bins
are equally sized (as indicated by the number lines in the middle of Figure 4.2.b.). Although
the UPenn team leaders wanted a simple “weak”/“strong” scheme (4 bins; cut points at -0.5, 0,
0.5), the 5 bin method (cut points at -0.6, -0.2, 0.2, 0.6) allows a forecaster to include a relevance
arrow in his diagram for completeness but simplify calculations by assigning “no dependence” to
the relationship. This method reduces the induced error (from 0.073 to 0.063) without increasing
94
the cognitive complexity of the dependence assessment task. Because it allows the forecaster
to acknowledge relevance even though there may not be correlation between variables, it is our
recommended default method. Forecasters should have the option though, to input a numerical
correlation coefficient if they feel able to discern one from the data and their understanding of the
system.
Figure 4.2: Error Induced by Correlation Approximation
Plot a.) shows the mean absolute error (higher indicates greater error) on the random variable [uC] resulting from each
of the approximation methods for variable dependence as a function of the cut off values for “weak” and “strong”.
The ideal thresholds for each approximation method are indicated by the numbered dots (number indicates the number
of bins). Table b.) shows the errors induced by the ideal set of thresholds of each approximation method. The ideal
thresholds cut the coefficient space into equal-sized bins and approximate the bin by its midpoint value.
As we record how forecasters use this methodology in future experiments (and even in com-
petition iterations) we will be able to determine the induced error on the counterfactual posterior
distribution of the forecast variable. This will be a more informative measure of the trade-offs
between simplicity and accuracy and will help turn the methodology into a tool to increase human
estimator accuracy in and out of competitions.
Key Finding 4.1. Approximate methods of dependence elicitation with more fidelity in-
duce less systemic error in joint probability distribution assessment.
When a judge cannot provide a numerical measure of dependence, such as a correlation
coefficient, we may instead elicit direction and rough magnitude. The more options available
to the judge, the less error is induced by the approximation method. The ideal thresholds
95
for magnitude cut the coefficient space into equal-sized bins and approximate the bin by its
midpoint value.
4.3 Case Study: A Comparison of Methods for Joint Probabil-
ity Assessment
Because the normal copula method assumes linear relationships among relevant uncertainties, it
may not be effective when non-linear relationships exist. Other methods of joint probability distri-
bution elicitation then become necessary. The research presented in this section uses a simulated
scenario to demonstrate the potential strengths and weaknesses of the three methods (Bayesian
network, maximum entropy, and Gaussian copula) on a probabilistic estimation problem relevant
to policy-making or strategy development.
Our experiment uses data from an agent-based simulation of infection built in NetLogo.
2
On
each time step of the simulation, agents move randomly and infect others with a variable chance
of infection. After a fixed recovery time period, agents recover with a variable chance of recovery.
Figure 4.4 shows a screenshot of the simulation. We run the simulation 100 times for each of the 16
combinations of infection and recovery chances (that’s four different infection chances: 20, 25, 30,
and 35%, and four different recovery chances: 50, 55, 60, and 65%) for a total of 1600 runs. From
each run we capture the number of days elapsed until the entire population has been infected and
the number of days until the entire population has recovered. A sample of the data set is presented
in Table 4.1 and Figure 4.3 summarizes the results of all 1600 runs.
For each iteration in this experiment, we select one of the 16 combinations of infection and
recovery chances. All 100 runs of this combination are removed from the remaining data and be-
come our counterfactual answer set- a distribution of outcomes resulting from initial conditions
that do not occur in our training set. Our training set now contains results from the 15 remain-
ing combinations of infection and recovery chances. We further trim this set to include only a
2
I modified the code of an existing NetLogo simulation program called epiDEM Basic (Yang & Wilensky, 2011).
NetLogo is a multi-agent programmable modelling environment that is free to download (Wilensky, 1999a, 1999b).
96
Table 4.1: Sample of data from Infection Simulation
randomly-selected sample of simulated results for each set of conditions. Possible data sampling
schemes are depicted in Figure 4.6; we will only use uniform sampling (a.) in this research. These
training data, along with an expert-drawn graphical model representing the possible relevance rela-
tionships among the uncertainties (see Figure 4.5), are used to build joint probability distributions
and estimates using each of the three methods under consideration: Bayesian networks, maximum
entropy, and Gaussian copulas.
Each run of this experiment has several available settings to adjust in order to attain our research
objectives. We can vary the number of instances per condition randomly selected for inclusion in
the models’ training set. We can also adjust number of the 15 conditions which will contribute
to the training set. Comparing results across these settings tells us how each method responds
to additional available data. We may also vary the elicitation methods available to each method
during data query. For example, each method may query the training data set for information on
the marginal uncertainties, pairwise correlations (or other measures like conditional probabilities)
of uncertainties, higher-order measures, or combinations of the preceding. The varying granularity
of these queries are akin to approximation methods when eliciting information from a human
expert. Comparing results across these settings tells us how each method responds to different
approximation methods. I experimented producing estimates with and without the simplifications
from graphical models. I used Brier score (lower is better) to measure the error of each estimate
when compared to the held-out counterfactual data set. Results are summarized in Table 4.2. The
97
Figure 4.3: Histograms of infection simulation results
This grid of histograms is organized by the chance of infection (rows) and the chance of recovery (columns). Each
individual plot summarizes the results of 100 runs of the simulation. The total time required for the entire population
to recover is divided into five bins: less than 193 days, between 193 and 207, between 207 and 220, between 220 and
237, and greater than 237 days.
maximum entropy method with pairwise assessment and assuming a fully connected graph was
the option that best reduced average error.
The Bayesian network method relies on prior conditional probabilities observed in data to pa-
rameterize the predictive model. When no such data exist to make an inferred non-zero probability,
the network model produces an estimate skewed towards zero. When unconditioned, the Bayesian
network approach performs similarly to the maximum entropy method. Maximum entropy worked
best in this case without the simplification provided by the graphical model. This is largely due
to the sparsity of data provided to the methods. Future work will control for both the type and
amount of data available to provide clear insights about the methods’ effectiveness. The normal
copula method performed similarly to the maximum entropy method. The graphical model made
no difference for the copula method because the correlations inferred from the data for the condi-
tionally independent pairs were at or near zero to begin with. The graphical model overall made
98
Figure 4.4: Agent-based infection simulation
This screenshot depicts an agent-based simulation of infection propagation coded in the NetLogo program. The
colored stick-figures represent susceptible (white), infected (red), and recovered (green) members of the population.
In this run of the simulation, the chance of infection was set to 35% and the chance of recovery was set to 65%. 5% of
the population is initially infected at the beginning of each run of the simulation.
very little impact on the accuracy of the three models.
Key Finding 4.2. The maximum entropy method with pairwise assessment and assum-
ing a fully connected graph best reduced average error in joint probability distribution
assessment.
I experimented producing estimates with and without the simplifications from graphical
models. I used Brier score (lower is better) to measure the error of each estimate when com-
pared to the held-out counterfactual data set.
4.4 Demonstration: HARNESS Web Application
I built a web application in RShiny (RStudio, 2014) that follows the above methodology and was
originally intended to assist human estimators on counterfactual problems relating to a computer
99
Figure 4.5: Agent-based infection simulation relevance diagram
This diagram depicts potential relevance relationships among the uncertainties in the infection simulation.
game called Civilization V for the competition mentioned above. Figure 4.7 displays step one
of the methodology in the application. Figure 4.8 displays steps two, three, and five. Steps four
and six are completed behind the scenes and the results are displayed in Figure 4.9. Using this
app with practice problems (with known answers) allows users to develop their intuition about
building relevance diagrams, asserting relevance relationships and selecting counterfactual ranges
for uncertainties.
With the help of relevance diagrams and copula functions, I defined a simple methodology to
help forecasters assess counterfactual impacts on a system. I used a hybrid approach that is robust
to diverse populations of forecasters and to small data sets. It allows forecasters to codify what they
understand about a system and share that understanding with their fellow forecasters. Collabora-
tion on diagrams, dependencies, marginal distributions, and counterfactual impacts may take place
at any point in the process. This allows for tailored aggregation of forecaster responses: a team
100
(a) Sampling from each condition (b) Sampling from all conditions
Figure 4.6: Infection data sampling schemes
The data from one condition (30% chance of infection and 60% chance of recovery) are held out as our counterfactual
data set. Other data points (represented by red squares) are sampled from the remaining conditions: either uniformly
from each condition (left) or uniformly from the entire data set (right).
could aggregate only the final forecasts (late aggregation), or the parameters of the forecasters’
individual models to create a joint model (early), or anything in between (mid). This methodol-
ogy also allows for potentially more accurate aggregation methods as we can discern which data
sources each forecaster uses (for example, this makes it possible to use Bayesian ensemble methods
that aggregate with consideration of mutual information). With a web-based digital tool that facil-
itates the above methodology we also allow for experimentation to find the best ways to display
information to incite more accurate model design and parametrization by forecasters.
Perhaps the greatest advantage of employing this methodology will come as forecasters de-
velop their own intuitions about how relevance structures and counterfactual assessments on im-
portant variables impact posterior distributions of the forecast variable. A tool built with only
supervised data (based on questions for which we have the answer distributions from simulations)
would achieve this end. Future iterations of the tool will allow the forecaster to access other data
sources relevant to real world problems. For example, when forecasting the future price of gold,
the forecaster may want to build a relevance diagram that includes other uncertainties such as key
exchange rates and the prices of other commodities. The tool should then be able to automatically
search the web to find the relevant data and suggest marginal distributions and correlations to build
a complete joint distribution using the copula approach outlined above or a Bayesian network with
101
Table 4.2: Joint Distribution Simulation Results
This table displays the results of the simulation experiment to ascertain the errors resulting from various types of joint
distribution elicitation techniques.
maximum entropy to address conditions with missing data.
One unique addition made by this application as the ability to select different relevant time
ranges of uncertainties (as depicted on the left side of Figure 4.8). This method allows forecasters
to select the time range of a variable that makes sense to them. In our “Incan Happiness” example,
a forecaster may feel that while the counterfactual antecedent (desertification) occurs at time 0,
and we are asked about happiness levels at time 100, the relevant time frame for the other system
variables (Incan Food and Production levels) is between 25 and 75. This creates four different
system variables: an initial value and a change value for both Food and Production (as shown in
the diagram in Figure 4.8). This allows the forecaster to assert different relevance relationships (in
terms of direction and magnitude) for the initial value of an uncertainty and the change in value
of the uncertainty over time. This construct allows the normal copula method to mitigate some of
the potential impacts of non-linear relationships among uncertainties and helps the forecaster think
more clearly about those relationships.
102
Figure 4.7: Screenshot of HARNESS web application: Forecast Variable Selection
This screenshot displays a forecasting competition practice problem that asks about the happiness level of the Incan
civilization after 100 turns of the game. The counterfactual antecedent was that the Incan civilization started at turn 0
in a desert instead of the lush landscape of its factual run. When the human estimator selects practice problem 1.1 from
the top left dropdown menu, the application auto-populates with the pertinent information from the limited provided
data set. The histogram displays realized values of happiness of the Incan civilization at turn 100 from the data.
4.5 Contributions and Conclusions
The research in this chapter establishes a simple and flexible methodology to incorporate diverse
information sources to produce estimates that may be of maximal use (they are expected to be
accurate enough for a level of effort (in terms of number of elicitations, cost of computer models,
etc.)). The methodology is presented with an appropriate intuitive interface as a tailorable web
application.
3
The methodology is relevant for counterfactual estimation problems as well as those
more general problems that require the estimation of a joint probability distribution where both
humans and machine models are expected to contribute value. I also codified the errors result-
3
The HARNESS web application is available at https://lharavit.shinyapps.io/hybrelnet/ and an overview video of
the application in action is available at this clickable link.
103
Figure 4.8: Screenshot of HARNESS web application: System Variable Selection
This screenshot displays the process for declaring relevant system variables (left side) and making assertions about
their relevance or dependence relationships (right side). The blue and red arrows represent positive and negative
relationships, respectively. The user selects the counterfactual values of appropriate system variables using the radio
buttons on the bottom left of the figure.
ing from approximate methods of dependence relationship elicitation and compared algorithmic
methods of joint probability elicitation.
104
Figure 4.9: Screenshot of HARNESS web application: Posterior Distribution
This screenshot displays the posterior distribution generated by the Guassian copula function and the user’s inputs. It
is displayed along side the actual results (provided by IARPA after 100 simulated runs of the counterfactual scenario).
The error of the posterior distribution is displayed in the upper left corner of the chart as its Brier score.
105
Chapter 5
Considerations and Conclusions
Overview
This dissertation presented diverse analyses and findings that together help advance the toolkit
available for using estimates from multiple sources to make difficult decisions. We started with
creating forecasts, then discussed how to mechanically alter them and combine them. We finished
with a discussion that extended univariate methods to joint distributions. Care should be taken
when applying these findings to new problem sets. There is work yet to be done in order to fully
realize the potential benefits of the advanced decision maker’s toolkit.
106
5.1 Fundamental Contributions of the Dissertation
The analyses presented in this research helped to codify best practices for decision makers given
particular contexts. I used a linear in log-odds weighting function to alter probability estimates and
a generalized aggregation function to combine them. I relied on graphical models and measures
of relevance to describe joint distributions. And throughout the research, I pursued hybridization
options to mitigate the weaknesses and capitalize on the strengths of both human- and machine-
centric methods. The fundamental contributions of this dissertation are:
1. A newly proposed calibration function that extends previous work to the multinomial case
and allows for incorporation of a baseline probability distribution.
2. A flexible methodology to mechanically adjust and combine univariate probabilistic esti-
mates from multiple sources.
3. A new data set of crowd-provided univariate probabilistic estimates that are more suitable
than traditional forecast competition data sets for examining calibration and combination
methodologies.
4. A graphical methodology to produce joint probabilistic estimates from multiple sources.
Novelty and Usefulness of this research For research to be a real contribution to the field, it
should be both new and useful. Next, I’ll review the contributions of this dissertation for their
novelty and usefulness. The findings presented in Chapter 2 are not new. They are useful to give
a decision maker a more intuitive sense of the way to use simple machine forecasts and they are
included in the dissertation to give the reader an introduction to the more advanced topics in the
following chapters. Chapter 3 provides a new methodology (CAR) that is shown to be useful.
It is new simply because it includes calibration both before and after aggregation. It includes a
new calibration function and an old generalized aggregation function. The aggregation function is
shown to be useful for allowing aggregations in between, and outside of, the linear and geometric
107
opinion pool. The new calibration function is both new and useful. The ability to integrate a
baseline distribution of choice, while new, was not shown to be all that useful in practice. The
Tetris Data Set is brand new. It provides crowd-sourced probabilistic estimates that resolve to
answers other than binary. It was useful to confirm other findings and will also be useful for future
research. In Chapter 4, I introduced a new methodology (HARNESS) that made old methods
more user friendly. I also packaged the methodology into a useful web application. In all of these
contributions, I make no claims of superiority of results. Many of the experiments described in this
research may be taken to their logical conclusions by future researchers in order to better define
the “best” ways to answer the questions posed in the abstract.
5.2 Considerations for Using the CAR Methodology
A common situation in crowd forecasting competitions is when we do not know the calibration
or skill of the crowd and we do not have another forecast known to be accurate. Under these
conditions, a good rule of thumb is to de-extremize individual forecasts towards the crowd linear
average, aggregate between the geometric and linear opinion pools, and the extremize aggressively
using the Karmarkar form. The first step removes potential individual overconfidence, the second
produces a crowd forecast that is most likely to correctly identify the interval in which the question
will resolve, and the third step seeks to minimize the squared error of the forecast when measured
against a binary outcome. The more a decision maker knows about the crowd of forecasters,
the more tailoring they can do with the Calibrate-Aggregate-Recalibrate scheme. Hierarchical
methods may assign different calibration parameters to different groups of forecasters. Similarly,
those different groups may be aggregated separately and then weighted during aggregation to find
the crowd forecast. The key point to remember is that this should not become an exercise in
optimization, but instead should rely on what the decision maker believes to be true about the
crowd and about available machine-generated models. The result will be an explainable system
that maintains the “degree of belief” definition of each probability assigned to an outcome interval.
108
The system also will work immediately, without demographic data on the judges or waiting for
sufficiently many questions to resolve in order to identify the optimal calibration parameter for
each judge.
Why Not Optimization Recent research on calibrating and combining expert forecasts has been
focused on finding optimal parameters and demonstrating the improvement in accuracy resulting
from those parameters. I have not focused on optimization in my research for two reasons. 1.
In most forecasting competitions, the interval forecasts resolve such that one interval contains
the answer and all the others do not. When using a squared error metric like Brier score, it will
always make sense to extremize aggressively as long as there is a good chance that most forecasters
assigned their maximum probability to the correct interval. Finding the optimal extremization
parameter, therefore, is fairly futile as it is completely dependent on the accuracy of the crowd: if
they are accurate, use an arbitrarily high extremization parameter; if they are not, use a value less
than one . I am less concerned with finding a forecast that will get a good score and more concerned
with finding one that will be most accurate for decision analysis. 2. I am less concerned with
crowd forecasting settings where I have performance data on forecasters. I am most interested in
the “cold-barrel” forecast, the first one. I don’t know anything about the performance of the crowd,
and I don’t have any previous data to use to estimate a parameter. I would rather have ranges of
useful parameters with psychologically explainable effects on calibration. This way, the decision
maker can use their own expertise to select initial parameters and adjust them after sufficiently
many forecasts are resolved.
5.3 Considerations for Using the HARNESS Methodology
The methodology may be adapted for use in group settings. It may be used on a whiteboard, on
a standalone laptop, or over the internet across multiple estimators, stakeholders, and machines.
Some ethical implications arise as biases in thinking may propagate through the methodology and
among human estimators. Importantly, the methodology may also be used to capture the necessary
109
(user transaction) data to improve organizational processes in estimation and decision making. Just
as with any potentially useful methodology, an adopting organization should deliberately measure
the usefulness of a new way of doing things. Such measures for the methods proposed here may
include accuracy of estimates, time to decision, and individual required effort.
5.4 Future Work
This dissertation answered many interesting questions, but also left many unanswered. Some ques-
tions whose answers will advance the adoption of the methods proposed herein are listed below:
1. How should a decision maker choose the best hybridization scheme (early, late, mid inter-
twined) based on the characteristics of the estimation problem?
2. How should a decision maker choose aggregation weighting coefficients? And how must
those coefficients be adjusted with respect to the generalized aggregation parameter?
3. Do forecasters update their forecasts according to Bayesian updating when exposed to a
common forecast (such as the crowd average of the output of a machine model)? How can
they be nudged towards more accurate updating?
4. How should a decision maker choose calibration and recalibration baselines and parameters
when little is known about similar problems or forecasters?
5. What other scoring metrics besides Brier score may be better suited to assess accuracy of
probabilistic estimates while encouraging truthful judgements?
5.5 Conclusion
This dissertation presented methods to generate probabilistic estimates from human experts and
from data to be used for the analysis of policy-type decisions. It offered findings to guide future
decision makers in using, altering, and combining estimates form multiple sources. It explored
110
various ways to hybridize human intuition with algorithmic approaches. It presented a new cali-
bration function and new methodologies that can be adapted to future problems. And it resulted in
a new data set to test future adaptations.
111
Appendix A
Tetris Estimation Survey
Overview
The Tetris survey was conducted in early 2021 on the Qualtrics platform provided by USC. It pro-
vides a new data set of crowd-provided univariate probabilistic estimates that are more suitable
than traditional forecast competition data sets for examining calibration and combination method-
ologies. Screenshots of the survey are presented in this Appendix.
112
Figure A.1: Tetris Survey Screenshot: Consent page.
113
Figure A.2: Tetris Survey Screenshot: Demographic data collection.
114
Figure A.3: Tetris Survey Screenshot: The first question asks about Blue square blocks with no
debris.
115
Figure A.4: Tetris Survey Screenshot: Respondents see the correct answer following their response
on the first question.
116
Figure A.5: Tetris Survey Screenshot: Blue square blocks with 5 rows of debris on the game board.
117
Figure A.6: Tetris Survey Screenshot: Blue square blocks with 10 rows of debris on the game
board.
118
Figure A.7: Tetris Survey Screenshot: The next set of questions asks about Purple I-shaped blocks.
119
Figure A.8: Tetris Survey Screenshot: Purple I-shaped blocks with 5 rows of debris on the game
board.
120
Figure A.9: Tetris Survey Screenshot: Purple I-shaped blocks with 10 rows of debris on the game
board.
121
Figure A.10: Tetris Survey Screenshot: Respondents may accept a final challenging question.
122
Figure A.11: Tetris Survey Screenshot: Both Blue square blocks and Purple I-shaped blocks with
between 1 and 10 rows of debris on the game board.
123
Figure A.12: Tetris Survey Screenshot: Respondents are asked whether they would like to see
additional information.
124
Figure A.13: Tetris Survey Screenshot: Respondents are shown additional data from similar sim-
ulations.
125
Figure A.14: Tetris Survey Screenshot: Respondents are shown results from a mathematical model
based on similar assumptions.
126
Figure A.15: Tetris Survey Screenshot: Respondents are shown the correct answers and crowd
average responses at the completion of the survey (1 of 2).
127
Figure A.16: Tetris Survey Screenshot: Respondents are shown the correct answers and crowd
average responses at the completion of the survey (2 of 2).
128
Figure A.17: Tetris Survey Screenshot: Exit page.
129
References
Abbas, A. E. (2004). Entropy methods for adaptive utility elicitation. IEEE Transactions on Sys-
tems, Man, and Cybernetics Part A:Systems and Humans., 34(2), 169–178. https://doi.org/
10.1109/TSMCA.2003.822269
Abbas, A. E. (2005). Maximum Entropy Distributions between Upper and Lower Bounds. Amer-
ican Institute of Physics Conference Proceedings, 803, 53. https://doi.org/10.1063/1.
2149777
Abbas, A. E. (2006). Entropy methods for joint distributions in decision analysis. IEEE Transac-
tions on Engineering Management, 53(1), 146–159. https://doi.org/10.1109/TEM.2005.
861803
Abbas, A. E. (2018). Foundations of Multiattribute Utility. Cambridge University Press. https :
//doi.org/10.1017/9781316596739
Abbas, A. E., Budescu, D. V ., Yu, H.-T., & Haggerty, R. (2008). A Comparison of Two Probability
Encoding Methods: Fixed Probability vs. Fixed Variable Values. Decision Analysis, 5(4),
190–202. https://doi.org/10.1287/deca.1080.0126
Abeliuk, A., Benjamin, D. M., Morstatter, F., & Galstyan, A. (2020). Quantifying machine influ-
ence over human forecasters. Scientific Reports, 10(1), 1–14. https://doi.org/10.1038/
s41598-020-72690-4
Ariely, D., Bender, R. H., Dietz, C. B., Gu, H., Wallsten, T. S., Au, W. T., Budescu, D. V ., &
Zauberman, G. (2000). The effects of averaging subjective probability estimates between
and within judges (tech. rep. No. 2). https://doi.org/10.1037/1076-898X.6.2.130
Balke, A., & Pearl, J. (1995). Counterfactuals and Policy Analysis in Structural Models. Proceed-
ing of the 11th Conference on Uncertainty in Artificial intelligence, 11–18.
Bennett, M. J. ( J., & Hugen, D. L. (2016). Financial analytics with R : building a laptop laboratory
for data science. Cambridge University Press.
130
Bickel, D. R. (2012). Game-theoretic probability combination with applications to resolving con-
flicts between statistical methods. International Journal of Approximate Reasoning, 53,
880–891. https://doi.org/10.1016/j.ijar.2012.04.002
Brennan, M. J., & Schwartz, E. S. (1985). Evaluating Natural Resource Investments. The Journal
of Business, 58(2), 135–157.
Brier, G. W. (1950). Verification of Forecasts Expressed in Terms of Probability. Monthly Weather
Review, 78(1), 1–3. https://doi.org/10.1175/1520-0493(1950)078h0001:vofeiti2.0.co;2
Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L. (2015). INFERRING CAUSAL
IMPACT USING BAYESIAN STRUCTURAL TIME-SERIES MODELS. The Annals of
Applied Statistics, 9(1), 247–274. https://doi.org/10.1214/14-AOAS788
Budescu, D. V ., & Chen, E. (2015). Identifying Expertise to Extract the Wisdom of Crowds. Man-
agement Science, 61(2), 267–280. https://doi.org/10.1287/mnsc.2014.1909
Budescu, D. V ., & Du, N. (2007). Coherence and Consistency of Investors’ Probability Judgments.
Management Science, 53(11), 1731–1744. https://doi.org/10.1287/mnsc.1070.0727
Bumiller, E. (2010). Enemy Lurks in Briefings on Afghan War: PowerPoint - The New York Times.
Retrieved April 20, 2020, from https://www.nytimes.com/2010/04/27/world/27powerpoint.
html?%7B%5C %7Dr=0
Cavagnaro, D. R., Pitt, M. A., Gonzalez, R., & Myung, J. I. (2013). Discriminating among proba-
bility weighting functions using adaptive design optimization. Journal of Risk and Uncer-
tainty, 47(3), 255–289. https://doi.org/10.1007/s11166-013-9179-3
Cavaleri, S., & Sterman, J. D. (1997). Towards evaluation of systems thinking interventions: a case
study (tech. rep.). John Wiley & Sons.
Chang, W., Chen, E., Mellers, B., & Tetlock, P. (2016). Developing expert political judgment:
The impact of training and practice on judgmental accuracy in geopolitical forecasting
tournaments. Judgement and Decision Making, 11(5), 509–526. http://journal.sjdm.org/16/
16511/jdm16511.html
Chang, W., & Tetlock, P. E. (2016). Rethinking the training of intelligence analysts. Intelligence
and National Security, 31(6), 903–920. https://doi.org/10.1080/02684527.2016.1147164
Chatfield, C. (1993). Calculating Interval Forecasts. Journal of Business & Economic Statistics,
11(2), 121–135.
Chow, C. K., & Liu, C. N. (1968). Approximating Discrete Probability Distributions with Depen-
dence Trees. IEEE Transactions on Information Theory, 14(3), 462–467.
131
Clemen, R. T. (1989). Combining forecasts: A review and annotated (tech. rep.).
Clemen, R. T., Fischer, G. W., & Winkler, R. L. (2000). Assessing Dependence: Some Experimen-
tal Results. Management Science, 46(8), 1100–1115. https://doi.org/10.1287/mnsc.46.8.
1100.12023
Clemen, R. T., & Reilly, T. (1999). Correlations and copulas for decision and risk analysis. Man-
agement Science, 45(2), 208–224. https://doi.org/10.1287/mnsc.45.2.208
Clemen, R. T., & Winkler, R. L. (1985). Limits for the Precision and Value of Information from
Dependent Sources. Operations Research, 33(2), 427–442. https://doi.org/10.1287/opre.
33.2.427
Clemen, R. T., & Winkler, R. L. (1999). Combining Probability Distributions From Experts in Risk
Analysis. Risk Analysis, 19(2).
Colson, A. R., & Cooke, R. M. (2017). Cross validation for the classical model of structured
expert judgment. Reliability Engineering and System Safety, 163, 109–120. https://doi.org/
10.1016/j.ress.2017.02.003
Cooke, R. M. (1993). Experts in uncertainty (K. Shrader-Frechette, Ed.; V ol. 44). Oxford Univer-
sity Press, Incorporated. https://doi.org/10.1016/0040-1625(93)90030-b
Cooke, R. M., Marion, Wittmann, E., Lodge, D. M., Rothlisberger, J. D., Rutherford, E. S., Zhang,
H., & Mason, D. M. (2014). Out-of-Sample Validation for Structured Expert Judgment of
Asian Carp Establishment in Lake Erie. Integrated Environmantal Assessment and Man-
agement, 10(4), 522–528. https://doi.org/10.1002/ieam.1559
Cooke, R. M., Marti, D., & Mazzuchi, T. (2021). Expert forecasting with and without uncertainty
quantification and weighting: What do the data say? International Journal of Forecasting,
37(1), 378–387. https://doi.org/10.1016/j.ijforecast.2020.06.007
Cordes, H., Foltice, B., & Langer, T. (2019). Misperception of Exponential Growth: Are People
Aware of Their Errors? DECISION ANALYSIS, 16(4), 261–280. https://doi.org/10.1287/
deca.2019.0395
Cowpertwait, P. S., & Metcalfe, A. V . (2009). Introductory Time Series with R. Springer Science;
Business Media. https://doi.org/10.1007/978-0-387-88698-5
Davis-Stober, C. P., Budescu, D. V ., Broomell, S. B., & Dana, J. (2015). The composition of opti-
mally wise crowds. Decision Analysis, 12(3), 130–143. https://doi.org/10.1287/deca.2015.
0315
132
Dawid, A. P., DeGroot, M. H., Mortera, J., Cooke, R., French, S., Genest, C., Schervish, M. J.,
Lindley, D. V ., McConway, K. J., & Winkler, R. L. (1995). Coherent combination of ex-
perts’ opinions. Test, 4(2), 263–313. https://doi.org/10.1007/BF02562628
Dillenberger, D., & Rozen, K. (2015). History-dependent risk attitude. Journal of Economic The-
ory, 157, 445–477. https://doi.org/10.1016/j.jet.2015.01.020
D¨ orner, D. (1996). The logic of failure : recognizing and avoiding error in complex situations.
Addison-Wesley Pub.
Du, Q., Hong, H., Wang, G. A., Wang, P., & Fan, W. (2017). CrowdIQ: A New Opinion Aggrega-
tion Model. Proceedings of the 50th Hawaii International Conference on System Sciences
(2017), 1737–1744. https://doi.org/10.24251/hicss.2017.211
Duke, A. (2018). Thinking in Bets: Making Smart Decisions When You Don’t Have All the Facts.
Portfolio/ Penguin.
Durante, F., & Sempi, C. (2016). Principles of Copula Theory. CRC Press.
Foltice, B., & Langer, T. (2017). Decision Analysis In Equations We Trust? Formula Knowledge
Effects on the Exponential Growth Bias in Household Finance Decisions. Decision Analy-
sis, 14(3), 170–186. https://doi.org/10.1287/deca.2017.0351
Forecasting Counterfactuals in Uncontrolled Settings (FOCUS). (2018). Retrieved March 30, 2020,
from https://www.iarpa.gov/index.php/research-programs/focus
Gaba, A., Tsetlin, I., & Winkler, R. L. (2017). Combining Interval Forecasts. Decision Analysis,
14(1), 1–20. https://doi.org/10.1287/deca.2016.0340
Galton, F. (1907). V ox Populi. Nature (London), 75(1949), 450–451. https://doi.org/10.1038/
075450a0
Garrett, R. T., Barragan, J., & Morris, A. (2020). Texas Gov. Greg Abbott says schools to remain
closed for rest of academic year but eases some coronavirus restrictions. Retrieved April
20, 2020, from https://www.dallasnews.com/news/public-health/2020/04/17/texas-gov-
greg-abbott-keeps-schools-closed-but-eases-some-coronavirus-restrictions/
Genest, C., & Zidek, J. V . (1986). Combining Probability Distributions: A Critique and Annotated
Bibliography. Statistical Science, 1(1), 114–135. https://projecteuclid.org/download/pdf%
7B%5C %7D1/euclid.ss/1177013437
Gneiting, T., Balabdaoui, F., & Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharp-
ness. J. R. Statist. Soc. B, 69(2), 243–268.
133
Gneiting, T., & Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation.
Journal of the American Statistical Association, 102(477), 359–378. https://doi.org/10.
1198/016214506000001437
Gonzalez, R., & Wu, G. (1999). On the Shape of the Probability Weighting Function (tech. rep.).
http://www.idealibrary.comon
Graziani, C., Rosner, R., Adams, J. M., & Machete, R. L. (2021). Probabilistic recalibration of
forecasts. International Journal of Forecasting, 37, 1–27. https : / / doi . org / 10 . 1016 / j .
ijforecast.2019.04.019
Grushka-Cockayne, Y ., Richmond, V ., & Jose, R. (2020). Combining prediction intervals in the
M4 competition. International Journal of Forecasting, 36, 178–185. https://doi.org/10.
1016/j.ijforecast.2019.04.015
Grushka-Cockayne, Y ., Richmond, V ., Jose, R. C., & Lichtendahljr, K. C. (2017). Ensembles of
Overfit and Overconfident Forecasts. Management Science, 63(4), 1110–1130. https://doi.
org/10.1287/mnsc.2015.2389
Guo, X., Grushka-Cockayne, Y ., & De Reyck, B. (2018). Forecasting Airport Transfer Passenger
Flow Using Real-Time Data and Machine Learning, Harvard Business School. https://doi.
org/10.2139/ssrn.3245609
Hall, M. L. (2016). System 3: Artificial Intelligence in Decision Making (tech. rep.). United States
Army War College. Carlisle, PA.
Han, Y ., & Budescu, D. (2019). A universal method for evaluating the quality of aggregators.
Judgment and Decision Making, 14(4), 395–411.
Haran, U., Moore, D. A., & Morewedge, C. K. (2010). A simple remedy for overprecision in
judgment. Judgment and Decision Making, 5(7), 467–476. www.freddiemac.com
Hartford, J., Lewis, G., Leyton-Brown, K., & Taddy, M. (2017). Deep IV: A Flexible Approach for
Counterfactual Prediction (tech. rep.). https://github.com/jhartford/DeepIV
Hastie, T., Tibshirani, R., & Friedman, J. (2009). Elements of Statistical Learning. Springer.
Hathout, M., Vuillet, M., Carvajal, C., Peyras, L., & Diab, Y . (2019). Expert judgments calibration
and combination for assessment of river levee failure probability. Reliability Engineering
and System Safety, 188, 377–392. https://doi.org/10.1016/j.ress.2019.03.019
Heckerman, D., Geiger, D., & Chickering, D. M. (1995). Learning Bayesian Networks: The Com-
bination of Knowledge and Statistical Data. Machine Learning, 20, 197–243.
134
Henrion, M., Fischer, G. W., & Mullin, T. (1993). Divide and Conquer? Effects of Decomposition
on the Accuracy and Calibration of Subjective Probability Distributions. Organizational
Behavior and Human Decision Making, 55, 207–227.
Herhkovitz, S. (2020). Crowdsourced Intelligence (Crosint): Using Crowds for National Security.
International Journal of Intelligence, Security, and Public Affairs, 22(1), 42–55. https :
//doi.org/10.1080/23800992.2020.1744824
Himmelstein, M., Atanasov, P., & Budescu, D. V . (2021). Forecasting Forecaster Accuracy: Psy-
chometric and Longitudinal Contributions of Past Performance and Individual Differences.
Judgement and Decision Making, 16(2), 323–362.
Hora, S. C. (2004). Probability Judgments for Continuous Quantities: Linear Combinations and
Calibration. Management Science, 50(5), 597–604. https://doi.org/10.1287/mnsc.1040.
0205
Hora, S. C. (2007). Eliciting probabilities from experts. In W. Edwards, R. F. J. Miles, & D.
von Winterfeldt (Eds.), Advances in decision analysis: From foundations to applications
(pp. 129–153). Cambridge University Press. https://doi.org/10.1017/CBO9780511611308.
009
Howard, R. A. (1989). Knowledge Maps. Management Science, 35(8), 903–922.
Howard, R. A., & Abbas, A. E. (2016). Foundations of Decision Analysis (Global). Pearson Edu-
cation Limited. https://doi.org/10.1002/9781118515853.ch3
Huang, Y ., Abeliuk, A., Morstatter, F., Atanasov, P., & Galstyan, A. (2020). Anchor Attention for
Hybrid Crowd Forecasts Aggregation (tech. rep.). https : / / www. iarpa . gov / index . php /
research-programs/ace
Hull, J. (2018). Options, futures, and other derivatives / John C. Hull (Tenth). Pearson.
Hybrid Forecasting Competition (HFC). (2016). Retrieved March 30, 2020, from https://www.
iarpa.gov/index.php/research-programs/hfc
Hyndman, Rob J., & Khandakar, Yeasmin. (2008). Automatic Time Series Forecasting: The fore-
cast Package for R. Journal of Statistical Software, 27(3), 22. http://www.jstatsoft.org/v27/
i03/paper
Ingersoll, J. (2008). Non-Monotonicity of the Tversky-Kahneman Probability-Weighting Function:
A Cautionary Note. European Financial Management, 14(3), 385–390. https://doi.org/10.
1111/j.1468-036X.2007.00439.x
135
Jarrahi, M. H. (2018). Artificial intelligence and the future of work: Human-AI symbiosis in or-
ganizational decision making. Business Horizons, 61, 577–586. https://doi.org/10.1016/j.
bushor.2018.03.007
Jaynes, E. T. (1968). Prior Probabilities. IEEE Transactions on Systems Science and Cybernetics,
4(3), 227–241. https://doi.org/10.1109/TSSC.1968.300117
Jones, M. D. (1998). The Thinker’s Toolkit: Fourteen Powerful Techniques for Problem Solving
(Rev. and u). Times Business.
Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus; Giroux.
Kahneman, D., & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk.
Econometrica, 47(2), 263–292. https://about.jstor.org/terms
Karmarkar, U. S. (1978). Subjectively weighted utility: A descriptive extension of the expected
utility model. Organizational Behavior and Human Performance, 21, 61–72. https://doi.
org/10.1016/0030-5073(78)90039-9
Kasparov, G. K. ( K., & Greengard, M. (2017). Deep Thinking : Where Machine Intelligence Ends
and Human Creativity Begins. Public Affairs.
Keeney, L. R., & von Winterfeldt, D. (1991). Eliciting Probabilities from Experts in Complex
Technical Problems. IEEE Transaction on Engineering Management, 38(3), 191–201.
Keren, G. (1991). Calibration and probability judgments: Conceptual and methodological issues.
Acta Psychologica, 77, 217–273.
Klein, G. (2007). Performing a Project Premortem. Retrieved March 30, 2020, from https://hbr.
org/2007/09/performing-a-project-premortem
Kleinmuntz, D. N., Fennema, M. G., & Peecher, M. E. (1996). Conditioned Assessment of Subjec-
tive Probabilities: Identifying the Benefits of Decomposition (tech. rep. No. 1).
Koerth, M., Bronner, L., & Mithani, J. (2020). Why It’s So Freaking Hard To Make A Good
COVID-19 Model. Retrieved April 14, 2020, from https://fivethirtyeight.com/features/
why-its-so-freaking-hard-to-make-a-good-covid-19-model/
Kong, X., Zeng, X., Chen, C., Fan, Y ., Huang, G., Li, Y ., & Wang, C. (2018). Development of
a maximum entropy-Archimedean copula-based bayesian network method for streamflow
frequency analysis-A case study of the Kaidu River Basin, China. Water (Switzerland),
11(1), 1–16. https://doi.org/10.3390/w11010042
Krogerus, M., & Tschappeler, R. (2008). The decision book : fifty models for strategic thinking.
Profile.
136
Ku, H. H., & Kullback, S. (1969). Approximating Discrete Probability Distributions. IEEE Trans-
actions on Information Theory, 15(4), 444–447.
Kung, S., Yi, M., Steyvers, M., Lee, M. D., & Dry, M. J. (2012). The Wisdom of the Crowd in
Combinatorial Problems. Cognitive Science, 36, 452–470. https://doi.org/10.1111/j.1551-
6709.2011.01223.x
Laan, A., Madirolas, G., & de Polavieja, G. G. (2017). Rescuing collective wisdom when the
average group opinion is wrong. Frontiers Robotics AI, 4(NOV). https://doi.org/10.3389/
frobt.2017.00056
Laur´ ıa, E. J. M. (2005). Learning the Structure of a Bayesian Network: An Application of Infor-
mation Geometry and the Minimum Description Length Principle. AIP Conference Pro-
ceedings, 803, 293. https://doi.org/10.1063/1.2149807
Levy, W. B., & Delic, H. (1994). Maximum Entropy Aggregation of Individual Opinions. IEEE
Transactions on Systems Science and Cybernetics, 24(4), 606–613.
Lichtendahl, K. C., Grushka-Cockayne, Y ., Jose, V . R., & Winkler, R. L. (2018). Bayesian Ensem-
bles of Binary-Event Forecasts: When Is It Appropriate to Extremize or Anti-Extremize?,
Harvard Business School. https://doi.org/10.2139/ssrn.2940740
Lichtendahl, K. C., & Winkler, R. L. (2020). Why do some combinations perform better than
others? International Journal of Forecasting, 36(1), 142–149. https://doi.org/10.1016/j.
ijforecast.2019.03.027
Lin, S.-W., & Huang, S.-W. (2021). Effects of overconfidence and dependence on aggregated prob-
ability judgments. Journal of Modelling in Management, 7(1), 6–22. https://doi.org/10.
1108/17465661211208785
Liu, D. (2013). Hey New York Times: a causal loop diagram is not a PowerPoint fail. Retrieved
April 14, 2020, from http://sdwise.com/2013/07/hey-new-york-times-a-causal-loop-
diagram-is-not-a-powerpoint-fail/
Liu, W.-Y ., Yue, K., & Gao, M.-H. (2011). Constructing probabilistic graphical model from pred-
icate formulas for fusing logical and probabilistic knowledge. Information Sciences, 181,
3828–3845. https://doi.org/10.1016/j.ins.2011.05.006
MacGregor, D., Lichtenstein, S., & Slovic, P. (1988). Structuring Knowledge Retrieval: An Analy-
sis of Decomposed Quantitative Judgments (tech. rep.).
Macgregor, D. G., & Lichtenstein, S. (1991). Problem Structuring Aids for Quantitative Estimation
(tech. rep.).
137
Making Gray-Zone Activity more Black and White. (2019). Retrieved March 30, 2020, from https:
//www.darpa.mil/news-events/2018-03-14
Mandel, D. R. (2019). Too soon to tell if the US intelligence community prediction market is more
accurate than intelligence reports: Commentary on stastny and lehner (2018). Judgment
and Decision Making, 14(3), 288–292.
Mayhew, S. (2002). iEMSs International MeetingSecurity Price Dynamics and Simulation in Fi-
nancial Engineering. Proceedings of the 2002 Winter Simulation COnference, (May), 4059–
4064.
McAndrew, T., Wattanachit, N., Gibson, G. C., & Reich, N. G. (2021). Aggregating predictions
from experts: A review of statistical methods, experiments, and applications. Wiley Inter-
disciplinary Reviews: Computational Statistics, 13(2), 1–26. https://doi.org/10.1002/wics.
1514
Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., Chen, E., Baker, J.,
Hou, Y ., Horowitz, M., Ungar, L., & Tetlock, P. (2015). Identifying and Cultivating Super-
forecasters as a Method of Improving Probabilistic Predictions. Perspectives on Psycho-
logical Science, 10(3), 267–281. https://doi.org/10.1177/1745691615577794
Mellers, B., Ungar, L., Baron, J., Ramos, J., Gurcay, B., Fincher, K., Scott, S. E., Moore, D.,
Atanasov, P., Swift, S. A., Murray, T., Stone, E., & Tetlock, P. E. (2014). Psychologi-
cal Strategies for Winning a Geopolitical Forecasting Tournament. Psychological Science,
25(5), 1106–1115. https://doi.org/10.1177/0956797614524255
Mellers, B. A., & Tetlock, P. E. (2019). From discipline-centered rivalries to solution-centered
science: Producing better probability estimates for policy makers. American Psychologist,
74(3), 290–300. https://doi.org/10.1037/amp0000429
Merkle, E. C., & Hartman, R. (2018). Weighted Brier score decompositions for topically het-
erogenous forecasting tournaments. Judgment and Decision Making, 13(2), 185–201. https:
//doi.org/10.31219/osf.io/p6wk5
Merkle, E. C., Steyvers, M., Mellers, B., & Tetlock, P. E. (2017). A neglected dimension of good
forecasting judgment: The questions we choose also matter. International Journal of Fore-
casting, 33, 817–832. https://doi.org/10.1016/j.ijforecast.2017.04.002
Migliore, S., Curcio, G., Mancini, F., & Cappa, S. F. (2014). Counterfactual thinking in moral
judgment: An experimental study. Frontiers in Psychology, 5(MAY), 1–7. https://doi.org/
10.3389/fpsyg.2014.00451
Mitchell, T. M. (1997). Machine Learning. McGraw Hill. https://doi.org/10.1145/242224.242229
138
Miyoshi, T., & Matsubara, S. (2018). Dynamically Forming a Group of Human Forecasters and
Machine Forecaster for Forecasting Economic Indicators. Proceedings of the Twenty-Seventh
International Joint Conference on Artificial Intelligence (IJCAI-18).
Morstatter, F., Galstyan, A., Satyukov, G., Benjamin, D., Abeliuk, A., Mirtaheri, M., Hossain,
K. T., Szekely, P., Ferrara, E., Matsui, A., Steyvers, M., Bennet, S., Budescu, D., Himmel-
stein, M., Ward, M., Beger, A., Catasta, M., Sosic, R., Leskovec, J., . . . Abbas, A. (2019).
SAGE: A Hybrid Geopolitical Event Forecasting System. Proceedings of the Twenty-Eighth
International Joint Conference on Artificial Intelligence (IJCAI-19), 6557–6559. https :
//sage-platform.isi.
Myung, J., Ramamoorti, S., & Bailey, A. D. (1996). Maximum entropy aggregation of expert
predictions. Management Science, 42(10), 1420–1436. https://doi.org/10.1287/mnsc.42.
10.1420
Nelsen, R. B. (2006). An Introduction to Copulas second edition (Second). Springer Science+Business
Media, Inc.
Norris, N. (1976). General means and statistical theory. American Statistician, 30(1), 8–12. https:
//doi.org/10.1080/00031305.1976.10479125
O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J., Oakley,
J. E., & Rakow, T. (2006). Uncertain Judgements: Eliciting Experts’ Probabilities. John
Wiley & Sons, Ltd. https://doi.org/10.1002/0470033312
O’leary, D. E. (2017). Crowd performance in prediction of the World Cup 2014. European Journal
of Operational Research, 260, 715–724. https://doi.org/10.1016/j.ejor.2016.12.043
Paddock, J. L., Siegel, D. R., & Smith, J. L. (1988). Option Valuation of Claims on Real Assets:
The Case of Offshore Petroleum Leases. The Quarterly Journal of Economics, 103(3),
479–508.
Page, M., Aiken, C., & Murdick, D. (2020). Future Indices: How Crowd Forecasting can Inform
the Big Picture (tech. rep. October). Center for Security and Emerging Technology. Wash-
ington, DC. https://cset.georgetown.edu/research/future-indices/
Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.
Morgan Kaufmann Publishers, Inc. https://doi.org/10.1016/b978-0-08-051489-5.50002-3
Pearl, J. (2000). Causality: Models , Reasoning , and Inference (First). Cambridge University Press.
Pearl, J. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146. https:
//doi.org/10.1214/09-SS057
139
Pearl, J., & Mackenzie, D. (2018). The book of why : the new science of cause and effect (First).
Basic Books.
Prahl, A., & van Swol, L. (2017). Understanding algorithm aversion: When is advice from automa-
tion discounted? Journal of Forecasting, 36(6), 691–702. https://doi.org/10.1002/for.2464
Prelec, D. (1998). The Probability Weighting Function. Econometrica, 66(3), 497–527.
Ranjan, R., & Gneiting, T. (2010). Combining probability forecasts. J. R. Statist. Soc. B, 72(1),
71–91.
Ravinder, H. V ., Kleinmuntz, D. N., & Dyer, J. S. (1988). The Reliability of Subjective Probabilities
Obtained Through Decomposition. Management Science, 34(2), 186–199. https://doi.org/
10.1287/mnsc.34.2.186
Reddy, K., & Clinton, V . (2016). Simulating stock prices using geometric Brownian motion: Evi-
dence from Australian companies. Australasian Accounting, Business and Finance Journal,
10(3), 23–47. https://doi.org/10.14453/aabfj.v10i3.3
Robinson, J. (2003). Future subjunctive: backcasting as social learning. Futures, 35, 839–856.
https://doi.org/10.1016/S0016-3287(03)00039-9
Roese, N. (1999). Counterfactual thinking and decision making. Psychonomic Bulletin & Review,
6(4), 570–578. http://www.nwu.edu/
RStudio, I. (2014). shiny: Easy web applications in R. http://shiny.rstudio.com
Russell, S. J., & Norvig, P. (2009). Artificial Intelligence: A Modern Approach (Third). Pearson.
Satop¨ a¨ a, V . A., Baron, J., Foster, D. P., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). Com-
bining multiple probability predictions using a simple logit model. International Journal
of Forecasting, 30, 344–356. https://doi.org/10.1016/j.ijforecast.2013.09.009
Satop¨ a¨ a, V . A., Jensen, S. T., Mellers, B. A., Tetlock, P. E., & Ungar, L. H. (2014). Probability
aggregation in time-series: Dynamic hierarchical modeling of sparse expert beliefs. Annals
of Applied Statistics, 8(2), 1256–1280. https://doi.org/10.1214/14-AOAS739
Satop¨ a¨ a, V . A., Pemantle, R., & Ungar, L. H. (2016). Modeling Probability Forecasts via Infor-
mation Diversity. Journal of the American Statistical Association, 111(516), 1623–1633.
https://doi.org/10.1080/01621459.2015.1100621
Scutari, M. (2018). Dirichlet Bayesian Network Scores and the Maximum Relative Entropy Prin-
ciple, University of Oxford.
140
Scutari, M., Graafland, C. E., & Manuel Guti´ errez, J. (2019). Who Learns Better Bayesian Network
Structures: Accuracy and Speed of Structure Learning Algorithms.
Senge, P. M. (2006). The fifth discipline : the art and practice of the learning organization (Revised
an). Doubleday/Currency.
Senge, P. M., & Sterman, J. D. (1992). Systems thinking and organizational learning: Acting lo-
cally and thinking globally in the organization of the future (tech. rep.).
Smith, J. E. (1993). Moment Methods for Decision Analysis. Management Science, 39(3). https:
//doi.org/10.1287/mnsc.39.3.340
Sterman, J. D. (1989). Modeling Managerial Behavior: Misperceptions of Feedback in a Dynamic
Decision Making Experiment. Management Science, 35(3), 321. https://doi.org/10.1287/
mnsc.35.3.321
Steyvers, M., Wallsten, T. S., Merkle, E. C., & Turner, B. M. (2014). Evaluating Probabilistic
Forecasts with Bayesian Signal Detection Models. Risk Analysis, 34(3). https://doi.org/10.
1111/risa.12127
Surowiecki, J. (2005). The wisdom of crowds. Anchor Books.
Sweeney, L. B., & Sterman, J. D. (2007). Thinking about systems: student and teacher conceptions
of natural and social systems. Dyn. Rev, 23, 285–312. https://doi.org/10.1002/sdr
Sweeney, L. B., & Sterman, J. D. (2000). Bathtub dynamics: initial results of a systems thinking
inventory. System Dynamics Review, 16(4), 249–286.
Tetlock, P. E., Horowitz, M. C., & Herrmann, R. (2012). Critical Review A Journal of Politics and
Society SHOULD ”SYSTEMS THINKERS” ACCEPT THE LIMITS ON POLITICAL
FORECASTING OR PUSH THE LIMITS? Critical Review, 24(3), 375–391. https://doi.
org/10.1080/08913811.2012.767047
Tetlock, P. E., & Gardner, D. (2015). Superforecasting : the art and science of prediction (First).
Crown.
Turner, B. M., Steyvers, M., Merkle, E. C., Budescu, D. V ., & Wallsten, T. S. (2014). Forecast
aggregation via recalibration. Machine Learning, 95(3), 261–289. https://doi.org/10.1007/
s10994-013-5401-4
Tversky, A., & Kahneman, D. (1974). Judgment under Uncertainty: Heuristics and Biases. Science,
185(4157), 1124–1131.
Tversky, A., & Kahneman, D. (1992). Advances in Prospect Theory: Cumulative Representation
of Uncertainty. Journal of Risk and Uncertainty, 5, 297–323.
141
Waggoner, P. D., Kennedy, R., Le, H., & Shiran, M. (2019). Big Data and Trust in Public Policy
Automation. Statistics, Politics and Policy, 10(2), 115–136. https://doi.org/10.1515/spp-
2019-0005
Wallsten, T. S., Shlomi, Y ., Nataf, C., & Tomlinson, T. (2016). Efficiently encoding and model-
ing subjective probability distributions for quantitative variables. Decision, 3(3), 169–189.
https://doi.org/10.1037/dec0000047
Wang, T., & Dyer, J. S. (2012). A Copulas-Based Approach to Modeling Dependence in Decision
Trees. Operations Research, 60(1), 225–242. https://doi.org/10.1287/opre.1110.1004
Wilensky, U. (1999a). NetLogo. http://ccl.northwestern.edu/netlogo/
Wilensky, U. (1999b). Thinking in Levels: A Dynamic Systems Approach to Making Sense of the
World. Journal of Science Education and Technology, 8(1).
Wilensky, U. (2001). NetLogo Tetris model. http://ccl.northwestern.edu/netlogo/models/Tetris
Wilson, K. J. (2017). An investigation of dependence in expert judgement studies with multiple
experts. International Journal of Forecasting, 33, 325–336. https://doi.org/10.1016/j.
ijforecast.2015.11.014
Winkler, R. L., Grushka-Cockayne, Y ., Lichtendahl, K. C., & Jose, R. R. (2019). Probability Fore-
casts and Their Combination: A Research Perspective. https : / / doi . org / 10 . 2139 / ssrn .
3258627
Wright, G., Saunders, C., & Ayton, P. (1988). The Consistency, Coherence and Calibration of
Holistic, Decomposed, and Recomposed Judgemental Probability Forecasts. Journal of
Forecasting, 7, 185–199. https://search.proquest.com/docview/220297257?accountid=
12834
Yang, C., & Wilensky, U. (2011). NetLogo epiDEM Basic model. http://ccl.northwestern.edu/
netlogo/models/epiDEMBasic
Yaniv, I., & Foster, D. P. (1995). Graininess of Judgment Under Uncertainty: An Accuracy-Informativeness
Trade-Off. Journal of Experimental Psychology: General, 124(4), 424–432. https://doi.org/
10.1037/0096-3445.124.4.424
Zellner, M., Abbas, A. E., Budescu, D. V ., & Galstyan, A. (2021). A survey of human judgement
and quantitative forecasting methods. Royal Society Open Science, 8(2), 201187. https :
//doi.org/10.1098/rsos.201187
Zhang, H., & Maloney, L. T. (2012). Ubiquitous log odds: A common representation of probability
and frequency distortion in perception, action, and cognition. Frontiers in Neuroscience,
6(JAN), 1–14. https://doi.org/10.3389/fnins.2012.00001
142
Abstract (if available)
Abstract
The purpose of this research is to advance understanding of the uses of human- and machine-generated probabilistic estimates. When used for decision analysis, probabilistic estimates provide a decision maker with a sense of both possibility and risk of future prospects. To mitigate potential biases in their own thinking and gaps in their own knowledge, wise decision makers elicit such estimates from multiple sources, such as experts, crowds, and machine models. Different experiences and points of view often help a group come up with better answers than its individuals would alone. However, multiple judges are not likely to agree and each is subject to their own systemic biases. In addition, many decisions (such as policy decisions) carry large consequences, may change in scope over time, and are riddled with uncertainties that are mutually relevant. Defining a joint probability distribution of the pertinent uncertainties then becomes a critical task in decision analysis. The task is made more difficult by lack of available data, lack of human expert abilities to handle the cognitive complexity of the estimation task, or often both. To mitigate these potential problems, this research analyzes data from recent forecasting competitions to arrive at generalizable findings. The findings are combined with novel methodologies that together help answer three key questions of contemporary significance: ? 1. How should a decision maker mechanically alter probability estimates from human judges to account for systemic bias? ? 2. How should a decision maker combine probability estimates from multiple human judges and from mathematical models? ? 3. How should a decision maker elicit joint probability estimates from human judges and from data? ? The answers to the above questions are not cut and dry; they instead rely on the context of the estimation tasks and the larger decision to be made. The analyses presented in this research help to codify best practices for decision makers given particular contexts. I use a linear in log-odds weighting function to alter probability estimates and a generalized aggregation function to combine them. I rely on graphical models and measures of relevance to describe joint distributions. And throughout the research, I pursue hybridization options to mitigate the weaknesses and capitalize on the strengths of both human- and machine-centric methods. The major contributions of this dissertation are: ? 1. A newly proposed calibration function that extends previous work to the multinomial case and allows for incorporation of a baseline probability distribution. ? 2. A flexible methodology to mechanically adjust and combine univariate probabilistic estimates from multiple sources. ? 3. A new data set of crowd-provided univariate probabilistic estimates that are more suitable than traditional forecast competition data sets for examining calibration and combination methodologies. ? 4. A graphical methodology to produce joint probabilistic estimates from multiple sources.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Utility functions induced by certain and uncertain incentive schemes
PDF
Machine learning-driven deformation prediction and compensation for additive manufacturing
PDF
Probabilistic framework for mining knowledge from georeferenced social annotation
PDF
Making terrorism risk assessments more useful for decision-making
PDF
Context-adaptive expandable-compact POMDPs for engineering complex systems
PDF
Fair Machine Learning for Human Behavior Understanding
PDF
Applications of Wasserstein distance in distributionally robust optimization
PDF
Nanostructure interaction modeling and estimation for scalable nanomanufacturing
PDF
Probabilistic data-driven predictive models for energy applications
PDF
Automated alert generation to improve decision-making in human robot teams
PDF
Calibration uncertainty in model-based analyses for medical decision making with applications for ovarian cancer
PDF
Learning enabled optimization for data and decision sciences
PDF
Network-based simulation of air pollution emissions associated with truck operations
PDF
Fabrication-aware machine learning for accuracy control in additive manufacturing
PDF
Artificial Decision Intelligence: integrating deep learning and combinatorial optimization
PDF
Modeling human bounded rationality in opportunistic security games
PDF
Subsurface model calibration for complex facies models
PDF
Defending industrial control systems: an end-to-end approach for managing cyber-physical risk
PDF
Algorithms and landscape analysis for generative and adversarial learning
PDF
On the interplay between stochastic programming, non-parametric statistics, and nonconvex optimization
Asset Metadata
Creator
Haravitch, Lucas J.
(author)
Core Title
Human and machine probabilistic estimation for decision analysis
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Industrial and Systems Engineering
Degree Conferral Date
2021-08
Publication Date
07/26/2021
Defense Date
05/03/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
forecast calibration,forecast combination,forecast competition,graphical models,hybridization,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Abbas, Ali (
committee chair
), John, Richard (
committee member
), Moore, James (
committee member
)
Creator Email
lharavit@usc.edu,luke.haravitch@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC15622960
Unique identifier
UC15622960
Legacy Identifier
etd-HaravitchL-9884
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Haravitch, Lucas J.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
forecast calibration
forecast combination
forecast competition
graphical models
hybridization