Upload
pkj009
View
213
Download
0
Embed Size (px)
Citation preview
7/29/2019 2011 Graefe
1/13
International Journal o Forecasting 27 (2011) 183195
www.elsevier.com/locate/ijorecast
Comparing ace-to-ace meetings, nominal groups, Delphi andprediction markets on an estimation task
Andreas Graeea,, J. Scott Armstrong b
aInstitute or Technology Assessment and Systems Analysis, Karlsruhe Institute o Technology, Germanyb The Wharton School, University o Pennsylvania, Philadelphia, PA, United States
Abstract
We conducted laboratory experiments or analyzing the accuracy o three structured approaches (nominal groups, Delphi,
and prediction markets) relative to traditional ace-to-ace meetings (FTF). We recruited 227 participants (11 groups per method)
who were required to solve a quantitative judgment task that did not involve distributed knowledge. This task consisted o ten
actual questions, which required percentage estimates. While we did not fnd statistically signifcant dierences in accuracy
between the our methods overall, the results diered somewhat at the individual question level. Delphi was as accurate as FTF
or eight questions and outperormed FTF or two questions. By comparison, prediction markets did not outperorm FTF orany o the questions and were inerior or three questions. The relative perormances o nominal groups and FTF were mixed
and the dierences were small. We also compared the results rom the three structured approaches to prior individual estimates
and staticized groups. The three structured approaches were more accurate than participants prior individual estimates. Delphi
was also more accurate than staticized groups. Nominal groups and prediction markets provided little additional value relative
to a simple average o the orecasts. In addition, we examined participants perceptions o the group and the group process. The
participants rated personal communications more avorably than computer-mediated interactions. The group interactions in FTF
and nominal groups were perceived as being highly cooperative and eective. Prediction markets were rated least avourably:
prediction market participants were least satisfed with the group process and perceived their method as the most difcult.
c 2010 International Institute o Forecasters. Published by Elsevier B.V. All rights reserved.
Keywords: Forecast accuracy; Group decision-making; Method comparison; Satisaction; Combining
1. Introduction
In situations where a lack o appropriate or avail-able inormation precludes one rom using quanti-
Corresponding author.E-mail addresses: [email protected] (A. Graee),
[email protected] (J.S. Armstrong).
tative methods, it can be helpul to incorporate hu-
man judgment in order to improve the orecast. But
how does one get the best orecast when aggregating
inormation rom a group o people? While organi-
zations most commonly rely on unstructured ace-
to-ace meetings, it is difcult to fnd evidence to
support the use o this strategy (Armstrong, 2006).
0169-2070/$ - see ront matter c 2010 International Institute o Forecasters. Published by Elsevier B.V. All rights reserved.doi:10.1016/j.ijorecast.2010.05.004
http://dx.doi.org/10.1016/j.ijforecast.2010.05.004http://www.elsevier.com/locate/ijforecastmailto:[email protected]:[email protected]://dx.doi.org/10.1016/j.ijforecast.2010.05.004http://dx.doi.org/10.1016/j.ijforecast.2010.05.004mailto:[email protected]:[email protected]://www.elsevier.com/locate/ijforecasthttp://dx.doi.org/10.1016/j.ijforecast.2010.05.0047/29/2019 2011 Graefe
2/13
184 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195
The literature suggests that structured approaches such
as nominal groups or Delphi provide more accurateorecasts than traditional meetings.
In recent years, there has been increasing inter-est in prediction markets as an approach or eliciting
inormation rom people, and a number o organiza-
tions have started to experiment with them. However,to date, we still do not know much about the peror-
mance o prediction markets. The available studies are
limited and are oten o a small scale. In particular,we do not know o any study that has analyzed their
perormance relative to either meetings or other struc-
tured group techniques or aggregating the knowledgein groups. Since the emergence o the feld, there has
been no meta-analysis that has analyzed the accuracy
o prediction markets.To address this defciency, we conducted laboratory
experiments which compared unstructured meetings,nominal groups, Delphi, and prediction markets. We
analyzed the relative accuracies o the our group
techniques on a quantitative judgment task andexamined the participants perceptions o their group
and their group process.
2. Group judgment tasks
We selected problems that did not involve highlydistributed inormation. This is not intended to suggest
that problems involving inormation which is highly
dispersed among the participants are unimportant, butmerely that we addressed a limited but typical
orecasting situation: general problems or which
people are unlikely to possess specifc knowledge. In
our study, the groups had to come up with estimates ora quantitative judgment task. This task consisted o ten
actual questions that required percentage estimates.
These questions had correct solutions, but all group
members could be expected to have some degree ouncertainty about the answers.
The fndings rom group perormance studies are
generally task-specifc. The critical question arisesas to whether or not the tasks used in experimental
settings are representative o real-world problems,
and thus allow or generalizations to dierent types o
tasks (such as orecasting problems) or participants.For example, Wright and Ayton (1986) provided
some evidence that the fndings rom calibration
studies which used general knowledge questions are
not applicable to judgmental orecasting problems,since the two task types involve dierent levelso uncertainty. Furthermore, it is oten argued that
student participants in laboratory experiments arelaymen, whereas the participants in real-worldorecasting scenarios are experts who are expected topossess superior knowledge.
In general, or a judgmental orecasting method
to perorm well, it should be able to aggregate therelevant inormation rom its members efciently.Thus, a basic precondition or the generalizabilityo problems used in experimental settings is the
involvement o participants in trying to solve them.Factual questions are commonly analyzed by groupdecision-making, as they have been shown to spark the
interest o student participants.
3. Group techniques
We analyzed our group techniques that dier inthe amount and structure o the interactions permittedbetween group members: ace-to-ace meetings,
nominal groups, the Delphi method, and predictionmarkets.
3.1. Unstructured ace-to-ace meetings
Unstructured ace-to-ace meetings (FTF) allowany orm o direct interaction between groupmembers. Although meetings are most common or
group decision-making in organizations, they havebeen shown to be subject to many biases anddrawbacks. For example, (1) it requires time and eortor a group to maintain itsel; (2) groups tend to
aim to reach speedy decisions and do not considerall problem dimensions, and thus tend to pursue alimited train o thought, which leads to a centraltendency eect or groupthink; (3) less confdent
group members, or people rom lower hierarchy levels,may keep silent about their reservations because oeither group pressures or conormity or impliedthreats o sanctions; (4) dominant personalities tend to
exert an excessive amount o inuence on the group.For a summary o these issues, see Van de Ven andDelbecq (1971).
In summary, Armstrong (2006) ound little evi-dence to support the use o meetings or orecastingor decision-making. In addition, meetings are expen-
sive both to schedule and to run. In some situations,
7/29/2019 2011 Graefe
3/13
A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 185
however, one must be concerned with not only the
quality o a decision but also its acceptability. Peo-ple generally enjoy human interactions and the sense
o working together, and meetings have oten beenshown to achieve high levels o satisaction (e.g. Boje& Murnighan, 1982; Van de Ven & Delbecq, 1974).
3.2. Nominal group technique
The nominal group technique (NGT), developedby Van de Ven and Delbecq (1971, 1974), tries to
avoid some o the drawbacks o traditional meetings
by adding a structured ormat to the direct interactions.This process is conducted in three steps. First, group
members work independently to generate individual
estimates or a problem. Then, the group enters anunstructured discussion to deliberate on the problem.
Finally, the group members again work independently
to provide their fnal individual estimates. The groupresult is the aggregated outcome o these fnal
individual estimates. The literature also reers to this
process as estimate-talk-estimate (Gustason, Shukla,Delbecq, & Walster, 1973).
The idea o NGT is that direct interactions during
the assessment or evaluation phase can have a positive
impact on estimation or problem solving. In particular,they can help group members to clariy and justiy
their points o view, which may help the group to make
more inormed decisions. Nonetheless, in the phaseso generating answers and making the fnal decisions,
NGT prevents direct interactions between group
members, in order to reduce the known drawbackso traditional meetings. See Van de Ven and Delbecq
(1971) or a summary o the advantages o nominal
groups.
3.3. The Delphi method
Delphi is a multiple-round survey in whichthe participants anonymously reveal their individualestimates and provide comments on a problem. Ater
each round, the individual estimates and comments
are summarized and reported to the participants aseedback. Taking this inormation into account, the
participants then provide their new estimate in the
ollowing round. The group result is the aggregatedoutcome o the individual estimates o the fnal round.
Unlike FTF and NGT, which require the physical
proximity o the group members, the participants
in Delphi are physically dispersed and do not meetin person. In general, Delphi unctions somewhat
similarly to NGT. The main dierence is that written
interactions are utilized throughout the process, inorder to avoid direct interactions between the group
members.The strengths o the method are seen in its
structured communication process, which enablesdiscussion and helps groups to achieve a consensus,
but limits the drawbacks associated with directinteractions. For reviews o the Delphi method, seeRowe and Wright (1999) and Woudenberg (1991).
3.4. Prediction markets
Prediction markets were popular in the late 1800s(Rhode & Strump, 2004), and have an impressive
track record in orecasting elections. In analyzing964 polls or the fve presidential elections rom1988 to 2004, Berg, Nelson, and Rietz (2008) ound
that the respective orecasts o the Iowa ElectronicMarkets were closer to the actual election results thanthe individual polls 74% o the time. However, this
advantage disappeared when the polls were averagedand damped (Erikson & Wlezien, 2008).
Prediction markets are gaining attention in variousfelds o orecasting. The idea is to set up a contract
whose payo depends on the outcome o an uncertainuture event. This contract, which can be interpretedas a bet on the outcome o the underlying uture event,
can then be traded by participants. As soon as theoutcome is known, the participants are paid o inexchange or the contracts they hold. Based on their
individual perormances, participants can win money.I one thinks that the current group estimate is toolow (high), one will buy (sell) stocks. Thus, through
the prospect o gaining money, the participants havean incentive to become active in the group process
whenever they expect the group estimate to beinaccurate. As in Delphi, the participants are mutuallyanonymous, and thus are not subject to the socialpressures connected with direct interactions. The
main dierence is that the participants exchangeinormation continuously through the price signal othe market, but do not share comments or reasons as
to why they buy or sell a contract. For an explanationo the concept o prediction markets and a useulsummary o the method, see Wolers and Zitzewitz
(2004).
7/29/2019 2011 Graefe
4/13
186 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195
Prediction markets using play-money have alsoproven to be accurate. In comparing the results oplay-money and real-money markets or sport events,
Servan-Schreiber, Wolers, Pennock, and Galebach(2004) did not fnd any dierences in accuracy. Onthe other hand, Rosenbloom and Notz (2006) oundreal-money markets to be more accurate or non-sportsevents.
4. Related work
Structured group techniques are expected to limitthe biases associated with meetings, and prior researchseems to support this assumption. In their review,Rowe and Wright (1999) reported a superior accuracy
o Delphi over unstructured interaction by a scoreo fve studies to one. In excluding two papers romtheir analysis that did not analyze the accuracy orquantitative judgment tasks, we adjusted the score tothree studies to one, still with two ties. Similarly,in summarizing the literature, Woudenberg (1991)ound an albeit slight superiority o Delphiover unstructured interaction. Furthermore, again orquantitative judgment tasks, Gustason et al. (1973)ound NGT to be more accurate than FTF, althoughFischer (1981) ound no dierences. Evidence on therelative perormances o NGT, Delphi, and predictionmarkets or quantitative judgment tasks is scarce.While Gustason et al. (1973) showed that NGT wasmore accurate than Delphi, two studies ound nodierences (Boje & Murnighan, 1982; Fischer, 1981).In the case o prediction markets, we do not know oany work that has compared the method to FTF, NGTor Delphi.
Little work has been done on analyzing partici-pants perceptions o group processes, although thiscan be important or the acceptance o its results. I thegroup members or decision-makers eel dissatisfed
with the process, its outcome may not be adopted, eveni it is highly accurate. On the other hand, a processthat is highly satisying or participants may not nec-essarily lead to accurate results. In general, personalinteractions in groups can lead to either coherence, andthus a high level o perceived satisaction, or disagree-ment, resulting in rustrated group members. For quan-titative judgmental tasks, Boje and Murnighan (1982)analyzed the participants perceptions o Delphi, NGT,and people working alone. They ound that NGTwas rated most avorably in terms o eectiveness,
satisaction, and the reedom to participate, whereas
Delphi was rated only slightly superior to working
alone. Van de Ven and Delbecq (1974) compared par-
ticipants perceptions o nominal groups, Delphi, andunstructured meetings or an idea generation problem.
They ound that NGT participants expressed the high-
est satisaction with the process, whereas the dier-ences between the Delphi method and meetings were
small.
For prediction markets, we do not know o anystudy that has analyzed the way in which people
perceive participation and how this might aect the
acceptability o market results. However, practical
experience indicates that people have problems
in understanding how prediction markets work.
In particular, they have problems in translatinginormation into market prices (Green, Armstrong,
& Graee, 2007). In addition, people appear to have
difculties in understanding what the prices mean. Forexample, during the 2008 fnancial crisis, intrade.com
launched a contract or predicting whether or not
the US government would pass a bail out. At one
time, the contract was traded at a price o $80,
which meant that the market orecasted an 80%probability that the bail out would go through.
A sta writer at CNNMoney.com who would
be expected to possess a certain understanding ofnancial instruments interpreted the market prices
wrongly, stating that 80% o the participants thought
that the event would come true (CNNMoney.com:
Oddsmakers see bailout OK soon, September 24,
2008). A misunderstanding o how prediction markets
work might be critical to their acceptance by both
participants and decision-makers. I people do notunderstand how to reveal inormation, the way in
which inormation is aggregated into market prices, or
how to interpret market prices, it is doubtul whether
they will have a avorable perception o the method,and thus have confdence in its results.
5. Hypotheses
Given the lack o evidence, we had no directionalhypotheses on the relative accuracy o the three
structured approaches (NGT, Delphi, and prediction
markets), since arguments could be ound in either
direction. For example, prediction markets might be
advantageous to Delphi because traders can update
7/29/2019 2011 Graefe
5/13
A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 187
their estimates continuously and are not tied to alimited number o rounds. On the other hand, themarket participants do not reveal the reasons or their
estimates. Thus, other group members might not beable to relate to why the group estimate has changed.
Accordingly, the possibility o argumentation andreasoning might avor NGT and Delphi. For accuracy,we expected: NGT Delphi prediction markets >
FTF.We sent the study design to experts in the feld
and asked them to provide their prior estimates o
the relative accuracies o the methods. To ensurethat the experts were amiliar with all our methods,we primarily reached out to people with experiencein prediction markets. Eight experts1 responded and
we derived the ollowing ranking rom their priors:Delphi > prediction markets > NGT > FTF. Inexpecting that all three structured approaches wouldoutperorm FTF, the expert priors were consistentwith our hypotheses. In particular, none o the experts
expected FTF to do better than any o the structuredapproaches.
To obtain a measure o the acceptability o themethods, we asked or the participants perceptionso the group and the group process along seven at-
tributes. In particular, participants rated (1) coopera-
tion and (2) disagreement among group members, andrevealed their (3) confdence in the group results. Theyalso provided ratings o (4) how ree they elt to par-ticipate, (5) whether their time was well spent, (6)how difcult it was to participate, (7) how satisfedthey were with the group process, and (8) how eec-tive they thought the group process was or solvingthe estimation problem. Since our group interactionprocesses simulated non-hierarchical meetings, we ex-
pected the group pressures to be lessened. Assumingthat people generally enjoy human interaction and thesense o working together, we hypothesized that the
personal interactions in FTF and NGT will lead tomore avorable ratings compared to Delphi and pre-diction markets. Furthermore, due to the lower pres-sures or conormity in NGT, we expected NGT to berated more avorably than FTF, as has been shown pre-viously by Van de Ven and Delbecq (1974). In addi-
tion, since we expected the majority o traders to be
1 Thanks to Yiling Chen, Kesten Green, Robin Hanson, SteanLuckner, Gene Rowe, Martin Spann, Gerrit Van Bruggen and EricZitzewitz.
unamiliar with the process o revealing ones opin-
ions by trading stocks, we expected prediction mar-
kets to be rated least avorably, in particular in terms o
difculty. In summary, or acceptability, we expected:NGT > FTF > Delphi > prediction markets.
6. Methods
6.1. Participants
The participants were students (n = 227) at the
University o Pennsylvania who were recruited by
the Wharton Behavioral Lab. They were randomly
assigned to 44 heterogeneous groups (11 per method),
and the 11 prediction market groups were urtherdivided into two treatments. In 5 groups, the traders
started trading immediately, while in 6 groups,
the traders generated their individual estimates
independently beore participating in the market.
The group size was determined by the number
o students showing up in each session. Most o
the groups (42 o 44) consisted o 4 to 6 subjects,
though one NGT group contained 3 subjects and
one prediction markets group consisted o 7 subjects.
Appendix 1 provides an overview o the group size
and (average) number o participants per method.
All appendices can be ound in a supporting onlinedocument at http://www.orecasters.org/ij.
All sessions lasted or one hour. Each participant
was paid a $10 show-up ee. In addition, the
participants were remunerated depending on their
group or individual perormance. On average, $15
was distributed among each group. In FTF, NGT,
and Delphi, $50 was distributed equally among the
participants in the two most accurate groups, while the
next two most accurate groups received $25, and the
next $15. To match the traditional pay-o mechanism
o prediction markets, the traders were paid basedon their individual perormances, i.e. $6 or the best
perorming trader in each group, $5 or the second
best, and $4 or the third best.
6.2. Materials and procedures
The data were collected in the Wharton Behavioral
Lab. FTF and NGT were conducted in a small
meeting room, while the Delphi and prediction
market groups worked in a computer room on
http://www.forecasters.org/ijfhttp://www.forecasters.org/ijfhttp://www.forecasters.org/ijf7/29/2019 2011 Graefe
6/13
188 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195
individual workstations, divided by partition walls
to prevent personal communication. Each participant
received general instructions explaining the relevant
group technique, as well as the respective pay-omechanism. As the process started, orms were handed
out to each participant or making personal notes
and analyses. During the whole process, whenever
individuals or groups were asked to give estimates,
they had to state their confdence in these estimates on
a seven point numerical scale (1: not at all confdent;
7: extremely confdent). Ater the group interaction,
all participants rom all group techniques received a
post-task orm on which to reveal their fnal individual
estimates. In addition, the post-task orm was used to
obtain the participants perceptions o the group and o
the group process as a whole, again on a seven pointnumerical scale. Examples o the materials used can
be ound in Appendices 3 to 6.
6.2.1. Unstructured ace-to-ace meetings
The group members were seated around a table to
discuss the problem, with the goal o reaching a group
estimate. Each group received one group questionnaire
or their fnal group estimates. In addition, each group
member received an individual questionnaire on which
to note the group estimates achieved.
6.2.2. Nominal groups
First, the group members were spread over three
tables to work independently to provide individual
estimates or each o the ten questions on a paper
questionnaire. Ater each participant had completed
the questionnaire, the group members were seated
around one table to discuss their individual estimates
with the group. Each group received one group
questionnaire or their fnal group estimates. In
addition, each group member received an individual
questionnaire on which to note the group estimatesachieved. Finally, the group members were again
spread out over three tables to provide their fnal
individual estimates on the post-task questionnaire.
The group results were calculated as the median o the
fnal individual estimates.
6.2.3. Delphi
Beore logging on to the system, the participants
watched a video tutorial (http://tinyurl.com/method-
comparison-Delphi) o how to use the Delphi sot-
ware. In order to conduct our research in a realis-
tic environment that can be adopted by practitioners,
we used the ree Delphi sotware available at For-
Prin.com. This sotware was developed in order to ed-ucate the users about Delphi and to aid them in the
use o the standard Delphi procedure by including its
our key eatures: anonymity o participants, iteration,
controlled eedback, and statistical aggregation o in-
dividual estimates. In each round, the sotware also
required the participants to provide their individual nu-
merical estimates, together with 10% lower and upper
confdence bounds. Furthermore, the participants had
the option to provide comments. Ater everyone had
responded, the results o the frst round were summa-
rized and reported as eedback to the participants; this
included each individual estimate (along with the con-fdence interval), a group summary o these estimates
(mean, median, and standard deviation), and each indi-
viduals comments per question. The participants then
provided their fnal individual estimates in a second
round. For our analysis, we used the median o the in-
dividual estimates rom the second round to calculate
the group results.
6.2.4. Prediction markets
We used the (partly) ree sotware solution o
inklingmarkets.com, which was specifcally designedto make participation as easy as possible or
inexperienced traders. In addition, since our markets
had only a small number o participants, it was
necessary to rely on a system using a market maker,
in order to ensure a sufcient liquidity o the market.
With Inklings market maker, the participants can
buy and sell contracts, and thus reveal inormation,
at any time, as they interact with the market maker
directly, and do not require a human trading partner.
Thus, the market can elicit inormation even rom
a single participant. For a urther discussion o thesotware, see Christiansen (2007), who conducted
feld experiments using Inkling.
Each question was mapped as one contract, which
could be traded by participants. At market closure,
each contract was liquidated at the value o the
correct answer or each question. Thus, i one
thought that the correct answer was higher than
the contracts current value, one would have bought
shares, otherwise one would have sold. The procedure
is explained in detail in a video tutorial, which
http://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphi7/29/2019 2011 Graefe
7/13
A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 189
was watched by each participant beore logging on
to the system (http://tinyurl.com/method-comparison-PM). Each participant had an initial deposit value o
$30,000 o play-money. At the end o the session, thelast traded contract prices were interpreted as the fnalgroup results.
6.3. Quantitative judgment task
The questions were inuenced by two limitationso the prediction market sotware. First, although
prediction markets can theoretically be used to predict
large numbers, it would have been necessary to scalethe index markets, which might have made it harder
or the participants to understand. Second, or market
maker mechanisms it is necessary to provide a startingprice, which should reasonably be chosen to be
greater than zero. However, announcing starting prices
provides traders with an anchor.To circumvent these problems, the quantitative
judgment task consisted o ten almanac questions that
required percentage estimates. Thus, no scaling was
necessary, since the market prices would always bebetween 0 and 100. Furthermore, all questions were
designed to be similar by providing an anchor in
the ormulation o the question, which was used asthe starting price in the markets. Examples include:
In 1900, the percentage o the total US population
that was aged 65 and over was 4.1%. What was thepercentage in 2000? or In 1980, 17.0% o the US
population (25 years and over) had completed 4 years
o college or more. What was the percentage in 2006?Appendix 2 provides a complete list o the questions
used in this study.It was emphasized in the written instructions and
by the incentive mechanism that the purpose o thetask was to estimate the answers to the questions. The
answers were not revealed until ater the completion
o the study.
7. Results
We used the absolute error as an index o
accuracy. Tests o statistical signifcance are reportedin ootnotes.2 Signifcance levels are indicated in the
2 Since the data were not normally distributed in most cases, weconducted non-parametric signifcance tests.
tables using asterisks ( : p 0.05; : p 0.01).The data that were analyzed in this study can be oundonline at http://tinyurlcom/method-comparison-data.
For NGT, Delphi, and six o the prediction marketgroups, we had access to the participants ex anteestimates prior to them entering the group interaction.We used both these individual priors and the mean othese estimates rom the participants in each groupas additional benchmarks or our analysis. In theollowing, the mean prior estimates or each group arereerred to as staticized groups.
7.1. Group accuracy beore group interaction
We analyzed the accuracy o the staticized groupsbeore group interaction took place. The goal othis exercise was to rule out the possibility that anydierences between the methods might have occurredbecause some groups simply happened to be betterthan others. For each question, as well as over all tenquestions, we did not fnd any dierences in accuracybetween the participants priors in NGT, Delphi, andprediction markets.3
7.2. Accuracy o structured group interaction vs.
individual priors
Table 1 shows the results, reported as the errorreduction or each question as well as over allten questions. The error reduction is the dierencebetween the absolute errors o the prior individualestimates and the respective method result. A positive(negative) value or the error reduction means thatthe method results were more (less) accurate than theindividual priors. As expected, overall, the results oeach method were more accurate than the individualpriors.4 At the individual question level, predictionmarkets were more accurate than the individualpriors or our questions and equally accurate or
six questions. NGT was more accurate or eightquestions and equally accurate or two questions.Delphi was most accurate, yielding a lower MAE thanthe respective individual priors or each o the tenquestions.
3 The Kruskal-Wallis test was used to test or dierences betweenthe absolute errors o the prior individual estimates o the threestructured approaches.
4 The Wilcoxon rank-sum test was used to test or dierencesbetween the absolute errors o the individual priors and the methodresults. Table 1 reports one-tailed p-values.
http://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurlcom/method-comparison-datahttp://tinyurlcom/method-comparison-datahttp://tinyurlcom/method-comparison-datahttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PM7/29/2019 2011 Graefe
8/13
190 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195
Table 1
Mean error reduction o each method compared to prior individual estimates beore group interaction.
Question NGT Delphi PM
(N = 54) (N = 59) (N = 34)Population aged 65+ 1.33* 3.15** 0.88
Internet users 2.69* 3.50** 1.84
Health expenditures 3.60** 2.99** 0.85
College graduates 4.12* 4.95** 6.51**
Population o Australia 2.42** 1.48** 2.67**
Overweight children 1.77 0.79** 0.79
Children with two parents 2.57** 4.51** 4.06**
Households with PC 0.78 5.08** 3.09
Motor vehicle production 9.82** 5.29** 5.43
Population aged 5 1.86** 1.24** 4.42**
Overall 3.10** 3.60** 3.05**
N: number o prior individual estimates. p 0.05.
p 0.01.
7.3. Accuracy o group interaction vs. staticized
groups
Group interactions should yield more accurateresults than can be obtained by simply averaging
the prior individual estimates within each group.
We compared the methods results to those o thecorresponding staticized groups. The results, again
reported as the error reduction, are shown in Table 2.5On average, the NGT results were signifcantly
more accurate than the staticized groups or three out
o ten questions (i.e., population o Australia, motorvehicle production, and population aged 5). For
the question households with PC, the NGT results
were signifcantly less accurate. For the remainingsix questions, there were no statistically signifcant
dierences in accuracy.The Delphi method yielded signifcantly more
accurate results than the staticized groups or three
out o the ten questions. For the remaining sevenquestions, the dierences were small.Prediction markets improved on the staticized
groups or two questions. For the remaining questions,
there were no statistically signifcant dierences,which might be due to the small number o
observations. However, although the dierences were
5 The Wilcoxon rank-sum test was used to test or dierencesbetween the absolute errors o the staticized groups and the methodresults. Table 2 reports one-tailed p-values.
small, prediction markets were less accurate than thestaticized groups or six o the ten questions.
Over all ten questions, group interaction improvedon the staticized groups. However, that being said, thegains in accuracy were modest and the sample sizeswere not large, and thus there is some uncertainty
about these fndings. The dierences were statisticallysignifcant only with respect to Delphi.
7.4. Relative accuracy o group methods
For each method, we calculated the MAEs oreach question as well as over all ten questions.With the lowest overall MAE o 5.62, Delphi wasthe most accurate, ollowed by NGT (5.92) and
prediction markets (7.07). As expected, FTF wasleast accurate; its overall MAE was 7.21. Althoughthe results conormed to our expectations that thestructured approaches would outperorm FTF, theoverall dierences between the methods were not
statistically signifcant.6 At the question level, theevidence on the relative perormances o the methodswas mixed. Table 3 shows the error reduction o NGT,Delphi, and prediction markets compared to FTF.
The dierences between FTF and NGT weresmall.7 For eight o the ten questions, there were no
6 The Kruskal-Wallis test showed no statistically signifcantdierences between the our methods over all 10 questions.
7 We conducted Mann-Whitney U-tests to compare the results oFTF and the respective structured approaches or each question.
7/29/2019 2011 Graefe
9/13
A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 191
Table 2
Mean error reduction o each method compared to staticized groups.
Question NGT Delphi PM
(N = 11) (N = 11) (N = 6)Population aged 65+ 0.87 0.93 1.7
Internet users 0.89 1.59 1.13
Health expenditures 0.71 0.09 3.22
College graduates 0.3 1.52 3.25
Population o Australia 2.13** 1.45** 2.47*
Overweight children 2.77 1.02 3.52
Children with two parents 1.69 1.39 1.37
Households with PC 4.96** 2.09 1.37
Motor vehicle production 8.79** 3.58* 3.95
Population aged 5 1.58** 0.73* 3.77*
Overall 0.41 0.54* 0.17
N: number o groups. p 0.05. p 0.01.
Table 3
Error reduction o NGT, Delphi, and prediction markets compared to FTF.
Question MAE Error reduction to FTF
FTF NGT Delphi PM
Population aged 65+ 7.76 2.69 4.29** 1.22
Internet users 4.53 0.53 1.33 3.36
Health expenditures 3.86 2.19 1.38 0.16
College graduates 6.36 0.27 0.26 1.87
Population o Australia 2.41 0.09 1.2 2.35*
Overweight children 13.98 7.93* 9.74** 5.52
Children with two parents 7.19 0.8 2.77 0.98
Households with PC 15 0.82 7.21 6.5
Motor vehicle production 10.12 0.22 5.49 7.92*
Population aged 5 0.86 1.00* 1.21 1.26**
Overall 7.21 1.29 1.59 0.14
p 0.05. p 0.01.
signifcant dierences between the two methods. For
the two remaining questions, NGT was more accurate
than FTF once and less accurate once.
Delphi was signifcantly more accurate than FTF
or two questions. For the remaining eight questions,
there were no dierences in accuracy. In no case was
Delphi signifcantly less accurate than FTF.
Contrary to our expectations, prediction markets
were unable to improve on the accuracy o FTF.
Instead, or three questions, prediction markets
yielded results that were signifcantly less accurate
than the respective FTF results. For the remaining
questions, there were no statistically signifcantdierences.
7.5. Participants perceptions
Ater the group interaction, we asked the partic-ipants about their perceptions o the group, as wellas o the group process. We had hypothesized thatthe personal interactions between participants wouldlead to more avorable ratings or FTF and NGTrelative to Delphi and prediction markets. Further-
more, we had assumed that, due to lower group pres-sures, NGT would be rated more avorably than FTF.
7/29/2019 2011 Graefe
10/13
192 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195
Fig. 1. Participants mean ratings o each method (on a 7-point scale).
Finally, because o the participants lack o amiliaritywith the approach, we expected prediction markets to
be rated most unavorably on most attributes. The re-
sults were consistent with these hypotheses. For eachattribute, Fig. 1 reports the statistical mean (and 95%
confdence intervals) o the participants ratings, re-vealed on a 7-point numerical scale rom 1 (very
low) to 7 (very high). Note that or the attributesdisagreementand difculty, higher ratings were inter-
preted as being negative.
Ratings o the group
NGT obtained the most avorable ratings or coop-eration and disagreement and ranked second or par-
ticipants confdence in the group answers. Predictionmarkets ranked second worst or disagreement andconfdence and worst or cooperation. For both coop-
eration and disagreement, FTF ranked second.
Although Delphi obtained the worst score ordisagreement, the participants were most confdentin the outcomes. We see the ollowing reasons or
the high level o perceived disagreement in Delphi:(1) Delphi participants cannot communicate with each
other directly, but reveal their estimates independently;and (2) agreement can only increase i incorporating
eedback inormation ater the frst round leads to
more cohesive results at the end o the process.However, disclosing disagreement can be desirable, as
it alerts decision-makers to uncertainty.
FTF obtained the lowest score in terms oconfdence in the group outcome. This is surprising,
since we would have expected that high levels o
perceived cooperation and agreement would result inhigh levels o confdence in the group results. We
assume that the FTF participants were aware that, dueto group tendency eects, probably not all available
inormation rom group members was aggregated.
Ratings o the group process
The rankings o the group process as a whole were
identical over 4 out o 5 attributes, with NGT placedfrst, FTF second, Delphi third and prediction markets
ourth. For difculty to participate, FTF achieved
an equally avorable score as NGT. For reedom toparticipate, FTF and Delphi swapped places. Given
the avorable ratings or FTF on most other attributes,it is interesting that Delphi achieved a higher score
than FTF or reedom to participate. Again, the reasonmight be that the FTF participants experienced group
pressures that may have hindered them rom ully
revealing their inormation, which fts with the lowconfdence scores or FTF. Overall, the participants
perceptions o the prediction markets process werepoor or all attributes.
In summary, the results conormed to our expecta-
tions. NGT obtained the most avorable ratings or 7 othe 8 categories. As hypothesized, FTF was also rated
quite avorably, being ranked second or six attributes.
7/29/2019 2011 Graefe
11/13
A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 193
Finally, the results corroborated our hypothesis that
prediction markets would be rated most unavorably:they had the worst ratings or six attributes and were
second worst or the two remaining attributes.
8. Discussion
Although the three structured approaches yielded
lower MAEs than FTF over all ten questions, the di-erences in accuracy between the our methods were
not statistically signifcant. The results conormed to
our hypotheses only with respect to the direction o
the results.However, some dierences between the methods
were identifed at the question level. Delphi was theonly approach that was never less accurate than FTF,
and outperormed FTF or two o the ten questions.By comparison, prediction markets were less accurate
than FTF or three o the ten questions and were
unable to outperorm FTF or any o the questions.
The relative perormances o NGT and FTF weremixed and the dierences were small. We can only
speculate on the reasons or these results.The participants in this study had to solve a
quantitative judgment task that required percentageestimates o the answers to actual questions. Such
tasks have clear solutions, lack highly distributedknowledge and uncertainty, and require groups only
to aggregate actual knowledge. We would expect thatstructured approaches would be substantially more
eective when participants reveal inormation that is
new and useul to others.For example, or site location problems, dierent
experts might have distinct knowledge about labor
and real estate costs, trafc patterns, and the
types o consumers that live near a potential retail
outlet. Other tasks, such as orecasting, decision-
making, negotiation, or conict solving, are morechallenging, as they involve not only acts, but also
values, emotions, and expectations, as well as highlevels o uncertainty. We would expect the relative
perormances o methods to be largely inuenced by
such problems (Wright & Ayton, 1986). In particular,we would expect group pressures to be much higher,
and thereore FTF to generally perorm worse.In contrast, when all o the experts draw upon
the same inormation, which might have been the
case or the type o problem used in this study,
inormation exchange is expected to be o little value.
Over the ten questions, only Delphi yielded more
accurate results (in terms o statistical signifcance)
than staticized groups. By comparison, the dierencesbetween prediction markets or NGT and staticized
groups were not statistically signifcant.
Under most circumstances, structured combined
orecasts are substantially more accurate than in-
dividual orecasts (Armstrong, 2001). Our results
provide urther support or this well-established prin-
ciple. Each o the three structured approaches yielded
lower mean absolute errors than the prior individual
estimates.
Another actor that might have inuenced the
results is the experiment design. The experiment
setting avored meetings, since it did not involve
hierarchies. In a more realistic environment with
hierarchies, we would expect group pressures to be
more o a concern, and as a result, the relative
perormance o FTF would decrease.
The comparatively poor perormance o prediction
markets may have been due to the limited means
o communication between participants. Although the
participants could exchange inormation continuously
through trading, they were unable to provide reasoning
or their estimates. They could only buy and sell
contracts, without indicating why they did so. Thus,the trading mechanism might not have allowed the
participants to convey inormation to the group.
However, prediction markets could be designed to
overcome such barriers. For example, additional
means o communication, such as orums or chat
rooms, could be implemented to allow traders
to communicate with each other and to reveal
the reasoning or their actions. Although several
commercial prediction markets have implemented
such solutions, we are not aware o any research that
has examined whether this has an eect on accuracy.Examining the participants perceptions o the
group and o the group process as a whole revealed
a preerence or methods involving personal commu-
nication. These results supported our hypothesis that
personal interactions in non-hierarchical meetings can
lead to coherence, and thereore increase the perceived
satisaction o the participants. In contrast, Delphi and
prediction markets were rated less avorably. In par-
ticular, prediction markets were rated worst on six
attributes and second worst on the remaining two.
7/29/2019 2011 Graefe
12/13
194 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195
We see at least two reasons or this. First, the lack oexchanging reasons or their judgments may alienate
traders. We would expect comort with the method to
rise with increasing insight into the opinions o el-low traders. Second, prediction markets are still a rel-atively new approach, and the idea o revealing ones
opinion through trading stocks may be difcult or par-ticipants to understand. Even though we relied on sot-
ware that was specifcally designed to make participa-tion as easy as possible or novices, participation in
prediction markets was rated as the most difcult byar. We assume that the high perceived difculty was
crucial to the poor ratings on other categories, whichin turn may hinder the urther adoption o prediction
markets by practitioners. Researchers and developers
should urther increase their eorts to make marketsotware solutions more accessible to inexperiencedtraders.
Participants perceptions o the group process canbe crucial to the acceptance o its results. When the
agreement among and satisaction o groupmembers is high, decision-makers can share the
responsibility or their decisions. Thus, our resultsmay entice decision-makers to rely on methods
involving personal communication. However, such astrategy has its drawbacks.
1. There is no evidence that high perceived satisac-tion is correlated to good perormance.
2. Taking into account administrative and time eorts,
meetings are expensive and can be difcult toconduct, as they require the presence o the
participants.3. Meetings are limited in the number o participants
involved. In contrast, Delphi and predictionmarkets do not require the presence o the
participants and can be conducted asymmetricallywith a virtually unlimited number o people.
4. Meetings can lead to poor decisions. Due to highergroup pressures in real world situations, we would
expect meetings to perorm worse or more com-plex problems and more realistic environments,
especially when the relevant inormation is dis-tributed among the experts.
9. Conclusion
We compared the relative accuracy o traditional
FTF to that o three structured approaches (NGT,
Delphi, and prediction markets) on a quantitative
judgment task that required percentage estimates orten actual questions.
Over the ten questions, we did not fnd any sta-tistically signifcant dierences in accuracy between
the our methods. However, the method results di-
ered somewhat at the individual question level. Del-phi was as accurate as FTF or eight questions and
outperormed FTF or two questions. By comparison,
prediction markets were unable to outperorm FTFor any o the questions and were inerior or three
questions. The relative perormances o NGT and
FTF were mixed and the dierences were generallysmall.
The three structured approaches were more accu-
rate than the participants prior individual estimates.Delphi was also more accurate than staticized groups,
in contrast to NGT and prediction markets. The re-sults suggest that NGT and prediction markets pro-
vide little additional value over simple averages o
orecasts in situations where the participants are alldrawing upon similar inormation. Future research
should evaluate the relative perormances o the meth-
ods or more complex problems in more realisticenvironments.
We also analyzed the participants perceptions o
the group and o the group process as a whole.The participants rated methods involving personal
communication (FTF and NGT) more avorablythan the computer-mediated Delphi and prediction
markets. In particular, the FTF and NGT participants
experienced higher levels o cooperation amongtheir groups and perceived the group interaction as
more eective. Prediction markets were rated least
avorably: the prediction market participants wereleast satisfed with the group process and rated their
method highest in terms o difculty o participation.
Acknowledgements
We would like to thank Kesten Green, RobinHanson, and David Pennock or helpul comments.
Thanks to the sta o the Wharton Behavioral
Lab, and in particular Daniela Lejtneker, Tatiana
Silva, and Patricia Zapater-roig. Ellen Rosenbergand Christian Menn helped with preparing materials.
Thanks to inklingmarkets.com or providing the
prediction markets sotware.
7/29/2019 2011 Graefe
13/13
A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 195
References
Armstrong, J. S. (2001). Combining orecasts. In J. S. Armstrong
(Ed.), Principles o orecastinga handbook or researchers
and practitioners (pp. 417439). Boston, MA: Kluwer
Academic Publishers.
Armstrong, J. S. (2006). Howto make betterorecasts and decisions:
avoid ace-to-ace meetings. ForesightThe International
Journal o Applied Forecasting, 5, 38.
Berg, J. E., Nelson, F. D., & Rietz, T. A. (2008). Prediction market
accuracy in the long run. International Journal o Forecasting,
24, 285300.
Boje, D. M., & Murnighan, J. K. (1982). Group confdence
pressures in iterative decisions. Management Science, 28,
11871196.
Christiansen, J. D. (2007). Prediction markets: practical experiments
in small markets and behaviours observed. Journal o Prediction
Markets, 1, 1741.Erikson, R. S., & Wlezien, C. (2008). Are political markets
really superior to polls as election predictors? Public Opinion
Quarterly, 72, 190215.
Fischer, G. W. (1981). When oracles aila comparison o
our procedures or aggregating subjective probability ore-
casts. Organizational Behavior and Human Perormance, 28,
96110.
Green, K. C., Armstrong, J. S., & Graee, A. (2007). Methods
to elicit orecasts rom groups: Delphi and prediction markets
compared. ForesightThe International Journal o Applied
Forecasting, 8, 1720.
Gustason, D. H., Shukla, R. K., Delbecq, A., & Walster, G. W.
(1973). A comparative study o dierences in subjective
likelihood estimates made by individuals, interacting groups,
Delphi groups, and nominal groups. Organizational Behavior
and Human Perormance, 9, 280291.Rhode, P. W., & Strump, K. S. (2004). Historical presidential
betting markets. Journal o Economic Perspectives, 18,
127141.
Rosenbloom, E. S., & Notz, W. (2006). Statistical tests o
real-money versus play-money prediction markets. Electronic
Markets, 16, 6369.
Rowe, G., & Wright, G. (1999). The Delphi technique as a
orecasting tool: issues and analysis. International Journal o
Forecasting, 15, 353375.
Servan-Schreiber, E., Wolers, J., Pennock, D. M., & Galebach, B.
(2004). Prediction markets: does money matter? Electronic
Markets, 14, 243251.
Van de Ven, A. H., & Delbecq, A. L. (1971). Nominal versus
interacting group processes or committee decision making
eectiveness. Academy o Management Journal, 14, 203212.
Van de Ven, A. H., & Delbecq, A. L. (1974). The eectiveness
o nominal, Delphi, and interacting group decision making
processes. Academy o Management Journal, 17, 605621.
Wolers, J., & Zitzewitz, E. (2004). Prediction markets. Journal o
Economic Perspectives, 18, 107126.
Woudenberg, F. (1991). An evaluation o Delphi. Technological
Forecasting and Social Change, 40, 131150.
Wright, G., & Ayton, P. (1986). Subjective confdence in orecasts:
a response to Fischho and MacGregor. Journal o Forecasting,
5, 117123.