2011 Graefe

7/29/2019 2011 Graefe

1/13

International Journal o Forecasting 27 (2011) 183195

www.elsevier.com/locate/ijorecast

Comparing ace-to-ace meetings, nominal groups, Delphi andprediction markets on an estimation task

Andreas Graeea,, J. Scott Armstrong b

aInstitute or Technology Assessment and Systems Analysis, Karlsruhe Institute o Technology, Germanyb The Wharton School, University o Pennsylvania, Philadelphia, PA, United States

Abstract

We conducted laboratory experiments or analyzing the accuracy o three structured approaches (nominal groups, Delphi,

and prediction markets) relative to traditional ace-to-ace meetings (FTF). We recruited 227 participants (11 groups per method)

who were required to solve a quantitative judgment task that did not involve distributed knowledge. This task consisted o ten

actual questions, which required percentage estimates. While we did not fnd statistically signifcant dierences in accuracy

between the our methods overall, the results diered somewhat at the individual question level. Delphi was as accurate as FTF

or eight questions and outperormed FTF or two questions. By comparison, prediction markets did not outperorm FTF orany o the questions and were inerior or three questions. The relative perormances o nominal groups and FTF were mixed

and the dierences were small. We also compared the results rom the three structured approaches to prior individual estimates

and staticized groups. The three structured approaches were more accurate than participants prior individual estimates. Delphi

was also more accurate than staticized groups. Nominal groups and prediction markets provided little additional value relative

to a simple average o the orecasts. In addition, we examined participants perceptions o the group and the group process. The

participants rated personal communications more avorably than computer-mediated interactions. The group interactions in FTF

and nominal groups were perceived as being highly cooperative and eective. Prediction markets were rated least avourably:

prediction market participants were least satisfed with the group process and perceived their method as the most difcult.

c 2010 International Institute o Forecasters. Published by Elsevier B.V. All rights reserved.

Keywords: Forecast accuracy; Group decision-making; Method comparison; Satisaction; Combining

1. Introduction

In situations where a lack o appropriate or avail-able inormation precludes one rom using quanti-

Corresponding author.E-mail addresses: [email protected] (A. Graee),

[email protected] (J.S. Armstrong).

tative methods, it can be helpul to incorporate hu-

man judgment in order to improve the orecast. But

how does one get the best orecast when aggregating

inormation rom a group o people? While organi-

zations most commonly rely on unstructured ace-

to-ace meetings, it is difcult to fnd evidence to

support the use o this strategy (Armstrong, 2006).

0169-2070/$ - see ront matter c 2010 International Institute o Forecasters. Published by Elsevier B.V. All rights reserved.doi:10.1016/j.ijorecast.2010.05.004
http://dx.doi.org/10.1016/j.ijforecast.2010.05.004http://www.elsevier.com/locate/ijforecastmailto:[email protected]:[email protected]://dx.doi.org/10.1016/j.ijforecast.2010.05.004http://dx.doi.org/10.1016/j.ijforecast.2010.05.004mailto:[email protected]:[email protected]://www.elsevier.com/locate/ijforecasthttp://dx.doi.org/10.1016/j.ijforecast.2010.05.004

7/29/2019 2011 Graefe

2/13

184 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195

The literature suggests that structured approaches such

as nominal groups or Delphi provide more accurateorecasts than traditional meetings.

In recent years, there has been increasing inter-est in prediction markets as an approach or eliciting

inormation rom people, and a number o organiza-

tions have started to experiment with them. However,to date, we still do not know much about the peror-

mance o prediction markets. The available studies are

limited and are oten o a small scale. In particular,we do not know o any study that has analyzed their

perormance relative to either meetings or other struc-

tured group techniques or aggregating the knowledgein groups. Since the emergence o the feld, there has

been no meta-analysis that has analyzed the accuracy

o prediction markets.To address this defciency, we conducted laboratory

experiments which compared unstructured meetings,nominal groups, Delphi, and prediction markets. We

analyzed the relative accuracies o the our group

techniques on a quantitative judgment task andexamined the participants perceptions o their group

and their group process.

2. Group judgment tasks

We selected problems that did not involve highlydistributed inormation. This is not intended to suggest

that problems involving inormation which is highly

dispersed among the participants are unimportant, butmerely that we addressed a limited but typical

orecasting situation: general problems or which

people are unlikely to possess specifc knowledge. In

our study, the groups had to come up with estimates ora quantitative judgment task. This task consisted o ten

actual questions that required percentage estimates.

These questions had correct solutions, but all group

members could be expected to have some degree ouncertainty about the answers.

The fndings rom group perormance studies are

generally task-specifc. The critical question arisesas to whether or not the tasks used in experimental

settings are representative o real-world problems,

and thus allow or generalizations to dierent types o

tasks (such as orecasting problems) or participants.For example, Wright and Ayton (1986) provided

some evidence that the fndings rom calibration

studies which used general knowledge questions are

not applicable to judgmental orecasting problems,since the two task types involve dierent levelso uncertainty. Furthermore, it is oten argued that

student participants in laboratory experiments arelaymen, whereas the participants in real-worldorecasting scenarios are experts who are expected topossess superior knowledge.

In general, or a judgmental orecasting method

to perorm well, it should be able to aggregate therelevant inormation rom its members efciently.Thus, a basic precondition or the generalizabilityo problems used in experimental settings is the

involvement o participants in trying to solve them.Factual questions are commonly analyzed by groupdecision-making, as they have been shown to spark the

interest o student participants.

3. Group techniques

We analyzed our group techniques that dier inthe amount and structure o the interactions permittedbetween group members: ace-to-ace meetings,

nominal groups, the Delphi method, and predictionmarkets.

3.1. Unstructured ace-to-ace meetings

Unstructured ace-to-ace meetings (FTF) allowany orm o direct interaction between groupmembers. Although meetings are most common or

group decision-making in organizations, they havebeen shown to be subject to many biases anddrawbacks. For example, (1) it requires time and eortor a group to maintain itsel; (2) groups tend to

aim to reach speedy decisions and do not considerall problem dimensions, and thus tend to pursue alimited train o thought, which leads to a centraltendency eect or groupthink; (3) less confdent

group members, or people rom lower hierarchy levels,may keep silent about their reservations because oeither group pressures or conormity or impliedthreats o sanctions; (4) dominant personalities tend to

exert an excessive amount o inuence on the group.For a summary o these issues, see Van de Ven andDelbecq (1971).

In summary, Armstrong (2006) ound little evi-dence to support the use o meetings or orecastingor decision-making. In addition, meetings are expen-

sive both to schedule and to run. In some situations,

7/29/2019 2011 Graefe

3/13

A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 185

however, one must be concerned with not only the

quality o a decision but also its acceptability. Peo-ple generally enjoy human interactions and the sense

o working together, and meetings have oten beenshown to achieve high levels o satisaction (e.g. Boje& Murnighan, 1982; Van de Ven & Delbecq, 1974).

3.2. Nominal group technique

The nominal group technique (NGT), developedby Van de Ven and Delbecq (1971, 1974), tries to

avoid some o the drawbacks o traditional meetings

by adding a structured ormat to the direct interactions.This process is conducted in three steps. First, group

members work independently to generate individual

estimates or a problem. Then, the group enters anunstructured discussion to deliberate on the problem.

Finally, the group members again work independently

to provide their fnal individual estimates. The groupresult is the aggregated outcome o these fnal

individual estimates. The literature also reers to this

process as estimate-talk-estimate (Gustason, Shukla,Delbecq, & Walster, 1973).

The idea o NGT is that direct interactions during

the assessment or evaluation phase can have a positive

impact on estimation or problem solving. In particular,they can help group members to clariy and justiy

their points o view, which may help the group to make

more inormed decisions. Nonetheless, in the phaseso generating answers and making the fnal decisions,

NGT prevents direct interactions between group

members, in order to reduce the known drawbackso traditional meetings. See Van de Ven and Delbecq

(1971) or a summary o the advantages o nominal

groups.

3.3. The Delphi method

Delphi is a multiple-round survey in whichthe participants anonymously reveal their individualestimates and provide comments on a problem. Ater

each round, the individual estimates and comments

are summarized and reported to the participants aseedback. Taking this inormation into account, the

participants then provide their new estimate in the

ollowing round. The group result is the aggregatedoutcome o the individual estimates o the fnal round.

Unlike FTF and NGT, which require the physical

proximity o the group members, the participants

in Delphi are physically dispersed and do not meetin person. In general, Delphi unctions somewhat

similarly to NGT. The main dierence is that written

interactions are utilized throughout the process, inorder to avoid direct interactions between the group

members.The strengths o the method are seen in its

structured communication process, which enablesdiscussion and helps groups to achieve a consensus,

but limits the drawbacks associated with directinteractions. For reviews o the Delphi method, seeRowe and Wright (1999) and Woudenberg (1991).

3.4. Prediction markets

Prediction markets were popular in the late 1800s(Rhode & Strump, 2004), and have an impressive

track record in orecasting elections. In analyzing964 polls or the fve presidential elections rom1988 to 2004, Berg, Nelson, and Rietz (2008) ound

that the respective orecasts o the Iowa ElectronicMarkets were closer to the actual election results thanthe individual polls 74% o the time. However, this

advantage disappeared when the polls were averagedand damped (Erikson & Wlezien, 2008).

Prediction markets are gaining attention in variousfelds o orecasting. The idea is to set up a contract

whose payo depends on the outcome o an uncertainuture event. This contract, which can be interpretedas a bet on the outcome o the underlying uture event,

can then be traded by participants. As soon as theoutcome is known, the participants are paid o inexchange or the contracts they hold. Based on their

individual perormances, participants can win money.I one thinks that the current group estimate is toolow (high), one will buy (sell) stocks. Thus, through

the prospect o gaining money, the participants havean incentive to become active in the group process

whenever they expect the group estimate to beinaccurate. As in Delphi, the participants are mutuallyanonymous, and thus are not subject to the socialpressures connected with direct interactions. The

main dierence is that the participants exchangeinormation continuously through the price signal othe market, but do not share comments or reasons as

to why they buy or sell a contract. For an explanationo the concept o prediction markets and a useulsummary o the method, see Wolers and Zitzewitz

(2004).

7/29/2019 2011 Graefe

4/13


Prediction markets using play-money have alsoproven to be accurate. In comparing the results oplay-money and real-money markets or sport events,

Servan-Schreiber, Wolers, Pennock, and Galebach(2004) did not fnd any dierences in accuracy. Onthe other hand, Rosenbloom and Notz (2006) oundreal-money markets to be more accurate or non-sportsevents.

4. Related work

Structured group techniques are expected to limitthe biases associated with meetings, and prior researchseems to support this assumption. In their review,Rowe and Wright (1999) reported a superior accuracy

o Delphi over unstructured interaction by a scoreo fve studies to one. In excluding two papers romtheir analysis that did not analyze the accuracy orquantitative judgment tasks, we adjusted the score tothree studies to one, still with two ties. Similarly,in summarizing the literature, Woudenberg (1991)ound an albeit slight superiority o Delphiover unstructured interaction. Furthermore, again orquantitative judgment tasks, Gustason et al. (1973)ound NGT to be more accurate than FTF, althoughFischer (1981) ound no dierences. Evidence on therelative perormances o NGT, Delphi, and predictionmarkets or quantitative judgment tasks is scarce.While Gustason et al. (1973) showed that NGT wasmore accurate than Delphi, two studies ound nodierences (Boje & Murnighan, 1982; Fischer, 1981).In the case o prediction markets, we do not know oany work that has compared the method to FTF, NGTor Delphi.

Little work has been done on analyzing partici-pants perceptions o group processes, although thiscan be important or the acceptance o its results. I thegroup members or decision-makers eel dissatisfed

with the process, its outcome may not be adopted, eveni it is highly accurate. On the other hand, a processthat is highly satisying or participants may not nec-essarily lead to accurate results. In general, personalinteractions in groups can lead to either coherence, andthus a high level o perceived satisaction, or disagree-ment, resulting in rustrated group members. For quan-titative judgmental tasks, Boje and Murnighan (1982)analyzed the participants perceptions o Delphi, NGT,and people working alone. They ound that NGTwas rated most avorably in terms o eectiveness,

satisaction, and the reedom to participate, whereas

Delphi was rated only slightly superior to working

alone. Van de Ven and Delbecq (1974) compared par-

ticipants perceptions o nominal groups, Delphi, andunstructured meetings or an idea generation problem.

They ound that NGT participants expressed the high-

est satisaction with the process, whereas the dier-ences between the Delphi method and meetings were

small.

For prediction markets, we do not know o anystudy that has analyzed the way in which people

perceive participation and how this might aect the

acceptability o market results. However, practical

experience indicates that people have problems

in understanding how prediction markets work.

In particular, they have problems in translatinginormation into market prices (Green, Armstrong,

& Graee, 2007). In addition, people appear to have

difculties in understanding what the prices mean. Forexample, during the 2008 fnancial crisis, intrade.com

launched a contract or predicting whether or not

the US government would pass a bail out. At one

time, the contract was traded at a price o $80,

which meant that the market orecasted an 80%probability that the bail out would go through.

A sta writer at CNNMoney.com who would

be expected to possess a certain understanding ofnancial instruments interpreted the market prices

wrongly, stating that 80% o the participants thought

that the event would come true (CNNMoney.com:

Oddsmakers see bailout OK soon, September 24,

2008). A misunderstanding o how prediction markets

work might be critical to their acceptance by both

participants and decision-makers. I people do notunderstand how to reveal inormation, the way in

which inormation is aggregated into market prices, or

how to interpret market prices, it is doubtul whether

they will have a avorable perception o the method,and thus have confdence in its results.

5. Hypotheses

Given the lack o evidence, we had no directionalhypotheses on the relative accuracy o the three

structured approaches (NGT, Delphi, and prediction

markets), since arguments could be ound in either

direction. For example, prediction markets might be

advantageous to Delphi because traders can update

7/29/2019 2011 Graefe

5/13


their estimates continuously and are not tied to alimited number o rounds. On the other hand, themarket participants do not reveal the reasons or their

estimates. Thus, other group members might not beable to relate to why the group estimate has changed.

Accordingly, the possibility o argumentation andreasoning might avor NGT and Delphi. For accuracy,we expected: NGT Delphi prediction markets >

FTF.We sent the study design to experts in the feld

and asked them to provide their prior estimates o

the relative accuracies o the methods. To ensurethat the experts were amiliar with all our methods,we primarily reached out to people with experiencein prediction markets. Eight experts1 responded and

we derived the ollowing ranking rom their priors:Delphi > prediction markets > NGT > FTF. Inexpecting that all three structured approaches wouldoutperorm FTF, the expert priors were consistentwith our hypotheses. In particular, none o the experts

expected FTF to do better than any o the structuredapproaches.

To obtain a measure o the acceptability o themethods, we asked or the participants perceptionso the group and the group process along seven at-

tributes. In particular, participants rated (1) coopera-

tion and (2) disagreement among group members, andrevealed their (3) confdence in the group results. Theyalso provided ratings o (4) how ree they elt to par-ticipate, (5) whether their time was well spent, (6)how difcult it was to participate, (7) how satisfedthey were with the group process, and (8) how eec-tive they thought the group process was or solvingthe estimation problem. Since our group interactionprocesses simulated non-hierarchical meetings, we ex-

pected the group pressures to be lessened. Assumingthat people generally enjoy human interaction and thesense o working together, we hypothesized that the

personal interactions in FTF and NGT will lead tomore avorable ratings compared to Delphi and pre-diction markets. Furthermore, due to the lower pres-sures or conormity in NGT, we expected NGT to berated more avorably than FTF, as has been shown pre-viously by Van de Ven and Delbecq (1974). In addi-

tion, since we expected the majority o traders to be

1 Thanks to Yiling Chen, Kesten Green, Robin Hanson, SteanLuckner, Gene Rowe, Martin Spann, Gerrit Van Bruggen and EricZitzewitz.

unamiliar with the process o revealing ones opin-

ions by trading stocks, we expected prediction mar-

kets to be rated least avorably, in particular in terms o

difculty. In summary, or acceptability, we expected:NGT > FTF > Delphi > prediction markets.

6. Methods

6.1. Participants

The participants were students (n = 227) at the

University o Pennsylvania who were recruited by

the Wharton Behavioral Lab. They were randomly

assigned to 44 heterogeneous groups (11 per method),

and the 11 prediction market groups were urtherdivided into two treatments. In 5 groups, the traders

started trading immediately, while in 6 groups,

the traders generated their individual estimates

independently beore participating in the market.

The group size was determined by the number

o students showing up in each session. Most o

the groups (42 o 44) consisted o 4 to 6 subjects,

though one NGT group contained 3 subjects and

one prediction markets group consisted o 7 subjects.

Appendix 1 provides an overview o the group size

and (average) number o participants per method.

All appendices can be ound in a supporting onlinedocument at http://www.orecasters.org/ij.

All sessions lasted or one hour. Each participant

was paid a $10 show-up ee. In addition, the

participants were remunerated depending on their

group or individual perormance. On average, $15

was distributed among each group. In FTF, NGT,

and Delphi, $50 was distributed equally among the

participants in the two most accurate groups, while the

next two most accurate groups received $25, and the

next $15. To match the traditional pay-o mechanism

o prediction markets, the traders were paid basedon their individual perormances, i.e. $6 or the best

perorming trader in each group, $5 or the second

best, and $4 or the third best.

6.2. Materials and procedures

The data were collected in the Wharton Behavioral

Lab. FTF and NGT were conducted in a small

meeting room, while the Delphi and prediction

market groups worked in a computer room on
http://www.forecasters.org/ijfhttp://www.forecasters.org/ijfhttp://www.forecasters.org/ijf

7/29/2019 2011 Graefe

6/13


individual workstations, divided by partition walls

to prevent personal communication. Each participant

received general instructions explaining the relevant

group technique, as well as the respective pay-omechanism. As the process started, orms were handed

out to each participant or making personal notes

and analyses. During the whole process, whenever

individuals or groups were asked to give estimates,

they had to state their confdence in these estimates on

a seven point numerical scale (1: not at all confdent;

7: extremely confdent). Ater the group interaction,

all participants rom all group techniques received a

post-task orm on which to reveal their fnal individual

estimates. In addition, the post-task orm was used to

obtain the participants perceptions o the group and o

the group process as a whole, again on a seven pointnumerical scale. Examples o the materials used can

be ound in Appendices 3 to 6.

6.2.1. Unstructured ace-to-ace meetings

The group members were seated around a table to

discuss the problem, with the goal o reaching a group

estimate. Each group received one group questionnaire

or their fnal group estimates. In addition, each group

member received an individual questionnaire on which

to note the group estimates achieved.

6.2.2. Nominal groups

First, the group members were spread over three

tables to work independently to provide individual

estimates or each o the ten questions on a paper

questionnaire. Ater each participant had completed

the questionnaire, the group members were seated

around one table to discuss their individual estimates

with the group. Each group received one group

questionnaire or their fnal group estimates. In

addition, each group member received an individual

questionnaire on which to note the group estimatesachieved. Finally, the group members were again

spread out over three tables to provide their fnal

individual estimates on the post-task questionnaire.

The group results were calculated as the median o the

fnal individual estimates.

6.2.3. Delphi

Beore logging on to the system, the participants

watched a video tutorial (http://tinyurl.com/method-

comparison-Delphi) o how to use the Delphi sot-

ware. In order to conduct our research in a realis-

tic environment that can be adopted by practitioners,

we used the ree Delphi sotware available at For-

Prin.com. This sotware was developed in order to ed-ucate the users about Delphi and to aid them in the

use o the standard Delphi procedure by including its

our key eatures: anonymity o participants, iteration,

controlled eedback, and statistical aggregation o in-

dividual estimates. In each round, the sotware also

required the participants to provide their individual nu-

merical estimates, together with 10% lower and upper

confdence bounds. Furthermore, the participants had

the option to provide comments. Ater everyone had

responded, the results o the frst round were summa-

rized and reported as eedback to the participants; this

included each individual estimate (along with the con-fdence interval), a group summary o these estimates

(mean, median, and standard deviation), and each indi-

viduals comments per question. The participants then

provided their fnal individual estimates in a second

round. For our analysis, we used the median o the in-

dividual estimates rom the second round to calculate

the group results.

6.2.4. Prediction markets

We used the (partly) ree sotware solution o

inklingmarkets.com, which was specifcally designedto make participation as easy as possible or

inexperienced traders. In addition, since our markets

had only a small number o participants, it was

necessary to rely on a system using a market maker,

in order to ensure a sufcient liquidity o the market.

With Inklings market maker, the participants can

buy and sell contracts, and thus reveal inormation,

at any time, as they interact with the market maker

directly, and do not require a human trading partner.

Thus, the market can elicit inormation even rom

a single participant. For a urther discussion o thesotware, see Christiansen (2007), who conducted

feld experiments using Inkling.

Each question was mapped as one contract, which

could be traded by participants. At market closure,

each contract was liquidated at the value o the

correct answer or each question. Thus, i one

thought that the correct answer was higher than

the contracts current value, one would have bought

shares, otherwise one would have sold. The procedure

is explained in detail in a video tutorial, which
http://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphi

7/29/2019 2011 Graefe

7/13


was watched by each participant beore logging on

to the system (http://tinyurl.com/method-comparison-PM). Each participant had an initial deposit value o

$30,000 o play-money. At the end o the session, thelast traded contract prices were interpreted as the fnalgroup results.

6.3. Quantitative judgment task

The questions were inuenced by two limitationso the prediction market sotware. First, although

prediction markets can theoretically be used to predict

large numbers, it would have been necessary to scalethe index markets, which might have made it harder

or the participants to understand. Second, or market

maker mechanisms it is necessary to provide a startingprice, which should reasonably be chosen to be

greater than zero. However, announcing starting prices

provides traders with an anchor.To circumvent these problems, the quantitative

judgment task consisted o ten almanac questions that

required percentage estimates. Thus, no scaling was

necessary, since the market prices would always bebetween 0 and 100. Furthermore, all questions were

designed to be similar by providing an anchor in

the ormulation o the question, which was used asthe starting price in the markets. Examples include:

In 1900, the percentage o the total US population

that was aged 65 and over was 4.1%. What was thepercentage in 2000? or In 1980, 17.0% o the US

population (25 years and over) had completed 4 years

o college or more. What was the percentage in 2006?Appendix 2 provides a complete list o the questions

used in this study.It was emphasized in the written instructions and

by the incentive mechanism that the purpose o thetask was to estimate the answers to the questions. The

answers were not revealed until ater the completion

o the study.

7. Results

We used the absolute error as an index o

accuracy. Tests o statistical signifcance are reportedin ootnotes.2 Signifcance levels are indicated in the

2 Since the data were not normally distributed in most cases, weconducted non-parametric signifcance tests.

tables using asterisks ( : p 0.05; : p 0.01).The data that were analyzed in this study can be oundonline at http://tinyurlcom/method-comparison-data.

For NGT, Delphi, and six o the prediction marketgroups, we had access to the participants ex anteestimates prior to them entering the group interaction.We used both these individual priors and the mean othese estimates rom the participants in each groupas additional benchmarks or our analysis. In theollowing, the mean prior estimates or each group arereerred to as staticized groups.

7.1. Group accuracy beore group interaction

We analyzed the accuracy o the staticized groupsbeore group interaction took place. The goal othis exercise was to rule out the possibility that anydierences between the methods might have occurredbecause some groups simply happened to be betterthan others. For each question, as well as over all tenquestions, we did not fnd any dierences in accuracybetween the participants priors in NGT, Delphi, andprediction markets.3

7.2. Accuracy o structured group interaction vs.

individual priors

Table 1 shows the results, reported as the errorreduction or each question as well as over allten questions. The error reduction is the dierencebetween the absolute errors o the prior individualestimates and the respective method result. A positive(negative) value or the error reduction means thatthe method results were more (less) accurate than theindividual priors. As expected, overall, the results oeach method were more accurate than the individualpriors.4 At the individual question level, predictionmarkets were more accurate than the individualpriors or our questions and equally accurate or

six questions. NGT was more accurate or eightquestions and equally accurate or two questions.Delphi was most accurate, yielding a lower MAE thanthe respective individual priors or each o the tenquestions.

3 The Kruskal-Wallis test was used to test or dierences betweenthe absolute errors o the prior individual estimates o the threestructured approaches.

4 The Wilcoxon rank-sum test was used to test or dierencesbetween the absolute errors o the individual priors and the methodresults. Table 1 reports one-tailed p-values.
http://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurlcom/method-comparison-datahttp://tinyurlcom/method-comparison-datahttp://tinyurlcom/method-comparison-datahttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PM

7/29/2019 2011 Graefe

8/13


Table 1

Mean error reduction o each method compared to prior individual estimates beore group interaction.

Question NGT Delphi PM

(N = 54) (N = 59) (N = 34)Population aged 65+ 1.33* 3.15** 0.88

Internet users 2.69* 3.50** 1.84

Health expenditures 3.60** 2.99** 0.85

College graduates 4.12* 4.95** 6.51**

Population o Australia 2.42** 1.48** 2.67**

Overweight children 1.77 0.79** 0.79

Children with two parents 2.57** 4.51** 4.06**

Households with PC 0.78 5.08** 3.09

Motor vehicle production 9.82** 5.29** 5.43

Population aged 5 1.86** 1.24** 4.42**

Overall 3.10** 3.60** 3.05**

N: number o prior individual estimates. p 0.05.

p 0.01.

7.3. Accuracy o group interaction vs. staticized

groups

Group interactions should yield more accurateresults than can be obtained by simply averaging

the prior individual estimates within each group.

We compared the methods results to those o thecorresponding staticized groups. The results, again

reported as the error reduction, are shown in Table 2.5On average, the NGT results were signifcantly

more accurate than the staticized groups or three out

o ten questions (i.e., population o Australia, motorvehicle production, and population aged 5). For

the question households with PC, the NGT results

were signifcantly less accurate. For the remainingsix questions, there were no statistically signifcant

dierences in accuracy.The Delphi method yielded signifcantly more

accurate results than the staticized groups or three

out o the ten questions. For the remaining sevenquestions, the dierences were small.Prediction markets improved on the staticized

groups or two questions. For the remaining questions,

there were no statistically signifcant dierences,which might be due to the small number o

observations. However, although the dierences were

5 The Wilcoxon rank-sum test was used to test or dierencesbetween the absolute errors o the staticized groups and the methodresults. Table 2 reports one-tailed p-values.

small, prediction markets were less accurate than thestaticized groups or six o the ten questions.

Over all ten questions, group interaction improvedon the staticized groups. However, that being said, thegains in accuracy were modest and the sample sizeswere not large, and thus there is some uncertainty

about these fndings. The dierences were statisticallysignifcant only with respect to Delphi.

7.4. Relative accuracy o group methods

For each method, we calculated the MAEs oreach question as well as over all ten questions.With the lowest overall MAE o 5.62, Delphi wasthe most accurate, ollowed by NGT (5.92) and

prediction markets (7.07). As expected, FTF wasleast accurate; its overall MAE was 7.21. Althoughthe results conormed to our expectations that thestructured approaches would outperorm FTF, theoverall dierences between the methods were not

statistically signifcant.6 At the question level, theevidence on the relative perormances o the methodswas mixed. Table 3 shows the error reduction o NGT,Delphi, and prediction markets compared to FTF.

The dierences between FTF and NGT weresmall.7 For eight o the ten questions, there were no

6 The Kruskal-Wallis test showed no statistically signifcantdierences between the our methods over all 10 questions.

7 We conducted Mann-Whitney U-tests to compare the results oFTF and the respective structured approaches or each question.

7/29/2019 2011 Graefe

9/13


Table 2

Mean error reduction o each method compared to staticized groups.

Question NGT Delphi PM

(N = 11) (N = 11) (N = 6)Population aged 65+ 0.87 0.93 1.7

Internet users 0.89 1.59 1.13

Health expenditures 0.71 0.09 3.22

College graduates 0.3 1.52 3.25

Population o Australia 2.13** 1.45** 2.47*

Overweight children 2.77 1.02 3.52

Children with two parents 1.69 1.39 1.37

Households with PC 4.96** 2.09 1.37

Motor vehicle production 8.79** 3.58* 3.95

Population aged 5 1.58** 0.73* 3.77*

Overall 0.41 0.54* 0.17

N: number o groups. p 0.05. p 0.01.

Table 3

Error reduction o NGT, Delphi, and prediction markets compared to FTF.

Question MAE Error reduction to FTF

FTF NGT Delphi PM

Population aged 65+ 7.76 2.69 4.29** 1.22

Internet users 4.53 0.53 1.33 3.36

Health expenditures 3.86 2.19 1.38 0.16

College graduates 6.36 0.27 0.26 1.87

Population o Australia 2.41 0.09 1.2 2.35*

Overweight children 13.98 7.93* 9.74** 5.52

Children with two parents 7.19 0.8 2.77 0.98

Households with PC 15 0.82 7.21 6.5

Motor vehicle production 10.12 0.22 5.49 7.92*

Population aged 5 0.86 1.00* 1.21 1.26**

Overall 7.21 1.29 1.59 0.14

p 0.05. p 0.01.

signifcant dierences between the two methods. For

the two remaining questions, NGT was more accurate

than FTF once and less accurate once.

Delphi was signifcantly more accurate than FTF

or two questions. For the remaining eight questions,

there were no dierences in accuracy. In no case was

Delphi signifcantly less accurate than FTF.

Contrary to our expectations, prediction markets

were unable to improve on the accuracy o FTF.

Instead, or three questions, prediction markets

yielded results that were signifcantly less accurate

than the respective FTF results. For the remaining

questions, there were no statistically signifcantdierences.

7.5. Participants perceptions

Ater the group interaction, we asked the partic-ipants about their perceptions o the group, as wellas o the group process. We had hypothesized thatthe personal interactions between participants wouldlead to more avorable ratings or FTF and NGTrelative to Delphi and prediction markets. Further-

more, we had assumed that, due to lower group pres-sures, NGT would be rated more avorably than FTF.

7/29/2019 2011 Graefe

10/13


Fig. 1. Participants mean ratings o each method (on a 7-point scale).

Finally, because o the participants lack o amiliaritywith the approach, we expected prediction markets to

be rated most unavorably on most attributes. The re-

sults were consistent with these hypotheses. For eachattribute, Fig. 1 reports the statistical mean (and 95%

confdence intervals) o the participants ratings, re-vealed on a 7-point numerical scale rom 1 (very

low) to 7 (very high). Note that or the attributesdisagreementand difculty, higher ratings were inter-

preted as being negative.

Ratings o the group

NGT obtained the most avorable ratings or coop-eration and disagreement and ranked second or par-

ticipants confdence in the group answers. Predictionmarkets ranked second worst or disagreement andconfdence and worst or cooperation. For both coop-

eration and disagreement, FTF ranked second.

Although Delphi obtained the worst score ordisagreement, the participants were most confdentin the outcomes. We see the ollowing reasons or

the high level o perceived disagreement in Delphi:(1) Delphi participants cannot communicate with each

other directly, but reveal their estimates independently;and (2) agreement can only increase i incorporating

eedback inormation ater the frst round leads to

more cohesive results at the end o the process.However, disclosing disagreement can be desirable, as

it alerts decision-makers to uncertainty.

FTF obtained the lowest score in terms oconfdence in the group outcome. This is surprising,

since we would have expected that high levels o

perceived cooperation and agreement would result inhigh levels o confdence in the group results. We

assume that the FTF participants were aware that, dueto group tendency eects, probably not all available

inormation rom group members was aggregated.

Ratings o the group process

The rankings o the group process as a whole were

identical over 4 out o 5 attributes, with NGT placedfrst, FTF second, Delphi third and prediction markets

ourth. For difculty to participate, FTF achieved

an equally avorable score as NGT. For reedom toparticipate, FTF and Delphi swapped places. Given

the avorable ratings or FTF on most other attributes,it is interesting that Delphi achieved a higher score

than FTF or reedom to participate. Again, the reasonmight be that the FTF participants experienced group

pressures that may have hindered them rom ully

revealing their inormation, which fts with the lowconfdence scores or FTF. Overall, the participants

perceptions o the prediction markets process werepoor or all attributes.

In summary, the results conormed to our expecta-

tions. NGT obtained the most avorable ratings or 7 othe 8 categories. As hypothesized, FTF was also rated

quite avorably, being ranked second or six attributes.

7/29/2019 2011 Graefe

11/13


Finally, the results corroborated our hypothesis that

prediction markets would be rated most unavorably:they had the worst ratings or six attributes and were

second worst or the two remaining attributes.

8. Discussion

Although the three structured approaches yielded

lower MAEs than FTF over all ten questions, the di-erences in accuracy between the our methods were

not statistically signifcant. The results conormed to

our hypotheses only with respect to the direction o

the results.However, some dierences between the methods

were identifed at the question level. Delphi was theonly approach that was never less accurate than FTF,

and outperormed FTF or two o the ten questions.By comparison, prediction markets were less accurate

than FTF or three o the ten questions and were

unable to outperorm FTF or any o the questions.

The relative perormances o NGT and FTF weremixed and the dierences were small. We can only

speculate on the reasons or these results.The participants in this study had to solve a

quantitative judgment task that required percentageestimates o the answers to actual questions. Such

tasks have clear solutions, lack highly distributedknowledge and uncertainty, and require groups only

to aggregate actual knowledge. We would expect thatstructured approaches would be substantially more

eective when participants reveal inormation that is

new and useul to others.For example, or site location problems, dierent

experts might have distinct knowledge about labor

and real estate costs, trafc patterns, and the

types o consumers that live near a potential retail

outlet. Other tasks, such as orecasting, decision-

making, negotiation, or conict solving, are morechallenging, as they involve not only acts, but also

values, emotions, and expectations, as well as highlevels o uncertainty. We would expect the relative

perormances o methods to be largely inuenced by

such problems (Wright & Ayton, 1986). In particular,we would expect group pressures to be much higher,

and thereore FTF to generally perorm worse.In contrast, when all o the experts draw upon

the same inormation, which might have been the

case or the type o problem used in this study,

inormation exchange is expected to be o little value.

Over the ten questions, only Delphi yielded more

accurate results (in terms o statistical signifcance)

than staticized groups. By comparison, the dierencesbetween prediction markets or NGT and staticized

groups were not statistically signifcant.

Under most circumstances, structured combined

orecasts are substantially more accurate than in-

dividual orecasts (Armstrong, 2001). Our results

provide urther support or this well-established prin-

ciple. Each o the three structured approaches yielded

lower mean absolute errors than the prior individual

estimates.

Another actor that might have inuenced the

results is the experiment design. The experiment

setting avored meetings, since it did not involve

hierarchies. In a more realistic environment with

hierarchies, we would expect group pressures to be

more o a concern, and as a result, the relative

perormance o FTF would decrease.

The comparatively poor perormance o prediction

markets may have been due to the limited means

o communication between participants. Although the

participants could exchange inormation continuously

through trading, they were unable to provide reasoning

or their estimates. They could only buy and sell

contracts, without indicating why they did so. Thus,the trading mechanism might not have allowed the

participants to convey inormation to the group.

However, prediction markets could be designed to

overcome such barriers. For example, additional

means o communication, such as orums or chat

rooms, could be implemented to allow traders

to communicate with each other and to reveal

the reasoning or their actions. Although several

commercial prediction markets have implemented

such solutions, we are not aware o any research that

has examined whether this has an eect on accuracy.Examining the participants perceptions o the

group and o the group process as a whole revealed

a preerence or methods involving personal commu-

nication. These results supported our hypothesis that

personal interactions in non-hierarchical meetings can

lead to coherence, and thereore increase the perceived

satisaction o the participants. In contrast, Delphi and

prediction markets were rated less avorably. In par-

ticular, prediction markets were rated worst on six

attributes and second worst on the remaining two.

7/29/2019 2011 Graefe

12/13


We see at least two reasons or this. First, the lack oexchanging reasons or their judgments may alienate

traders. We would expect comort with the method to

rise with increasing insight into the opinions o el-low traders. Second, prediction markets are still a rel-atively new approach, and the idea o revealing ones

opinion through trading stocks may be difcult or par-ticipants to understand. Even though we relied on sot-

ware that was specifcally designed to make participa-tion as easy as possible or novices, participation in

prediction markets was rated as the most difcult byar. We assume that the high perceived difculty was

crucial to the poor ratings on other categories, whichin turn may hinder the urther adoption o prediction

markets by practitioners. Researchers and developers

should urther increase their eorts to make marketsotware solutions more accessible to inexperiencedtraders.

Participants perceptions o the group process canbe crucial to the acceptance o its results. When the

agreement among and satisaction o groupmembers is high, decision-makers can share the

responsibility or their decisions. Thus, our resultsmay entice decision-makers to rely on methods

involving personal communication. However, such astrategy has its drawbacks.

1. There is no evidence that high perceived satisac-tion is correlated to good perormance.

2. Taking into account administrative and time eorts,

meetings are expensive and can be difcult toconduct, as they require the presence o the

participants.3. Meetings are limited in the number o participants

involved. In contrast, Delphi and predictionmarkets do not require the presence o the

participants and can be conducted asymmetricallywith a virtually unlimited number o people.

4. Meetings can lead to poor decisions. Due to highergroup pressures in real world situations, we would

expect meetings to perorm worse or more com-plex problems and more realistic environments,

especially when the relevant inormation is dis-tributed among the experts.

9. Conclusion

We compared the relative accuracy o traditional

FTF to that o three structured approaches (NGT,

Delphi, and prediction markets) on a quantitative

judgment task that required percentage estimates orten actual questions.

Over the ten questions, we did not fnd any sta-tistically signifcant dierences in accuracy between

the our methods. However, the method results di-

ered somewhat at the individual question level. Del-phi was as accurate as FTF or eight questions and

outperormed FTF or two questions. By comparison,

prediction markets were unable to outperorm FTFor any o the questions and were inerior or three

questions. The relative perormances o NGT and

FTF were mixed and the dierences were generallysmall.

The three structured approaches were more accu-

rate than the participants prior individual estimates.Delphi was also more accurate than staticized groups,

in contrast to NGT and prediction markets. The re-sults suggest that NGT and prediction markets pro-

vide little additional value over simple averages o

orecasts in situations where the participants are alldrawing upon similar inormation. Future research

should evaluate the relative perormances o the meth-

ods or more complex problems in more realisticenvironments.

We also analyzed the participants perceptions o

the group and o the group process as a whole.The participants rated methods involving personal

communication (FTF and NGT) more avorablythan the computer-mediated Delphi and prediction

markets. In particular, the FTF and NGT participants

experienced higher levels o cooperation amongtheir groups and perceived the group interaction as

more eective. Prediction markets were rated least

avorably: the prediction market participants wereleast satisfed with the group process and rated their

method highest in terms o difculty o participation.

Acknowledgements

We would like to thank Kesten Green, RobinHanson, and David Pennock or helpul comments.

Thanks to the sta o the Wharton Behavioral

Lab, and in particular Daniela Lejtneker, Tatiana

Silva, and Patricia Zapater-roig. Ellen Rosenbergand Christian Menn helped with preparing materials.

Thanks to inklingmarkets.com or providing the

prediction markets sotware.

7/29/2019 2011 Graefe

13/13


References

Armstrong, J. S. (2001). Combining orecasts. In J. S. Armstrong

(Ed.), Principles o orecastinga handbook or researchers

and practitioners (pp. 417439). Boston, MA: Kluwer

Academic Publishers.

Armstrong, J. S. (2006). Howto make betterorecasts and decisions:

avoid ace-to-ace meetings. ForesightThe International

Journal o Applied Forecasting, 5, 38.

Berg, J. E., Nelson, F. D., & Rietz, T. A. (2008). Prediction market

accuracy in the long run. International Journal o Forecasting,

24, 285300.

Boje, D. M., & Murnighan, J. K. (1982). Group confdence

pressures in iterative decisions. Management Science, 28,

11871196.

Christiansen, J. D. (2007). Prediction markets: practical experiments

in small markets and behaviours observed. Journal o Prediction

Markets, 1, 1741.Erikson, R. S., & Wlezien, C. (2008). Are political markets

really superior to polls as election predictors? Public Opinion

Quarterly, 72, 190215.

Fischer, G. W. (1981). When oracles aila comparison o

our procedures or aggregating subjective probability ore-

casts. Organizational Behavior and Human Perormance, 28,

96110.

Green, K. C., Armstrong, J. S., & Graee, A. (2007). Methods

to elicit orecasts rom groups: Delphi and prediction markets

compared. ForesightThe International Journal o Applied

Forecasting, 8, 1720.

Gustason, D. H., Shukla, R. K., Delbecq, A., & Walster, G. W.

(1973). A comparative study o dierences in subjective

likelihood estimates made by individuals, interacting groups,

Delphi groups, and nominal groups. Organizational Behavior

and Human Perormance, 9, 280291.Rhode, P. W., & Strump, K. S. (2004). Historical presidential

betting markets. Journal o Economic Perspectives, 18,

127141.

Rosenbloom, E. S., & Notz, W. (2006). Statistical tests o

real-money versus play-money prediction markets. Electronic

Markets, 16, 6369.

Rowe, G., & Wright, G. (1999). The Delphi technique as a

orecasting tool: issues and analysis. International Journal o

Forecasting, 15, 353375.

Servan-Schreiber, E., Wolers, J., Pennock, D. M., & Galebach, B.

(2004). Prediction markets: does money matter? Electronic

Markets, 14, 243251.

Van de Ven, A. H., & Delbecq, A. L. (1971). Nominal versus

interacting group processes or committee decision making

eectiveness. Academy o Management Journal, 14, 203212.

Van de Ven, A. H., & Delbecq, A. L. (1974). The eectiveness

o nominal, Delphi, and interacting group decision making

processes. Academy o Management Journal, 17, 605621.

Wolers, J., & Zitzewitz, E. (2004). Prediction markets. Journal o

Economic Perspectives, 18, 107126.

Woudenberg, F. (1991). An evaluation o Delphi. Technological

Forecasting and Social Change, 40, 131150.

Wright, G., & Ayton, P. (1986). Subjective confdence in orecasts:

a response to Fischho and MacGregor. Journal o Forecasting,

5, 117123.

Documents

2011 Graefe