2011 Graefe

  • Upload
    pkj009

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

  • 7/29/2019 2011 Graefe

    1/13

    International Journal o Forecasting 27 (2011) 183195

    www.elsevier.com/locate/ijorecast

    Comparing ace-to-ace meetings, nominal groups, Delphi andprediction markets on an estimation task

    Andreas Graeea,, J. Scott Armstrong b

    aInstitute or Technology Assessment and Systems Analysis, Karlsruhe Institute o Technology, Germanyb The Wharton School, University o Pennsylvania, Philadelphia, PA, United States

    Abstract

    We conducted laboratory experiments or analyzing the accuracy o three structured approaches (nominal groups, Delphi,

    and prediction markets) relative to traditional ace-to-ace meetings (FTF). We recruited 227 participants (11 groups per method)

    who were required to solve a quantitative judgment task that did not involve distributed knowledge. This task consisted o ten

    actual questions, which required percentage estimates. While we did not fnd statistically signifcant dierences in accuracy

    between the our methods overall, the results diered somewhat at the individual question level. Delphi was as accurate as FTF

    or eight questions and outperormed FTF or two questions. By comparison, prediction markets did not outperorm FTF orany o the questions and were inerior or three questions. The relative perormances o nominal groups and FTF were mixed

    and the dierences were small. We also compared the results rom the three structured approaches to prior individual estimates

    and staticized groups. The three structured approaches were more accurate than participants prior individual estimates. Delphi

    was also more accurate than staticized groups. Nominal groups and prediction markets provided little additional value relative

    to a simple average o the orecasts. In addition, we examined participants perceptions o the group and the group process. The

    participants rated personal communications more avorably than computer-mediated interactions. The group interactions in FTF

    and nominal groups were perceived as being highly cooperative and eective. Prediction markets were rated least avourably:

    prediction market participants were least satisfed with the group process and perceived their method as the most difcult.

    c 2010 International Institute o Forecasters. Published by Elsevier B.V. All rights reserved.

    Keywords: Forecast accuracy; Group decision-making; Method comparison; Satisaction; Combining

    1. Introduction

    In situations where a lack o appropriate or avail-able inormation precludes one rom using quanti-

    Corresponding author.E-mail addresses: [email protected] (A. Graee),

    [email protected] (J.S. Armstrong).

    tative methods, it can be helpul to incorporate hu-

    man judgment in order to improve the orecast. But

    how does one get the best orecast when aggregating

    inormation rom a group o people? While organi-

    zations most commonly rely on unstructured ace-

    to-ace meetings, it is difcult to fnd evidence to

    support the use o this strategy (Armstrong, 2006).

    0169-2070/$ - see ront matter c 2010 International Institute o Forecasters. Published by Elsevier B.V. All rights reserved.doi:10.1016/j.ijorecast.2010.05.004

    http://dx.doi.org/10.1016/j.ijforecast.2010.05.004http://www.elsevier.com/locate/ijforecastmailto:[email protected]:[email protected]://dx.doi.org/10.1016/j.ijforecast.2010.05.004http://dx.doi.org/10.1016/j.ijforecast.2010.05.004mailto:[email protected]:[email protected]://www.elsevier.com/locate/ijforecasthttp://dx.doi.org/10.1016/j.ijforecast.2010.05.004
  • 7/29/2019 2011 Graefe

    2/13

    184 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195

    The literature suggests that structured approaches such

    as nominal groups or Delphi provide more accurateorecasts than traditional meetings.

    In recent years, there has been increasing inter-est in prediction markets as an approach or eliciting

    inormation rom people, and a number o organiza-

    tions have started to experiment with them. However,to date, we still do not know much about the peror-

    mance o prediction markets. The available studies are

    limited and are oten o a small scale. In particular,we do not know o any study that has analyzed their

    perormance relative to either meetings or other struc-

    tured group techniques or aggregating the knowledgein groups. Since the emergence o the feld, there has

    been no meta-analysis that has analyzed the accuracy

    o prediction markets.To address this defciency, we conducted laboratory

    experiments which compared unstructured meetings,nominal groups, Delphi, and prediction markets. We

    analyzed the relative accuracies o the our group

    techniques on a quantitative judgment task andexamined the participants perceptions o their group

    and their group process.

    2. Group judgment tasks

    We selected problems that did not involve highlydistributed inormation. This is not intended to suggest

    that problems involving inormation which is highly

    dispersed among the participants are unimportant, butmerely that we addressed a limited but typical

    orecasting situation: general problems or which

    people are unlikely to possess specifc knowledge. In

    our study, the groups had to come up with estimates ora quantitative judgment task. This task consisted o ten

    actual questions that required percentage estimates.

    These questions had correct solutions, but all group

    members could be expected to have some degree ouncertainty about the answers.

    The fndings rom group perormance studies are

    generally task-specifc. The critical question arisesas to whether or not the tasks used in experimental

    settings are representative o real-world problems,

    and thus allow or generalizations to dierent types o

    tasks (such as orecasting problems) or participants.For example, Wright and Ayton (1986) provided

    some evidence that the fndings rom calibration

    studies which used general knowledge questions are

    not applicable to judgmental orecasting problems,since the two task types involve dierent levelso uncertainty. Furthermore, it is oten argued that

    student participants in laboratory experiments arelaymen, whereas the participants in real-worldorecasting scenarios are experts who are expected topossess superior knowledge.

    In general, or a judgmental orecasting method

    to perorm well, it should be able to aggregate therelevant inormation rom its members efciently.Thus, a basic precondition or the generalizabilityo problems used in experimental settings is the

    involvement o participants in trying to solve them.Factual questions are commonly analyzed by groupdecision-making, as they have been shown to spark the

    interest o student participants.

    3. Group techniques

    We analyzed our group techniques that dier inthe amount and structure o the interactions permittedbetween group members: ace-to-ace meetings,

    nominal groups, the Delphi method, and predictionmarkets.

    3.1. Unstructured ace-to-ace meetings

    Unstructured ace-to-ace meetings (FTF) allowany orm o direct interaction between groupmembers. Although meetings are most common or

    group decision-making in organizations, they havebeen shown to be subject to many biases anddrawbacks. For example, (1) it requires time and eortor a group to maintain itsel; (2) groups tend to

    aim to reach speedy decisions and do not considerall problem dimensions, and thus tend to pursue alimited train o thought, which leads to a centraltendency eect or groupthink; (3) less confdent

    group members, or people rom lower hierarchy levels,may keep silent about their reservations because oeither group pressures or conormity or impliedthreats o sanctions; (4) dominant personalities tend to

    exert an excessive amount o inuence on the group.For a summary o these issues, see Van de Ven andDelbecq (1971).

    In summary, Armstrong (2006) ound little evi-dence to support the use o meetings or orecastingor decision-making. In addition, meetings are expen-

    sive both to schedule and to run. In some situations,

  • 7/29/2019 2011 Graefe

    3/13

    A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 185

    however, one must be concerned with not only the

    quality o a decision but also its acceptability. Peo-ple generally enjoy human interactions and the sense

    o working together, and meetings have oten beenshown to achieve high levels o satisaction (e.g. Boje& Murnighan, 1982; Van de Ven & Delbecq, 1974).

    3.2. Nominal group technique

    The nominal group technique (NGT), developedby Van de Ven and Delbecq (1971, 1974), tries to

    avoid some o the drawbacks o traditional meetings

    by adding a structured ormat to the direct interactions.This process is conducted in three steps. First, group

    members work independently to generate individual

    estimates or a problem. Then, the group enters anunstructured discussion to deliberate on the problem.

    Finally, the group members again work independently

    to provide their fnal individual estimates. The groupresult is the aggregated outcome o these fnal

    individual estimates. The literature also reers to this

    process as estimate-talk-estimate (Gustason, Shukla,Delbecq, & Walster, 1973).

    The idea o NGT is that direct interactions during

    the assessment or evaluation phase can have a positive

    impact on estimation or problem solving. In particular,they can help group members to clariy and justiy

    their points o view, which may help the group to make

    more inormed decisions. Nonetheless, in the phaseso generating answers and making the fnal decisions,

    NGT prevents direct interactions between group

    members, in order to reduce the known drawbackso traditional meetings. See Van de Ven and Delbecq

    (1971) or a summary o the advantages o nominal

    groups.

    3.3. The Delphi method

    Delphi is a multiple-round survey in whichthe participants anonymously reveal their individualestimates and provide comments on a problem. Ater

    each round, the individual estimates and comments

    are summarized and reported to the participants aseedback. Taking this inormation into account, the

    participants then provide their new estimate in the

    ollowing round. The group result is the aggregatedoutcome o the individual estimates o the fnal round.

    Unlike FTF and NGT, which require the physical

    proximity o the group members, the participants

    in Delphi are physically dispersed and do not meetin person. In general, Delphi unctions somewhat

    similarly to NGT. The main dierence is that written

    interactions are utilized throughout the process, inorder to avoid direct interactions between the group

    members.The strengths o the method are seen in its

    structured communication process, which enablesdiscussion and helps groups to achieve a consensus,

    but limits the drawbacks associated with directinteractions. For reviews o the Delphi method, seeRowe and Wright (1999) and Woudenberg (1991).

    3.4. Prediction markets

    Prediction markets were popular in the late 1800s(Rhode & Strump, 2004), and have an impressive

    track record in orecasting elections. In analyzing964 polls or the fve presidential elections rom1988 to 2004, Berg, Nelson, and Rietz (2008) ound

    that the respective orecasts o the Iowa ElectronicMarkets were closer to the actual election results thanthe individual polls 74% o the time. However, this

    advantage disappeared when the polls were averagedand damped (Erikson & Wlezien, 2008).

    Prediction markets are gaining attention in variousfelds o orecasting. The idea is to set up a contract

    whose payo depends on the outcome o an uncertainuture event. This contract, which can be interpretedas a bet on the outcome o the underlying uture event,

    can then be traded by participants. As soon as theoutcome is known, the participants are paid o inexchange or the contracts they hold. Based on their

    individual perormances, participants can win money.I one thinks that the current group estimate is toolow (high), one will buy (sell) stocks. Thus, through

    the prospect o gaining money, the participants havean incentive to become active in the group process

    whenever they expect the group estimate to beinaccurate. As in Delphi, the participants are mutuallyanonymous, and thus are not subject to the socialpressures connected with direct interactions. The

    main dierence is that the participants exchangeinormation continuously through the price signal othe market, but do not share comments or reasons as

    to why they buy or sell a contract. For an explanationo the concept o prediction markets and a useulsummary o the method, see Wolers and Zitzewitz

    (2004).

  • 7/29/2019 2011 Graefe

    4/13

    186 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195

    Prediction markets using play-money have alsoproven to be accurate. In comparing the results oplay-money and real-money markets or sport events,

    Servan-Schreiber, Wolers, Pennock, and Galebach(2004) did not fnd any dierences in accuracy. Onthe other hand, Rosenbloom and Notz (2006) oundreal-money markets to be more accurate or non-sportsevents.

    4. Related work

    Structured group techniques are expected to limitthe biases associated with meetings, and prior researchseems to support this assumption. In their review,Rowe and Wright (1999) reported a superior accuracy

    o Delphi over unstructured interaction by a scoreo fve studies to one. In excluding two papers romtheir analysis that did not analyze the accuracy orquantitative judgment tasks, we adjusted the score tothree studies to one, still with two ties. Similarly,in summarizing the literature, Woudenberg (1991)ound an albeit slight superiority o Delphiover unstructured interaction. Furthermore, again orquantitative judgment tasks, Gustason et al. (1973)ound NGT to be more accurate than FTF, althoughFischer (1981) ound no dierences. Evidence on therelative perormances o NGT, Delphi, and predictionmarkets or quantitative judgment tasks is scarce.While Gustason et al. (1973) showed that NGT wasmore accurate than Delphi, two studies ound nodierences (Boje & Murnighan, 1982; Fischer, 1981).In the case o prediction markets, we do not know oany work that has compared the method to FTF, NGTor Delphi.

    Little work has been done on analyzing partici-pants perceptions o group processes, although thiscan be important or the acceptance o its results. I thegroup members or decision-makers eel dissatisfed

    with the process, its outcome may not be adopted, eveni it is highly accurate. On the other hand, a processthat is highly satisying or participants may not nec-essarily lead to accurate results. In general, personalinteractions in groups can lead to either coherence, andthus a high level o perceived satisaction, or disagree-ment, resulting in rustrated group members. For quan-titative judgmental tasks, Boje and Murnighan (1982)analyzed the participants perceptions o Delphi, NGT,and people working alone. They ound that NGTwas rated most avorably in terms o eectiveness,

    satisaction, and the reedom to participate, whereas

    Delphi was rated only slightly superior to working

    alone. Van de Ven and Delbecq (1974) compared par-

    ticipants perceptions o nominal groups, Delphi, andunstructured meetings or an idea generation problem.

    They ound that NGT participants expressed the high-

    est satisaction with the process, whereas the dier-ences between the Delphi method and meetings were

    small.

    For prediction markets, we do not know o anystudy that has analyzed the way in which people

    perceive participation and how this might aect the

    acceptability o market results. However, practical

    experience indicates that people have problems

    in understanding how prediction markets work.

    In particular, they have problems in translatinginormation into market prices (Green, Armstrong,

    & Graee, 2007). In addition, people appear to have

    difculties in understanding what the prices mean. Forexample, during the 2008 fnancial crisis, intrade.com

    launched a contract or predicting whether or not

    the US government would pass a bail out. At one

    time, the contract was traded at a price o $80,

    which meant that the market orecasted an 80%probability that the bail out would go through.

    A sta writer at CNNMoney.com who would

    be expected to possess a certain understanding ofnancial instruments interpreted the market prices

    wrongly, stating that 80% o the participants thought

    that the event would come true (CNNMoney.com:

    Oddsmakers see bailout OK soon, September 24,

    2008). A misunderstanding o how prediction markets

    work might be critical to their acceptance by both

    participants and decision-makers. I people do notunderstand how to reveal inormation, the way in

    which inormation is aggregated into market prices, or

    how to interpret market prices, it is doubtul whether

    they will have a avorable perception o the method,and thus have confdence in its results.

    5. Hypotheses

    Given the lack o evidence, we had no directionalhypotheses on the relative accuracy o the three

    structured approaches (NGT, Delphi, and prediction

    markets), since arguments could be ound in either

    direction. For example, prediction markets might be

    advantageous to Delphi because traders can update

  • 7/29/2019 2011 Graefe

    5/13

    A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 187

    their estimates continuously and are not tied to alimited number o rounds. On the other hand, themarket participants do not reveal the reasons or their

    estimates. Thus, other group members might not beable to relate to why the group estimate has changed.

    Accordingly, the possibility o argumentation andreasoning might avor NGT and Delphi. For accuracy,we expected: NGT Delphi prediction markets >

    FTF.We sent the study design to experts in the feld

    and asked them to provide their prior estimates o

    the relative accuracies o the methods. To ensurethat the experts were amiliar with all our methods,we primarily reached out to people with experiencein prediction markets. Eight experts1 responded and

    we derived the ollowing ranking rom their priors:Delphi > prediction markets > NGT > FTF. Inexpecting that all three structured approaches wouldoutperorm FTF, the expert priors were consistentwith our hypotheses. In particular, none o the experts

    expected FTF to do better than any o the structuredapproaches.

    To obtain a measure o the acceptability o themethods, we asked or the participants perceptionso the group and the group process along seven at-

    tributes. In particular, participants rated (1) coopera-

    tion and (2) disagreement among group members, andrevealed their (3) confdence in the group results. Theyalso provided ratings o (4) how ree they elt to par-ticipate, (5) whether their time was well spent, (6)how difcult it was to participate, (7) how satisfedthey were with the group process, and (8) how eec-tive they thought the group process was or solvingthe estimation problem. Since our group interactionprocesses simulated non-hierarchical meetings, we ex-

    pected the group pressures to be lessened. Assumingthat people generally enjoy human interaction and thesense o working together, we hypothesized that the

    personal interactions in FTF and NGT will lead tomore avorable ratings compared to Delphi and pre-diction markets. Furthermore, due to the lower pres-sures or conormity in NGT, we expected NGT to berated more avorably than FTF, as has been shown pre-viously by Van de Ven and Delbecq (1974). In addi-

    tion, since we expected the majority o traders to be

    1 Thanks to Yiling Chen, Kesten Green, Robin Hanson, SteanLuckner, Gene Rowe, Martin Spann, Gerrit Van Bruggen and EricZitzewitz.

    unamiliar with the process o revealing ones opin-

    ions by trading stocks, we expected prediction mar-

    kets to be rated least avorably, in particular in terms o

    difculty. In summary, or acceptability, we expected:NGT > FTF > Delphi > prediction markets.

    6. Methods

    6.1. Participants

    The participants were students (n = 227) at the

    University o Pennsylvania who were recruited by

    the Wharton Behavioral Lab. They were randomly

    assigned to 44 heterogeneous groups (11 per method),

    and the 11 prediction market groups were urtherdivided into two treatments. In 5 groups, the traders

    started trading immediately, while in 6 groups,

    the traders generated their individual estimates

    independently beore participating in the market.

    The group size was determined by the number

    o students showing up in each session. Most o

    the groups (42 o 44) consisted o 4 to 6 subjects,

    though one NGT group contained 3 subjects and

    one prediction markets group consisted o 7 subjects.

    Appendix 1 provides an overview o the group size

    and (average) number o participants per method.

    All appendices can be ound in a supporting onlinedocument at http://www.orecasters.org/ij.

    All sessions lasted or one hour. Each participant

    was paid a $10 show-up ee. In addition, the

    participants were remunerated depending on their

    group or individual perormance. On average, $15

    was distributed among each group. In FTF, NGT,

    and Delphi, $50 was distributed equally among the

    participants in the two most accurate groups, while the

    next two most accurate groups received $25, and the

    next $15. To match the traditional pay-o mechanism

    o prediction markets, the traders were paid basedon their individual perormances, i.e. $6 or the best

    perorming trader in each group, $5 or the second

    best, and $4 or the third best.

    6.2. Materials and procedures

    The data were collected in the Wharton Behavioral

    Lab. FTF and NGT were conducted in a small

    meeting room, while the Delphi and prediction

    market groups worked in a computer room on

    http://www.forecasters.org/ijfhttp://www.forecasters.org/ijfhttp://www.forecasters.org/ijf
  • 7/29/2019 2011 Graefe

    6/13

    188 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195

    individual workstations, divided by partition walls

    to prevent personal communication. Each participant

    received general instructions explaining the relevant

    group technique, as well as the respective pay-omechanism. As the process started, orms were handed

    out to each participant or making personal notes

    and analyses. During the whole process, whenever

    individuals or groups were asked to give estimates,

    they had to state their confdence in these estimates on

    a seven point numerical scale (1: not at all confdent;

    7: extremely confdent). Ater the group interaction,

    all participants rom all group techniques received a

    post-task orm on which to reveal their fnal individual

    estimates. In addition, the post-task orm was used to

    obtain the participants perceptions o the group and o

    the group process as a whole, again on a seven pointnumerical scale. Examples o the materials used can

    be ound in Appendices 3 to 6.

    6.2.1. Unstructured ace-to-ace meetings

    The group members were seated around a table to

    discuss the problem, with the goal o reaching a group

    estimate. Each group received one group questionnaire

    or their fnal group estimates. In addition, each group

    member received an individual questionnaire on which

    to note the group estimates achieved.

    6.2.2. Nominal groups

    First, the group members were spread over three

    tables to work independently to provide individual

    estimates or each o the ten questions on a paper

    questionnaire. Ater each participant had completed

    the questionnaire, the group members were seated

    around one table to discuss their individual estimates

    with the group. Each group received one group

    questionnaire or their fnal group estimates. In

    addition, each group member received an individual

    questionnaire on which to note the group estimatesachieved. Finally, the group members were again

    spread out over three tables to provide their fnal

    individual estimates on the post-task questionnaire.

    The group results were calculated as the median o the

    fnal individual estimates.

    6.2.3. Delphi

    Beore logging on to the system, the participants

    watched a video tutorial (http://tinyurl.com/method-

    comparison-Delphi) o how to use the Delphi sot-

    ware. In order to conduct our research in a realis-

    tic environment that can be adopted by practitioners,

    we used the ree Delphi sotware available at For-

    Prin.com. This sotware was developed in order to ed-ucate the users about Delphi and to aid them in the

    use o the standard Delphi procedure by including its

    our key eatures: anonymity o participants, iteration,

    controlled eedback, and statistical aggregation o in-

    dividual estimates. In each round, the sotware also

    required the participants to provide their individual nu-

    merical estimates, together with 10% lower and upper

    confdence bounds. Furthermore, the participants had

    the option to provide comments. Ater everyone had

    responded, the results o the frst round were summa-

    rized and reported as eedback to the participants; this

    included each individual estimate (along with the con-fdence interval), a group summary o these estimates

    (mean, median, and standard deviation), and each indi-

    viduals comments per question. The participants then

    provided their fnal individual estimates in a second

    round. For our analysis, we used the median o the in-

    dividual estimates rom the second round to calculate

    the group results.

    6.2.4. Prediction markets

    We used the (partly) ree sotware solution o

    inklingmarkets.com, which was specifcally designedto make participation as easy as possible or

    inexperienced traders. In addition, since our markets

    had only a small number o participants, it was

    necessary to rely on a system using a market maker,

    in order to ensure a sufcient liquidity o the market.

    With Inklings market maker, the participants can

    buy and sell contracts, and thus reveal inormation,

    at any time, as they interact with the market maker

    directly, and do not require a human trading partner.

    Thus, the market can elicit inormation even rom

    a single participant. For a urther discussion o thesotware, see Christiansen (2007), who conducted

    feld experiments using Inkling.

    Each question was mapped as one contract, which

    could be traded by participants. At market closure,

    each contract was liquidated at the value o the

    correct answer or each question. Thus, i one

    thought that the correct answer was higher than

    the contracts current value, one would have bought

    shares, otherwise one would have sold. The procedure

    is explained in detail in a video tutorial, which

    http://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphihttp://tinyurl.com/method-comparison-Delphi
  • 7/29/2019 2011 Graefe

    7/13

    A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 189

    was watched by each participant beore logging on

    to the system (http://tinyurl.com/method-comparison-PM). Each participant had an initial deposit value o

    $30,000 o play-money. At the end o the session, thelast traded contract prices were interpreted as the fnalgroup results.

    6.3. Quantitative judgment task

    The questions were inuenced by two limitationso the prediction market sotware. First, although

    prediction markets can theoretically be used to predict

    large numbers, it would have been necessary to scalethe index markets, which might have made it harder

    or the participants to understand. Second, or market

    maker mechanisms it is necessary to provide a startingprice, which should reasonably be chosen to be

    greater than zero. However, announcing starting prices

    provides traders with an anchor.To circumvent these problems, the quantitative

    judgment task consisted o ten almanac questions that

    required percentage estimates. Thus, no scaling was

    necessary, since the market prices would always bebetween 0 and 100. Furthermore, all questions were

    designed to be similar by providing an anchor in

    the ormulation o the question, which was used asthe starting price in the markets. Examples include:

    In 1900, the percentage o the total US population

    that was aged 65 and over was 4.1%. What was thepercentage in 2000? or In 1980, 17.0% o the US

    population (25 years and over) had completed 4 years

    o college or more. What was the percentage in 2006?Appendix 2 provides a complete list o the questions

    used in this study.It was emphasized in the written instructions and

    by the incentive mechanism that the purpose o thetask was to estimate the answers to the questions. The

    answers were not revealed until ater the completion

    o the study.

    7. Results

    We used the absolute error as an index o

    accuracy. Tests o statistical signifcance are reportedin ootnotes.2 Signifcance levels are indicated in the

    2 Since the data were not normally distributed in most cases, weconducted non-parametric signifcance tests.

    tables using asterisks ( : p 0.05; : p 0.01).The data that were analyzed in this study can be oundonline at http://tinyurlcom/method-comparison-data.

    For NGT, Delphi, and six o the prediction marketgroups, we had access to the participants ex anteestimates prior to them entering the group interaction.We used both these individual priors and the mean othese estimates rom the participants in each groupas additional benchmarks or our analysis. In theollowing, the mean prior estimates or each group arereerred to as staticized groups.

    7.1. Group accuracy beore group interaction

    We analyzed the accuracy o the staticized groupsbeore group interaction took place. The goal othis exercise was to rule out the possibility that anydierences between the methods might have occurredbecause some groups simply happened to be betterthan others. For each question, as well as over all tenquestions, we did not fnd any dierences in accuracybetween the participants priors in NGT, Delphi, andprediction markets.3

    7.2. Accuracy o structured group interaction vs.

    individual priors

    Table 1 shows the results, reported as the errorreduction or each question as well as over allten questions. The error reduction is the dierencebetween the absolute errors o the prior individualestimates and the respective method result. A positive(negative) value or the error reduction means thatthe method results were more (less) accurate than theindividual priors. As expected, overall, the results oeach method were more accurate than the individualpriors.4 At the individual question level, predictionmarkets were more accurate than the individualpriors or our questions and equally accurate or

    six questions. NGT was more accurate or eightquestions and equally accurate or two questions.Delphi was most accurate, yielding a lower MAE thanthe respective individual priors or each o the tenquestions.

    3 The Kruskal-Wallis test was used to test or dierences betweenthe absolute errors o the prior individual estimates o the threestructured approaches.

    4 The Wilcoxon rank-sum test was used to test or dierencesbetween the absolute errors o the individual priors and the methodresults. Table 1 reports one-tailed p-values.

    http://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurlcom/method-comparison-datahttp://tinyurlcom/method-comparison-datahttp://tinyurlcom/method-comparison-datahttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PMhttp://tinyurl.com/method-comparison-PM
  • 7/29/2019 2011 Graefe

    8/13

    190 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195

    Table 1

    Mean error reduction o each method compared to prior individual estimates beore group interaction.

    Question NGT Delphi PM

    (N = 54) (N = 59) (N = 34)Population aged 65+ 1.33* 3.15** 0.88

    Internet users 2.69* 3.50** 1.84

    Health expenditures 3.60** 2.99** 0.85

    College graduates 4.12* 4.95** 6.51**

    Population o Australia 2.42** 1.48** 2.67**

    Overweight children 1.77 0.79** 0.79

    Children with two parents 2.57** 4.51** 4.06**

    Households with PC 0.78 5.08** 3.09

    Motor vehicle production 9.82** 5.29** 5.43

    Population aged 5 1.86** 1.24** 4.42**

    Overall 3.10** 3.60** 3.05**

    N: number o prior individual estimates. p 0.05.

    p 0.01.

    7.3. Accuracy o group interaction vs. staticized

    groups

    Group interactions should yield more accurateresults than can be obtained by simply averaging

    the prior individual estimates within each group.

    We compared the methods results to those o thecorresponding staticized groups. The results, again

    reported as the error reduction, are shown in Table 2.5On average, the NGT results were signifcantly

    more accurate than the staticized groups or three out

    o ten questions (i.e., population o Australia, motorvehicle production, and population aged 5). For

    the question households with PC, the NGT results

    were signifcantly less accurate. For the remainingsix questions, there were no statistically signifcant

    dierences in accuracy.The Delphi method yielded signifcantly more

    accurate results than the staticized groups or three

    out o the ten questions. For the remaining sevenquestions, the dierences were small.Prediction markets improved on the staticized

    groups or two questions. For the remaining questions,

    there were no statistically signifcant dierences,which might be due to the small number o

    observations. However, although the dierences were

    5 The Wilcoxon rank-sum test was used to test or dierencesbetween the absolute errors o the staticized groups and the methodresults. Table 2 reports one-tailed p-values.

    small, prediction markets were less accurate than thestaticized groups or six o the ten questions.

    Over all ten questions, group interaction improvedon the staticized groups. However, that being said, thegains in accuracy were modest and the sample sizeswere not large, and thus there is some uncertainty

    about these fndings. The dierences were statisticallysignifcant only with respect to Delphi.

    7.4. Relative accuracy o group methods

    For each method, we calculated the MAEs oreach question as well as over all ten questions.With the lowest overall MAE o 5.62, Delphi wasthe most accurate, ollowed by NGT (5.92) and

    prediction markets (7.07). As expected, FTF wasleast accurate; its overall MAE was 7.21. Althoughthe results conormed to our expectations that thestructured approaches would outperorm FTF, theoverall dierences between the methods were not

    statistically signifcant.6 At the question level, theevidence on the relative perormances o the methodswas mixed. Table 3 shows the error reduction o NGT,Delphi, and prediction markets compared to FTF.

    The dierences between FTF and NGT weresmall.7 For eight o the ten questions, there were no

    6 The Kruskal-Wallis test showed no statistically signifcantdierences between the our methods over all 10 questions.

    7 We conducted Mann-Whitney U-tests to compare the results oFTF and the respective structured approaches or each question.

  • 7/29/2019 2011 Graefe

    9/13

    A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 191

    Table 2

    Mean error reduction o each method compared to staticized groups.

    Question NGT Delphi PM

    (N = 11) (N = 11) (N = 6)Population aged 65+ 0.87 0.93 1.7

    Internet users 0.89 1.59 1.13

    Health expenditures 0.71 0.09 3.22

    College graduates 0.3 1.52 3.25

    Population o Australia 2.13** 1.45** 2.47*

    Overweight children 2.77 1.02 3.52

    Children with two parents 1.69 1.39 1.37

    Households with PC 4.96** 2.09 1.37

    Motor vehicle production 8.79** 3.58* 3.95

    Population aged 5 1.58** 0.73* 3.77*

    Overall 0.41 0.54* 0.17

    N: number o groups. p 0.05. p 0.01.

    Table 3

    Error reduction o NGT, Delphi, and prediction markets compared to FTF.

    Question MAE Error reduction to FTF

    FTF NGT Delphi PM

    Population aged 65+ 7.76 2.69 4.29** 1.22

    Internet users 4.53 0.53 1.33 3.36

    Health expenditures 3.86 2.19 1.38 0.16

    College graduates 6.36 0.27 0.26 1.87

    Population o Australia 2.41 0.09 1.2 2.35*

    Overweight children 13.98 7.93* 9.74** 5.52

    Children with two parents 7.19 0.8 2.77 0.98

    Households with PC 15 0.82 7.21 6.5

    Motor vehicle production 10.12 0.22 5.49 7.92*

    Population aged 5 0.86 1.00* 1.21 1.26**

    Overall 7.21 1.29 1.59 0.14

    p 0.05. p 0.01.

    signifcant dierences between the two methods. For

    the two remaining questions, NGT was more accurate

    than FTF once and less accurate once.

    Delphi was signifcantly more accurate than FTF

    or two questions. For the remaining eight questions,

    there were no dierences in accuracy. In no case was

    Delphi signifcantly less accurate than FTF.

    Contrary to our expectations, prediction markets

    were unable to improve on the accuracy o FTF.

    Instead, or three questions, prediction markets

    yielded results that were signifcantly less accurate

    than the respective FTF results. For the remaining

    questions, there were no statistically signifcantdierences.

    7.5. Participants perceptions

    Ater the group interaction, we asked the partic-ipants about their perceptions o the group, as wellas o the group process. We had hypothesized thatthe personal interactions between participants wouldlead to more avorable ratings or FTF and NGTrelative to Delphi and prediction markets. Further-

    more, we had assumed that, due to lower group pres-sures, NGT would be rated more avorably than FTF.

  • 7/29/2019 2011 Graefe

    10/13

    192 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195

    Fig. 1. Participants mean ratings o each method (on a 7-point scale).

    Finally, because o the participants lack o amiliaritywith the approach, we expected prediction markets to

    be rated most unavorably on most attributes. The re-

    sults were consistent with these hypotheses. For eachattribute, Fig. 1 reports the statistical mean (and 95%

    confdence intervals) o the participants ratings, re-vealed on a 7-point numerical scale rom 1 (very

    low) to 7 (very high). Note that or the attributesdisagreementand difculty, higher ratings were inter-

    preted as being negative.

    Ratings o the group

    NGT obtained the most avorable ratings or coop-eration and disagreement and ranked second or par-

    ticipants confdence in the group answers. Predictionmarkets ranked second worst or disagreement andconfdence and worst or cooperation. For both coop-

    eration and disagreement, FTF ranked second.

    Although Delphi obtained the worst score ordisagreement, the participants were most confdentin the outcomes. We see the ollowing reasons or

    the high level o perceived disagreement in Delphi:(1) Delphi participants cannot communicate with each

    other directly, but reveal their estimates independently;and (2) agreement can only increase i incorporating

    eedback inormation ater the frst round leads to

    more cohesive results at the end o the process.However, disclosing disagreement can be desirable, as

    it alerts decision-makers to uncertainty.

    FTF obtained the lowest score in terms oconfdence in the group outcome. This is surprising,

    since we would have expected that high levels o

    perceived cooperation and agreement would result inhigh levels o confdence in the group results. We

    assume that the FTF participants were aware that, dueto group tendency eects, probably not all available

    inormation rom group members was aggregated.

    Ratings o the group process

    The rankings o the group process as a whole were

    identical over 4 out o 5 attributes, with NGT placedfrst, FTF second, Delphi third and prediction markets

    ourth. For difculty to participate, FTF achieved

    an equally avorable score as NGT. For reedom toparticipate, FTF and Delphi swapped places. Given

    the avorable ratings or FTF on most other attributes,it is interesting that Delphi achieved a higher score

    than FTF or reedom to participate. Again, the reasonmight be that the FTF participants experienced group

    pressures that may have hindered them rom ully

    revealing their inormation, which fts with the lowconfdence scores or FTF. Overall, the participants

    perceptions o the prediction markets process werepoor or all attributes.

    In summary, the results conormed to our expecta-

    tions. NGT obtained the most avorable ratings or 7 othe 8 categories. As hypothesized, FTF was also rated

    quite avorably, being ranked second or six attributes.

  • 7/29/2019 2011 Graefe

    11/13

    A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 193

    Finally, the results corroborated our hypothesis that

    prediction markets would be rated most unavorably:they had the worst ratings or six attributes and were

    second worst or the two remaining attributes.

    8. Discussion

    Although the three structured approaches yielded

    lower MAEs than FTF over all ten questions, the di-erences in accuracy between the our methods were

    not statistically signifcant. The results conormed to

    our hypotheses only with respect to the direction o

    the results.However, some dierences between the methods

    were identifed at the question level. Delphi was theonly approach that was never less accurate than FTF,

    and outperormed FTF or two o the ten questions.By comparison, prediction markets were less accurate

    than FTF or three o the ten questions and were

    unable to outperorm FTF or any o the questions.

    The relative perormances o NGT and FTF weremixed and the dierences were small. We can only

    speculate on the reasons or these results.The participants in this study had to solve a

    quantitative judgment task that required percentageestimates o the answers to actual questions. Such

    tasks have clear solutions, lack highly distributedknowledge and uncertainty, and require groups only

    to aggregate actual knowledge. We would expect thatstructured approaches would be substantially more

    eective when participants reveal inormation that is

    new and useul to others.For example, or site location problems, dierent

    experts might have distinct knowledge about labor

    and real estate costs, trafc patterns, and the

    types o consumers that live near a potential retail

    outlet. Other tasks, such as orecasting, decision-

    making, negotiation, or conict solving, are morechallenging, as they involve not only acts, but also

    values, emotions, and expectations, as well as highlevels o uncertainty. We would expect the relative

    perormances o methods to be largely inuenced by

    such problems (Wright & Ayton, 1986). In particular,we would expect group pressures to be much higher,

    and thereore FTF to generally perorm worse.In contrast, when all o the experts draw upon

    the same inormation, which might have been the

    case or the type o problem used in this study,

    inormation exchange is expected to be o little value.

    Over the ten questions, only Delphi yielded more

    accurate results (in terms o statistical signifcance)

    than staticized groups. By comparison, the dierencesbetween prediction markets or NGT and staticized

    groups were not statistically signifcant.

    Under most circumstances, structured combined

    orecasts are substantially more accurate than in-

    dividual orecasts (Armstrong, 2001). Our results

    provide urther support or this well-established prin-

    ciple. Each o the three structured approaches yielded

    lower mean absolute errors than the prior individual

    estimates.

    Another actor that might have inuenced the

    results is the experiment design. The experiment

    setting avored meetings, since it did not involve

    hierarchies. In a more realistic environment with

    hierarchies, we would expect group pressures to be

    more o a concern, and as a result, the relative

    perormance o FTF would decrease.

    The comparatively poor perormance o prediction

    markets may have been due to the limited means

    o communication between participants. Although the

    participants could exchange inormation continuously

    through trading, they were unable to provide reasoning

    or their estimates. They could only buy and sell

    contracts, without indicating why they did so. Thus,the trading mechanism might not have allowed the

    participants to convey inormation to the group.

    However, prediction markets could be designed to

    overcome such barriers. For example, additional

    means o communication, such as orums or chat

    rooms, could be implemented to allow traders

    to communicate with each other and to reveal

    the reasoning or their actions. Although several

    commercial prediction markets have implemented

    such solutions, we are not aware o any research that

    has examined whether this has an eect on accuracy.Examining the participants perceptions o the

    group and o the group process as a whole revealed

    a preerence or methods involving personal commu-

    nication. These results supported our hypothesis that

    personal interactions in non-hierarchical meetings can

    lead to coherence, and thereore increase the perceived

    satisaction o the participants. In contrast, Delphi and

    prediction markets were rated less avorably. In par-

    ticular, prediction markets were rated worst on six

    attributes and second worst on the remaining two.

  • 7/29/2019 2011 Graefe

    12/13

    194 A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195

    We see at least two reasons or this. First, the lack oexchanging reasons or their judgments may alienate

    traders. We would expect comort with the method to

    rise with increasing insight into the opinions o el-low traders. Second, prediction markets are still a rel-atively new approach, and the idea o revealing ones

    opinion through trading stocks may be difcult or par-ticipants to understand. Even though we relied on sot-

    ware that was specifcally designed to make participa-tion as easy as possible or novices, participation in

    prediction markets was rated as the most difcult byar. We assume that the high perceived difculty was

    crucial to the poor ratings on other categories, whichin turn may hinder the urther adoption o prediction

    markets by practitioners. Researchers and developers

    should urther increase their eorts to make marketsotware solutions more accessible to inexperiencedtraders.

    Participants perceptions o the group process canbe crucial to the acceptance o its results. When the

    agreement among and satisaction o groupmembers is high, decision-makers can share the

    responsibility or their decisions. Thus, our resultsmay entice decision-makers to rely on methods

    involving personal communication. However, such astrategy has its drawbacks.

    1. There is no evidence that high perceived satisac-tion is correlated to good perormance.

    2. Taking into account administrative and time eorts,

    meetings are expensive and can be difcult toconduct, as they require the presence o the

    participants.3. Meetings are limited in the number o participants

    involved. In contrast, Delphi and predictionmarkets do not require the presence o the

    participants and can be conducted asymmetricallywith a virtually unlimited number o people.

    4. Meetings can lead to poor decisions. Due to highergroup pressures in real world situations, we would

    expect meetings to perorm worse or more com-plex problems and more realistic environments,

    especially when the relevant inormation is dis-tributed among the experts.

    9. Conclusion

    We compared the relative accuracy o traditional

    FTF to that o three structured approaches (NGT,

    Delphi, and prediction markets) on a quantitative

    judgment task that required percentage estimates orten actual questions.

    Over the ten questions, we did not fnd any sta-tistically signifcant dierences in accuracy between

    the our methods. However, the method results di-

    ered somewhat at the individual question level. Del-phi was as accurate as FTF or eight questions and

    outperormed FTF or two questions. By comparison,

    prediction markets were unable to outperorm FTFor any o the questions and were inerior or three

    questions. The relative perormances o NGT and

    FTF were mixed and the dierences were generallysmall.

    The three structured approaches were more accu-

    rate than the participants prior individual estimates.Delphi was also more accurate than staticized groups,

    in contrast to NGT and prediction markets. The re-sults suggest that NGT and prediction markets pro-

    vide little additional value over simple averages o

    orecasts in situations where the participants are alldrawing upon similar inormation. Future research

    should evaluate the relative perormances o the meth-

    ods or more complex problems in more realisticenvironments.

    We also analyzed the participants perceptions o

    the group and o the group process as a whole.The participants rated methods involving personal

    communication (FTF and NGT) more avorablythan the computer-mediated Delphi and prediction

    markets. In particular, the FTF and NGT participants

    experienced higher levels o cooperation amongtheir groups and perceived the group interaction as

    more eective. Prediction markets were rated least

    avorably: the prediction market participants wereleast satisfed with the group process and rated their

    method highest in terms o difculty o participation.

    Acknowledgements

    We would like to thank Kesten Green, RobinHanson, and David Pennock or helpul comments.

    Thanks to the sta o the Wharton Behavioral

    Lab, and in particular Daniela Lejtneker, Tatiana

    Silva, and Patricia Zapater-roig. Ellen Rosenbergand Christian Menn helped with preparing materials.

    Thanks to inklingmarkets.com or providing the

    prediction markets sotware.

  • 7/29/2019 2011 Graefe

    13/13

    A. Graee, J.S. Armstrong / International Journal o Forecasting 27 (2011) 183195 195

    References

    Armstrong, J. S. (2001). Combining orecasts. In J. S. Armstrong

    (Ed.), Principles o orecastinga handbook or researchers

    and practitioners (pp. 417439). Boston, MA: Kluwer

    Academic Publishers.

    Armstrong, J. S. (2006). Howto make betterorecasts and decisions:

    avoid ace-to-ace meetings. ForesightThe International

    Journal o Applied Forecasting, 5, 38.

    Berg, J. E., Nelson, F. D., & Rietz, T. A. (2008). Prediction market

    accuracy in the long run. International Journal o Forecasting,

    24, 285300.

    Boje, D. M., & Murnighan, J. K. (1982). Group confdence

    pressures in iterative decisions. Management Science, 28,

    11871196.

    Christiansen, J. D. (2007). Prediction markets: practical experiments

    in small markets and behaviours observed. Journal o Prediction

    Markets, 1, 1741.Erikson, R. S., & Wlezien, C. (2008). Are political markets

    really superior to polls as election predictors? Public Opinion

    Quarterly, 72, 190215.

    Fischer, G. W. (1981). When oracles aila comparison o

    our procedures or aggregating subjective probability ore-

    casts. Organizational Behavior and Human Perormance, 28,

    96110.

    Green, K. C., Armstrong, J. S., & Graee, A. (2007). Methods

    to elicit orecasts rom groups: Delphi and prediction markets

    compared. ForesightThe International Journal o Applied

    Forecasting, 8, 1720.

    Gustason, D. H., Shukla, R. K., Delbecq, A., & Walster, G. W.

    (1973). A comparative study o dierences in subjective

    likelihood estimates made by individuals, interacting groups,

    Delphi groups, and nominal groups. Organizational Behavior

    and Human Perormance, 9, 280291.Rhode, P. W., & Strump, K. S. (2004). Historical presidential

    betting markets. Journal o Economic Perspectives, 18,

    127141.

    Rosenbloom, E. S., & Notz, W. (2006). Statistical tests o

    real-money versus play-money prediction markets. Electronic

    Markets, 16, 6369.

    Rowe, G., & Wright, G. (1999). The Delphi technique as a

    orecasting tool: issues and analysis. International Journal o

    Forecasting, 15, 353375.

    Servan-Schreiber, E., Wolers, J., Pennock, D. M., & Galebach, B.

    (2004). Prediction markets: does money matter? Electronic

    Markets, 14, 243251.

    Van de Ven, A. H., & Delbecq, A. L. (1971). Nominal versus

    interacting group processes or committee decision making

    eectiveness. Academy o Management Journal, 14, 203212.

    Van de Ven, A. H., & Delbecq, A. L. (1974). The eectiveness

    o nominal, Delphi, and interacting group decision making

    processes. Academy o Management Journal, 17, 605621.

    Wolers, J., & Zitzewitz, E. (2004). Prediction markets. Journal o

    Economic Perspectives, 18, 107126.

    Woudenberg, F. (1991). An evaluation o Delphi. Technological

    Forecasting and Social Change, 40, 131150.

    Wright, G., & Ayton, P. (1986). Subjective confdence in orecasts:

    a response to Fischho and MacGregor. Journal o Forecasting,

    5, 117123.