13
Statistical principles in psychological research LEARNING OBJECTIVES After studying this supplement you should be able to: 1 distinguish between the different measures of central tendency 2 describe the different measures of variability 3 define statistical significance 4 define effect size and explain why it is important in the reporting of data 5 outline some common tests of statistical significance. 2S

Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

Statistical principles in psychological research

L E A R N I N G O B J E C T I V E S

After studying this supplement you should be able to:

1 distinguish between the different measures of central tendency

2 describe the different measures of variability

3 define statistical significance

4 define effect size and explain why it is important in the reporting of data

5 outline some common tests of statistical significance.

2S

Page 2: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

CONCEPT MAP

Statistical principles in psychological research

Summarising the data: descriptive statistics

• Descriptive statisticsallowresearcherstosummarisedatainareadilyunderstandableform.Thefirststepindescribingthedataisoftentoprovideafrequency distribution,whichshowshowfrequentlyparticipantsreceivedeachofthemanypossiblescores.

• Measures of central tendencyprovideanindexofthewayatypicalparticipantrespondedonameasure.Thethreemostcommonmeasuresofcentraltendencyarethemean(averageofthescoresofallparticipants),mode(mostcommonscore)andmedian(thescorethatfallsinthemiddleofthedistribution).

• Variabilityistheextentofwhichparticipantstendtodifferfromoneanother.Thestandard deviationdescribeshowmuchtheaverageparticipantdeviatesfromthemean.

• Inanormal distribution,thescoresofmostparticipantsfallinthemiddleofthebell‐shapeddistribution,andprogressivelyfewerparticipantshavescoresateitherextreme.Participants’scoresonanormallydistributedvariablecanbedescribedintermsofthenumberofstandarddeviationsfromthemeanoraspercentile scores,whichindicatethepercentageofscoresthatfallbelowthem.

34% 34%

13.5% 13.5%

2% 2%

70 85 100 115 130 145

–2 –1 0 +1 +2 +3

55

–3

2 16 50 84 98 99.90.1

Num

ber o

f par

ticip

ants

99.7% of scores95% of scores

68% of scores

IQ scoreStandarddeviationPercentile

Testing the hypothesis: inferential statistics

• Toassesswhetherthefindingsofastudyarelikelytoreflectanythingotherthanchance,psychologistsuseinferential statistics,notablytestsofstatistical significance.Theyusuallyreportaprobability value,or p value,whichrepresentstheprobabilitythatanypositivefindingsobtained(suchasadifferencebetweengroups,oracorrelationcoefficientthatdiffersfromzero) wereaccidentalorjustamatterofchance.Byconvention,psychologistsacceptpvaluesthatfallbelow .05(thathaveaprobabilityofbeingaccidentaloflessthan5percent).Thebestwaystoprotectagainstspuriousfindingsaretouselargesamplesandtotrytoreplicatefindingsinothersamples.

• Theeffect sizeindicatesthemagnitudeoftheexperimentaleffectorthestrengthofarelationship.

• Choosingwhichinferentialstatisticstousedependsonthedesignofthestudyandwhetherthevariablesassessedarecontinuousorcategorical.

• AChi‐square test(orχ2)isusedifboththeindependentanddependentvariablesarecategorical;itcomparestheobserveddatawiththeresultsthatwouldbeexpectedbychanceandteststhelikelihoodthatthedifferencesobservedandexpectedareaccidental.

• At testcomparesthemeanscoresoftwogroupsandisaspecialcaseofastatisticalprocedurecalledananalysis of variance (ANOVA),whichcanbeusedtocomparethemeansoftwoormoregroups.ANOVAassessesthelikelihoodthatdifferencesinmeansamonggroupsoccurredbychance(orexaminestheextenttowhichvariationinscoresisattributabletotheindependentvariable);ifthevariationbetweengroupsissubstantiallylargerthanthevariationwithingroups,thentheindependentvariableprobablyaccountsforthedifference.

Central quest ions: what does it mean to say that a f inding is ‘s ignif icant ’?

◆ Inferential statistics are extremely important in trying to draw inferences from psychological research. But they are not foolproof, and are heavily dependent on sample size.

2S

2 Psychology | 4th Australian and New Zealand Edition

Page 3: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

Statisticsarefarmoreintuitivethanmostpeoplebelieve,eventopeoplewhodonotconsidermathematics their strong suit. As described in chapter 2, psychologists use descriptive statisticstosummarisequantitativedatainanunderstandableform.Theyemployinferential statistics to tellwhether theresultsreflectanythingother thanchance.Wediscusseachin

turn, and then return to one central question:what does itmean to describe a study’s findings as‘significant’?

■ Summarising the data: descriptive statisticsThe first step in describing participants’ responses on a variable is usually to chart a frequency distribution.A frequencydistribution isexactlywhat it sounds like—amethodoforganising thedatatoshowhowfrequentlyparticipantsreceivedeachofthemanypossiblescores.Inotherwords,afrequencydistributionrepresentsthewayscoresweredistributedacrossthesample.

Thekindof frequencydistribution that a lecturermight observeon amid‐semester examination(inaverysmallclass, for illustration) isshowninfigure2S.1andagaingraphically infigure2S.2.Thegraph,calledahistogram,plotsrangesofscoresalongthexaxisandthefrequencyofscoresineachrangeontheyaxis.Therounded‐outversionofthehistogramdrawnwithalineisthefamiliar‘curve’.

Score

91 Mode91

87

85 Median

84

81

20

Total 539

539 (total)Mean = = 77 7 (number of students)

FIGURE 2S.1Distribution of test scores on a mid-semester examination

Measures of central tendencyPerhaps the most important descriptive statistics are measures of central tendency,whichprovideanindexofthewayatypicalpar-ticipantrespondedonameasure.Thethreemostcommonmeasuresofcentraltendencyarethemean,themodeandthemedian.

The mean is simply the statistical average of the scores of allparticipants,calculatedbyaddingupalltheparticipants’scoresanddividingbythenumberofparticipants.Themeanisthemostcommonlyreportedmeasureofcentraltendencyandisthemostintuitivelydescriptiveoftheaverageparticipant.

Sometimes, however, the mean may be misleading. For example, consider the table of mid‐semesterexamscorespresentedinfigure2S.1.Themeangradeis77.Yetthemeanfallsbelow6ofthe7scoresonthetable.Infact,moststudents’scoresfallsomewherebetween81and91.

Central quest ion ◆ What does it mean to say that a psychological finding is ‘significant’?

FIGURE 2S.2Histogram showing a frequency distribution of test scores. A frequency distribution shows graphically the frequency of each score (how many times it occurs) distributed across the sample.

5

4

3

2

1Freq

uenc

y (n

umbe

r of s

tude

nts

rece

ivin

g sc

ores

in th

is ra

nge)

0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90

Scores91–100

2S | Statistical principles in psychological research 3

Page 4: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

z Whyisthemeansolow?Itispulleddownbyasinglestudent’sscore—anoutlier—whoprobablydidnotstudy.Inthiscase,themedianwouldbeamoreusefulmeasureofcentraltendency,becauseameancanbestronglyinfluencedbyextremeandunusualscoresinasample.Themedianreferstothescorethatfallsinthemiddleofthedistributionofscores,withhalfscoringbelowandhalfaboveit.Reportingthemedianessentiallyallowsustoignoreextremescoresoneachendofthedistributionthatwouldbiasaportraitofthetypicalparticipant.Infact,themedianinthiscase—85(whichhasthreescoresaboveandthreebelowit)—makesmoreintuitivesenseinthatitseemstocapturethemiddleofthedistribution,whichispreciselywhatameasureofcentraltendencyissupposedtodo.

In other instances, a usefulmeasure of central tendency is themode (ormodal score),which isthemost common (i.e.most frequent) score observed in the sample. In this case, themode is 91,because twostudentsreceivedascoreof91,whereasallotherscoreshadafrequencyofonlyone.Theproblemwiththemodeinthiscaseisthatitisalsothehighestscore,whichisnotagoodesti-mateofcentraltendency.

VariabilityAs the previous examples suggest, another important descriptive statistic is a measure of thevariability of scores— that is, howmuch participants’ scores differ fromone another.Variabilityinfluencesthechoiceofmeasureofcentraltendency.Thesimplestmeasureofvariabilityistherangeofscores,whichshowsthedifferencebetweenthehighestandthelowestvalueobservedonthevari-able.Infigure2S.1,therangeofscoresisquitelarge,from20to91.

Therangecanbeabiasedestimateofvariability,however,inmuchthesamewayasthemeancanbeabiasedestimateofcentraltendency.Scoresdorangeconsiderablyinthissample,butforthevastmajorityofstudents,variabilityisminimal(rangingfrom81to91).Hence,amoreusefulmeasureisthestandard deviation (SD),whichagainisjustwhatitsoundslike:theamounttheaveragepartici-pantdeviatesfromthemeanofthesample.Figure2S.3showshowtocomputeastandarddeviation,usingthefirstfivestudents’scoresasanillustration.

I N T E R I M S U M M A R YDescriptive statisticsallowresearcherstosummarisedatainareadilyunderstandableform.Thefirst step in describing the data is often to provide a frequency distribution,which showshowfrequentlyparticipantsreceivedeachofthemanypossiblescores.Themostimportantdescriptivestatisticsaremeasures of central tendency,whichprovideanindexofthewayatypicalpartici-pantrespondedonameasure.Themeanistheaverageofthescoresofallparticipants;themodeisthemostcommonscore;themedianisthescorethatfallsinthemiddleofthedistribution.

ScoreDeviation

from the mean (D) D2

91 91 – 87.6 = 3.4 11.56 Mean = ΣN

= 438/5 = 87.6

SD = ÅΣD 2

N= Å

43.25

= 2.94

91 91 – 87.6 = 3.4 11.56

87 87 – 87.6 = –.6 .36

85 85 – 87.6 = –2.6 6.76

84 84 – 87.6 = –3.6 12.96

Σ = sum = 438 0 43.20

Note:Computingastandarddeviation(SD)ismoreintuitivethanitmightseem.Thefirststepistocalculatethemeanscore,whichinthiscaseis87.6.Thenextstepistocalculatethedifference,ordeviation,betweeneachparticipant’sscoreandthemeanscore,asshownincolumn2.Thestandarddeviationismeanttocapturetheaveragedeviationofparticipantsfromthemean.Theonlycomplicationisthattakingtheaverageofthedeviationswouldalwaysproduceameandeviationofzerobecausethesumofdeviationsisbydefinitionzero(seethetotalincolumn2).Thus,thenextstepistosquarethedeviations(column3).Thestandarddeviationisthencalculatedbytakingthesquarerootofthesum(Σ)ofallthesquareddifferencesdividedbythenumberofparticipants(N).

FIGURE 2S.3The standard deviation

APPLY ✚ DISCUSSIn Indonesia, many people are very poor, whereas a small number are extremely wealthy.

• How might the mean, median and mode for family income provide realistic or misleading measures of central tendency in describing how poor or wealthy the average Indonesian is?

4 Psychology | 4th Australian and New Zealand Edition

Page 5: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

The normal distributionWhen researchers collect data on continuous variables (such asweight or IQ) andplot themon ahistogram,thedatausuallyapproximateanormaldistribution,likethedistributionofIQscoresshowninfigure2S.4.Inanormal distribution,thescoresofmostparticipantsfallinthemiddleofthebell‐shapeddistribution,andprogressivelyfewerparticipantshavescoresateitherextreme.Inotherwords,mostindividualsareaboutaverageonmostdimensions,andveryfewareextremelyaboveorbelowaverage.Thus,mostpeoplehaveanIQaroundaverage(100),whereasveryfewhaveanIQof70or 130.Inadistributionofscoresthatiscompletelynormal,themean,modeandmedianareallthesame.

Num

ber

of s

ubje

cts

99.7% of scores95% of scores

68% of scores

34% 34%

13.5% 13.5%

2% 2%

70 85 100 115 130 145 IQ scoreStandarddeviationPercentile

–2 –1 0 +1 +2 +3

2 16

55

–3

0.1 50 84 98 99.9

FIGURE 2S.4A normal distribution. IQ scores approximate a normal distribution, which looks like a bell-shaped curve; 68 percent of scores fall within one standard deviation of the mean (represented by the area under the curve in blue). The curve is a smoothed-out version of a histogram. An individual’s score can be represented alternatively by the number of standard deviations it diverges from the mean in either direction or by a percentile score, which shows the percentage of scores that fall below it (to the left on the graph).

Participants’ scores on a variable that is normally distributed can be described in terms of howfar they are from average— that is, their deviation from themean.Thus, a person’s IQ could bedescribedeitheras85orasonestandarddeviationbelowthemean,becausethestandarddeviationin  IQ is about15.Aparticipant two standarddeviationsbelow themeanwouldhave an IQof70,whichisborderingonsevereintellectualimpairment.Fornormaldata,68percentofparticipantsfallwithinonestandarddeviationofthemean(34percentoneithersideofit),95percentfallwithintwostandard deviations andmore than 99.7 percent fallwithin three standard deviations.Thus, an IQabove145isaveryrareoccurrence.

Knowing the relationship between standard deviations and percentages of participants whosescores liewithindifferentpartsofadistributionallowsresearcherstoreportpercentile scores,whichindicate the percentage of scores that fall below a score.Thus, a participantwhose score is threestandarddeviationsabovethemeanisinthe99.7thpercentile,whereasanaverageparticipant(whosescoredoesnotdeviatefromthemean)isinthe50thpercentile.

I N T E R I M S U M M A R YVariability is the extent to which participants tend to differ from one another. The standard deviationdescribeshowmuchtheaverageparticipantdeviatesfromthemean.Whenpsychologistscollectdataoncontinuousvariables,theyoftenfindthatthedataapproximateanormal distribution,withmostscorestowardsthemiddle.Participants’scoresonanormallydistributedvariablecanbedescribed in termsof thenumberof standarddeviations from themeanor as percentile scores,whichindicatethepercentageofscoresthatfallbelowthem.

■ Testing the hypothesis: inferential statisticsWhenresearchersfindadifferencebetweentheresponsesofparticipantsinoneconditionandanother,theymust inferwhether these differences occurred by chance or reflect a true causal relationship.Similarly,iftheydiscoveracorrelationbetweentwovariables,theyneedtoknowthelikelihoodthatthetwovariablessimplycorrelatedbychance.

AsthephilosopherDavidHume(1711–1776)explainedtwocenturiesago,wecanneverbeentirelysureabouttheanswerstoquestionslikethese.Ifsomeonebelievesthatallswansarewhiteandobserves

2S | Statistical principles in psychological research 5

Page 6: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

99swansthatarewhiteandnonethatarenot,canthepersonconcludewithcertaintythatthehundredthswanwillalsobewhite?Theissueisoneofprobability.Ifthepersonhasobservedarepresentativesampleofswans,whatisthelikelihoodthat,given99whiteswans,ablackonewillappearnext?

Statistical significancePsychologiststypicallydealwiththisissueintheirresearchbyusingtestsofstatistical significance,whichhelp determinewhether the results of a study are likely to haveoccurred simplyby chance(andthuscannotbemeaningfullygeneralisedtoapopulation)orwhethertheyreflecttruepropertiesofthepopulation.Statisticalsignificanceshouldnotbeconfusedwithpracticalortheoreticalsignifi-cance.A researchermaydemonstratewith ahighdegreeof certainty that, on theaverage, femalesspendlesstimewatchingfootballthanmales,butwhocares?Statisticalsignificancemeansonlythatafindingisunlikelytobeanaccidentofchance.

Beyonddescribingthedata,then,theresearcher’ssecondtaskistodrawinferencesfromthesampletothepopulationasawhole.Inferentialstatisticshelpsortoutwhetherornotthefindingsofastudyreallyshowanything.Researchersusuallyreportthelikelihoodthattheirresultsmeansomethingintermsofaprobability value (orp value).Apvaluerepresentstheprobabilitythatanypositivefindingsobtainedwiththesample(suchasdifferencesbetweentwoexperimentalconditions)werejustamatterofchance.Inotherwords,apvalueisanindexoftheprobabilitythatpositivefindingsobtainedwouldnotapplytothepopulationandinsteadreflectonlythepeculiarcharacteristicsoftheparticularsample.

Toillustrate,onestudytestedthehypothesisthatchildrenincreasinglyshowsignsofmoralityandempathy during their second year (Zahn‐Waxler, Radke‐Yarrow,Wagner,&Chapman, 1992).Theinvestigators trained27mothers to tape‐record reportsof anyepisode inwhich theirone‐year‐oldseitherwitnessed distress (e.g. seeing themother burn herself on the stove) or caused distress (e.g.pullingthecat’s tailorbitingthemother’sbreastwhilenursing).Themothersdictateddescriptionsoftheseeventsoverthecourseofthenextyear;eachreportincludedthechild’sresponsetotheotherperson’sdistress.Coders thenrated thechild’sbehaviourusingcategoriessuchasprosocialbehav-iour,definedaseffortstohelpthepersonindistress.

Table2S.1 shows the averagepercentageof times the childrenbehavedprosociallyduring theseepisodes at eachof threeperiods: time1 (13 to 15monthsof age), time2 (18 to 20months) andtime3(23to25months).Asthetableshows,thepercentageoftimeschildrenbehavedprosociallyincreaseddramaticallyoverthecourseoftheyear,regardlessofwhethertheywitnessedorcausedthedistress.Whentheinvestigatorsanalysedthechangesinratesofprosocialresponsesovertimetobothtypesofdistress(witnessedandcaused), theyfoundthedifferencesstatisticallysignificant.Ajumpfrom9to49prosocialbehavioursin12monthswasthusprobablynotachanceoccurrence.

Byconvention,psychologistsaccepttheresultsofastudywhenevertheprobabilityofpositivefindingsattributabletochanceislessthan5percent.Thisistypicallyexpressedasp<.05.Thus,thesmallerthep value,themorecertainyoucanfeelabouttheresults.Aresearcherwouldratherbeabletosaythatthechancesthatherfindingsarespurious(i.e.justaccidental)are1in1000(p<.001)than1in100(p<.01).

TABLE 2S.1 Children’s prosocial response to another person’s distress during the second year of life

Percentage of episodes in which the child behaved prosocially

Type of incident Time 1 Time 2 Time 3

Witnessed distress 9 21 49

Caused distress 7 10 52

Source: Adapted from Zahn-Waxler et al. (1992).

Nevertheless, researchers can never be certain that their results are true of the population as awhole;ablackswancouldalwaysbeswimmingin thenext lake.Norcan theybesure that if theyperformed the studywith 100 different participants theywould not obtain different findings. Thisiswhyreplication—repeatingastudytoseewhetherthesameresultsoccuragain—isextremelyimportantinscience.Forexample,inhisstudiesofmoodandmemorydescribedinchapter2,Bowerhit an unexpected black swan.His initial series of studies yielded compelling results, but someofthesefindings failed to replicate in laterexperiments.Heultimatelyhad toalterpartsofhis theorythattheinitialdatahadsupported(Bower,1989).

MAKING CONNECTIONSResearch using large samples of older people has dispelled many myths about ‘senility’. Among other things, this research finds tremendous variation in cognitive abilities in old age, with some people showing substantial impairment and others very little (chapter 12).

6 Psychology | 4th Australian and New Zealand Edition

Page 7: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

Thebestwaytoensurethatastudy’sresultsarenotaccidentalistousealargesample.Thelargerthesample,themorelikelyitreflectstheactualpropertiesofthepopulation.Suppose30peopleintheworldareover115yearsold,andresearcherswanttoknowaboutmemoryinthispopulation.Iftheresearcherstest25ofthem,theycanbemuchmorecertainthattheirfindingsaregeneralisabletothispopulationthaniftheystudyasampleofonlytwoofthem.Thesetwocouldhavehadpoormemoriesinthefirstplaceorhavehadillnessesthataffectedtheirmemory.

Mostpeopleintuitivelyunderstandtheimportanceof largenumbers insampling,evenif theydonotrealiseit.Forexample,tennisfansrecognisethelogicbehindmatchescomprisedofmultiplesetsandwouldobject ifdecisionsaboutwhomoveson to thenext roundweremadeon thebasisof asinglegame.Intuitively,theyknowthatavarietyoffactorscouldinfluencetheoutcomeofanysinglegameotherthantheabilityoftheplayers,suchasfluctuationsinconcentration,momentaryphysicalcondition (such as a dull pain in the foot), lighting,wind orwhich player served first. Because asinglegame is not a large enough set of observations tomake a reliable assessmentofwho is thebetterplayer,manysportsrelyonabest‐of‐three,best‐of‐fiveorbest‐of‐sevenseries.

Information about the effect size is required to understand the magnitude of the experimentaleffector thestrengthofa relationship (Burton,2007).Effectsize indicesareof twogeneral types:(1) indicesthatcomparedifferencesbetweentreatmentmeans(e.g.Cohen’sdandGlass’sΔ—pointbiserial correlation); and (2) indices that arebasedonmeasuresof association, such as correlation(e.g.Pearson’sr) andexplainedvariance.Theeffect size indicator shouldbe reported immediatelyafterthetestofstatisticalsignificance(e.g.pvalue),followedbyashortdescriptivesentenceaboutthenatureoftheeffect.Forexample,‘withanalphalevelof.01,therelationshipbetweentheperson-alitytrait“extroversion”andacademicsuccesswasweak(Pearson’sr=.10)’.

An example of clinical significanceBy Associate Professor John Reece, RMIT University

There aremany differentways of assessing clinical significance; the approach a researcher adoptswilldependonthenatureoftheresearchdesignandtheconditionunderinvestigation.Anexampleofoneapproachtoanalysingclinicalsignificanceispresentedinthecontextofahypotheticalinterven-tionstudythatexaminestheeffectofapsychotherapeutictreatmentonanxiety.

JacobsonandTruax(1991)haveproposedamethodforassessingclinicalsignificancethatispop-ularamongpsychologicalresearchers.Theirprocessinvolvesfirstestablishingaclinicalgoalforallparticipantsthatisbasedontheavailablenormativedatafortheoutcomemeasure—thisisoftenascale or questionnaire that directly assesses the targeted condition, such as depression or anxiety.Second, afigureknownas the reliable change index is calculated.This is the level of change thatindividualsmustdemonstrateinordertoindicateaclinicallysignificanteffect.Oncethisisdone,theresultsareoftenpresentedgraphically,suchasinfigure2S.5,whichpresentstheresultofaclinicalsignificanceanalysisforananxietyintervention.

GroupControlTreatment

60

80

40

20

20 40

Pre-intervention anxiety

Post

-inte

rven

tion

anxi

ety

60 80FIGURE 2S.5Clinical significance analysis for an anxiety intervention.

ONE S TEP

FURTHER

2S | Statistical principles in psychological research 7

Page 8: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

I N T E R I M S U M M A R YTo assess whether the findings of a study are likely to reflect anything other than chance, psy-chologists use inferential statistics, notably tests of statistical significance. They usually report a probability value, or p value, which represents the probability that any positive findings obtained (such as a difference between groups, or a correlation coefficient that differs from zero) were accidental or just a matter of chance. By convention, psychologists accept p values that fall below .05 (that have a probability of being accidental of less than 5 percent). The best ways to protect against spurious findings are to use large samples and to try to replicate findings in other samples. The effect size indicates the magnitude of the experimental effect or the strength of a relationship.

Common tests of statistical significanceChoosing which inferential statistics to use depends on the design of the study and particularly on whether the variables to be assessed are continuous or categorical. If both sets of variables are con-tinuous, the researcher simply correlates them to see whether they are related and tests the probability that a correlation of that magnitude could occur by chance.

For many kinds of research, however, the investigator wants to compare two or more groups, such as males and females, or participants exposed to several different experimental conditions. In this case, the independent variables are categorical (male/female, condition 1/condition 2). If both the independent and dependent variables are categorical, the appropriate statistic is a Chi‐square test

In this study, 81 participants were allocated to one of two conditions: a new experimental psycho-therapy for treating anxiety, or a control group who received no treatment. The numbers in the two groups are not even because of participant dropouts during the trial. A questionnaire was used to assess anxiety symptoms in all participants both before and after the administration of the intervention. Inferential statistical testing found a statistically significant result in favour of the treatment at p < .05, with a moderate effect size. In order to analyse the results fully, the researcher also ran an analysis of clinical significance using a slightly modified version of the Jacobson and Truax (1991) approach.

The horizontal lines at a score of 40 at both the horizontal and vertical axis represent the clinical goal; in other words, scoring less than 40 on this measure means that the participant is no longer scoring at a level that is clinically important. The dotted lines on either side of these represent error of measurement. If the participant’s score falls inside the area bound by these dotted lines, we can’t rule out the possibility that any change we see is just a reflection of random error that is associated with the measurement scale. The diagonal line is important: participants whose score falls below it repre-sent those whose score after the intervention is less than what is was before. In other words, these are the people who have improved. The blue dots represent control group participants; the green dots represent treatment group participants.

How do we use this graph to assess clinical significance? We look for participants who meet the following criteria:•   their level of anxiety was above the clinical cut‐off before the

intervention•   they improved; that is, their scores are below the diagonal line•   their level of anxiety is below the clinical cut‐off after the

intervention•   their scores do not fall into the area that could simply represent

error of measurement.There are 11 participants in the treatment group who meet all of

the criteria in figure 2S.5, but only three from the control group. This might seem impressive, but it needs to be taken into account that there were 38 participants in the treatment group, which means that less than a third demonstrated clinically significance change, even though the result was statistically significant and demonstrated a strong effect.

8 Psychology | 4th Australian and New Zealand Edition

Page 9: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

(or  χ2). A Chi‐square test compares the observed data with the results that would be expected by chance and tests the likelihood that the differences between observed and expected are accidental.

For example, suppose a researcher wants to know whether patients with antisocial personality dis-order are more likely than the general population to have had academic difficulties in primary school. In other words, he wants to know whether one categorical variable (a diagnosis of antisocial versus normal personality) predicts another (presence or absence of academic difficulties, defined as having failed a grade in primary school). The researcher collects a sample of 50 male patients with the dis-order (since the incidence is much higher in males and gender could be a confounding variable) and compares them with 50 males of similar socioeconomic status (since difficulties in school are cor-related with social class) without the disorder. He finds that of his antisocial sample, 20 individuals failed a grade in primary school, whereas only two of the others did (figure 2S.6). The likelihood is extremely small that this difference could have occurred by chance, and the Chi‐square test would therefore show that the difference between groups is statistically significant.

In many cases, the independent variables are categorical, but the dependent variables are con-tinuous. This was the case in Pennebaker, Colder, and Sharp’s (1990) study of emotional expression. In this study, the investigators placed participants in one of two conditions (writing about emotional events or about neutral events, a categorical variable) and compared the number of visits they subse-quently made to the health service (a continuous variable). The question to be answered statistically was the likelihood that the mean number of visits to the doctor in the two conditions differed by chance. If participants who wrote about the transition to university made 0.73 visits to the health ser-vice on average whereas those who wrote about a neutral event made 1.56, is this discrepancy likely to be accidental or does it truly depend on the condition to which they were exposed?

When comparing the mean scores of two groups, researchers use a t test. A t test is actually a special case of a statistical procedure called an analysis of variance (ANOVA), which can be used to compare the means of two or more groups. ANOVA assesses the likelihood that mean differences among groups occurred by chance. To put it another way, ANOVA assesses the extent to which vari-ation in scores is attributable to the independent variable. The logic behind ANOVA is quite simple. If the variation between groups (the difference between the average member of one group versus another) is substantially larger than the variation within groups, then the independent variable prob-ably accounts for the difference.

Once again, a larger sample is helpful in determining whether mean differences between groups are real or random. If Pennebaker and his colleagues tested only two participants in each condition and found mean differences, they could not be confident of the findings because the results could simply reflect the idiosyncrasies of these four participants. If they tested 30, however, and the differences between the two conditions were large and relatively consistent across participants, the ANOVA would be statistically significant.

Chi‐square, t tests and analysis of variance are not the only statistics psychologists employ. They also use correlation coefficients and many others. In all cases, however, their aim is the same: to try to draw generalisations about a population without having to study every one of its members.

A team of researchers wants to test the hypothesis that children who are physically abused by their parents will be more aggressive. They are considering analysing the data in three ways.• First, they classify participants as abused or non‐

abused (the independent variable). They measure aggressiveness (the dependent variable) using a scale of 1 to 7 completed by their teacher (1 = not aggressive, 7 = very aggressive). What statistic are they likely to use to test the hypothesis?

• Next, they again classify patients as abused versus non‐abused, but this time they measure whether the child has hit another child in school, coded yes or no (a categorical variable). What statistic are they likely to use?

• Now, the researchers decide to measure abuse on a scale of 1 to 7 (1 = no abuse, 7 = severe abuse). They also measure the child’s aggressiveness on a scale of 1 to 7. What statistic are they likely to use?

I N T E R I M S U M M A R YWhether you use inferential statistics such as Chi‐square or ANOVA depends on the design of the study, particularly on whether the variables assessed are continuous or categorical.

FIGURE 2S.6Typical data appropriate for a Chi-square analysis

Scholastic performance

Failure No failure

Antisocial 20 30

Normal 2 48

Note: A Chi-square is the appropriate statistic when testing the relationship between two categorical variables. In this case, the variables are diagnosis (presence or absence of antisocial personality disorder) and school failure (presence or absence of a failed grade). The Chi-square statistic tests the likelihood that the relative abundance of school failure in the antisocial group occurred by chance.

2S | Statistical principles in psychological research 9

Page 10: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

Central quest ion rev is ited ◆ What does it mean to say that a finding is ‘significant’?

Inferentialstatisticsareextremelyimportantintryingtodrawinferencesfrompsychologicalresearch,buttheyarenotfoolproof.Infact,somepsychologistsarenowdebatingjusthowusefulsignificancetestingreallyis(seeAbelson,1995;Hubbardetal.,2000;Hunter,1997;Shrout,1997).

One problem with significance testing is that p values are heavily dependent on sample size.Withalargeenoughsample,atinycorrelationwillbecomesignificant,eventhoughtherelationshipbetween the two variablesmay beminuscule.Conversely, researchers oftenmistakenly infer fromnon‐significant findings that no real difference exists between groups, when they simply have notusedalargeenoughsampletoknow.Ifgroupdifferencesdoemerge,apvalueismeaningfulbecauseit specifies the likelihood that thefindings couldhaveoccurredby chance. If groupdifferencesdonotemerge,however,apvaluecanbemisleadingbecausetruedifferencesbetweengroupsorexperi-mentalconditionsmaynotshowupsimplybecause thesamplewas toosmallor thestudywasnotconductedwellenough.

For thesereasons,somepsychologistshavecalledforothermethodsofreportingdata thatallowconsumersofresearchtodrawtheirownconclusions.Oneapproachistoreportthesizeoftheeffectofbeinginonegrouporanother,suchashowmuchdifferenceanexperimentalmanipulationmade onthedependentvariable.Onewayofreportingeffectsizeputstheeffectinstandarddeviationunits—that is,howmanystandarddeviationsdoes theaverageparticipant inoneconditiondiffer fromtheaverageparticipantinanotheronthedependentvariable?

Forexample, ifparticipantswhowriteaboutanemotionaleventforfourdaysinarowgotothehealthservice0.73timesandthosewhowriteaboutaneutraleventgo1.56times,themeaningofthatdiscrepancyisunclearunlessweknowthestandarddeviations(SD)ofthemeans.IftheSDisaround0.75, then studentswhowrite about emotional events are a full standard deviation better off thancontrolparticipants—whichisdefinitelyafindingworthwritinghomeabout.IftheSDis0.25,theeffectsizeis threestandarddeviations,whichmeansthatwritingaboutemotionaleventswouldputtheaverageparticipant in theexperimentalconditionin the99thpercentileofhealth incomparisonwith participants in the control condition, which would be an extraordinary effect. This exampleillustrateswhyresearchersalwaysreportSDsalongwithmeans:0.73vs.1.56isameaninglessdiffer-enceifwedonotknowhowmuchtheaveragepersonnormallyfluctuatesfromthesemeans.

Arelatedproblemisthatpeopleoftenconfusestatisticalsignificancewithpracticalsignificance.Forexample,inthestudyofaspirinandheartdiseasedescribedinchapter2,theresearchersdiscov-eredaneffectoftakingaspirinthattranslatedintoacorrelationof.03.Withalargeenoughsample(about20000,whichishundredsoftimeslargerthanthesampleinmoststudies),theyrealisedthatthisseeminglysmalleffectwasnotonlystatisticallysignificantbutclinicallysignificant,translatinginto largenumbersof lives.Butsuppose theyhad tested thehypothesisonasampleof300,whichwouldhavebeenaverysmallsampleforthiskindofresearch.Inthiscase,a.03correlationwouldnothavebeenstatisticallysignificant,and thescientificcommunitywouldneverhaveknownaboutthebeneficialeffectsofaspirinontheheart.

Inthefinalanalysis,perhapswhatstatisticsreallydoistohelparesearchertellacompellingstory(Abelson,1995).Agood seriesof studies aims to solveamystery, leading the reader stepby stepthroughallthepossiblescenarios,rulingoutonesuspectafteranother.Statisticscanlendconfidencetotheconclusion,buttheycanneverentirelyruleoutthepossibilityofasurpriseending.

10 Psychology | 4th Australian and New Zealand Edition

Page 11: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

R E V I E W Q U E S T I O N S1. Describethedifferenttypesofdescriptivestatisticsusedtosummarisedata.2. Describethecharacteristicsofanormaldistribution.3. Explainwhatismeantbythe75thpercentile.4. Explainwhatismeantbytheexpression‘p<.05’.5. DistinguishbetweentheChi‐squaretestandtheanalysisofvariance.

D I S C U S S I O N Q U E S T I O N S1. Whencanwebeconfidentthatapsychologicalfindingis‘significant’?2. Whatisthebestwaytoensurethattheresultsofastudyarenotduetochance?3. Whyshouldpsychologistsreporttheeffectsizewhenreportingtheirresults?

A P P L I C AT I O N Q U E S T I O N S1. Calculatethemean,modeandmedianforthefollowingsetofscores:10,4,8,5,3,6,8,4,8and9.2. Calculatethestandarddeviationforthefollowingdataset:31,39,46,33,42,44and38.

Thesolutionstotheapplicationquestionscanbefoundattheendofthischapter.

S U M M A R Y1 Central tendency

• Themost important descriptive statistics aremeasures of central tendency,whichprovidean indexof thewaya typicalparticipantrespondedonameasure.Themeanisthestatisticalaverageofthescoresofallparticipants.Themodeisthemostcommonorfrequentscoreorvalueofthevariableobservedinthesample.Themedianisthescorethatfallsrightinthemiddleofthedistributionofscores;halftheparticipantsscorebelowitandhalfaboveit.

2 Variability

• Variability refers to theextent towhichparticipants tend todifferfromoneanother in theirscores.Thestandard deviation refers totheamount that theaverageparticipantdeviates from themeanofthesample.

3 Statistical significance

• Psychologists apply tests of statistical significance to determinewhether positive results are likely to have occurred simply bychance.Aprobability value,orp value, represents theprobabilitythat positive findings (such as group differences) were accidentalor just a matter of chance. By convention, psychologists accept

p values that fallbelow .05 (thathaveaprobabilityofbeingacci-dentaloflessthan5percent).Thebestwaytoensurethatastudy’sresults are not accidental is to use a large enough sample thatrandomfluctuationswillcanceleachotherout.

4 Effect size

• Tests of statistical significance are not without their limitations.With a large enough sample size, significantdifferences are likelytoappearwhetherornot theyaremeaningful,andpvaluesdonotadequately reflect the possibility that negative findings occurredby chance. Thus, some psychologists advocate other methods formaking inferences frompsychological data, such as effect size, toindicate the magnitude of the experimental effect or the strengthof a relationship.Statistical techniques areusefulwaysofmakinganargument,not foolproofmethods forestablishingpsychologicaltruths.

5 Common tests of statistical significance

• The choice of which inferential statistics to use depends onthe  design of the study, particularly on whether the variablesassessedarecontinuousorcategorical.CommonstatisticaltestsareChi‐squareandanalysis of variance.

K E Y T E R M Sanalysisofvariance

(ANOVA) p. 9Chi‐squaretest (orχ2) p. 9descriptivestatistics p. 3effectsize p. 7frequencydistribution p. 3

histogram p. 3inferentialstatistics p. 3mean p. 3measuresofcentraltendency

p. 3median p. 4

mode p. 4normaldistribution p. 5percentilescores p. 5probabilityvalue(orPvalue)

p. 6range p. 4

standarddeviation(SD) p. 4statisticalsignificance p. 6ttest p. 9variability p. 4

2S | Statistical principles in psychological research 11

Page 12: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

GLOSSARY

analysis of variance (ANOVA) AstatisticalprocedureusedtocomparethemeansoftwoormoregroupsChi-square test (or χ2) Atestofstatisticalsignificanceusedwhenboththeindependentanddependentvariablesarecategoricaldescriptive statistics Numbersthatdescribethedatafromastudyinawaythatsummarisestheiressentialfeatureseffect size Ameasureofthemagnitudeofanexperimentaleffectorthestrengthofarelationshipfrequency distribution Amethodoforganisingthedatatoshowhowfrequentlyparticipantsreceivedeachofthemanypossiblescoreshistogram Plotsrangesofscoresalongthexaxisandthefrequencyofscoresineachrangeontheyaxis

inferential statistics Proceduresforassessingwhethertheresultsobtainedwithasamplearelikelytoreflectcharacteristicsofthepopulationasawholemean Thestatisticalaverageofthescoresofallparticipantsonameasuremeasures of central tendency Provideanindexofthewayatypicalparticipantrespondedonameasuremedian Thescorethatfallsinthemiddleofthedistributionofscores,withhalfofparticipantsscoringbelowitandhalfaboveitmode Themostcommonormostfrequentscoreorvalueofavariableobservedinasamplenormal distribution Abellshapedfrequencydistributionwherethescoresofmostparticipantsfallinthemiddleandprogressivelyfewerscoresfallateitherextreme

percentile scores Indicatesthepercentageofscoresthatfallbelowascoreprobability value Theprobabilitythatobtainedfindingswereaccidentalorjustamatterofchance;alsocalledpvaluerange Ameasureofvariabilitythatrepresentsthedifferencebetweenthehighestandthelowestvalueonavariableobtainedinasamplestandard deviation (SD) Theamountthattheaverageparticipantdeviatesfromthemeanofthesampleonameasurestatistical significance Thelikelihoodthatresultsofastudyhaveoccurredsimplybychancet test Atestofstatisticalsignificanceusedwhencomparingthemeanscoresoftwogroupsvariability Theextenttowhichparticipantstendtovaryfromeachotherintheirscoresonameasure

12 Psychology | 4th Australian and New Zealand Edition

Page 13: Statistical principles in psychological research 2S · 2019-07-03 · 0–10 11–20 21–30 31–40 41–50 51–60 61–70 71–80 81–90 Scores 91–100 2S | Statistical principles

Chapter 2 Supplement1. Mean=6.5

Mode=8Median=8

2. Standarddeviation=5

SOLUTIONS TO APPLICATION QUESTIONS

2S | Statistical principles in psychological research 13