Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1.3.1MeasuringCenter:TheMeanMean-Thearithmeticaverage.Tofindthemean (pronouncedxbar)ofasetofobservations,addtheirvaluesanddividebythenumberofobservations.Ifthenobservationsarex1,x2,…,xn,theirmeanis:
Or
Actually,thenotation referstothemeanofasample.Mostofthetime,thedatawe’llencountercanbethoughtofasasamplefromsomelargerpopulation.Whenweneedtorefertoapopulationmean,we’llusethesymbolμ(Greeklettermu,pronounced“mew”).Ifyouhavetheentirepopulationofdataavailable,thenyoucalculateμinjustthewayyou’dexpect:addthevaluesofalltheobservations,anddividebythenumberofobservations.Example–TravelTimestoWorkinNorthCarolinaCalculatingthemeanBelowisdataontraveltimesof15NorthCarolinaresidents.1)Findthemeantraveltimeforall15workers2)Calculatethemeanagain,thistimeexcludingthepersonwhoreporteda60-minutetraveltimetowork.Whatdoyounotice?
Thepreviousexampleillustratesanimportantweaknessofthemeanasameasureofcenter:themeanissensitivetotheinfluenceofextremeobservations.Thesemaybeoutliers,butaskeweddistributionthathasnooutlierswillalsopullthemeantowarditslongtail.Becausethemeancannotresisttheinfluenceofextremeobservations,wesaythatitisnotaresistantmeasureofcenter.ResistantMeasure-Astatisticthatisnotaffectedverymuchbyextremeobservations.
1.3.2MeasuringCenter:TheMedianMedian-ThemedianMisthemidpointofadistribution,thenumbersuchthathalftheobservationsaresmallerandtheotherhalfarelarger.Tofindthemedianofadistribution:
1. Arrangeallobservationsinorderofsize,fromsmallesttolargest.2. Ifthenumberofobservationsnisodd,themedianMisthecenterobservationintheordered
list.3. Ifthenumberofobservationsniseven,themedianMistheaverageofthetwocenter
observationsintheorderedlist.Example–TravelTimestoWorkinNorthCarolinaFindingthemedianwhennisoddWhatisthemediantraveltimeforour15NorthCarolinaworkers?Herearethedataarrangedinorder:
51010101012152020253030404060
Thecountofobservationsn=15isodd.Thebold20isthecenterobservationintheorderedlist,with7observationstoitsleftand7toitsright.Thisisthemedian,M=20minutes.
Example–StuckinTrafficFindingthemedianwhennisevenPeoplesaythatittakesalongtimetogettoworkinNewYorkStateduetotheheavytrafficnearbigcities.Whatdothedatasay?Herearethetraveltimesinminutesof20randomlychosenNewYorkworkers:
103052540201015302015208515651560604045
1.Makeastemplotofthedata.Besuretoincludeakey.2.Findaninterpretthemedian.
1.3.3ComparingtheMeanandtheMedianOurdiscussionoftraveltimestoworkinNorthCarolinaillustratesanimportantdifferencebetweenthemeanandthemedian.Themediantraveltime(themidpointofthedistribution)is20minutes.Themeantraveltimeishigher,22.5minutes.Themeanispulledtowardtherighttailofthisright-skeweddistribution.Themedian,unlikethemean,isresistant.Ifthelongesttraveltimewere600minutesratherthan60minutes,themeanwouldincreasetomorethan58minutesbutthemedianwouldnotchangeatall.Theoutlierjustcountsasoneobservationabovethecenter,nomatterhowfarabovethecenteritlies.Themeanusestheactualvalueofeachobservationandsowillchaseasinglelargeobservationupward.Themeanandmedianofaroughlysymmetricdistributionareclosetogether.Ifthedistributionisexactlysymmetric,themeanandmedianareexactlythesame.Inaskeweddistribution,themeanisusuallyfartheroutinthelongtailthanisthemedian.LeftSkewedDistributions RightSkewedDistribution
CheckYourUnderstandingQuestions1through4refertothefollowingsetting.Here,onceagain,isthestemplotoftraveltimestoworkfor20randomlyselectedNewYorkers.Earlier,wefoundthatthemedianwas22.5minutes.1.Basedonlyonthestemplot,wouldyouexpectthemeantraveltimetobelessthan,aboutthesameas,orlargerthanthemedian?Why? 2.Useyourcalculatortofindthemeantraveltime.WasyouranswertoQuestion1correct? 3.InterpretyourresultfromQuestion2incontextwithoutusingthewords“mean”or“average.”4.Wouldthemeanorthemedianbeamoreappropriatesummaryofthecenterofthisdistributionofdrivetimes?Justifyyouranswer.
1.3.4MeasuringSpread:TheInterquartileRange(IQR)Ausefulnumericaldescriptionofadistributionrequiresbothameasureofcenterandameasureofspread.HowtoCalculateQuartilesQ1|M|Q31.ArrangetheobservationsinincreasingorderandlocatethemedianMintheorderedlistofobservations.2.ThefirstquartileQ1isthemedianoftheobservationswhosepositionintheorderedlististotheleftofthemedian.3.ThethirdquartileQ3isthemedianoftheobservationswhosepositionintheorderedlististotherightofthemedian.InterquartileRange–IQR=Q3-Q1Example–TravelTimestoWorkinNorthCarolinaCalculatingquartilesOurNorthCarolinasampleof15workers’traveltimes,arrangedinincreasingorder,isThereisanoddnumberofobservations,sothemedianisthemiddleone,thebold20inthelist.Thefirstquartileisthemedianofthe7observationstotheleftofthemedian.Thisisthe4thofthese7observations,soQ1=10minutes(showninblue).Thethirdquartileisthemedianofthe7observationstotherightofthemedian,Q3=30minutes(showningreen).Sothespreadofthemiddle50%ofthetraveltimesisIQR=Q3−Q1=30−10=20minutes.BesuretoleaveouttheoverallmedianMwhenyoulocatethequartiles.
ThequartilesandtheinterquartilerangeareresistantbecausetheyarenotaffectedbyafewextremeobservationsExample–StuckinTrafficAgainFindingandinterpretingtheIQRFindandinterprettheinterquartilerange(IQR).
1.3.5IdentifyingOutliersInadditiontoservingasameasureofspread,theinterquartilerange(IQR)isusedaspartofaruleofthumbforidentifyingoutliers.1.5*IQR–Callanobservationanoutlierifitfallsmorethan1.5xIQRabovethethirdquartileorbelowthefirstquartileExample–TravelTimestoworkinNewYorkIdentifyingOutliersusingthe1.5*IQRruleIdentifyanyoutliersinthedatafromthestemplot.Q1=15minutesQ3=42.5minutesIQR=27.5minutesExample–TravelTimestoWorkinNorthCarolinaIdentifyingOutliersDetermineifthetraveltimeof60minutesinthesampleof15NorthCarolinaworkersisanoutlier.Q1=10minutesQ3=30minutesIQR=20minutes
1.3.6TheFive-NumberSummaryandBoxplotsFive-NumberSummary–Consistsofthesmallestobservation,thefirstquartile,themedian,thethirdquartile,andthelargestobservation,writteninorderfromsmallesttolargest.Insymbols,thefive-numbersummaryis
MinimumQ1MQ3Maximum
Thesefivenumbersdivideeachdistributionroughlyintoquarters.About25%ofthedatavaluesfallbetweentheminimumandQ1,about25%arebetweenQ1andthemedian,about25%arebetweenthemedianandQ3,andabout25%arebetweenQ3andthemaximum.Thefive-numbersummaryofadistributionleadstoanewgraph,theboxplot(akaboxandwhiskerplot).HowtoMakeaBoxplot1.Acentralboxisdrawnfromthefirstquartile(Q1)tothethirdquartile(Q3).2.Alineintheboxmarksthemedian.3.Lines(calledwhiskers)extendfromtheboxouttothesmallestandlargestobservationsthatarenotoutliers.Example–HomeRunKingMakingaBoxplotBarryBondssetthemajorleaguerecordbyhitting73homerunsinasingleseasonin2001.OnAugust7,2007,Bondshithis756thcareerhomerun,whichbrokeHankAaron’slongstandingrecordof755.Bytheendofthe2007seasonwhenBondsretired,hehadincreasedthetotalto762.HerearedataonthenumberofhomerunsthatBondshitineachofhis21completeseasons:
162524193325344637334240373449734645452628
Makeaboxplotfortheabovedata,theinitialstepshavebeendonetosaveyoutime.
CheckYourUnderstandingThe2009rosteroftheDallasCowboysprofessionalfootballteamincluded10offensivelinemen.Theirweights(inpounds)were
338318353313318326307317311311
1.Findthefive-numbersummaryforthesedatabyhand.Showyourwork.2.CalculatetheIQR.Interpretthisvalueincontext.3.Determinewhetherthereareanyoutliersusingthe1.5×IQRrule.4.Drawaboxplotofthedata.
1.3.7MeasuringSpread:TheStandardDeviationThefive-numbersummaryisnotthemostcommonnumericaldescriptionofadistribution.Thatdistinctionbelongstothecombinationofthemeantomeasurecenterandthestandarddeviationtomeasurespread.Thestandarddeviationanditscloserelative,thevariance,measurespreadbylookingathowfartheobservationsarefromtheirmean.Let’sexplorethisideausingasimplesetofdata.Example–HowManyPets?InvestigatingspreadaroundthemeanBelowlistsdatadetailingthenumberofpetsownedby9children.
134445789
Themeannumberofpetsis5.Let’slookatwheretheobservationsinthedatasetarerelativetothemean.Thefigureabovedisplaysthedatainadotplot,withthemeanclearlymarked.Thedatavalue1is4unitsbelowthemean.Wesaythatitsdeviationfromthemeanis−4.Whataboutthedatavalue7?Itsdeviationis7−5=2(itis2unitsabovethemean).Thearrowsinthefiguremarkthesetwodeviationsfromthemean.Thedeviationsshowhowmuchthedatavaryabouttheirmean.Theyarethestartingpointforcalculatingthevarianceandstandarddeviation.
Thetabletotheleftshowsthedeviationfromthe
mean foreachvalueinthedataset.Sumthedeviationsfromthemean.Youshouldget0,becausethemeanisthebalancepointofthedistribution.Sincethesumofthedeviationsfromthemeanwillbe0foranysetofdata,weneedanotherwaytocalculatespreadaroundthemean.Howcanwefixtheproblemofthepositiveandnegativedeviationscancelingout?Wecouldtaketheabsolutevalueofeachdeviation.Orwecouldsquarethedeviations.Formathematicalreasonsbeyondthescopeofthisbook,statisticianschoosetosquareratherthantouseabsolutevalues.
Wehaveaddedacolumntothetablethatshowsthe
squareofeachdeviation .Addupthesquareddeviations.Didyouget52?Nowwecomputetheaveragesquareddeviation—sortof.Insteadofdividingbythenumberofobservationsn,wedividebyn−1:
Thevalue6.5iscalledthevariance.
Variance- The average squared distance of the observations in a data set from their mean.
In symbols, Becausewesquaredallthedeviations,ourunitsarein“squaredpets.”That’snogood.We’lltakethesquareroottogetbacktothecorrectunits—pets.Theresultingvalueisthestandarddeviation:
This2.55isroughlytheaveragedistanceofthevaluesinthedatasetfromthemean.StandardDeviation-Thestandarddeviationsxmeasurestheaveragedistanceoftheobservationsfromtheirmean.Itiscalculatedbyfindinganaverageofthesquareddistancesandthentakingthe
squareroot.Thisaveragesquareddistanceiscalledthevariance.Insymbols,thevariance isgivenbyHowtoFindtheStandardDeviation
1. Findthedistanceofeachobservationfromthemeanandsquareeachofthesedistances.2. Averagethedistancesbydividingtheirsumbyn−1.3. Thestandarddeviationsxisthesquarerootofthisaveragesquareddistance:
Manycalculatorsreporttwostandarddeviations,givingyouachoiceofdividingbynorbyn−1.Theformerisusuallylabeledσx,thesymbolforthestandarddeviationofapopulation.Ifyourdatasetconsistsoftheentirepopulation,thenit’sappropriatetouseσx.Moreoften,thedatawe’reexaminingcomefromasample.Inthatcase,weshouldusesx.Moreimportantthanthedetailsofcalculatingsxarethepropertiesthatdeterminetheusefulnessofthestandarddeviation:
• sxmeasuresspreadaboutthemeanandshouldbeusedonlywhenthemeanischosenasthemeasureofcenter.
• sxisalwaysgreaterthanorequalto0.sx=0onlywhenthereisnovariability.Thishappensonlywhenallobservationshavethesamevalue.Otherwise,sx>0.Astheobservationsbecomemorespreadoutabouttheirmean,sxgetslarger.
• sxhasthesameunitsofmeasurementastheoriginalobservations.Forexample,ifyoumeasuremetabolicratesincalories,boththemeanXandthestandarddeviationsxarealsoin
calories.Thisisonereasontoprefersxtothevariance ,whichisinsquaredcalories.• LikethemeanX,sxisnotresistant.Afewoutlierscanmakesxverylarge.
TheuseofsquareddeviationsmakessxevenmoresensitivethanXtoafewextremeobservations.
CheckYourUnderstandingTheheights(ininches)ofthefivestartersonabasketballteamare67,72,76,76,and84.1.Findandinterpretthemean.2.Makeatablethatshows,foreachvalue,itsdeviationfromthemeananditssquareddeviationfromthemean.3.Showhowtocalculatethevarianceandstandarddeviationfromthevaluesinyourtable.4.Interpretthemeaningofthestandarddeviationinthissetting.
1.3.9ChoosingMeasureofCenterandSpreadWenowhaveachoicebetweentwodescriptionsofthecenterandspreadofadistribution:themedianandIQR,orXandsx.BecauseXandsxaresensitivetoextremeobservations,theycanbemisleadingwhenadistributionisstronglyskewedorhasoutliers.Inthesecases,themedianandIQR,whicharebothresistanttoextremevalues,provideabettersummary.We’llseeinthenextchapterthatthemeanandstandarddeviationarethenaturalmeasuresofcenterandspreadforaveryimportantclassofsymmetricdistributions,theNormaldistributions.ChoosingMeasuresofCenterandSpreadThemedianandIQRareusuallybetterthanthemeanandstandarddeviationfordescribingaskeweddistributionoradistributionwithstrongoutliers.UseXandsxonlyforreasonablysymmetricdistributionsthatdon’thaveoutliers.Rememberthatagraphgivesthebestoverallpictureofadistribution.Numericalmeasuresofcenterandspreadreportspecificfactsaboutadistribution,buttheydonotdescribeitsentireshape.Numericalsummariesdonothighlightthepresenceofmultiplepeaksorclusters,forexample.Alwaysplotyourdata.
Example-WhoTextsMore—MalesorFemales?PullingitalltogetherFortheirfinalproject,agroupofAPStatisticsstudentsinvestigatedtheirbeliefthatfemalestextmorethanmales.Theyaskedarandomsampleofstudentsfromtheirschooltorecordthenumberoftextmessagessentandreceivedoveratwo-dayperiod.Herearetheirdata: