Stat 13, Intro. to Statistical Methods for the Life and Health Sciences.



ThefinalFri Dec9,8am-11,right here,willbeonch1-10.BringaPENCILandCALCULATORandanybooksornotesyouwant.Nocomputers.

1.Midterms.Thescoresarelistedinmidtermscores.txt .Theyareoutof20.



Ourdept's officialstatement:Intheaftermathofthepresidentialelection,inwhichpassionsranhighsurroundingissuesoftolerancefordiversity,itisimportanttorememberthatharassmentanddiscriminationbasedonsuchthingsas:•race,ethnicity,ancestry,color•sex,gender,genderidentity,genderexpression,sexualorientation•nationalorigin,citizenshipstatus•religionarenotacceptableatUCLA,andmayhaveseriousconsequences.Informationforhowtoobtainredressorcounselingifyouaresubjectedtosuchharassmentordiscriminationcanbefoundat:


Intotalthismakes17,104likelyvotersinthoseWisconsinpollsputtogether.Theyaveraged40.3% forTrump,andClinton46.8%.Thedifferenceis6.5%.Combined,themarginoferrorfora95%confidenceintervalaroundTrump'spercentagewouldbe0.735%.Thestandarderroris0.375%ontheestimateofTrump'spercentageof40.3%,andhegot47.9%.Sotheywereoffby7.6%whichismorethan20standarderrors.Theprobabilityis1in10^90thatthepollswouldbeoffbythatmuchormorejustbychance,iftheanswerstothepollswerejustarandomsampleofhowpeoplewereactuallygoingtovote.Technically,thereareundecidedvotersinthepollsalso.JusttakingthedifferenceinpercentagesbetweenTrumpandHillaryClintonratherthanthepercentageforTrumpintoaccount,theresultswereoffbyabout10SEs,not20,andthismakestheprobabilityofsomethingthisextremeormoreextremestillastronomical,about1.5*10-23.Thechanceofamonkeyrandomlytyping15letterscompletelyatrandomandhappeningtochoose"hillary rclinton"inorder,wouldbe6*10-22,soit'sabout40timesmorelikely.Whatdoweconclude?


2. Polls.

b.IfClintonhadoutperformedthepolls,whatwouldhavebeenanexplanation?*fundraisingadvantage.* moreexperiencedcampaignstaff.*moreorganizedgroundgame.*Latinopopulationshadsurged.*AhigherpercentageofLatinosvoted.*Earlyvotinghadenabledmorepoorpeopleandminoritiestovote.Thuspeoplewhomightbeconsideredunlikelyvotersduetovotingtrendsinpreviouselectionsmightactuallybevoting,andthemajorityofthesewouldbeexpectedtobeDemocrats.*Mostlygoodnewsforherrightbeforetheelection.TheFBIsaidtheywentthroughtheemailsandclearedherofcharges.Lotsofbigstarswereperformingandgettingpeopleouttothepollsandrallies.Obama,Michelle,BillClinton,andmanyotherswerecampaigninghardforherinthefinaldays.*Meanwhile,manytopRepublicanswerenotevensupportingTrump.Hisgroundgameandfieldofficesweredisorganizedornonexistent.EvensomerightwingradiohostswerecriticizingTrump.Hehadfiredhiscampaignmanagermidwaythroughthecampaignandhadaninexperiencedhodge podgeofsupportersandstaff.


c.Otherpossiblepro-Clintonexplanations.*Hillarypreparedveryearlyforherrunforoffice.Trumpwasalatecomer.*TheDemocratsmostlycoalescedaroundHillary,allowinghertopileupnumerousendorsementsveryearly.Onlyacoupleofpeopleevenranagainsther,andnonewerereallypromisingcandidates.EvenSanderswasnotseenasaveryviablecandidatewhentheracebegan.*OntheRepublicanside,Trumphadtofendoff16othercandidateswhileHillarywasraisingmoneyandholdingontoit.*AfterTrumpgotthenomination,barely,hehadlittleconventionbounce,andHillaryhadahugeoneandgotabigleadinthepolls.*Obama'spopularityhasbeenhigh.*Theeconomy,whilenottoostrong,ismuchmuchstrongerthanwhenObamabeganinoffice,andhewasabletocampaignstronglyforHillary.*Shealsohadanincrediblycharismaticandgreatspeakerinherhusband,andMichellemadegreatspeechesaswell.*Thefatherofamuslim fallensoldierspokeeloquentlyagainstTrump.*Melania gotcaughtblatantlyplagiarizingMichelle'sspeechfrom2008.*TrumpgotsuedforfraudforTrumpUniversity,andcriticizedthejudgeasunfitbecausehewasofMexicandescent.*Trumpmadefunofareporterinawheelchair.*Clintonwonall3debatesaccordingtomostpollsandsurveys.


2. Polls.





NateSilverand538.comuseaBayesianmodeltoforecasttheelection.HowdoesBayesianmodelingwork?Themodelstartswithpriordistributionswhicharesupposedtoreflecttheresearcher'sbeliefsabouttheprobabilityofsomethingbeforecollectingdata.Forinstance,youmighthaveapriordistributionthatthepercentageµofvotestheDemocraticcandidatewouldgetisspreaduniformlybetween40%and60%.Thenyoucollectdata,frompolls,economicdata,etc.,andgraduallyupdateyourdistribution,formingaposteriordistribution forµ.NateSilver'smainideawastoweightthedifferentpollsbyhowaccuratetheywerehistorically.



IcouldjustsayeveryDemocratandRepublicanhasa50%chanceeveryelection.Thismodelmightbewellcalibrated,butitisnotveryinformative.Validation.Isthemodelaccurate?Doesitoutperformcompetingmodels?e.g.Log-likelihood:∑i won log(pi)+∑i lost log(1-pi).



Herearethepresidentialelectionsthathaveoccurredinmylifetime.1972.Nixonvs.McGovern.1976.Cartervs.Ford.1980.Reaganvs.Carter.1984.Reaganvs.Mondale.1988.Bushvs.Dukakis.1992.Clintonvs.Bush.1996.Clintonvs.Dole.2000.Bushvs.Gore.2004.Bushvs.Kerry.2008.Obamavs.McCain.2012.Obamavs.Romney.2016.Trump vs.Clinton.

Many would say the candidate with morehumor,style,andcharisma won11/12ofthese elections.The two-sided p-value =0.63%.Experience?Hard tosay,but perhaps the moreexperienced candidate won3/12.The two-sided p-value =14.6%.


Time 30 41 41 43 47 48 51 54 54 56 56 56 57 58

Score 100 84 94 90 88 99 85 84 94 100 65 64 65 89

Time 58 60 61 61 62 63 64 66 66 69 72 78 79

Score 83 85 86 92 74 73 75 53 91 85 62 68 72


DescribingScatterplots•Whenwedescribedatainascatterplot,wedescribethe• Direction(positiveornegative)• Form(linearornot)• Strength(strong-moderate-weak,wewillletcorrelationhelpusdecide)• UnusualObservations• Howwouldyoudescribethetimeandtestscatterplot?

Correlation• Correlationmeasuresthestrengthanddirectionofalinear associationbetweentwoquantitative variables.• Correlationisanumberbetween-1and1.• Withpositivecorrelationonevariableincreases,onaverage,astheotherincreases.• Withnegativecorrelationonevariabledecreases,onaverage,astheotherincreases.• Thecloseritistoeither-1or1thecloserthepointsfittoaline.• Thecorrelationforthetestdatais-0.56.

CorrelationGuidelinesCorrelationValue Strengthof


0.7to1.0 Strong Thepointswillappeartobenearlyastraightline

0.3to0.7 Moderate Whenlookingatthegraphtheincreasing/decreasingpatternwillbeclear,but thereisconsiderablescatter.

0.1to0.3 Weak Withsomeeffortyouwillbeabletoseeaslightlyincreasing/decreasingpattern

0to0.1 None Nodiscernibleincreasing/decreasingpattern

Same StrengthResultswithNegativeCorrelations

InfluentialObservations• Thecorrelationchangedfrom-0.56(afairlymoderatenegativecorrelation)to-0.12(aweaknegativecorrelation).• Pointsthatarefartotheleftorrightandnotintheoveralldirectionofthescatterplotcangreatlychangethecorrelation.(influentialobservations)

Correlation• Correlationmeasuresthestrengthanddirectionofalinear associationbetweentwoquantitativevariables.• -1< r< 1• Correlationmakesnodistinctionbetweenexplanatoryandresponsevariables.• Correlationhasnounits.• Correlationisnotresistanttooutliers.Itissensitive.

LearningObjectivesforSection10.1• Summarizethecharacteristicsofascatterplotbydescribingitsdirection,form,strengthandwhetherthereareanyunusualobservations.• Recognizethatthecorrelationcoefficientisappropriateonlyforsummarizingthestrengthanddirectionofascatterplotthathaslinearform.• Recognizethatascatterplotistheappropriategraphfordisplayingtherelationshipbetweentwoquantitativevariablesandcreateascatterplotfromrawdata.• Recognizethatacorrelationcoefficientof0meansthereisnolinearassociationbetweenthetwovariablesandthatacorrelationcoefficientof-1or1meansthatthescatterplotisexactlyastraightline.• Understandthatthecorrelationcoefficientisinfluencedbyextremeobservations.

• Null:Thereisnoassociationbetweenheartrateandbodytemperature.(ρ=0)• Alternative:Thereisapositivelinearassociationbetweenheartrateandbodytemperature.(ρ>0)


Tmp 98.3 98.2 98.7 98.5 97.0 98.8 98.5 98.7 99.3 97.8HR 72 69 72 71 80 81 68 82 68 65Tmp 98.2 99.9 98.6 98.6 97.8 98.4 98.7 97.4 96.7 98.0HR 71 79 86 82 58 84 73 57 62 89


Page 28: Stat 13, Intro. to Statistical Methods for the Life and ...Stat 13, Intro. to Statistical Methods for the Life and ... The chance of a monkey randomly typing 15 letters completely




TemperatureandHeartRate• Iftherewasnoassociationbetweenheartrateandbodytemperature,whatistheprobabilitywewouldgetacorrelationashighas0.378justbychance?

• Ifthereisnoassociation,wecanbreakapartthetemperaturesandtheircorrespondingheartrates.Wewilldothisbyshufflingoneofthevariables.

ShufflingCards• Let’sremindourselveswhatwedidwithcardstofindoursimulatedstatistics.• Withtwoproportions,wewrotetheresponseonthecards,shuffledthecardsandplacedthemintotwopilescorrespondingtothetwocategoriesoftheexplanatoryvariable.• Withtwomeanswedidthesamethingexceptthistimetheresponseswerenumbersinsteadofwords.

20.0% Improvers

66.7% Improvers






























40.0% Improvers

46.7% Improvers0.400 – 0.467 = -0.067

Difference in Simulated Proportions

mean = 3.90mean = 19.82

Music Nomusic









45.6 10.0










mean = 6.38 mean = 16.126.38 – 16.12 = -9.74

Difference in Simulated Means

ShufflingCards• Nowhowwillthisshufflingbedifferentwhenboththeresponseandtheexplanatoryvariablearequantitative?• Wecan’tputthingsintwopilesanymore.• Westillshufflevaluesoftheresponsevariable,butthistimeplacethemnexttotwovaluesoftheexplanatoryvariable.

98.3° 98.2° 97.7° 98.5° 97.0° 98.8° 98.5° 98.7° 99.3° 97.8°

98.2° 99.9° 98.6° 98.6° 97.8° 98.4° 98.7° 97.4° 96.7° 98.0°

r = 0.378

6972 8180 82687172

r = 0.073

Simulated Correlations

Body Temperature and Heart Rate

68 65

7971 8458 57738286 62 89

-0.253 -0.3450.062 0.259




-0.0290.059 -0.006


-0.327 0.1000.067










Only one simulated statistic out of 30 was as large or larger than our observed correlation of 0.378, hence our p-value for this null distribution is 1/30 ≈ 0.03.

Simulated Correlations 0.378

TemperatureandHeartRate• Wecanlookattheoutputof1000shuffleswithadistributionof1000simulatedcorrelations.

TemperatureandHeartRate• Noticeournulldistributioniscenteredat0andsomewhatsymmetric.• Wefoundthat530/10000timeswehadasimulatedcorrelationgreaterthanorequalto0.378.

TemperatureandHeartRate• Withap-valueof0.053=5.3%,wealmostbutdonotquitehavestatisticalsignificance.Thisismoderateevidenceofapositivelinearassociationbetweenbodytemperatureandheartrate.Perhapsalargersamplewouldgiveasmallerp-value.

Introduction• Ifwedecideanassociationislinear,itishelpfultodevelopamathematicalmodelofthatassociation.• Helpsmakepredictionsabouttheresponsevariable.• Theleast-squaresregressionline isthemostcommonwayofdoingthis.

Introduction• Unlessthepointsareperfectlylinearlyalligned,therewillnotbeasinglelinethatgoesthrougheverypoint.• Wewantalinethatgetsascloseaspossibletoallthepoints.

Introduction• Wewantalinethatminimizestheverticaldistancesbetweenthelineandthepoints• Thesedistancesarecalledresiduals.• Thelinewewillfindactuallyminimizesthesumofthesquaresoftheresiduals.• Thisiscalledaleast-squaresregressionline.

GrowingPlates?• TherearemanyrecentarticlesandTVreportsabouttheobesityproblem.• Onereasonsomehavegivenisthatthesizeofdinnerplatesareincreasing.• Aretheseblackcirclesthesamesize,orisonelargerthantheother?

GrowingPlates?• Theyappeartobethesamesizeformany,buttheoneontherightisabout20%largerthantheleft.

• Thissuggeststhatpeoplewillputmorefoodonlargerdinnerplateswithoutknowingit.

• Thereisnameforthisphenomenon:Delboeufillusion

GrowingPlates?• Researchersgathereddatatoinvestigatetheclaimthatdinnerplatesaregrowing• Americandinnerplatessoldonebay onMarch30,2010(VanIttersum andWansink,2011)• Yearmanufacturedanddiameteraregiven.

GrowingPlates?• Bothyear(explanatoryvariable)anddiameterininches(responsevariable)arequantitative.• Eachdotrepresentsoneplateinthisscatterplot.• Describetheassociationhere.

GrowingPlates?• Theassociationappearstoberoughlylinear• Theleastsquaresregressionlineisadded• Howcanwedescribethisline?

RegressionLineTheregressionequationis𝑦" = 𝑎 + 𝑏𝑥:• a isthey-intercept• b istheslope• x isavalueoftheexplanatoryvariable• ŷ isthepredictedvaluefortheresponsevariable

• Foraspecificvalueofx,thecorrespondingdistancey − 𝑦" (oractual– predicted)isaresidual

RegressionLine• Theleastsquareslineforthedinnerplatedatais𝑦" = −14.8 + 0.0128𝑥• Ordiameter7 = −14.8 + 0.0128(year)• Thisallowsustopredictplatediameterforaparticularyear.

Slope𝑦" = −14.8 + 0.0128𝑥

• Whatisthepredicteddiameterforaplatemanufacturedin2000?• -14.8+0.0128(2000)=10.8in.

• Whatisthepredicteddiameterforaplatemanufacturedin2001?• -14.8+0.0128(2001)=10.8128in.

• Howdoesthiscomparetoourpredictionfortheyear2000?• 0.0128larger

• Slopeb =0.0128meansthatdiametersarepredictedtoincreaseby0.0128inchesperyearonaverage

Slope• Slopeisthepredictedchangeintheresponsevariableforone-unitchangeintheexplanatoryvariable.• Boththeslopeandthecorrelationcoefficientforthisstudywerepositive.• Theslopeis0.0128• Thecorrelationis0.604

• Theslopeandcorrelationcoefficientwillalwayshavethesamesign.

y-intercept• They-interceptiswheretheregressionlinecrossesthey-axisorthepredictedresponsewhentheexplanatoryvariableequals0.• Wehaday-interceptof-14.8inthedinnerplateequation.Whatdoesthistellusaboutourdinnerplateexample?• Dinnerplatesinyear0were-14.8inches.

• Howcanitbenegative?• Theequationworkswellwithintherangeofvaluesgivenfortheexplanatoryvariable,butfailsoutsidethatrange.

• Ourequationshouldonlybeusedtopredictthesizeofdinnerplatesfromabout1950to2010.

Extrapolation• Predictingvaluesfortheresponsevariableforvaluesoftheexplanatoryvariablethatareoutsideoftherangeoftheoriginaldataiscalledextrapolation.

• Whiletheinterceptandslopehavemeaninginthecontextofyearanddiameter,rememberthatthecorrelationdoesnot.Itisjust0.604.• However,thesquareofthecorrelation(coefficientofdeterminationorr2)doeshavemeaning.• r2 =0.6042=0.365or36.5%• 36.5%ofthevariationinplatesize(theresponsevariable)canbeexplainedbyitslinearassociationwiththeyear(theexplanatoryvariable).

LearningObjectivesforSection10.3• Understandthatonewayascatterplotcanbesummarizedisbyfittingthebest-fit(leastsquaresregression)line.• Beabletointerpretboththeslopeandinterceptofabest-fitlineinthecontextofthetwovariablesonthescatterplot.• Findthepredictedvalueoftheresponsevariableforagivenvalueoftheexplanatoryvariable.• Understandtheconceptofresidualandfindandinterprettheresidualforanobservationalunitgiventherawdataandtheequationofthebestfit(regression)line.• Understandtherelationshipbetweenresidualsandstrengthofassociationandthatthebest-fit(regression)linethisminimizesthesumofthesquaredresiduals.

LearningObjectivesforSection10.3• Findandinterpretthecoefficientofdetermination(r2)asthesquaredcorrelationandasthepercentoftotalvariationintheresponsevariablethatisaccountedforbythelinearassociationwiththeexplanatoryvariable.• Understandthatextrapolationiswhenaregressionlineisusedtopredictvaluesoutsideoftherangeofobservedvaluesfortheexplanatoryvariable.• Understandthatwhenslope=0meansnoassociation,slope<0meansnegativeassociation,slope>0meanspositiveassociation,andthatthesignoftheslopewillbethesameasthesignofthecorrelationcoefficient.• Understandthatinfluentialpointscansubstantiallychangetheequationofthebest-fitline.