Lecture 1: Essential statistics - Shane Elipot · A quote from Data Analysis Methods in Physical Oceanography, Thompson and Emery, Third Edition, 2014: "Statistical methods are essential

Lecture1:

Essentialstatistics

ShaneElipotTheRosenstielSchoolofMarineandAtmosphericScience,

UniversityofMiami

Createdwith{Remark.js}using{Markdown}+{MathJax}

Loading[MathJax]/jax/output/HTML-CSS/jax.js

http://remarkjs.com/

https://daringfireball.net/projects/markdown/

https://www.mathjax.org/

Foreword

Statistics(noun,pluralinformbutsingularorpluralinconstruction)

1. Merriam-Webster:abranchofmathematicsdealingwiththecollection,analysis,interpretation,andpresentationofmassesofnumericaldata

Foreword



2. OxforddictionnaryofEnglish:

a.thepracticeorscienceofcollectingandanalysingnumericaldatainlargequantities,especiallyforthepurposeofinferringproportionsinawholefromthoseinarepresentativesample.

b.acollectionofquantitativedata(e.g.Thestatisticsofthedataareunknown.)

Foreword



2. OxforddictionnaryofEnglish:

a.thepracticeorscienceofcollectingandanalysingnumericaldatainlargequantities,especiallyforthepurposeofinferringproportionsinawholefromthoseinarepresentativesample.

b.acollectionofquantitativedata(e.g.Thestatisticsofthedataareunknown.)

Statistic(noun)asingletermordatuminacollectionofstatistics(e.g.Themeanofthedataiszero.)

References

[1]Bendat,J.S.,&Piersol,A.G.(2011).Randomdata:analysisandmeasurementprocedures(Vol.729).JohnWiley&Sons.

[2]Thomson,R.E.,&Emery,W.J.(2014).Dataanalysismethodsinphysicaloceanography.Newnes.dx.doi.org/10.1016/B978-0-12-387782-6.00003-X

[3]Taylor,J.(1997).Introductiontoerroranalysis,thestudyofuncertaintiesinphysicalmeasurements.

[4]Press,W.H.etal.(2007).Numericalrecipes3rdedition:Theartofscientificcomputing.Cambridgeuniversitypress.

[5]Kanji,G.K.(2006).100statisticaltests.Sage.

[6]vonStorch,H.andZwiers,F.W.(1999).StatisticalAnalysisinClimateResearch,CambridgeUniversityPress

http://dx.doi.org/10.1016/B978-0-12-387782-6.00003-X

Foreword

AquotefromDataAnalysisMethodsinPhysicalOceanography,ThompsonandEmery,ThirdEdition,2014:

"Statisticalmethodsareessentialtodeterminingthevalueofthedataandtodecidehowmuchofitcanbeconsideredusefulfortheintendedanalysis.Thisstatisticalapproacharisesfromthefundamentalcomplexityoftheocean,amultivariatesystemwithmanydegreesoffreedominwhichnonlineardynamicsandsamplinglimitationsmakeitdifficulttoseparatescalesofvariability."

Whatdomultivariate,degreesoffreedom,sampling,andvariabilitymeaninthiscontext?

Lecture1:Outline1. Introduction2. EstimationvsTruth3. FundamentalStatistics4. CommonDistributions5. Errors,Uncertainties,andhypothesistesting6. Extraslides

1.Introduction

Introduction

Inoceanographyandatmosphericscience,wehaveatourdisposalanumberofphysicaltheories,thatisequations,thatdescribethespatialandtemporalvariabilityoftheclimatesystem.

Introduction


Specifically,ingeophysicalfluiddynamics,theNavier-Stokesequationsdescribethemotionofafluidin2Dor3D.Yet,wedonotknowiftheseequationshavereasonablephysicalsolutions(Ifyoufigurethisout,there'sa$1MprizefromtheClayMathematicsInstitute).Assumingthattheydo,thentheocean,asanexample,canbeseenasbeingadeterministicsystem,whichmeansthatmathematicalexpressionscanbeusedtodescribecompletelythevelocityandpressurefield(e.g. ).p(x, t) = cos(ωt− k ⋅ x)p0

http://www.claymath.org/millennium-problems/navier%E2%80%93stokes-equation

Introduction


Specifically,ingeophysicalfluiddynamics,theNavier-Stokesequationsdescribethemotionofafluidin2Dor3D.Yet,wedonotknowiftheseequationshavereasonablephysicalsolutions(Ifyoufigurethisout,there'sa$1MprizefromtheClayMathematicsInstitute).Assumingthattheydo,thentheocean,asanexample,canbeseenasbeingadeterministicsystem,whichmeansthatmathematicalexpressionscanbeusedtodescribecompletelythevelocityandpressurefield(e.g. ).

Yet,evenifwedonotknowtheanalyticalsolutions,wecandiscretizetheequationswithinacomputermodelstoobtainapproximatesolutionsdescribingtheflowdeterministically...(intheory).

p(x, t) = cos(ωt− k ⋅ x)p0

http://www.claymath.org/millennium-problems/navier%E2%80%93stokes-equation

Introduction

Adeterministicapproachisnotrealisticbecauseofthreecommonlyacknowledgedreasons.Theoceanandtheatmospherearecomplex,nonlinear,andunpredictable.

Introduction


Asanexample,thereisanestimated watermoleculesintheocean(that'scomplexity)sothattherearetoomanyvariables,andtoomanyinitialandboundaryconditionstobespecified,(jointlyformingthenumberofdegreesoffreedomofthesystem)inordertosolveallequationsnumericallyinacomputermodel.Becausemanyvariablescannotbeobserved,orareunspecifiedatthestartofasimulation,outcomeswillappearrandomtotheobserver(that'sunpredictability).

4.7 × 1046

Introduction


Asanexample,thereisanestimated watermoleculesintheocean(that'scomplexity)sothattherearetoomanyvariables,andtoomanyinitialandboundaryconditionstobespecified,(jointlyformingthenumberofdegreesoffreedomofthesystem)inordertosolveallequationsnumericallyinacomputermodel.Becausemanyvariablescannotbeobserved,orareunspecifiedatthestartofasimulation,outcomeswillappearrandomtotheobserver(that'sunpredictability).

Finally,oceanandatmospherearenonlinearwhichmeansthatyoucannotreallyfindaportionofthesystem(e.g.surfacegravitywaves)withafinitenumberofdegreesoffreedomwhoseevolutionisisolatedandcanbemadedeterministic.Unknownperturbationswillrendertheobservationstocontainrandomness,ornoise.

4.7 × 1046

Introduction

Onceweacceptthattheclimateisnotapurelydeterministicsystem,butcontainrandomness,wecanrelyonasuiteoftoolsespeciallyapplicabletostochasticsystemsorprocesses.

stochastic(adjective,technical)

havingarandomprobabilitydistributionorpatternthatmaybeanalysedstatisticallybutmaynotbepredictedprecisely.

Asaconsequence,intherestofthislecture,wewilloftendiscussrandomvariables(hereafterr.v.),thatisvariablestowhichstatisticaltheorycanbeapplied.Inaddition,wewilltaketheapproachthatoursystem,orthatourobservationaldata,canbeseparatedintosignalplusnoise.Asanexample,estimationoftheseasonalcycleofoceantemperature(thesoughtaftersignal)isdisturbedbychangesduetooceancurrents,butalsochangesduetotheimperfectionsofyourtemperaturesensors,etc.

2.Estimationvstruth

Estimationvstruth

Weareinthebusinessofestimation:wetrytodescribeand/orunderstandtheclimatesystembyestimatingthevalueofarandomvariable.Thisr.v.maybeaphysicalvariablesuchasairtemperature,watervapor,seaicearea,seasurfacetemperature,sealevel,snowcover,glaciervolume,CO concentration,etc.2

Estimationvstruth

Weareinthebusinessofestimation:wetrytodescribeand/orunderstandtheclimatesystembyestimatingthevalueofarandomvariable.Thisr.v.maybeaphysicalvariablesuchasairtemperature,watervapor,seaicearea,seasurfacetemperature,sealevel,snowcover,glaciervolume,CO concentration,etc.

Oritmaybeavariablederivedfromoneorseveralotherr.v.,inessenceaparameter.Thisparameterisitselfar.v.Asanexampleoceanheatcontent,thedailymeantemperature,thedecadaltemperaturetrend,theacoustictraveltimeinwater,theamplitudeoftheseasonalcyleoftemperature,theadiabaticlapserate,thepowerspectraldensityfunctionofvelocity,theprecipitationrate,etc.

2

Estimationvstruth

Weareinthebusinessofestimation:wetrytodescribeand/orunderstandtheclimatesystembyestimatingthevalueofarandomvariable.Thisr.v.maybeaphysicalvariablesuchasairtemperature,watervapor,seaicearea,seasurfacetemperature,sealevel,snowcover,glaciervolume,CO concentration,etc.

Oritmaybeavariablederivedfromoneorseveralotherr.v.,inessenceaparameter.Thisparameterisitselfar.v.Asanexampleoceanheatcontent,thedailymeantemperature,thedecadaltemperaturetrend,theacoustictraveltimeinwater,theamplitudeoftheseasonalcyleoftemperature,theadiabaticlapserate,thepowerspectraldensityfunctionofvelocity,theprecipitationrate,etc.

Thedistinctionbetweenvariableandparameterismaybesemantic.Inanycase,let'scall ar.v.ofinterest.

2

ϕ

Estimationvstruth

Unfortunatelly,itislikelythatwewillneverknow exactly,butonlyaccessanestimatethatwewillnote ( "hat").Thisestimatemeansweuseagivenmethodoragiveninstrumenttomeasureorcalculate .

ϕ

ϕ̂ ϕ

ϕ

Estimationvstruth

Unfortunatelly,itislikelythatwewillneverknow exactly,butonlyaccessanestimatethatwewillnote ( "hat").Thisestimatemeansweuseagivenmethodoragiveninstrumenttomeasureorcalculate .

Asanexample,wewanttoknowthetemperatureoftheroom.Wecan:

1. useonetemperaturesensoratonefixedlocation(inthemiddleoftheroom),repeatedlythroughtime, times.

2. use temperaturesensors,once.3. useonetemperaturesensor,usedrepeatedly times,eachtime

inadifferentcorneroftheroom4. etc.

Eachoneofthesemethodsleadstooneestimateof"thetemperatureoftheroom".

ϕ

ϕ̂ ϕ

ϕ

NN

N

Expectationofanestimate

Let'sassumethatwedesignanexperimentandobtain ,repeatedlytimes.Theexpectationvalueof ,denoted ,is

where istheestimatefromthe -thexperiment.Unfortunatelly,sincewemusthave ,theexpectationcannotbeknown.

ϕ̂

N ϕ̂ E[ ]ϕ̂

E[ ] =ϕ̂ limN→∞

1N

∑n=0

N

ϕ̂n

ϕ̂n nN →∞




However,ifwecanreasonablymakesomeassumptionsaboutthestatisticsofther.v.itself,and/orifweknowthespecificationsofourinstruments,thenwecansometimesderivetheexpectationofourestimator.

ϕ̂

N ϕ̂ E[ ]ϕ̂


1N

∑n=0

N

ϕ̂n

ϕ̂n nN →∞





Whyisthisimportant?

ϕ̂

N ϕ̂ E[ ]ϕ̂


1N

∑n=0

N

ϕ̂n

ϕ̂n nN →∞





Whyisthisimportant?Toassesshowgoodourestimateis.

ϕ̂

N ϕ̂ E[ ]ϕ̂


1N

∑n=0

N

ϕ̂n

ϕ̂n nN →∞


Let'sassumethatwehaveaccesstotheexpectation ofanestimator ofther.v. .

E[ ]ϕ̂ϕ̂ ϕ



Ifwearelucky, ,andtheestimator issaidtobeunbiased.

E[ ]ϕ̂ϕ̂ ϕ

E[ ] = ϕϕ̂ ϕ̂



Ifwearelucky, ,andtheestimator issaidtobeunbiased.Otherwise,itissaidtobebiasedandwedefinethebiasoftheestimator:

whichconstitutesasystematicerroroftheestimate.

E[ ]ϕ̂ϕ̂ ϕ

E[ ] = ϕϕ̂ ϕ̂

b[ ] = E[ ] − ϕ,ϕ̂ ϕ̂



Ifwearelucky, ,andtheestimator issaidtobeunbiased.Otherwise,itissaidtobebiasedandwedefinethebiasoftheestimator:

whichconstitutesasystematicerroroftheestimate.

Inaddition,thevalueoftheestimatorwillchangefromoneexperimenttothenext,sowedefinethevarianceoftheestimator,denoted ,as

whichconstitutestherandomerroroftheestimate.

E[ ]ϕ̂ϕ̂ ϕ

E[ ] = ϕϕ̂ ϕ̂

b[ ] = E[ ] − ϕ,ϕ̂ ϕ̂

Var[ ]ϕ̂

Var[ ] = E[( − E[ ] ],ϕ̂ ϕ̂ ϕ̂ )2


Thebiasandthevarianceoftheestimatorcontributesbothtoitstotalerror,whichcanbeassessedbythemeansquareerror(MSE):

usuallyreportedastherootmeansquareerror:

Anothersometimesusefulquantityisthenormalizedrmserror:

whichisunitlessandcanbereportedasapercentage.

MSE[ ] = E[( − ϕ ] = Var[ ] + (b[ ] ,ϕ̂ ϕ̂ )2 ϕ̂ ϕ̂ )2

RMS = =MSE[ ]ϕ̂− −−−−−−√ Var[ ] + (b[ ]ϕ̂ ϕ̂ )2

− −−−−−−−−−−−−√

ε[ϕ] = for ϕ ≠ 0,E[( − ϕ ]ϕ̂ )2− −−−−−−−−√

ϕ

Estimationvstruth

FromtheInternationalOrganizationforStandardization(ISO)publication5725-1:1994Accuracy(truenessandprecision)ofmeasurementmethodsandresults—Part1:Generalprinciplesanddefinitions

Introduction0.1ISO5725usestwoterms"trueness"and"precision"todescribetheaccuracyofameasurementmethod."Trueness"referstotheclosenessofagreementbetweenthearithmeticmeanofalargenumberoftestresultsandthetrueoracceptedreferencevalue."Precision"referstotheclosenessofagreementbetweentestresults.

https://www.iso.org/obp/ui/#iso:std:iso:5725:-1:ed-1:v1:en

Example

SpecificationsheetsofSeabird911CTDPlussensors:

Example

SpecificationsheetsofSeabird911CTDPlussensors:

Aninterpretationisthattheaccuracyisthetotalerror,orerrorwhichisoriginallyonlytherandomerror.Thestabilityimpliesthatthebias,originallyzero,increaseswithtime.

RMS

2.Fundamentalstatistics

Fundamentalstatistics

Let'sconsiderarandomvariable ,forwhichweobtain(thatismeasure,calculate)values ,dependentonanindex alongagivendimension,orscale(e.g.temperatureasafunctionoftime,sealevelalongasatellitetrack,oxygenconcentrationasafunctionofdepthetc).

xXn n

Fundamentalstatistics

Let'sconsiderarandomvariable ,forwhichweobtain(thatismeasure,calculate)values ,dependentonanindex alongagivendimension,orscale(e.g.temperatureasafunctionoftime,sealevelalongasatellitetrack,oxygenconcentrationasafunctionofdepthetc).

Statisticaltheoryoftenconsidersr.v.thatarecontinuous,asin.Sincewearetypicallydealingwithdigitalornumericaldata

thatarediscrete,wewilltakeadiscreteapproachinthiscourse,asin where isthestepoftherecord.

Sometimes,wewillneedtoreverttocontinuousnotationswhenneeded;asanexample:

xXn n

X(t)

= X( ) = X(nΔt)Xn tn Δt

Δt⟺ X(t)dt, T = NΔt∑n=0

N

Xn ∫ T

0

Mean

Thefirststatisticalquantitytoconsideristhetruemean,populationmean,orexpectationof :x

≡ E[X] =μx limN→∞

1N

∑n=1

N

Xn

Mean

Thefirststatisticalquantitytoconsideristhetruemean,populationmean,orexpectationof :

Unfortunatelly,weonlyhaveafinitenumbersofestimatessowecomputethesamplemean

wherethelastequalitymeansthatthesamplemeanisanestimatorofthetruemeanof .

x

≡ E[X] =μx limN→∞

1N

∑n=1

N

Xn

, n = 1,…,NXn

≡ =X¯ ¯¯̄ 1

N∑n=1

N

Xn μ̂x

x

Variance

Thenextstatisticalquantityofinterestisthevarianceof :

iscalledthestandarddeviation.

x

≡ E[(X − ].σ2x μx)2

=σx σ2x−−√

Variance

Thenextstatisticalquantityofinterestisthevarianceof :

iscalledthestandarddeviation.

Anestimatorof isthesamplevariance :

Notethefactor insteadof inthepreviousexpression;seesection4.1ofreference[1]foranexplanation.

x

≡ E[(X − ].σ2x μx)2

=σx σ2x−−√

σ2x s2x

= = ( − =s2x σ̂2x

1N − 1

∑n=1

N

Xn X¯ ¯¯̄ )2

1N − 1

∑n=1

N ( − )Xn

1N

∑n=1

N

Xn

2

1N−1

1N

Mean

Let'sgobacktotheestimator of .Itisanestimator,soisitbiased?Howmuchdoesitvary?

X¯ ¯¯̄

μx

Mean


Theexpectationisanmathematicaloperatorwhichislinear:

X¯ ¯¯̄

μx

E[aX + bY ] = aE[X] + bE[Y ]

Mean



Let'susethisrulefor :

Notethat isadefinition,validforall !

X¯ ¯¯̄

μx

E[aX + bY ] = aE[X] + bE[Y ]

X¯ ¯¯̄

E[ ] = E[ ] = E[ ] = (N ) =X¯ ¯¯̄ 1

N∑n=1

N

Xn

1N

∑n=1

N

Xn

1N

μx μx

E[ ] =Xn μx n

Mean





Since thenitissaidthat isanunbiasedestimatorof(seeslideonbias).Thismeansthatthemoreobservationsof

weobtain,themoreaccuratetheestimationofthemeanwillbe.

X¯ ¯¯̄

μx

E[aX + bY ] = aE[X] + bE[Y ]

X¯ ¯¯̄

E[ ] = E[ ] = E[ ] = (N ) =X¯ ¯¯̄ 1

N∑n=1

N

Xn

1N

∑n=1

N

Xn

1N

μx μx

E[ ] =Xn μx n

E[ ] =X¯ ¯¯̄

μx X¯ ¯¯̄

μx x

Mean





Since thenitissaidthat isanunbiasedestimatorof(seeslideonbias).Thismeansthatthemoreobservationsof

weobtain,themoreaccuratetheestimationofthemeanwillbe.Thatdoesnotmeanthat isfreeoferrors...

X¯ ¯¯̄

μx

E[aX + bY ] = aE[X] + bE[Y ]

X¯ ¯¯̄

E[ ] = E[ ] = E[ ] = (N ) =X¯ ¯¯̄ 1

N∑n=1

N

Xn

1N

∑n=1

N

Xn

1N

μx μx

E[ ] =Xn μx n

E[ ] =X¯ ¯¯̄

μx X¯ ¯¯̄

μx x

X¯ ¯¯̄

Mean

Howmuchdoesthesamplemeanestimatorvary?Recallthedefinitionofthe ofanestimator;for itisMSE X

¯ ¯¯̄

E[( − ] = Var[ ] + (b[ ] = Var[ ] + 0X¯ ¯¯̄

μx)2 X¯ ¯¯̄

X¯ ¯¯̄ )2 X

¯ ¯¯̄

Mean

Howmuchdoesthesamplemeanestimatorvary?Recallthedefinitionofthe ofanestimator;for itis

Undersomeassumption(thatthe areindependent),itcanbeshown(yourhomework,orsection4.1ofreference[1])that

Thevarianceofthemeanestimatoristhetruevarianceofthedata( )dividedbythenumberofobservation( ).

MSE X¯ ¯¯̄

E[( − ] = Var[ ] + (b[ ] = Var[ ] + 0X¯ ¯¯̄

μx)2 X¯ ¯¯̄

X¯ ¯¯̄ )2 X

¯ ¯¯̄

Xn

Var[ ] = =X¯ ¯¯̄ σ2x

N

Var[X]N

σ2x N

Mean

Howmuchdoesthesamplemeanestimatorvary?Recallthedefinitionofthe ofanestimator;for itis

Undersomeassumption(thatthe areindependent),itcanbeshown(yourhomework,orsection4.1ofreference[1])that

Thevarianceofthemeanestimatoristhetruevarianceofthedata( )dividedbythenumberofobservation( ).Sincewetypicallydonotknowthetruevariance,wesubstituteforthesamplevariancetoobtainthestandarderrorofthemean,orrandomerrorforthemean:

MSE X¯ ¯¯̄

E[( − ] = Var[ ] + (b[ ] = Var[ ] + 0X¯ ¯¯̄

μx)2 X¯ ¯¯̄

X¯ ¯¯̄ )2 X

¯ ¯¯̄

Xn

Var[ ] = =X¯ ¯¯̄ σ2x

N

Var[X]N

σ2x N

s.e.[ ] ≡ = =X¯ ¯¯̄ Var[ ]X¯ ¯¯̄− −−−−−√ σ̂

2x

N

−−−

√ sx

N−−√

Mean

isameasureoftheuncertainty,orofourcapabilityofestimatingthemeanvalueof .s.e.[ ]X¯ ¯¯̄

x

ExamplefromBeal,L.M.etal.(2015),CapturingtheTransportVariabilityofaWesternBoundaryJet:ResultsfromtheAgulhasCurrentTime-seriesexperiment(ACT),J.Phys.Oceanogr.,45,1302-1324,doi:10.1175/JPO-D-14-0119.1

Mean

isameasureoftheuncertainty,orofourcapabilityofestimatingthemeanvalueof .s.e.[ ]X¯ ¯¯̄

x

http://dx.doi.org/10.1175/JPO-D-14-0119.1

Example

AgulhascurrentboundarytransportfromBeal,L.M.andS.Elipot,BroadeningnotstrengtheningoftheAgulhasCurrentsincetheearly1990s,Nature,540,570573,doi:10.1038/nature19853

http://dx.doi.org/10.1038/nature19853

Dependingonyourdata,yourpointofview,andyourinterests,themeanandthevariancemaytellyoualot,orlittle,aboutyourdata.

Dependingonyourdata,yourpointofview,andyourinterests,themeanandthevariancemaytellyoualot,orlittle,aboutyourdata.Thus,youmaywanttolookatthefrequencydistributionplot,orhistogram,whichisacountofyourdatavaluesinanumberof

discreteintervals.

Histogram

Thereisnogeneralrule(onlyrecommendations)onhowtochoosethesizeofthebinsorthenumberofbinstobeused.

Histogram

Thereisnogeneralrule(onlyrecommendations)onhowtochoosethesizeofthebins,orthenumberofbinstobeused.

Histogram

Thefrequencyofoccurencesofagivenvalue isquantityderivedfrom andcanbeitselfestimated.Theredlineinthisplotshowsakernelestimateofthehistogram(seepracticalsessionthisafternoon).

x = ax

Buttheoverallvaluesof stilldependonthewidth ofthebins.Seethisexamplewith and

.

Probabilityfunction

Let'sconsidertheprobability,orrelativecountofoccurences,toobtainavalue intheinterval .Thisdefinesaprobabilityfunction:

X [ , ]xk−1 xk

( ≤ X ≤ ) = = , countin[ , ]P ∗ xk−1 xk P ∗k

ck

Nck xk−1 xk

= 1, k:intervalindex∑k

P ∗k

P ∗kΔXΔX = 10

ΔX = 5

PDFandCFD

Let'sconsiderinsteadthediscreteprobabilitydensityfunction,orPDF:

P ( ≤ X ≤ ) = = , ΔX = −xk−1 xk Pkck

NΔXxk xk−1

PDFandCFD


andthediscretecumulative(probability)distributionfunction,orCDF:

P ( ≤ X ≤ ) = = , ΔX = −xk−1 xk Pkck

NΔXxk xk−1

F ( ) = ( ≤ X ≤ ) = (X ≤ ) =xk P ∗ x0 xk P ∗ xk ∑i≤k

P ∗i

PDFandCFD


andthediscretecumulative(probability)distributionfunction,orCDF:

Sinceallthevalues arecontainedbetween and :

P ( ≤ X ≤ ) = = , ΔX = −xk−1 xk Pkck

NΔXxk xk−1

F ( ) = ( ≤ X ≤ ) = (X ≤ ) =xk P ∗ x0 xk P ∗ xk ∑i≤k

P ∗i

X min[X] max[X]

P (min[X] ≤ X ≤ max[X])

F (max[X])

=

=

1 or ΔX = ( )ΔX = 1∑k

PkN

NΔX

1

PDFandCFD

TheoverallvaluesofthePDFandCDFdonotdependonthebinwidth.However,theirresolution(ordetailedshapes)do.

PDFandCDF

Asonereducesthesizeofthebins, ,thecontinuousPDFandCDFareapproximated:

withtheproperty

ΔX→ 0

P (x ≤ X ≤ x+ΔX)⟶ p(x) =df(x)dx

F (x) = P (X ≤ x)ΔX⟶ f(x) = p( ) d∫ x

−∞x′ x′

ΔX = 1⟶ p(x) dx = f(+∞) − f(−∞) = 1 − 0 = 1∑k

Pk ∫ +∞−∞

PDFandstatistics

Wecannowgivesomeformaldefinitions:

Thelastexpressiondefines the -thcentralmomentof .

μx

μn

≡

≡

xp(x) dx, ≡ (x− p(x) dx∫ +∞−∞

σ2x ∫ +∞−∞

μx)2

(x− p(x) dx∫ +∞−∞

μx)n

(x)μn n x

PDFandstatistics



Thediscreteequivalentis

μx

μn

≡

≡

xp(x) dx, ≡ (x− p(x) dx∫ +∞−∞

σ2x ∫ +∞−∞

μx)2

(x− p(x) dx∫ +∞−∞

μx)n

(x)μn n x

≡ E[(X − ]μn μx)n

PDFandstatistics



Thediscreteequivalentis

Thesecondcentralmoment isthevariance,bydefinition.Themeanisthefirstmomentabouttheorigin( ),thatis

μx

μn

≡

≡

xp(x) dx, ≡ (x− p(x) dx∫ +∞−∞

σ2x ∫ +∞−∞

μx)2

(x− p(x) dx∫ +∞−∞

μx)n

(x)μn n x

≡ E[(X − ]μn μx)n

μ20

= (x− 0)p(x) dxμx ∫ +∞−∞

3rdmoment:skewness

Thethirdnormalizedcentralmomentiscalledtheskewness.ItdescribesthetendencyforanasymmetrybetweenpositiveexcursionsandnegativeexcursionsofthePDF:

One(biased)estimatoris

Positive SkewNegative Skew

≡ =γxμ3

(μ2)3/2μ3

(σ2x )3/2

≡γ̂x( −1

N∑N

n=1 Xn X¯ ¯¯̄ )3

[ ( − ]1N

∑Nn=1 Xn X

¯ ¯¯̄ )23/2

4thmoment:kurtosis

Thefourthnormalizedcentralmomentiscalledthekurtosis.Itdescribesthepeakedness(concentrationnear ),oratendencyforlongtails(concentrationfarfrom ):

One(biased)estimatoris

Becausethekurtosisofanormal,orGaussian,distributionisequalto ,oftentheexcesskurtosis isconsidered.SeeMoors(1986),“TheMeaningofKurtosis:DarlingtonReexamined”.

μxμx

≡ =κxμ4

(μ2)2μ4

(σ2x)2

≡κ̂x( −1

N∑N

n=1 Xn X¯ ¯¯̄ )4

[ ( − ]1N

∑Nn=1 Xn X

¯ ¯¯̄ )24/2

3 − 3κx

IllustrationofKurtosis

Distributionscorrespondingtodifferentvaluesofexcesskurtosis.

Positiveexcesskurtosiscorrespondstolongtailsandpeakedness.

Kurtosis:example

Hughesetal.(2010)Identificationofjetsandmixingbarriersfromsealevelandvorticitymeasurementsusingsimplestatistics

TheyusedthestatisticsandPDFofsealevelanomaliesandderivedrelativevorticitytoshowthatstrongoceanicjetstendtobeidentifiedbyazerocontourinskewnesscoincidingwithalowvalueofkurtosis.

Whyisthisuseful?

Ithinkthatplottingthehistogramofyourdata,andfurtherestimatingindetailitsPDF,givesyouaholistic,orglobalview,ofthedatapopulationfromwhichyoursampleisdrawn.MaybeyouwillfindthattheestimatedPDFofasampleonagivendayisdifferentfromtheestimatedPDFonanotherday...

Whyisthisuseful?


Italsoallowsyoutoanswerquestionssuchas:whatisthemostprobablevalueofthedata?Howoftendoweobserveextremevalues?etc.

Whyisthisuseful?



Inaddition,theknowledgeofyourdatadistribution,and/orachoiceofamodelforyourdatadistributionwillallowyoutodefineconfidenceintervalsforyourestimatedparameters,andtoproceedtoconducthypothesistestinginyourresearch.

Whyisthisuseful?



Inaddition,theknowledgeofyourdatadistribution,and/orachoiceofamodelforyourdatadistributionwillallowyoutodefineconfidenceintervalsforyourestimatedparameters,andtoproceedtoconducthypothesistestinginyourresearch.

Beforelookingatthis,weneedtoreviewthetheoryofvariousprobabilitydistributionfunctions.

Butfirstanexample

Animation(#joyplot)byGavinSchmidtshowingglobaltemperaturedistributionin10-yrwindows,seehisblogpostonrealclimate.org.DatafromtheNASAGISSSurfaceTemperatureAnalysis(GISTEMP)dataset.

https://twitter.com/ClimateOfGavin

http://www.realclimate.org/index.php/archives/2017/07/joy-plots-for-climate-change/#more-20524

https://data.giss.nasa.gov/gistemp/

3.CommonDistributions

z = rand(1000,1);x1 = -1; x2 = 2;x = (x2-x1)*z + x1;

Theuniformdistribution

Arandomvariable thatisuniformlydistributedbetween andhasforPDF:

Themeanofthisdistributionis anditsstandarddeviationis

x x1x2

p(x) =

=

,1−x2 x10,

≤ x ≤x1 x2

otherwise

( + )/2x1 x2( − )/(2 )x2 x1 3–√

= 0.5, = = 0.8660μx σx2 − −1

2 3)(√

Thenormaldistribution(orGaussian)

Arandomvariable thatisnormallydistributedwithmean andstandarddeviation hasforPDF:

x μxσx

p(x) = exp[− ] ≡ N ( , )1

σx 2π−−√

(x− μx)2

2σ2xμx σx



If thenthevariable

Hereafter, willmean"distributedlike".

x μxσx

p(x) = exp[− ] ≡ N ( , )1

σx 2π−−√

(x− μx)2

2σ2xμx σx

x ∼ N ( , )μx σx z = ∼ N (0, 1)x− μxσx

∼



If thenthevariable

Hereafter, willmean"distributedlike".

InMatlab,thefollowinggeneratesadatavector containing1000samplesfromar.v. ,andadatavector fromar.v.

z = randn(1000,1);x = 1.35*z + 2.1;

x μxσx

p(x) = exp[− ] ≡ N ( , )1

σx 2π−−√

(x− μx)2

2σ2xμx σx

x ∼ N ( , )μx σx z = ∼ N (0, 1)x− μxσx

∼

z∼ N (0, 1) x

∼ N (2.1, 1.35)

Thenormaldistribution

TheredcurveisthetheoreticalnormalPDF .Thehistogramiscomputedfromasampleofsize .

N (2.1, 1.35)N = 1000


Thenormaldistributionisofparticularimportancebecauseofthecentrallimittheoremwhichassertsroughlythatthenormaldistributionistheresultofthesumofalargenumberofindependentrandomvariableactingtogether.



Tobemorespecific,let be independentr.v.withindividualmeans andvariances .Nowconsiderthenewr.v.

Thecentrallimittheoremstatesthat,as , willbenormallydistributedwithmean andvariance .Inpractice,theCLTisusedfor "large".

, ,… , ,…,x1 x2 xi xN Nμi σ2i

x = + +…+ .a1x1 a2x2 aN xN

N → +∞ x∑k akμk ∑k a

2kσ2k

N



Tobemorespecific,let be independentr.v.withindividualmeans andvariances .Nowconsiderthenewr.v.

Thecentrallimittheoremstatesthat,as , willbenormallydistributedwithmean andvariance .Inpractice,theCLTisusedfor "large".

Infact,wehavealreadyusedthecentrallimittheorem...

, ,… , ,…,x1 x2 xi xN Nμi σ2i

x = + +…+ .a1x1 a2x2 aN xN

N → +∞ x∑k akμk ∑k a

2kσ2k

N

ThenormaldistributionandtheCLT

Recallthatthesamplemeanoftherecord isdefinedas

Here, canbeseenasanewr.v.whichisthesumofindividualr.v.(forwhichwehaveonlyonevalue)withthesamepopulationmean

,andsamepopulationvariance .

, ,… ,X1 X2 XN

= = ( ) + ( ) +…+ ( )X¯ ¯¯̄ 1

N∑n=1

N

Xn

1N

X11N

X21N

XN

X¯ ¯¯̄

μx σ2x

ThenormaldistributionandtheCLT

Recallthatthesamplemeanoftherecord isdefinedas

Here, canbeseenasanewr.v.whichisthesumofindividualr.v.(forwhichwehaveonlyonevalue)withthesamepopulationmean

,andsamepopulationvariance .

Hence,theCLTstatesthat,for largeenough, isnormallydistributedwithmean andvariance ,wherethislastresultwasgivenpreviouslywithoutexplanation.

, ,… ,X1 X2 XN

= = ( ) + ( ) +…+ ( )X¯ ¯¯̄ 1

N∑n=1

N

Xn

1N

X11N

X21N

XN

X¯ ¯¯̄

μx σ2x

N X¯ ¯¯̄

(1/N ) = (N/N ) =∑Nn=1 μx μx μx

(1/N = (N/ ) = (1/N )∑Nn=1 )2σ2x N 2 σ2x σ2x

The distribution

Let be independentr.v. .Letanewr.v.definedas

χ2n

, ,… ,z1 z2 zn n ∼ N (0, 1)

= + +…+χ2 z21 z22 z2n

The distribution


Itissaidthatthisr.v.isa"chi-squared"variablewith degreesoffreedom(DOF).

χ2n

, ,… ,z1 z2 zn n ∼ N (0, 1)

= + +…+χ2 z21 z22 z2n

n

The distribution


Itissaidthatthisr.v.isa"chi-squared"variablewith degreesoffreedom(DOF).Suchr.v.hasforPDF:

Themeanofthisdistributionis anditsvarianceis .

χ2n

, ,… ,z1 z2 zn n ∼ N (0, 1)

= + +…+χ2 z21 z22 z2n

n

p(x) = ≡ (n)orexp(− )x −1

n

2x2

Γ(n/2)2n

2

χ2 χ2n

n 2n

The distribution


Itissaidthatthisr.v.isa"chi-squared"variablewith degreesoffreedom(DOF).Suchr.v.hasforPDF:

Themeanofthisdistributionis anditsvarianceis .

Examplesof variablesarepowerspectraldensityfunctionestimates(seeLecture4ontimeseriesanalysis)orvarianceestimates.Thesamplevarianceof is

χ2n

, ,… ,z1 z2 zn n ∼ N (0, 1)

= + +…+χ2 z21 z22 z2n

n

p(x) = ≡ (n)orexp(− )x −1

n

2x2

Γ(n/2)2n

2

χ2 χ2n

n 2n

χ2

x ∼ N ( , )μx σx

∼ (N − 1)s2xσ2xN−1χ2

The distribution

Inthisexample,theredcurveisthetheoretical PDFfor .Thehistogramiscomputedfromasampleofsize .

χ2n

χ2n n = 14N = 1000

TheFdistribution

If and thenthefollowingvariable

is distributedwith and degreesoffreedom.Themathematicalexpressionforthisdistributionisverycomplicatedandnotveryusefulhere,seereference[1].

Themeanvalueofthe distributionis for anditsvarianceis

The distributionarisesasanexamplewhentestingfortheequalityoftwopopulationvariances(seepracticalthisafternoon).

x ∼ χ2n y ∼ χ2m

∼ F (n,m)x/ny/m

F n m

F m/(m− 2) m > 2

form > 42 (n+m− 2)m2

n(m− 2 (m− 4))2

F

TheStudent's distribution

Let and betwoindependentr.v.with and .Letbeanewr.v.definedas

Itissaidthatthisr.v.isaStudent's variablewith degreesoffreedom(DOF).

t

y z y ∼ χ2n z ∼ N (0, 1)

t =z

y

n

−−√t n


Let and betwoindependentr.v.with and .Letbeanewr.v.definedas

Itissaidthatthisr.v.isaStudent's variablewith degreesoffreedom(DOF).Suchr.v.hasforPDF:

Themeanis for andthevarianceis for .

Anexampleof distributedr.v.istheestimateofthemeanofapopulationwithunknownvariance,aswewillseelater.

t

y z y ∼ χ2n z ∼ N (0, 1)

t =z

y

n

−−√t n

p(x) = ≡ t(n)orΓ[(n+ 1)/2]Γ(n/2)πn−−−√

[1 + ]x2

n

n+12

tn

0 n > 0 n

n−2 n > 2

t


Inthisexample,theredcurveisthetheoretical PDFfor .Thehistogramiscomputedfromasampleofsize .TheblackcurveisthetheoreticalPDF .

t

tn n = 10N = 1000

N (0, 1)

TheGamma( )distributionfamily

Infact,the distributionandtheexponentialdistributionaretwoparticularcasesoftheGamma( )distributionfamily.Arandomvariable thatisGammadistributedwithparameters and hasforPDF:

withthe functiondefinedas

exponentialdistribution

distribution

Γ

χ2nΓ

x α β

p(x) = , β > 0; 0 ≤ x ≤ +∞exp[−x/β]xα−1

Γ(α)βα

Γ

Γ(α)

Γ(n)Γ(α)

=

==

exp(− ) d∫ +∞0

x′α−1 x′ x′

(n− 1)! forninteger(α− 1)Γ(α− 1)forαcontinuouswithΓ(1) = 1

α = 1⟶

α = , β = 2⟶n2 χ2n

4.Uncertainties,errorsandhypothesistesting

Probabilitystatements

Weusedistributionstomakeprobabilitystatementsaboutourr.v.estimates.

Probabilitystatements

Weusedistributionstomakeprobabilitystatementsaboutourr.v.estimates.Itisusefultoconsiderthefollowing.ForanygivenPDF

,andassociatedCDF ,ofthevariable ,let'sdenote thevaluethatcorrespondsto ,thatis

BEWAREthatthistheconventionusedhere.ItisnotablydifferentinMatlabwhereicdf('normal',1- ,0,1)returnsthevalue

p(z) f(z) z zαf(z) = 1 − α

f( ) = p(z) dz = Prob[z ≤ ] = 1 − αzα ∫ zα

−∞zα

α zα

Confidenceintervals

PDFsareusedtoderiveconfidenceintervals(CI),theinterpretationofwhichissubtle.WhatisaCIforyou?

Confidenceintervals


Givenanestimate ofaquantity ,andachosensignificancelevel,weconstructanintervalwithlowerbound andupperboundsothatthisintervalisexpectedtocoverthetrue,unknown,but

fixedvalueof ,withprobability .

ϕ̂ ϕα ϕLϕU

ϕ 1 − α

Confidenceintervals




Inotherwords,ifwecouldrepeattheestimationandcalculationoftheCImanytimes,wecanexpectthatthetrueunknownparameteriscoveredbythecalculatedCI,95outof100times.

ϕ̂ ϕα ϕLϕU

ϕ 1 − α

ϕ

Confidenceintervals




Inotherwords,ifwecouldrepeattheestimationandcalculationoftheCImanytimes,wecanexpectthatthetrueunknownparameteriscoveredbythecalculatedCI,95outof100times.

Thereisnoprobabilitystatementabout ,onlyabout and.

ϕ̂ ϕα ϕLϕU

ϕ 1 − α

ϕ

ϕ ϕ̂[ , ]ϕL ϕU

Confidenceintervals

Asaconcreteexample,considerthesamplemean of.

Westatedearlierthat .Assuch,wecanstatethatthenew"transformed"variable

andthatwecanfindtwo valuessuchthat

X¯ ¯¯̄

x ∼ N ( , )μx σx

∼ N ( , / )X¯ ¯¯̄

μx σx N−−√

z = ∼ N (0, 1)−X¯ ¯¯̄ μx

/σx N−−√

z

Prob[ < ≤ ] = 1 − αz1−α/2−X¯ ¯¯̄ μx

/σx N−−√

zα/2

Confidenceintervals:normalcase

Since,thenormaldistributionissymmetricaroundzero,

Prob[ < ≤ ] = 1 − αz1−α/2−X¯ ¯¯̄ μx

/σx N−−√

zα/2

= −z1−α/2 zα/2


Asaresult,thenormalizedcalculatedvariable issuchthat

with probability.Afterrearranging,wecanstatethatthetruemean ofther.v. issuchthat

withaconfidenceof .

Incommonparlance,a95%CIfor is

z

− < z = ≤zα/2−X¯ ¯¯̄ μx

/σx N−−√

zα/2

1 − αμx x

− ≤ < +X¯ ¯¯̄ σxzα/2

N−−√

μx X¯ ¯¯̄ σxzα/2

N−−√

100(1 − α)%

μx

[ − , + ]X¯ ¯¯̄ σxzα/2

N−−√

X¯ ¯¯̄ σxzα/2

N−−√


Typicalintervalsusedare:

Beforetheadventofadvancedsoftwares,peoplereliedonstatisticaltablesforthevaluesof ,suchastheonesfoundintheAppendicesofreferences[1],[2],[3],[6].

90%CI : α = 0.1⟶ = 1.6449zα/2

95%CI : α = 0.05⟶ = 1.9600zα/2

99%CI : α = 0.01⟶ = 2.5758zα/2

z


ExamplefromBendatandPiersol(2011):95%CI : α = 0.05⟶ α/2 = 0.0250

Confidenceintervals:normalcase;example:

UsingaCTDrecordoftemperatureat24Hz,weestimatethemeantemperaturenear70dbpressurelevelbyaveragingdatapointswithin.5dbof70db(fallingrateis1-2m/s), .Wefind

.FromthespecificationsheetoftheCTD911Plus,theaccuracyofthetemperaturesensoris ,whichweinterpretasbeingtherandomerrororstdof ,i.e. ,ignoringapossiblebias.

N = 11= 19.21359T¯ ¯¯̄

0.001T σT

Confidenceintervals:normalcase;example:

UsingaCTDrecordoftemperatureat24Hz,weestimatethemeantemperaturenear70dbpressurelevelbyaveragingdatapointswithin.5dbof70db(fallingrateis1-2m/s), .Wefind

.FromthespecificationsheetoftheCTD911Plus,theaccuracyofthetemperaturesensoris ,whichweinterpretasbeingtherandomerrororstdof ,i.e. ,ignoringapossiblebias.Weuse ,andbasedonthepreviousformula,the95%CIforthetruemean is:

Alternatively,oncecanstatethattheestimateofthemeanwith95%uncertaintyis

N = 11= 19.21359T¯ ¯¯̄

0.001T σT

( − )/( / ) ∼ N (0, 1)T¯ ¯¯̄

μT σT N−−√

μT

19.21359 − ≤ < 19.21359 +0.001 × 1.96

11−−√μT

0.001 × 1.96

11−−√

⇒ 19.21300 ≤ < 19.21418μT

= 19.21359 ± 0.00059μT

Confidenceintervals: case

Nowimaginethatyouobtainthedatafromthepreviousexamplebutdonotknowthespecificationofthesensorused,thatis .Instead,youcanconsiderthesamplestandarddeviation asanestimateoftheunknown .

t

σTsT

σT


Nowimaginethatyouobtainthedatafromthepreviousexamplebutdonotknowthespecificationofthesensorused,thatis .Instead,youcanconsiderthesamplestandarddeviation asanestimateoftheunknown .Itcanbeshown(notobvious),that

t

σTsT

σT

∼ t(N − 1)−T¯ ¯¯̄ μT

/sT N−−√


Nowimaginethatyouobtainthedatafromthepreviousexamplebutdonotknowthespecificationofthesensorused,thatis .Instead,youcanconsiderthesamplestandarddeviation asanestimateoftheunknown .Itcanbeshown(notobvious),that

Thus,a %CIforthetruemean is:

where isthevalueofthe variablesuchthat

The distributionissymmetriclikethenormaldistributionsothat

t

σTsT

σT

∼ t(N − 1)−T¯ ¯¯̄ μT

/sT N−−√

100(1 − α) μT

[ − ≤ < + ]T¯ ¯¯̄ sT tN−1;α/2

N−−√

μT T¯ ¯¯̄ sT tN−1;α/2

N−−√

tN−1;α/2 tN−1

Prob [t ≤ ] = 1 − or Prob [t > ] =tN−1;α/2α

2tN−1;α/2

α

2

t= −tN;1−β tN;β


GoingbacktoourCTDexample,westillhave butnowcalculate .Since ,the95%CIfor becomes:

Alternativelyoncecanstatethattheestimateofthemeanwith95%uncertaintyis

t

= 19.21359T¯ ¯¯̄

= 0.02277sT = 2.0227t40−1;0.05/2μT

19.21359 − ≤ < 19.21359 +0.02277 × 2.0227

1–√μT

0.02277 × 2.0227

11−−√

⇒ 19.19829 ≤ < 19.22889μT

= 19.21359 ± 0.01530μT




Previously,westatedthat

Sowhatistherightanswer?

t

= 19.21359T¯ ¯¯̄

= 0.02277sT = 2.0227t40−1;0.05/2μT

19.21359 − ≤ < 19.21359 +0.02277 × 2.0227

1–√μT

0.02277 × 2.0227

11−−√

⇒ 19.19829 ≤ < 19.22889μT

= 19.21359 ± 0.01530μT

= 19.21359 ± 0.00059μT




Previously,westatedthat

Sowhatistherightanswer?Itisarguable...

t

= 19.21359T¯ ¯¯̄

= 0.02277sT = 2.0227t40−1;0.05/2μT

19.21359 − ≤ < 19.21359 +0.02277 × 2.0227

1–√μT

0.02277 × 2.0227

11−−√

⇒ 19.19829 ≤ < 19.22889μT

= 19.21359 ± 0.01530μT

= 19.21359 ± 0.00059μT

Toattempttoanswerthisquestion,let'slookatthedata.Thebluecurveisthe24Hztemperaturedata,andtheredcurveandshadingshowthe1dbbinaveragesand95%CIusingthe distribution.t

Instrumentalvssampling/modelerrors

Whenweuse asanestimateofthetemperatureat70db,weassumedthatthe recordsoftemperatureat24Hzhadthesameexpectationandvariance.

T¯ ¯¯̄

N = 11



However,weknowbasedonouroceanographicknowledgethatthetemperatureisnotnecessarilyaconstantwithdepth.Here,theobvioustemperaturegradientwithdepthmakesusthinkthattheexpectedtemperatureat69.5dbisnotthesameasat70.5db.

T¯ ¯¯̄

N = 11




Thus,whenwecalculateanarithmeticaverage( )oftemperaturethatisafunctiondepth,itisanestimateofatemperaturequantitythatisvariablebecauseof1)theaccuracyofthesensor,and2)thevaryingexpectationvalue.

T¯ ¯¯̄

N = 11

T¯ ¯¯̄





Traditionally,thissecondtypeoferroriscalledasamplingerrororamodelerror,whichaddstothefirsttypeoferrorcalledinstrumentalerror.

T¯ ¯¯̄

N = 11

T¯ ¯¯̄





Traditionally,thissecondtypeoferroriscalledasamplingerrororamodelerror,whichaddstothefirsttypeoferrorcalledinstrumentalerror.

Partoftheanalysisofyourdataistounderstand,ormodel,thesourcesorvarianceandhenceoferrorswhencalculatingderivedquantitiessuchasmean,varianceetc.

T¯ ¯¯̄

N = 11

T¯ ¯¯̄

Interlude:modelingsignalandnoise

Partoftheproblemofchoosingtheappropriatevariancetoestimateerrorsischoosingamodelfortheobservations.Asanexample,themeasured"process" maybethesumofagivensignalplusinstrumentalnoise .

xy ε

x = y+ ε



Ifthesignalandnoiseindependent,thenthetotalvarianceoftheprocessis

xy ε

x = y+ ε

Var[x]σ2x

==Var[y]σ2y

++Var[ε]σ2ε



Ifthesignalandnoiseindependent,thenthetotalvarianceoftheprocessis

InourpreviousexampleoftheCTDprofile,thesamplevarianceofthemeasurementswaslikelythesumoftheinstrumentalerrorandofthe"error"fromthebackgroundshearoftemperature.Iwouldtendtochoosethesecondcase( distributionwithunknownvariance)toderiveCIs.

xy ε

x = y+ ε

Var[x]σ2x

==Var[y]σ2y

++Var[ε]σ2ε

t


The distributionisdefinedforpositivevaluesonly,andisnotsymmetric:

χ2n

Prob [ < ≤ ] = 1 − αχ2n;1−α/2 χ2n χ2

n;α/2

χ2

≠ −χ2n;1−β χ2

n;β

Confidenceintervals: caseexample

The distributioncanbeusedtoderiveCIsforvarianceestimates.Itcanbeshownthatfor samplesdrawnfromanormallydistributedr.v. withvariance ,wehave

whichcanbeusedtoderive %CIforvarianceestimatesas

χ2n

χ2

Nx σ2x

∼(N − 1)s2x

σ2xχ2N−1

100(1 − α)s2x

≤ <(N − 1)s2xχ2N−1;α/2

σ2(N − 1)s2xχ2N−1;1−α/2

Hypothesistesting

Confidenceintervalsareparticularcasesofhypothesistesting,acaseofdataanalysisthatoccursfrequently.Seetheintroductionof100statisticaltestsbyG.K.Kanji(seereference[5]).

Hypothesistesting


Hypothesistestingdoesnotconsistinprovingordisprovinghypotheses.Justlikewewillneverknowthetruevalueofar.v.,wewillneverproveinanindeniablefashionthatanhypothesisistrue.

Hypothesistesting



Hypothesistestingconsistsinshowingthatanhypothesiscannotbesupportedgivenitssmallprobability.Howsmallisyourownchoice,orthedifferencebetweengettingpublishedornotpublished,orastakeholdertakingadecisionoractionorinaction,etc.

Hypothesistesting



Hypothesistestingconsistsinshowingthatanhypothesiscannotbesupportedgivenitssmallprobability.Howsmallisyourownchoice,orthedifferencebetweengettingpublishedornotpublished,orastakeholdertakingadecisionoractionorinaction,etc.

Ingeneral,thehypothesiswearetryingtodenouce,decry,etc,isonewithnochange(i.e. ,"themeantemperaturetodayisthesameasyesterday"),sothatitistypicallycalledthenullhypothesis,

.When isrejectedbecauseofinsufficientprobability,weacceptthealternaticehypothesis (i.e. ,"themeantemperaturetodayisdifferentfromyesterday").

a = b

H0 H0H1 a ≠ b

Hypothesistesting

Step1

Defineyourpracticalproblemintermsofsimplehypotheses,anullhypothesisandanalternatehypothesisthattypicallyleadstoaction.Decideifyouarelikelytoconductaone-tailedortwo-tailedtest.

Hypothesistesting

Step1

Defineyourpracticalproblemintermsofsimplehypotheses,anullhypothesisandanalternatehypothesisthattypicallyleadstoaction.Decideifyouarelikelytoconductaone-tailedortwo-tailedtest.

Asanexample,anullhypothesisisthatthepopulationmean ofar.v. isequaltoagivenvalue (maybe ).Alternativehypothesesmaybethat isnotequalto (case ,two-tailedtest),orthatisgreaterorsmallerthan (cases ,one-tailedtests).

μxx μ0 0

μx μ0 1 μxμ0 2, 3

1.

2.

3.

: =H0 μx μ0

: ≠H1 μx μ0: =H0 μx μ0: >H1 μx μ0

: =H0 μx μ0: <H1 μx μ0

Hypothesistesting

Step2

Deriveastatistic,thatisanumber,thatcanbecalculatedfromyourdataandyourassumptions,typicallyunderyournullhypothesis

.Makesurethatthisnumberisgoingtobedifferentwhen istrueorwhen istrue.H0 H0

H1

Hypothesistesting

Step2

Deriveastatistic,thatisanumber,thatcanbecalculatedfromyourdataandyourassumptions,typicallyunderyournullhypothesis

.Makesurethatthisnumberisgoingtobedifferentwhen istrueorwhen istrue.

Followingthepreviousexample,wesawthatif isnormallydistributedwithknownvariance ,thenthestatistic

H0 H0H1

xσx

z = ∼ N (0, 1)−X¯ ¯¯̄¯ μx

/σx N√

Hypothesistesting

Steps3&4

Chooseacriticalregionforyourteststatisticandasignificancelevel thatdeterminethesizeofyourcriticalregion.Criticalregionscanbeofthreetypes;right-sidedmeansthatyourejectifyourteststatisticisgreaterthanorequaltosomerightcriticalvalue;left-sidedyougetit;orboth-sidedsothatyoureject ifyourteststatisticiseithergreaterthanorequaltotherightcriticalvalueorlessthanorequaltotheleftcriticalvalue.

αH0

H0

Hypothesistesting

Steps3&4:example

Case1:

isoutsideofthecriticalregion!Noreasontoreject (i.e.weacceptthatthemeanisnotdifferentfrom )

Case2:

isinthecriticalregionforaright-sidedtest!Wecanreject (inthesensethatthemeanappearslargerthan )

Case3:

isoutsidethecriticalregionforaleft-sidedtest!Noreasontoreject (inthesensethatitmeandoesnotappeartobelessthan

).

α = 0.05, : = 4,N = 9, = 4.6, = 1.0 → z = = 1.8H0 μ0 X¯ ¯¯̄

σx4.6 − 4

1/ 9–√

= −1.96 < z = 1.8 < = 1.96z1−0.05/2 z0.05/2

z H0μ0

z = 1.8 > = 1.64z1−0.05

z H0μ0

= −1.64 ≤ z = 1.8z0.05

zH0

μ0

Hypothesistesting

PleaseseethebookbyG.K.Kanji,100StatisticalTests,(2006)!Itisveryhandy...

Onelastthing:Errorpropagation

Wesawcommoncaseswherethestatisticswere , ,or r.v.Whatifwearetryingtoassesstheerrororuncertaintyforar.v.thatisarbitrarilyfunctionof variables withindependentrandomerrors (maybetheRMSerror)?

x ∼ z t χ2

yN xn

εxn

y = y( , ,… , )x1 x2 xN

Onelastthing:Errorpropagation

Wesawcommoncaseswherethestatisticswere , ,or r.v.Whatifwearetryingtoassesstheerrororuncertaintyforar.v.thatisarbitrarilyfunctionof variables withindependentrandomerrors (maybetheRMSerror)?

Anapproximateformulafor"small"errorsis

Seereference[3].

x ∼ z t χ2

yN xn

εxn

y = y( , ,… , )x1 x2 xN

≈ + +…+ε2y ( )∂y∂x1

2

ε2x1 ( )∂y∂x2

2

ε2x2 ( )∂y∂xN

2

ε2xN

Practicalsession

Pleasedownloaddataatthefollowinglink:

PleasedownloadtheMatlabcodeatthefollowinglink:

MakesureyouhaveinstalledandtestedthefreejLabMatlabtoolboxfromJonathanLillyatwww.jmlilly.net/jmlsoft.html

https://www.jmlilly.net/jmlsoft.html

Extraslides

-testfortwopopulationmeans(variancesunknownandunequal)

Followingtest#9ofKanji(2006),reference[5]

Theteststatisticis

whichisusedtotest ,sothat with

t

t =( − ) − ( − )X¯ ¯¯̄1 X¯ ¯¯̄2 μ1 μ2

( + )s21n1

s22n2

12

=μ1 μ2 t ∼ t(0, ν)

ν =

( + )s21n1

s22n2

2

+s41

( − 1)n21 n1

s42

( − 1)n22 n2

Kolmogorov-Smirnovtestfordistribution

TheKolmogorov-Smirnovtestcomparesanempiricaldistributionfunction toaprescribednormaldistributionfunction withmean andstandarddeviation .Itconsidersthestatistic

whichmeasuresthemaximumdistancebetweenthetwodistribution(asseenonaQ-Qplot).

Theissueisthatthistestistooconservativewhenthemeanandstdof arecalculatedfromthedata.AnalternativetestiscalledtheLillieforstest,whichismorestringent.Seealsotest20ofKanji(2006),reference[5]foranothertest.

InMatlab:

h = kstest(x); h = lillietest(x);

F̂ Fμ σ

D = | ( ) − F (μ,σ)|maxXi

F̂ Xi

F

Documents

Lecture 1: Essential statistics - Shane Elipot · A quote from Data Analysis Methods in Physical Oceanography, Thompson and Emery, Third Edition, 2014: "Statistical methods are essential