27
Admixture Correction in the Outgroup f3 Statistic Presented by Nita Tunga In partial fulfilment of the requirements for graduation with the Dean’s Scholars Honors Degree in the Department of Mathematics University of Texas at Austin May, 2018 Prof. Jennifer Mann Austin, Ph.D Supervising Professor Prof. Kathryn Dabbs, Ph.D Second Reader

Admixture Correction in the Outgroup f3 Statistic

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Admixture Correction in the Outgroup f3 Statistic

AdmixtureCorrectionintheOutgroupf3Statistic

PresentedbyNitaTunga

InpartialfulfilmentoftherequirementsforgraduationwiththeDean’sScholarsHonorsDegreeintheDepartmentof

MathematicsUniversityofTexasatAustin

May,2018

Prof.JenniferMannAustin,Ph.DSupervisingProfessor

Prof.KathrynDabbs,Ph.DSecondReader

Page 2: Admixture Correction in the Outgroup f3 Statistic

1

TableofContents

Introduction...................................................................................................................................2

Chapter1:Background....................................................................................................................4

Chapter2:Project...........................................................................................................................9

Chapter3:Dataset........................................................................................................................10

Chapter4:MethodsandResults....................................................................................................11

Section4.1:CorrectionAttempt1.....................................................................................14

Section4.2:CorrectionAttempt2.....................................................................................14

Section4.3:CorrectionAttempt3.....................................................................................17

Section4.4:CorrectionAttempt4.....................................................................................19

Conclusion....................................................................................................................................23

AppendixA:Glossary....................................................................................................................25

References....................................................................................................................................26

Page 3: Admixture Correction in the Outgroup f3 Statistic

2

Introduction

Geneticinheritancecanbestudiedwithinapurelygeneticscope.However,this

eliminatespartofthepicture.Thefieldofgeneticsisoftenthoughtofasanaturalscience

withlittleincommonwithfieldsofsocialscience.However,inhumangeneticsandthe

geneticsoftheorganismswhichhumansimpact,theroleofculturalandsocietalforces

cannotbeignored.Forinstance,lactaseisanenzymeusedtodigestlactoseinmilk.Assuch,

itisanenzymewhoseactivityreducessignificantlyafterweaning.Nonetheless,ashumans

havebeguntoingestmoredairyproductsintoadulthood,lactasepersistencehasevolvedto

enablehumanstodigestthesedairyproducts.

Myresearchinvolvesmathematicallyrepresentingthegeneticsimilarityoftwo

populationsaccuratelyviathef3statistic.Theoutgroup-f3statisticisusefulin

understandingapopulation’sgenetichistoryandhowgeneticallyrelatedtwopopulations

are.Itshowshowclosetwopopulationsarecomparedtoathirdpopulationthatisequally

distantgeneticallyfromthefirsttwo.However,iftwopopulationssharearecentgenetic

interactionwithanotherpopulation,theoutgroup-f3statisticcouldshowthosetwo

populationsasbeingclosertogetherthantheytrulyare.Thisgeneticinteractionoftwoor

morepreviouslyisolatedpopulationsinterbreedingisreferredtoasadmixture.Admixture

skews,oreveninhibits,anunderstandingofthosepopulations’genetichistories.

Toavoidthisproblem,Ihaveattemptedtodeviseamodifiedversionofthe

outgroup-f3statistictoensureanaccuraterepresentationofgeneticrelatedness.Formy

project,artificialadmixturewasintroducedinsixunadmixedhumanpopulations.

Dependingontherelationshipbetweenincreasedcontaminationandthef3statistic,we

proposedandadjustedsolutionsforacorrectedf3accordingly.

Page 4: Admixture Correction in the Outgroup f3 Statistic

3

Itestedmyproposedcorrectionsbyapplyingittopopulationsthatcontain

individualswithandwithoutrecenthistoriesofgeneticadmixture.Aftercorrectingforthe

proportionofadmixtureinthepopulation,Icomparedthiscorrectedoutgroup-f3statistic

totheoutgroup-f3valuecalculatedfortheoriginalunadmixedpopulation.Thegoalofthis

workistohaveacorrectedstatisticthatonecanapplytotwopopulations,independentof

admixtureproportions.Ultimately,thiswillhelpustobetterunderstandtheevolutionary

historiesofpopulations.Moreover,acorrectedstatisticwillaidotherresearchersasthey

analysedemographichistoriesfurtherinthepast.

Page 5: Admixture Correction in the Outgroup f3 Statistic

4

Background

Fstatisticswerefirstproposedinthepaper,“ReconstructingIndianpopulation

history”,publishedinNaturein2009.Inthispaper,Reichandcolleaguesoutlinethewayf2,

f3,andf4statisticscanbeusedtomeasuregeneticdriftbetweentwo,three,andfourtaxa

respectively.Thef3statisticproposedinthispaperisusefulfordetectingadmixture

betweengroups.Tosummarise,thef3statisticassumesanullhypothesisofnoadmixture,

whichimpliesanonnegativef3statistic.F3isbestusedtodetectadmixturewhenthetime

betweenoriginalsplitandsecondarycontactislarge,coalescencebeforeadmixtureis

unlikely,andtheadmixtureproportioniscloseto50%.

Inregardtof3statisticsinparticular,Reichetal.proposeanequationtobeusedto

measurethegeneticdriftbetweenthreepopulations,PopulationsX,A,andB.Thisequation

is,inasimplifiedform,𝑓3 = 𝑥 − 𝑎 ∗ (𝑥 − 𝑏),wherex,a,andbrepresentallele

frequenciesintheirrespectivepopulations.Bysimplifyingtheequation,weseethatthereis

aproportionalrelationshipbetweenthef3statisticandthegeneticdriftbetween

PopulationsAandXandPopulationsBandX.Geneticdriftisdefinedtobethechangein

allelefrequencyalongagraphedgeonaphylogenetictree.Phylogenetictreesaregraphical

representationsofthegeneticrelationshipbetweenagroupofindividualsorpopulations

basedonphysicalorgeneticcharacteristics.Thelengthofthebranchesonthetreeoften

representthegeneticdistance,numberofgeneticdifferences,betweenindividualsor

populations.

Morespecifically,thecalculatedf3statisticistheproductofthefrequency

differencebetweenthosepopulations.Thistestisusefultoseeifcertaingroupshave

inheritedgenesfromdifferentancestries.Whentherehasbeennoadmixture,thef3

statisticisexpectedtobepositive.Whentherehasbeenadmixture,thef3statisticcouldbe

Page 6: Admixture Correction in the Outgroup f3 Statistic

5

negative.Furthermore,lowerf3valuesareindicativeoflesscloselyrelatedpopulations,

whereashigherf3valuesareindicativeofmorecloselyrelatedpopulations.Thefarther

aparttwopopulationsare,thesmallerthetwoterms(x-aandx-b)intheequation,and

therefore,thelowerthef3statistic.Similarly,whentwopopulationsareclosertogether,the

twotermsintheequationarelarger,resultinginalargerf3statistic.

Tobetterunderstandwhatthef3statisticcanbeusedfor,werefertothefigure

below.

Hereweseethattherearetwopopulationsthatareclosertogether(PopulationsA

andB),thantheyaretothethirdpopulation(PopulationX).Inthecontextoftheequation,

𝑓3 = 𝑥 − 𝑎 ∗ (𝑥 − 𝑏),weseethatwearecomparingtheallelefrequenciesinPopulations

AandB,inrelationtotheallelefrequenciesinPopulationX.IfweseehowfarPopulation

A’sallelefrequenciesarefromPopulationX’sallelefrequenciesandcomparethistothe

distancebetweenPopulationB’sallelefrequenciesandPopulationX’s,wecanevaluatethe

geneticdistancebetweenPopulationAandPopulationB.Tothinkaboutthisinadifferent

way,bysubtractingoutPopulationA’sallelefrequenciesfromthoseofPopulationX,weare

seeinghowmuchlongerorshorteronebranchlengthiscomparedtotheother.Doingso

enablesustoanalysethedistanceofeachofthethreepopulationsinrelationtothevertex

thatconnectsallthreeofthem.However,ifwehaveanunknownPopulationYthat

PA PB PX

Page 7: Admixture Correction in the Outgroup f3 Statistic

6

integratesitsDNAintobothPopulationAandPopulationB,itwouldappearthatthesetwo

populationsareclosergeneticallythanonewouldexpect.Intermsoftheequation,this

wouldmakebothterms(x-a)and(x-b)increaseordecreasetogether.Assuch,theresulting

f3valuewillbeinordinatelyhigherorlower.Thisisaninterestingresultifoneisconcerned

withtherelationshipofPopulationYtoPopulationsAandB.However,ifyouareinterested

inthegeneticrelationshipofPopulationsAandBbeforetheiradmixturewithPopulationY,

thiscanbeaconfoundingfactor.

NickPattersonwasabletoworkthroughmoreofthemathbehindtheFstatistics

tests,whichhedocumentedinhispaper“AncientAdmixtureinHumanHistory,”published

inGeneticsin2012.Healsodiscussestheoutgroupcase,whichisfurtherdiscussedin

MaanasaRaghavan’spaper,“UpperPalaeolithicSiberiangenomerevealsdualancestryof

NativeAmericans,”publishedinNaturein2014.Inthispaper,theconceptofoutgroup-f3

statisticsisintroduced.Outgroup-f3statisticsinvolvecomparingtwopopulationstoathird,

“outgroup,”population,whichisequallygeneticallyremovedfromtheothertwo

populations.Bydoingso,theoutgrouppopulationservesasareferencegroupfor

measuringgeneticrelatednessofthepopulationsinquestion.Soinsteadoflookingfor

admixturebetweenPopulationXandtheotherpopulations,theoutgroup-f3statisticisa

measureofthegeneticsimilaritybetweenPopulationsAandB.

InBenjaminPeter’spaper,“Admixture,PopulationStructure,andF-Statistics,”he

providesaclearoverviewofFandDstatistics(Genetics,2017).Healsomakesthepointthat

f3statisticscanbeusedasatestforadmixture,notjustforhowcloselyrelatedtwo

populationsare.Healsopointsoutthatinthehistoryofhumans,manyofthecalculatedf3

valuesarenegative,whichcouldshowthatpopulationphylogeniesarenotalwaysthebest

waytodiscusshumanevolution.

Page 8: Admixture Correction in the Outgroup f3 Statistic

7

F3statisticshavebeenusefulindeterminingavarietyofgeneticrelatedness

questionsandarewidelyusedinthefieldofhumanpopulationgeneticsandevolutionary

biologymorebroadly.Forinstance,outgroup-f3statisticswereusedtotestrelatedness

betweenLevantineandsouthernArabianpopulationstoAfricanpopulationsalongthe

NorthernandSouthernDispersalRoutesoutofAfrica.HumansevolvedinAfricaoverthe

past2millionyears.AmajordispersalofhumansoutofAfricaoccurredaround50thousand

yearsagoandledtothemajorityofhumangeneticvariationweseeacrosstheworldtoday.

AnthropologistsandgeneticistshavelongdebatedwhethertheprimaryrouteoutofAfrica

wastheNorthernRouteortheSouthernRoute.In“Testingsupportforthenorthernand

southerndispersalroutesoutofAfrica:ananalysisofLevantineandsouthernArabian

populations,”Vyasandcolleaguesattemptedtoanswerthatquestionusingf3statistics

(AmericanJournalofPhysicalAnthropology,2017).TheNorthernDispersalRouteledinto

Levant,whereastheSouthernDispersalRouteledintosouthernArabia.Byusingf3statistics

toseehowlinkedthepopulationswerepairwise,itwasfoundthatneitherdispersalroute

wasfavouredovertheother.TheMbuti,agroupofpeoplecurrentlylivingincentralAfrica,

wasusedastheoutgrouppopulationforthistest.Theresultsshowedthatboththe

LevantineandArabianpopulationswereequallyrelatedtotheAfricanpopulation.

Thef3testwastakenfurtherandusedtoshowthatboththeLevantineandArabian

populationssharedrelativelysimilarrelatednesstonon-Africanpopulations.Withineach

region,somegroupshadmoresub-Saharanancestry,whichledtolowerf3values.Another

reasonforalowerf3statisticcouldbeanearlierdivergencefromnon-Africanpopulations,

whichwouldbeusefulindeterminingwhichroutewasusedbyearlierpopulations.The

statisticwasusedtoshowthatbothpopulationsweregenerallyequallyrelatedtoallthe

Page 9: Admixture Correction in the Outgroup f3 Statistic

8

Africanpopulationsaswell.Therefore,theresearcherswerenotabletodistinguishwhich

dispersalroutewasusedmore.

Thef3statisticshavealsobeenusedinexploringtherelatednessofvarious

subspeciesofgrapes.Incontrasttothepreviousexampleofoutgroup-f3statistics,thistest

usednormalf3statisticstoseewhatsortofadmixturehasoccurredinthehistoryofthe

grape.Whilethisinvolvesunderstandinghowrelatedtwospeciesofgrapesare,theprimary

purposeofthisstudywastoseehowtomaximallyutilisethegeneticdiversityofgrapes.

Thegrape’shistoryofdomesticationbeganaround6000-8000yearsago,whenthe

domesticatedgrape,Vitisviniferavinifera,wascultivatedfromthewildgrape,Vitisvinifera

sylvestris.Thef3statisticwasusedtotestformixturebetweenviniferawest,viniferaeast,

andsylvestriswest(f3=-0.00481);f3statisticswerealsousedtotestformixturebetween

sylvestriswest,viniferawest,andsylvestriseast(f3=0.0268).

Theresearchersfoundthatwesternviniferaismostlikelyacombinationofeastern

viniferaandwesternsylvestris.Nonetheless,theredoesnotseemtobeagenetictransfer

betweenwesternviniferaandwesternsylvestris.Thissupportsthatviniferaoriginatedin

theNearEastandunderwentintrogressionintoviniferafromwildsylvestrisinEurope.

Thisanalysisfoundthatlittleofthepotentialgeneticdiversityofthegrapehasbeen

explored.Theresearchersusethisfindingtosuggestthattoovercomethegrape’s

significantpathogenpressures,itsgeneticdiversitymustbeutilisedtoitsadvantage.The

domesticatedgrapecontainsgeneticvariationmuchlargerthanthatofhumans,thus

makingitidealtomanipulateforitspolymorphismsandgeneticdiversity.

Page 10: Admixture Correction in the Outgroup f3 Statistic

9

Project

Thegoalofthisprojectistocorrectforadmixturewhencalculatingtheoutgroup-f3

statisticsoitisanaccuratemeasureofgeneticrelatedness.Ifirstproposedasimilar

correctiontothatusedbyLindoetal.fortheDstatistic.

TheDstatisticcanbeusedtotestforadmixtureacrossfourpopulations.Inhis

paper,“AncientindividualsfromtheNorthAmericanNorthwestCoastreveal10,000years

ofregionalgeneticcontinuity,”JohnLindoproposedacontaminationcorrectiontoaccount

forsimilaradmixturehistoriesforthisstatistic(ProceedingsoftheNationalAcademyof

SciencesoftheUnitedStatesofAmerica,2017).ThecontaminationcorrectionfactorLindo

proposesisbasedoncontaminationofanancientgenomewithmodernDNAfroma

distantlyrelatedpopulation,thoughtheoneweproposeforf3statisticswillbebasedon

thelevelofartificiallyinducedadmixture.Nonetheless,Lindousedacorrectedformulato

calculateanewDstatistic,withadmixturecorrectedforusingthecontaminationcorrection.

DShukáKáaisthecontaminatedsample’sDstatistic;DGBRistheDstatistic,substitutingan

individualrepresentativeofthepopulationthatcontaminatedShukáKáa;cisthe

contaminationrate.Forthef3statistic,thisequationwouldlooklike𝑓3∗ = +,-.∗+,/0-.

,where

f3isthecontaminatedsample’sf3statistic,f3aisthef3statisticwithanoutgroupasthe

populationthatcontaminatedtheoriginalgroup,andaistheadmixtureproportion.

Page 11: Admixture Correction in the Outgroup f3 Statistic

10

Dataset

OurresearchgrouputilisedpopulationdatafromNorthandSouthAmerican

indigenouspopulations.ThefirststepofthisprojectwastogatherusableremovedSNPs

thatweremissinginmorethan90%ofthepopulation,andpruningSNPsbasedonlinkage

disequilibrium.InextusedtheADMIXTUREprogramtoidentifyindividualswithevidenceof

Europeanadmixture.Populationswerethensplitintothreegroups:thosethathadno

evidenceofEuropeanAdmixture(Cabecar,Mixe,Surui,GuaraniKW,Xaltocan,and

Xavante),thosewhereanumberofindividualswereadmixedandanumberwerenot

(JaltocanHidalgo,Pima,Xaltocan),andthosewheretheentirepopulationhadEuropean

admixture(AleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUSNative

American).

PopulationsintowhichAdmixturewasArtificially

Introduced

PopulationwithAdmixedandUnadmixed

Individuals

AdmixedPopulationsonwhichtoTestCorrection

Cabecar JaltocanHidalgo AleutRaff

Mixe Pima Algonquin

Surui Xaltocan Cree

GuaraniKW Chipewyan

Xaltocan Inupiat

Xavante Ojibwa

SouthernUSNativeAmerican

Page 12: Admixture Correction in the Outgroup f3 Statistic

11

Methods

Formyproject,IusedsixcompletelyunadmixedhumanpopulationsfromNorthand

SouthAmerica-Cabecar,Mixe,Surui,GuaraniKW,Xaltocan,andXavante.Iintroduced

artificialadmixtureinconstant5%intervalsfrom5%to95%admixturefromaEuropean

population.ThiswasdoneviaaprograminRthatarbitraryreplaced5to95%ofthe

population’sgenomewiththecorrespondingsegmentofaEuropeangenome.Belowisan

exampleofthecodeusedtoinduceadmixtureinthepopulationCabecarusingafor-loop.

v ADM=(0.050.10.150.20.250.30.350.40.450.50.550.60.650.70.750.80.85

0.90.95)

v forjin"${ADM[@]}";doRscriptadmixer.R--file./final_dataset_cleanest2.vcf--

donorSpanish--recipCabecar--p$j--subs5--outfinal_admix_Cabecar_$j.vcf;

done

Aftersimulatingadmixtureinthesepopulations,Iobtainedoutgroup-f3valuesfor

eachofthesepopulationsandeachoftheadmixturelevelswithinthemusingtheprogram

popstats.Ialsoobtainedanf3statisticbyswappingouttheEnglishpopulationforthe

Yorubapopulation,awestAfricangroupassumedtobeequallydistantlyrelatedtoallthese

populations,astheoutgroup.Thisoutgroupservesasareferencegrouptocomparethe

desiredpopulationandtheingroupto.Karitianawasusedastheingroupforbothtests.

Then,wecanseehowincreasedadmixtureaffectsthestatistic.Thiswasdoneusingthe

commandsbelow,wherejspannedtheadmixtureproportionsmentionedpreviously:

v python~/Desktop/project/bin/popstats/popstats.py--file

final_admix_Cabecar_$j--f3--popsC,Karitiana,Yoruba--informative>

final_admix_Cabecar_$j_f3.txt

Page 13: Admixture Correction in the Outgroup f3 Statistic

12

v python~/Desktop/project/bin/popstats/popstats.py--file

final_admix_Cabecar_$j--f3--popsC,Karitiana,English--informative>

final_admix_Cabecar_$j_f3a.txt

Comparingthesevaluestotheadmixturelevels,Iwasabletore-evaluatethe

suggestedsolutionasneeded.Thenbygettinganf3statisticforthesepopulationsand

settingtheoutgroupasthepopulationassumedtohavecontaminatedthem(English

population),Icalculatedanewf3statistic,whichwashopefullycorrectedforadmixture.

Tofurthertestifthiscorrectionworked,Itookpopulationsthatcontained

individualswithandwithoutadmixturedgenomes.Bycorrectingfortheportionofthe

populationthatwasadmixed,Isawifthiscorrectedf3statisticmatchedtheunadmixed

portion’sf3statistic.IdidthisinindividualsfromtheJaltocanHidalgo,PimaandXaltocan

populations.Ithencomputedabaselinef3statisticcomparingthewholepopulations,with

Karitianaastheingroup,andYorubaastheoutgroup.Afterdoingso,Igotanf3statistic

fromtheadmixedindividualsinthesepopulationsinrelationtoYoruba,andthengotanf3

statisticfromtheadmixedindividualsinthesepopulationsinrelationtoanEnglish

population.

Ifthef3statisticwassuccessfullycorrected,wecouldmakeinferencesaboutthe

genetichistoriesofothercontaminatedpopulations.Ithenappliedthef3statistictothe

populationsAleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUS

NativeAmerican.IobtainedtheadmixtureproportionfromtheamountofEuropeanDNAin

theseindividuals.ThenIcorrectedforthef3statisticbygettinganf3usingYorubafirst,and

thenusingEnglishancestrytocomparetheirgenomesto.

Basedonpreliminaryresults,thesolutioncouldtaketheformofacorrected

equationforoutgroup-f3statistics.Ontheotherhand,itcouldstartwithanequationtoget

Page 14: Admixture Correction in the Outgroup f3 Statistic

13

acorrectedf3value,whichisthenmanipulatedfurther.ThisiswhereIcouldcomeupwith

atableofvaluesthatcorrespondtodifferentlevelsofadmixture.These“differences”

betweenthesemi-correctedf3andthebaselinef3arethentobesubtractedfromthesemi-

correctedf3.Sinceothersattemptingtousethiscorrectionwillnothaveabaselinef3for

comparison,ourgoalistocomeupwithauniversalsetofdifferencesthatcanbeused

dependingsolelyontheadmixturelevels.

Page 15: Admixture Correction in the Outgroup f3 Statistic

14

CorrectionAttempt1

Usingthetwof3statistics,Ipositedacorrectionequationtogetthecorrectedf3

valuestolooksimilartothebaselinef3valueswhengraphed.Asimilarcorrectionasthat

proposedfortheDstatisticbyLindowasattemptedfirst.However,thiswasunsuccessful.A

newequationwasthensuggestedandtested.Thisequationtooktheformof

𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓3,wheref3wasthestatisticcalculatedwithYorubaastheoutgroup,

f3awasthestatisticcalculatedwithEnglishastheoutgroup,andawastheadmixture

proportionthatweintroducedintothepopulation.Usingthesevalues,Igraphedthe

relationbetweenadmixtureproportionandthecorrectedf3statistic.Allthepopulations’

graphsexhibitedsimilartrends.BelowisagraphusingCabecar’sf3valuestobeusedasa

reference.

CorrectionAttempt2

Clearly,thetwosetsofpointsarenotthatsimilar.Assuch,Iattemptedtoagain

correcttheequation.Lookingatthetrendoff3valuesdippingaround20-40%admixture

levels,itseemedthatperhapsIwasovercorrectingthef3valuesbyusingf3valuesthat

changewiththeadmixtureproportion.Assuch,Iproposedthefollowingequationinstead:

𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓32.345674,wheref3baselinewasthevaluecalculatedforeachofthe

0.2150.22

0.2250.23

0.2350.24

0.2450.25

0 0.2 0.4 0.6 0.8 1

F3VALUE

S

ADMIXTUREPROPORTION

CABECAR

Correctedf3values

Baselinef3values

Linear(Baselinef3values)

Page 16: Admixture Correction in the Outgroup f3 Statistic

15

populationswhentherewasnoartificialadmixtureintroduced.Thisappearedtoatleast

presentabettercorrelationbetweenadmixtureandcorrectedf3valueswhengraphed.

Belowisagraphofthenewlycorrectedf3valuesplottedagainstadmixtureproportions

again.

Thesenewf3valueslookrelativelylinear,andassuch,I

seemedtobeontherighttrack.Tofurthercorrectthef3values,I

attemptedtofindthedifferencebetweenthenewlycorrectedf3

valuesandthebaselinef3values.Ididthisforeachpopulation,

andthenfoundtheaveragesofthedifferencesforeach

admixtureproportion.Totherightisatableoftheresults.

Ithenplottedtheadmixtureproportionsandtheaverage

differences,astheylookedquitesimilar.Ihopedtoseeifthere

wasacorrelationusingalinearrelationship.TheR2valuewas

0.9973,indicatingthatthereisasignificantrelationshipbetween

thesetwovalues.Thus,Iattemptedtousetheequationforthe

linearregressionlineasacorrectionforthef3values.Iusedthe

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0 0 . 2 0 . 4 0 . 6 0 . 8 1

F3VALUE

S

ADMIXTUREPROPORTION

CABECAR

Correctedf3values

Baselinef3values

Linear(Baselinef3values)

AdmixtureProportion

0.05

Average

Differences0.004970747

0.1 0.0101043520.15 0.0153668690.2 0.0208302490.25 0.0263991740.3 0.0321738640.35 0.0381413220.4 0.0442220140.45 0.0504215940.5 0.0566984520.55 0.0631453990.6 0.069525620.65 0.0764070520.7 0.0831501270.75 0.0899637640.8 0.0971640070.85 0.1046993370.9 0.1118768870.95 0.119257046

Page 17: Admixture Correction in the Outgroup f3 Statistic

16

valuesthatIhadcorrectedusingtheequation 𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓32.345674 andthen

subtractedthedifference,calculatedusingthefollowingequation:𝑦 = 0.1272𝑥 − 0.005.

Givenacertainadmixtureproportion,Iwouldplugthatvalueinforxintheequationtoget

thedifferencetobesubtractedfromthecorrectedf3value.Thisresultedinaparabolic

lookinggraphofthef3valuesplottedagainsttheadmixtureproportion,shownbelow(again

withthebaselinef3valuesplottedasareferenceforthedesiredvalues).

Clearly,thiswasnotanidealcorrectionofthef3valuesagain.Iattemptedtoputthis

intheperspectiveoftheconfidenceintervalsofthebaselinef3values.Assuch,theupper

boundandlowerboundrepresentativeofonestandarddeviationaboveandbelowthe

baselinef3areshownonthegraph(thestandarddeviationwascalculatedbythepopstats

programusedtogetthebaselinef3value).Therefore,Iattemptedtofixtheregression

equationwehadgottenfromtheaveragedifferences.Assuch,Iusedthefollowingpower

equationinstead:𝑦 = 0.122𝑥[email protected],

indicatingthatthisequationmightworkasacorrection.Nonetheless,onceIusedthis

0.224

0.225

0.226

0.227

0.228

0.229

0.23

0.231

0 0.2 0.4 0.6 0.8 1

F3Value

s

AdmixtureProportion

F3VersusAdmixture

Correctedf3values

Baselinef3values

upperbound

lowerbound

Linear(Baselinef3values)

Page 18: Admixture Correction in the Outgroup f3 Statistic

17

equationwiththedifferentadmixtureproportionstosubtractfromthecorrectedf3values,

Istillhadagraphthatdidnotlookideal(below).

CorrectionAttempt3:

Then,Iattemptedtojustusetheaveragedifferencestosubtractfromthecorrected

f3.Ihopedtogetthesedifferencesformoreadmixturevalueswherethelinearregression

line/powerlinedidnotmatchthedatawell,ifthisattemptworked.Icalculatednewf3

valueswiththiscorrectionandgotthefollowinggraph.

0.22

0.225

0.23

0.235

0.24

0.245

0.25

0.255

0 0.2 0.4 0.6 0.8 1

F3Value

s

AdmixtureProportion

F3VersusAdmixture

Correctedf3values

Baselinef3values

upperbound

lowerbound

Linear(Baselinef3values)

0.22450.2250.22550.2260.22650.2270.22750.2280.22850.2290.2295

0 0.2 0.4 0.6 0.8 1

F3Value

s

AdmixtureProportion

F3VersusAdmixture

Correctedf3values

Baselinef3values

upperbound

lowerbound

Linear(Baselinef3values)

Page 19: Admixture Correction in the Outgroup f3 Statistic

18

Thisgraphclearlylookedalotbetterthanpreviousattempts.Furthermore,itwas

theonlysolutionthusfarthatyieldedcorrectedf3valueswithintheboundsofonestandard

deviationaboveandbelowaref3statistic.Nonetheless,itwasnotaperfectfit.

Tomakethisgraphevenbetter,Igotintervalsthatwereclosertogether(intervalsof

1%admixture)between75and85%ofadmixture.Thiswasanareathatlookedtohavea

largedegreeofvariancebetweenthebaselinesandthecorrectedf3values.Assuch,ifthese

newdifferencesthatwerecalculatedwerebetterindicatorsofthedifferencetosubtract

fromthecorrectedf3,thenIcouldusethesevaluesforthecorrection.

Afterfindingintervalsthatwereclosertogether,Inoticedthatthisdidnot

significantlyimpactthecorrectionfactor.Assuch,Itriedtouseasecondorderpolynomial

equation,andgotthehighestR2valueyet(R2=0.99999).Belowisthegraphwhenusingthe

quadraticequationtocorrectthef3valuestobaseline.

WhenIcontinuedwithmyresults,Iquicklyranintoasnag.Ihadusedthebaseline

f3tofindacorrectionequationtogettothebaselinef3.Inotherwords,Iusedtheresultto

forcethedesiredresult.However,Iwasunabletofactoroutthebaselinef3valuestogeta

correctionindependentofthem.

0.224

0.225

0.226

0.227

0.228

0.229

0.23

0 0.2 0.4 0.6 0.8 1

f3value

s

Admixtureproportion

F3versusadmixture

Correctedf3values

baselinef3

upperbound

lowerbound

Linear(baselinef3)

Page 20: Admixture Correction in the Outgroup f3 Statistic

19

CorrectionAttempt4

Assuch,Iwasbacktosquareoneandattemptedtoworkwiththeinitialcorrection

equationforf3( 𝑓3 − 𝑓3. ∗ 𝑎 + 𝑓3).Ithenwentbackandgotthedifferencesbetween

thebaselinef3valuesandthesef3values.Afterdoingthis,Iplottedthebaselinef3values

againstthe“corrected”f3values.Thereappearedtobeafairlylineartrendamongstthef3

valuesusingtheequationabove,acrossallsixpopulations.Ialsonoticedthatallthef3

valueswerelessthanthebaselinef3,whichreinforcedthetrendofdecreasingf3values

withincreasedadmixturelevels.BelowisasamplegraphfromthepopulationCabecar(with

thebaselinef3valuesinorange,andthepreliminarilycorrectedf3valuesinblue).The

equationgivenisforthelinearregressionlineforthepreliminarilycorrectedf3values.

Thisledustobelievethatwecouldusethedifferencesbetweenthebaselineand

thepreliminarilycorrectedf3.Afterdoingthisforthesixpopulations,Igottheaverageof

thesixdifferencesforeachadmixturelevel.Forinstance,Igottheaveragedifferenceforan

admixtureproportionof5%acrossallsixpopulations.Afterdoingso,Iusedtheaverage

differencestogetanewlycorrectedf3byaddingthemtothepreliminarilycorrectedf3.I

noticedthatthesenewf3wererelativelysimilartothebaselinef3,thoughtheywerenot

y=-0.1036x+0.221R²=0.99645

0

0.05

0.1

0.15

0.2

0.25

0 0.2 0.4 0.6 0.8 1

F3Value

s

AdmixtureProportion

BaselineVersusF3Values

f3values

Baselinef3values

Linear(f3values)

Page 21: Admixture Correction in the Outgroup f3 Statistic

20

ideal.Assuch,Idecideditwouldbebeneficialtogeta95%confidenceintervalforthe

differences,toseeiftheseconfidenceintervalsofdifferenceswouldgiveussomethinginan

appropriaterangearoundthebaselinewhenaddedtothepreliminarilycorrectedf3.

Todoso,Iwantedtouseat-test,butthedatawasnotapproximatelynormally

distributed.Therefore,IusedaWilcoxonsigned-ranktest,whichisanon-parametrical

statisticalhypothesistestthatallowsustoperformaversionofthet-testwithoutnormally

distributeddata.ItisoftenreferredtoastheWilcoxonTTest.UpondoingsoinR,Inoticed

thatthe95%confidenceintervalsforthedifferencesforeachadmixturelevelacrossthesix

populationwouldgiveusarangeofdifferences.Belowisatableoftheseconfidence

intervals.

AdmixtureProportion0.05

WilcoxConfidenceIntervals(0.00750608,0.02297737)

0.1 (0.01441939,0.03037686)0.15 (0.02137041,0.03316990)0.2 (0.02569456,0.04308678)0.25 (0.03300473,0.04942423)0.3 (0.03840717,0.05224841)0.35 (0.04300735,0.05772577)0.4 (0.04815174,0.06442764)0.45 (0.05378068,0.06854280)0.5 (0.06017016,0.07539880)0.55 (0.06489406,0.08105843)0.6 (0.07016284,0.08537045)0.65 (0.07382676,0.09035722)0.7 (0.07968564,0.09463120)0.75 (0.08441375,0.10056375)0.8 (0.08823818,0.10527647)0.85 (0.09257367,0.11016625)0.9 (0.09765934,0.11534444)0.95 (0.1012127,0.1194585)

Whentheuncorrectedf3valueswereaddedtothelowerandupperboundsofthe

confidenceintervals,Igotintervalsfornewlycorrectedf3values.OnceIdidthis,Inoticed

Page 22: Admixture Correction in the Outgroup f3 Statistic

21

thatthisintervaloff3valuesincludedthebaselinef3values.Atfirst,Ihopedtogetthe

baselinef3valuestoalignwiththenewlycorrectedf3valueswhenusingonestandard

deviationaboveandbelowthebaselineinconjunctionwiththeconfidenceintervalofnewly

correctedf3values.However,thecorrectionusingthesedifferencesworkedwellenough

thatwedidnotneedtoconsideronestandarddeviationaboveandbelowthebaselinef3.

Simplyusingtheconfidenceintervalsforthedifferencestogetconfidenceintervalsfor

correctedf3valueswassufficientasacorrection.

Ithenappliedthiscorrectiontothenaturalpopulations,JaltocanHidalgo,Pima,and

Xaltocan.Ididsobyroundingtheadmixtureproportionforthesepopulationstothenearest

fivehundredths,suchthatIwouldbeabletousethedifferences(sinceweonlyhadthese

foradmixturesthatweremultiplesof0.05).Upondoingso,Iusedtheconfidenceintervals

forthedifferencesandaddedthelowerandupperboundstotheinitial,uncorrectedf3

value.OnceIdidthis,Inoticedthatthebaselinef3statisticfellinthisrangeofnewf3values

inthePimapopulationandintheXaltocanpopulation.However,thiscorrectiondidnot

workforJaltocanHidalgo.Therangeofnewf3valuesendedupbeing(0.247343959,

0.262815249),whereasthebaselinef3valuewas0.227966338.

Regardless,Ithenappliedthiscorrectiontothepopulationsthathadadmixture,

AleutRaff,Algonquin,Cree,Chipewyan,Inupiat,Ojibwa,andSouthernUSNativeAmerican.

Iusedtheconfidenceintervalsforthedifferencesagainandroundedtheadmixture

proportionforeachpopulationtothenearestfivehundredths.Upondoingso,Icalculated

anintervaloff3valuesthatthebaselinef3ispresumedtofallin.

ToseeifIwasabletogetabettercorrection,Iplottedtheaveragedifferences.Iwas

abletouseapolynomialregressionlinesincetheR2valueswereallabove0.99993.Ithen

gottheequationforthiscurve,whichIthenusedtogetavalue(usingtheadmixture

Page 23: Admixture Correction in the Outgroup f3 Statistic

22

proportionasthexvalue)toaddtothepreliminarilycorrectedf3.Thisresultsinf3values

thataresimilartothef3valuesIgotfrommerelyaddingbackintheaveragedifferencefor

theadmixtureproportion5%increments.However,theydonotfallwithinonestandard

deviationofthebaselinef3values,justasadding5%admixtureincrementeddifferencesdid

notyieldf3valuesthatfellwithinthatrangeeither.

Assuch,Iplottedthelowerboundsandupperboundsofthe95%Wilcoxconfidence

intervalsseparatelyandfoundregressionlinesforeach.Ifoundthatsecond-order

polynomialequationsfitthedatabest(highestR2value)andwasabletousethese

equationstoaddbackinthedifferencetothebaselinef3value.Thisallowedacontinuous

correctionofthef3statistic,ratherthanjustatdiscreteadmixtureintervalsof5%.

Page 24: Admixture Correction in the Outgroup f3 Statistic

23

Conclusion

Throughthecourseofthisresearchproject,Ihavedevelopedacrudeadmixture

correctionfortheoutgroup-f3statistic.Byfirstfindingthef3valueofthecontaminated

population,a“correctionfactor”canbeaddedbackintobringthatvaluewithinaballpark

aroundthebaselinef3statistic.Thiscorrectionfactorcomesintheformofalowerbound

quadraticequationandanupperboundquadraticequation.Whenbothoftheseareadded

tothef3statistic,theresultisarangeoff3values.Comparingtheseresultstothebaseline

f3statistics,Iconcludethatthiscorrectionworkswithinamarginoferror.Sincethe

correctiononlyworkedintwooutofthethreepopulationswithadmixedandunadmixed

individuals,wecannotconcludeirrefutablythatthiscorrectionworks.

Nonetheless,thecorrectionworkedforalladmixturelevelsinallsixoftheartificially

admixedpopulations(6×19 = 114cases).Therefore,Iappliedthecorrectiontothe

sevenpopulationsthatwerecompletelyadmixedwithEuropeanDNA.Thisresultedina

rangeoff3valuesthatresembledappropriatef3values.However,thereisnowaytocheck

forwhichofthesesevenpopulationsthecorrectionactuallyworked.

Inthefuture,researchersmightbeabletofine-tuneourcorrectionusingdatafrom

morepopulations.Forinstance,ourconfidenceintervalsfortheWilcoxonsigned-ranktest

wouldlikelyspanashorterrangeiftherewasmoredatatopullfrom.Furthermore,itis

possiblethatresearchersmightbeabletofurthermanipulatethepostulatedequations

mentionedpreviously.GiventhatLindoandcolleagueswereabletofindaneatcorrection

equationfortheDstatistic,itispossiblethatthereexistsoneforthef3statisticaswell.It

wasalsoobservedduringthisprojectthatcertaincorrectionsthatweresuggestedworked

betteratloweradmixtureproportions.Justasthenormalf3statisticismostaccurateunder

Page 25: Admixture Correction in the Outgroup f3 Statistic

24

certainconditions,oneofwhichisthattheadmixtureproportionbecloseto50%,itis

possiblethattheoutgroup-f3statisticworksbestatloweradmixtureproportions.

Regardless,thiscorrectionisusefulforresearchershopingtostudythegenetic

relatednessofdifferentpopulations.Inparticular,thispotentialsolutionismostusefulfor

thosehopingtoperformoutgroup-f3statisticsinpopulationsthathaveindividualswith

geneticadmixture.

Page 26: Admixture Correction in the Outgroup f3 Statistic

25

Glossary

• Admixture:geneticinteractionoftwoormorepreviouslyisolatedpopulations

interbreeding

• Dstatistic:afour-populationtestforadmixture

• Fstatistic:measuressharedgeneticdriftbetweensetsofpopulations

o Normalf3statistic:testsforadmixturebetweenthreepopulations

o Outgroup-f3statistic:proportionaltoamountofsharedgenetichistory

betweentwopopulations

• For-loop:acontrolflowstatementthatspecifiesiterationtoexecuteacode

repeatedly

• Geneticdrift:thechangeinallelefrequenciesinapopulationovergenerationsasa

mechanismofevolution

• Geneticrelatedness:probabilitythattwoindividualsshareanallelefromcommon

ancestry

• Linkagedisequilibrium:non-randomassociationofallelesatvariousloci

• Outgroup:referencegroupoforganismsnotinthepopulationsbeingstudied

• Phylogenetictrees:branchingdiagramrepresentingevolutionaryrelationships

amongstorganisms

• SNPs:singlenucleotidepolymorphisms;changeinasinglenucleotideataspecific

genomeposition

Page 27: Admixture Correction in the Outgroup f3 Statistic

26

Bibliography

Alexander,DavidH.,etal.“FastModel-BasedEstimationofAncestryinUnrelated

Individuals.”GenomeResearch,ColdSpringHarborLab,22June2009,

genome.cshlp.org/content/early/2009/07/31/gr.094052.109.

Lindo,John,etal.“AncientIndividualsfromtheNorthAmericanNorthwestCoastReveal

10,000YearsofRegionalGeneticContinuity.”PNAS,NationalAcademyofSciences,

18Apr.2017,www.pnas.org/content/114/16/4093.

Myles,Sean,etal.“GeneticStructureandDomesticationHistoryoftheGrape.”PNAS,

NationalAcademyofSciences,1Mar.2011,

www.pnas.org/content/108/9/3530.abstract.

Patterson,Nick,etal.“AncientAdmixtureinHumanHistory.”Genetics,Genetics,1Nov.

2012,www.genetics.org/content/192/3/1065.

Peter,BenjaminM.“Admixture,PopulationStructure,andF-Statistics.”Genetics,Genetics,

1Apr.2016,www.genetics.org/content/202/4/1485.

Pontussk.“Pontussk/Popstats.”GitHub,GitHub,30July2015,

github.com/pontussk/popstats.

Raghavan,Maanasa,etal.“UpperPalaeolithicSiberianGenomeRevealsDualAncestryof

NativeAmericans.”Nature,MacmillanPublishersLimited,2Jan.2014,

www.academia.edu/7110954/Upper_Palaeolithic_Siberian_genome_reveals_dual_

ancestry_of_Native_Americans.

Reich,David,etal.“ReconstructingIndianPopulationHistory.”Nature,U.S.NationalLibrary

ofMedicine,24Sept.2009,www.ncbi.nlm.nih.gov/pmc/articles/PMC2842210/.

Vyas,DevenN.,etal.“TestingSupportfortheNorthernandSouthernDispersalRoutesout

ofAfrica:anAnalysisofLevantineandSouthernArabianPopulations.”American

JournalofPhysicalAnthropology,Wiley-Blackwell,15Sept.2017,

onlinelibrary.wiley.com/doi/10.1002/ajpa.23312/full.