Evaluating the quality of the 1000 Genomes Project data · 10 ABSTRACT Data from the 1000 Genomes...

Preview:

Citation preview

*Correspondence: jeff.wall@ucsf.edu

Evaluatingthequalityofthe1000Genomes1

Projectdata 2

SaurabhBelsare1,MichalSakin-Levy2,YuliaMostovoy2,SteffenDurinck3,SubhraChaudhry3,MingXiao4,Andrew3

S.Peterson3,Pui-YanKwok1,2,5,SomasekarSeshagiri3andJeffreyD.Wall1,6,*4 1InstituteforHumanGenetics,UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,94143,USA,2DepartmentofDermatology,5

UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,94143,USA,3DepartmentofMolecularBiology,GenentechInc.,1DNAWay,6 SouthSanFrancisco,CA,94080,USA,4SchoolofBiomedicalScience,Engineering,andHealthSystems,DrexelUniversity,Philadelphia,7

PA,19104,USA,5CardiovascularResearchInstitute,SanFrancisco,SanFrancisco,CA,94143,USA,6DepartmentofEpidemiologyand8 Biostatistics,UniversityofCalifornia,SanFrancisco,SanFrancisco,CA,94143,USA9

ABSTRACTDatafromthe1000Genomesprojectisquiteoftenusedasareferenceforhumangenomic10

analysis.However,itsaccuracyneedstobeassessedtounderstandthequalityofpredictionsmadeusingthis11

reference.Wepresenthereanassessmentofthegenotype,phasing,andimputationaccuracydatainthe100012

Genomesproject.Wecomparethephasedhaplotypecallsfromthe1000Genomesprojecttoexperimentally13

phasedhaplotypesfor28ofthesameindividualssequencedusingthe10XGenomicsplatform.Weobserve14

thatphasingandimputationforrarevariantsareunreliable,whichlikelyreflectsthelimitedsamplesizeof15

the1000Genomesprojectdata.Further,itappearsthatusingapopulationspecificreferencepaneldoesnot16

improvetheaccuracyofimputationoverusingtheentire1000Genomesdatasetasareferencepanel.We17

alsonotethattheerrorratesandtrendsdependonthechoiceofdefinitionoferror,andhenceanyerror18

reportingneedstotakethesedefinitionsintoaccount. 19

INTRODUCTION 20

The1000GenomesProject(1KGP)wasdesignedtoprovideacomprehensivedescriptionofhumangeneticvariation21

throughsequencingmultipleindividuals1-3.Specifically,the1KGPprovidesalistofvariantsandhaplotypesthatcanbe22

usedforevolutionary,functionalandbiomedicalstudiesofhumangenetics.Overthethreephasesofthe1KGP,atotalof23

2504individualsacross26populationsweresequenced.Thesepopulationswereclassifiedinto5majorcontinental24

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

2

groups:Africa(AFR),America(AMR),Europe(EUR),EastAsia(EAS),andSouthAsia(SAS).The1KGPdatawasgenerated25

usingacombinationofmultiplesequencingapproaches,includinglowcoveragewholegenomesequencingwithmean26

depthof7.4X,deepexomesequencingwithameandepthof65.7X,anddensemicroarraygenotyping.Inaddition,asubset27

ofindividuals(427)includingmother-father-childtriosandparent-childduosweredeepsequencedusingtheComplete28

Genomicsplatformatahighcoveragemeandepthof47X.Theprojectinvolvedcharacterizationofbiallelicand29

multiallelicSNPs,indels,andstructuralvariants.30

Giventhelowdepthof(sequencing)coverageformost1KGPsamples,itisunclearhowaccuratetheimputedhaplotypes31

are,especiallyforrarevariants.Wequantifythisaccuracydirectlybycomparingimputedgenotypesandhaplotypes32

basedonlow-coveragewhole-genomesequencedatafromthe1KGPwithhighlyaccurate,experimentallydetermined33

haplotypesfrom28ofthesamesamples.Additionalmotivationforourstudyisgivenbelow. 34

PhasingItisimportanttounderstandphaseinformationinanalyzinghumangenomicdata.Phasinginvolvesresolving35

haplotypesforsitesacrossindividualwholegenomesequences.Theterm’diplomics’4hasbeencoinedtodescribe36

"scientificinvestigationsthatleveragephaseinformationinordertounderstandhowmolecularandclinicalphenotypesare37

influencedbyuniquediplotypes".Thediplotypeshowseffectsinfunctionanddiseaserelatedphenotypes.Multiple38

phenomenalikeallele-specificexpression,compoundheterozygosity,inferringhumandemographichistory,and39

resolvingstructuralvariantsrequiresanunderstandingofthephaseofavailablegenomicdata.Phasedhaplotypesare40

alsorequiredasanintermediatestepforgenotypeimputation. 41

Phasingmethodscanbecategorizedintomethodswhichuseinformationfrommultipleindividualsandthosewhichrely42

oninformationfromasingleindividual5.Theformerareprimarilycomputationalmethods,whilethelatteraremostly43

experimentalapproaches.Somecomputationalapproachesuseinformationfromexistingpopulationgenomicdatabases44

andcanbeusedforphasingmultipleindividuals.These,however,maybeunabletocorrectlyphaserareandprivate45

variants,whicharenotrepresentedinthereferencedatabaseused.Ontheotherhand,somemethodsuseinformation46

fromparentsorcloselyrelatedindividuals.ThesehavetheadvantageofbeingabletouseIdentical-By-Descent(IBD)47

information,andallowlongrangephasing,butrequiresequencingofmoreindividuals,whichaddstothecost.Afew48

methodswhichusetheseapproachesare:PHASE6,fastPHASE7,BEAGLE8-9,SHAPEIT10-11,EAGLE12-13andIMPUTEv214. 49

Experimentalphasingmethods,ontheotherhand,ofteninvolveseparationofentirechromosomesfollowedby50

sequencingofshortsegments,whichcanthenbecomputationallyreconstructedtogenerateentirehaplotypes.These51

methodsdonotneedinformationfromindividualsotherthantheonebeingsequenced.Thesemethodsinvolve52

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

3

genotypingbeingperformedseparatelyfromphasing.Thesemethodsfallintotwobroadcategories,namelydenseand53

sparsemethods14.Densemethodsresolvehaplotypesinsmallblocksingreatdetail,whereallvariantsinaspecificregion54

arephased.However,theydonotinformthephaserelationshipbetweenthehaplotypeblocks.Theseinvolvediluting55

highmolecularweightDNAfragmentssuchthatfragmentsfromatmostonehaplotypearepresentineachunit.Sparse56

methodscanresolvephaserelationshipsacrosslargedistances,butmaynotinformonthephaseofeachvariantina57

chromosome.Inthesemethods,alownumberofwholechromosomesiscompartmentalizedsuchthatonlyoneofeach58

pairofhaplotypesispresentineachcompartment.Thesecompartmentalizationsarefollowedbysequencingtogenerate59

thehaplotypes. 60

Inthiswork,weusephasedhaplotypesgeneratedusingthe10XGenomicsmethodwhichuseslinked-readsequencing15.61

1nanogramofhighmolecularweightgenomicDNAisdistributedacross100,000droplets.ThisDNAisbarcodedand62

amplifiedusingpolymerase.ThistaggedDNAisreleasedfromthedropletsandundergoeslibrarypreparation.These63

librariesareprocessedviaIlluminashort-readsequencing.Acomputationalalgorithmisthenusedtoconstructphased64

haplotypesbasedonthebarcodes. 65

ImputationImputationinvolvesthepredictionofgenotypesnotdirectlyassayedinasampleofindividuals.66

Experimentallysequencinggenomestoahighcoverageisanexpensiveprocess.Lowcoveragesequencingorarrayscan67

beusedaslow-costmethodsforsequencing.However,thesemethodsmayleadtouncertaintyinestimatedgenotypes68

(lowcoveragesequencing)ormissinggenotypevaluesforuntypedsites(arrays).Imputationcanbeusedtoobtain69

genotypedataformissingpositionsusingreferencedataandknowndataatasubsetofpositionsinindividualswhich70

needtobeimputed.ImputationisusedtoboostthepowerofGWASstudies16,finemappingaparticularregionofa71

chromosome17,orperformingmeta-analysis18,whichinvolvescombiningreferencedatafrommultiplereferencepanels. 72

Imputationusesareferencepanelofknownhaplotypeswithallelesknownatahighdensityofhaplotypedpositions.A73

study/inferencepanelgenotypedatasparsesetofpositionsisusedforsequenceswhichneedtobeimputed.Performing74

imputationinvolvestwobasicsteps: 75

• Phasinggenotypesatgenotypedpositionsinthestudy/inferencepanel 76

• Haplotypesfromtheinferencepanelwhichmatchthoseinthereferencepanelatthepositionsinthestudypanel77

areassumedtomatchinallotherpositions 78

Variousimputationalgorithmsperformthesestepssequentiallyanditerativelyorsimultaneously. 79

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

4

Factorsaffectingthequalityofthephasingandimputationare(1)sizeofreferencepanel(2)densityofSNPsinreference80

panel(3)accuracyofcalledgenotypesinthereferencepanel(4)degreeofrelatednessbetweensequencesinreference81

panelandstudysequences(5)ethnicityofthestudyindividualsincomparisonwiththeavailablereferencedataand(6)82

allelefrequencyofthesitebeingphasedorimputed5. 83

Multiplemethodshavebeendevelopedforgenotypeimputation19.fastPHASE7,MACH20-21,BEAGLE8,22-23,andIMPUTEv21484

aresomewidelyusedmethodsforimputation. 85

AnanalysisoftheimputationaccuracyfortheHapMapprojecthasbeenperformedaboutadecadeago24,butnosimilar86

detailedanalysisexistsforassessingthephasingandimputationofthe1000Genomesproject,particularlycomparingthe87

databaseagainstexperimentallyphasedsequences.Wepresenthereadetailedassessmentofthequalityofphasingand88

imputationforthe1000Genomesdatabase,particularlyasafunctionofminorallelefrequencyandinter-SNPdistances89

forbiallelicSNPs.90

91

MATERIALANDMETHODS 92

InputData 93

ProcessedVCFsweredownloadedfromthe1000Genomeswebsite.Thisdataisavailableforeachchromosome94

separately.Toobtainagreementwiththeexperimentaldata,1000GenomesVCFscorrespondingtotheGRCh38assembly95

weredownloaded.Experimentaldatawassequencedusingthe10XGenomicsplatformfor28individuals:5GM,18HG,96

and5NA.TheGMandNAindividualswereoriginallypartoftheHapMapprojectwhiletheHGarefromthe100097

Genomesproject.ThirteenoftheseindividualswereprocessedatUCSFandsequencedatNovogene,whiletheremaining98

individualswereprocessedandsequencedatGenentech.Thepopulationsfromwhicheachoftheindividualscome(as99

listedintheCoriellCatalog)are: 100

• SouthAsia(SAS): 101

o GujaratiIndiansinHouston,Texas,USA(HapMap)[GIH]-GM21125*,NA20900,NA20902 102

o PunjabiinLahore,Pakistan[PJL]-HG03491,HG03619 103

o SriLankanTamilintheUK[STU]-HG03679,HG03752,HG03838* 104

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

5

o IndianTeluguintheUK[ITU]-HG03968 105

o BengaliinBangladesh[BEB]-HG04153,HG04155 106

• EastAsia(EAS): 107

o HanChineseinBeijing,China(HapMap)[CHB]-GM18552*,NA18570,NA18571 108

o ChineseDaiinXishuangbanna,China[CDX]-HG00851*,HG01802,HG01804 109

o KinhinHoChiMinhCity,Vietnam[KHV]-HG02064,HG02067 110

o JapaneseinTokyo,Japan(HapMap)[JPT]-NA19068* 111

• Africa(AFR): 112

o LuhyainWebuye,Kenya(HapMap)[LWK]-GM19440* 113

o GambianinWesternDivision,TheGambia[GWD]-HG02623* 114

o EsanfromNigeria[ESN]-HG03115* 115

• Europe(EUR): 116

o ToscaniinItalia(TuscansinItaly)(HapMap)[TSI]-GM20587* 117

o BritishfromEnglandandScotland,UK[GBR]-HG00250* 118

o FinnishinFinland[FIN]-HG00353* 119

• America(AMR): 120

o MexicanAncestryinLosAngeles,California,USA(HapMap)[MXL]-GM19789* 121

o PeruvianinLima,Peru[PEL]-HG01971* 122

AsterisksnexttosampleIDsrefertosamplesprocessedatUCSF. 123

Preprocessing1000GenomesData 124

The1000GenomesdatawasseparatedintoindividualandchromosomespecificVCFsusingvcftools25.Further,the125

variantswerefilteredforbiallelicSNPs,phased,filteredforPASS,andindelswereremoved.Theexperimentallyphased126

dataalsohadaverysmallfractionofunphasedSNPs,whichwereremovedbyfilteringwithvcftools.Theanalysiswas127

performedonlyforautosomes. 128

PhasingAnalysis 129

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

6

Thealternate(ALT)allelefrequenciesofalltheSNPsofinterestwereobtainedfromthe1000Genomesdataand130

convertedtominorallelefrequenciestobeabletoanalyzeswitcherrorasafunctionofminorallelefrequencies.The131

filteredSNPsfromtheexperimentaldataweresplitintophasesets,basedonphasesetinformationavailableinthe132

experimentalVCFfiles.Switcherrorwascalculatedbetweentheexperimentaland1000Genomesdataforeachphaseset133

ineachchromosomeofeachindividualfromtheexperimentaldataset.Switcherrorisdefinedaspercentageofpossible134

switchesinhaplotypeorientationusedtorecoverthecorrectphaseinanindividual26orproportionofheterozygouspositions135

whosephaseiswronglyinferredrelativetothepreviousheterozygousposition27.vcftoolsreturnstheswitcherroraswellas136

allpositionsofswitchesoccurringalongthechromosome. 137

SwitchErrorasaFunctionofMinorAlleleFrequencyALTallelefrequencieswereaccessedforeachoftheswitch138

positionsfromthedataandwereconvertedtominorallelefrequencies.Distributionofswitchpositionsasafunctionof139

minorallelefrequencywasplottedforeachchromosomeineachindividual. 140

SwitchErrorasaFunctionofInterSNPDistancePositionsofeachSNPwereaccessedfromthedata.Thenumberof141

intermediateswitcheswerecountedforallpairofSNPs,notonlyconsecutiveSNPs.Ifthenumberofswitchesbetween142

twoSNPswereodd,aswitcherrorwascounted.Thiswasusedtocalculatethedistributionofswitcherrorsasafunction143

ofinter-SNPdistance. 144

145

ImputationAnalysis 146

Theentireimputationanalysisisperformedforeachchromosomeforeachindividual. 147

GenerateRecombinationMapIMPUTEv213makesavailablerecombinationmapsforeachchromosomeusingthe1000148

GenomesdatafortheGRCh37assembly.ArecombinationmapwasobtainedforeachchromosomeforGRCh38bylifting149

overtheGRCh37mapsusingtheliftoversoftware.~8kpositions(0.2%)wereremovedfromtheliftedover150

recombinationmapbecauseliftoverresultedinthembeingintheincorrectorder. 151

GenerateReferencePanelAreferencehaplotypepanelwasgeneratedforallindividualsfromthe1000Genomesdataby152

subsettingittothespecificpopulationofinterest.1000Genomesdatafortheindividualswhichwereexperimentally153

sequencedwasnotincludedinthereferencepanel.vcftoolswasusedtofilterouttheindividualsofinterestfromthe1000154

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

7

Genomesdata.bcftoolswasusedtoconverttheVCFdatatohaps-sample-legendformat.Analternateapproachwasalso155

used,wheretheentire1000Genomesdatawasusedtogenerateareferencehaplotypepanel. 156

GenerateStudyPanelAstudypanelwasgeneratedfortheexperimentallysequencedindividualsselected.Thestudy157

panelisassumedtobegenotypedatpositionscorrespondingtotheIlluminaInfiniumOmni2.5-8array.Arraypositions158

wereliftedoverfromGRCh37toGRCh38usingliftover.1000Genomeshaplotypes(since1000Genomesdatais159

prephased,thestudypanelisalsointheformofhaplotypesratherthangenotypes)forthosepositionsforthose160

individualswereselectedtocreatethestudypanelusingvcftools.FilteredVCFfileswereconvertedtothehaps-sample161

formatusingbcftools. 162

RunImputationMissingpositionsareimputedusingIMPUTEv2.Imputationwasperformedin5Mbwindows.The163

genotypeoutputbyimputationwasconvertedtoVCFformatusingbcftools.VCFsproducedoverallwindowswere164

combinedusingvcf-concat.IMPUTEv2generallyphasesthetypedgenotypedsitesinstudypanel.Thisisfollowedby165

imputationwhichisperformedbyassumingthathaplotypesinthestudypanelthatmatchthehaplotypesinthereference166

panelatthetypedsitesalsomatchintheuntypedsites.IMPUTEv2thenperformsaniterativeprocessperforming167

multipleMonte-Carlostepsalternatingphasingandimputation.Forthisanalysis,however,ashaplotypesfromthe1000168

Genomesprojectweredirectlyusedtogeneratethestudypanel,thephasingstepwasnotperformed. 169

FilterPositionsForonepartoftheanalysis,i.e.estimatingerrorsinthepositionsrepresentedintheexperimentally170

phasedVCFs(henceforthcalledexperimentalSNPs),thepositionsfromthoseVCFswerefilteredfromtheimputeddata171

usingvcftools.ExperimentalgenotypesfromtheexperimentalVCFswereobtainedforeachindividualofinterestusing172

vcftools.SNPswithduplicateentriesineithertheimputedorexperimentaldatawereremoved.Continent-specificallele173

frequencieswereobtainedfortheexperimentalSNPsfromthe1000Genomesdatausingvcftools,tobeabletoanalyze174

switcherrorasafunctionofMinorAlleleFrequencies.Fortheotherpartoftheanalysis,i.e.estimatingerrorsforall175

positionsinthe1000Genomesdata,theallelefractionsweresimilarlyobtainedforalloftheSNPs. 176

ImputationErrorImputationerrorwascomputedasfractionofgenotypesbeingincorrectlyidentified.Imputationerror177

wascomputedforboth,theSNPsintheexperimentaldataandalltheSNPsin1000Genomesdata.Erroriscomputedasa178

functionofminorallelefrequency.Thecontinent-specificminorallelefrequencieswereusedforanalyzingtheimputation179

error.180

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

8

Forallanalysiswhereerrorrateiscomputedasafunctionofthecontinent-specificminorallelefrequency(genotyping181

errorandimputationerror;Figs.1,2,7,8),theminorallelefrequenciesarebinnedasMAF=0.0%,0.0%<MAF<0.2%,182

0.2%<=MAF<0.5%,0.5%<=MAF<1%,1%<=MAF<5%,MAF>=5%.Fortheanalysiswhereall1000Genomesminor183

allelefrequenciesareused(phasingerrorandimputationerrorcomparinguseofmultiplereferencepanels;Figs.3,4,9),184

theminorallelefrequenciesarebinnedintoonlyfivebins,i.e.thereisnoMAF=0.0%bin.Restofthebinsarethesameas185

forthecontinent-specificMAFbins. 186

ExperimentalMethods 187

Samplesprocessing:HMWGenomicDNAwasextractedandconvertedinto10xsequencinglibrariesaccordingtothe188

10XGenomics(Pleasanton,CA,USA)ChromiumGenomeUserGuideandaspublishedpreviously28.Briefly,GEMSwere189

madewith1.25ngHMWtemplategDNA,Master-mixGenomeGelBeadsandpartitioningoilonthemicrofluidicGenome190

Chip.IsothermalincubationoftheGEMs(for3hat30°C;for10minat65°C;storedat4°C)producedbarcodedfragments191

rangingfromafewtoseveralhundredbasepairs.AfterdissolutionoftheGenomeGelBeadintheGEMIlluminaRead1192

sequencingprimer,16bp10xbarcodeand6bprandomprimerarereleased.TheGEMswerethenbrokenandthepooled193

fractionswererecovered.SilaneandSolidPhaseReversibleImmobilization(SPRI)beadswereusedtopurifyandsize194

selectthefragmentsforlibrarypreparation.Libraryprepwasperformedaccordingtothemanufacturer'sinstructions195

describedintheChromiumGenomeUserGuideRevC.Librariesweremadeusing10xGenomicsadapters.Thefinal196

librariescontaintheP5andP7primersusedinIlluminabridgeamplification.Thebarcodedlibrarieswerethenquantified197

byqPCR(KAPABiosystemsLibraryQuantificationKitforIlluminaplatforms).SequencingwasdoneusingIlluminaHiSeq198

4000with2×150paired-endreads.Rawreadswereprocessed,alignedtothereferencegenome,andhadSNPscalledand199

phasedusing10XGenomics’LongRangersoftware(version2.1.1or2.1.6)withthe“wgs”pipelinewithdefaultsettings.200

RESULTS 201

The1000Genomesprojectchromosome-specificVCFsfortheGRCh38assemblycontainbetween6.4M(chr1)to1.1M202

(chr22)variantsoverallthe2504individuals.AfterfilteringforbiallelicSNPs,phased,filteredforPASS,removingindels,203

weareleftwith6.15M(chr1)to1.05M(chr22)variants.Theexperimentallyphaseddatafromthe10XGenomicsplatform204

hasdifferentnumbersofcalledvariantsforeachsequencedindividual.Forchromosome1,thenumberofcalledvariants205

variesfrom414Kto494Kacrossthe28individuals,while,forchromosome22,thenumberofcalledSNPsvariesfrom206

104Kto120K.Afterperformingasimilarfilteringfortheexperimentaldata,thenumberofbiallelicPASSphasedSNPs207

rangesbetween298Kand357Kforchromosome1and64Kand75Kforchromosome22. 208

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

9

209 (a)(b)210

Figure1DistributionofSNPsasafunctionofcontinent-specificminorallelefrequencies(a)onlyexperimentalSNPs(b)all1000211 GenomesSNPs 212

TheSNPsfromtheexperimentallyphasedVCFs(Fig.1a),averagedovercontinentgroupsshowthatthevastmajorityof213

SNPsinthisselectionhavehighcontinent-specificMAFvalues(>5%).Comparingacrosscontinentsforthecontinent214

invariantSNPs,theAfricanandAmericanindividualshaveanorderofmagnitudelesscontinentinvariantSNPsthanthe215

European,EastAsianandSouthAsianindividuals.However,ifwelookatalltheSNPsinthe1000GenomesData(filtered216

forbiallelicPASSphasedSNPs)asafunctionofcontinent-specificMAF,thedistributionweobservehasaverydifferent217

trend.Thereisasignificantover-representationoftheverylowcontinent-specificMAFSNPs(<0.1%),∼5∗107,as218

comparedtoallthesubsequenthigherMAFSNPs,whichallrange<1∗107. 219

Thesediscrepanciesbetweenthenumbersinthe1000Genomesdataandintheexperimentallyphaseddata,aswellas220

thedifferingtrendsasafunctionofMAFoccurbecausethe1000GenomesdataincludesaSNPifevenoneindividualin221

the2504individualshasavariant(heterozygousorhomozygous-alternate)atthatpositionwhiletheexperimentaldata222

includesaSNPonlyifthatparticularindividualhasavariant(heterozygousorhomozygous-alternate)atthatposition.223

ThisresultsinamuchlargernumberofoverallSNPsbeingpresentinthe1000Genomesdataascomparedtothe224

experimentalandalsothemajorityofthe1000GenomesSNPshavingextremelylowMAF,asthosewouldoccuronlyin225

oneorafewindividuals.226

GenotypingError227

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

10

228 (a)(b)229 Figure2Genotypingerror(a)intheexperimentalVCFpositionsasafunctionofcontinent-specificminorallelefrequencyaveragedover230 allchromosomesoverallindividualsineachcontinent(b)falsepositivevsfalsenegativerates(definedintext)forall1000Genomes231 SNPs 232

Genotypingerroriscomputedcomparingthe1000Genomesgenotypeswiththeexperimentalgenotypes.The233

experimentalgenotypesforallSNPsnotpresentintheexperimentalVCFforeachindividualareassumedtobe234

homozygousreference.Mismatchedgenotypesarecountedaserrors.Figure2alooksattheerrors(fractionofgenotypes235

whichareincorrect)fortheexperimentalVCFpositionsasafunctionofthecontinent-specificminorallelefrequencies.236

Thereishighererroratthepopulationinvariantsites(MAF=0.0%)intheAfricanandAmericanpopulationsthanthe237

European,EastAsianandSouthAsianpopulations.Thiscorrelateswithalowertotalnumberofpopulationinvariant238

SNPsinthosecontinents(Fig.1a).Fornon-invariantSNPs,weobserve,asexpected,adecreasingerrorratewith239

increasingminorallelefrequency,toa<1.5%errorgenotypingerrorratefortheSNPswithminorallelefrequencies>1%.240

Comparingfalsepositive(sitesnon-homozygousreferencein1000Genomesdataandhomozygousreferenceinthe241

experimentaldata)vsfalsenegative(siteshomozygousreferencein1000genomesdataandnon-homozygousreference242

intheexperimentaldata)errorratesforall1000Genomessites(Fig.1b),weseethatthegenotypingfortheEuropean243

andAmericanindividualsisveryaccurate,withbothlowfalsepositiveandfalsenegativerates.TheEastAsianandSouth244

Asianpopulationsbothhavemostlylowfalsepositiverates,butshowawiderange(factorof2)offalsenegativerates,245

whileshowingonlya~15%variationinthefalsepositiveratesformostindividuals.Incontrast,theAfricanindividuals246

mostlyhaverelativelylowfalsenegativerates,buthaveamongthehighestfalsepositiverates.Thisindicatesthatthe247

sequencinginthe1000Genomesprojecthasovercallednon-homozygousreferencevariantsinAfricanindividuals248

comparedtotherest,andovercalledSNPsashomozygousreferenceinsomeoftheEastandSouthAsianindividuals. 249

Phasing 250

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

11

251

Figure3SwitcherrorasafunctionofMinorAlleleFrequenciesfordifferentindividualchromosomes.Chromosome21showshigher252 switcherrorforlargeMAFvalues 253

Phasingerrorsareallanalyzedforoverall1000Genomesminorallelefrequencies,notcontinentspecificMAFs.254

Comparingtheswitcherroracrossindividualchromosomes(Fig.3),weobservethattheswitcherrorrangesbetween25255

−30%fortherareMAF(<0.1%)SNPs,fallingto<5%forSNPswithMAFs1−5%.ThemajorityofSNPs,whichfallinthe256

MAF>5%category,haveanerror<2.5%.However,acomparativelyhigherswitcherroratlargerMAFvalues(>5%)is257

observedforchromosome21.Thisplot(Fig.3)showsonlyasubsetofchromosomesasingleindividual(GM18552),but258

thistrendisobservedforallotherchromosomesandindividualsstudied.259

260 (a)(b)(c)261

Figure4Switcherror(a)Totalswitcherror(numberofswitchesinexperimentalSNPs/totalnumberofexperimentalSNPs)foreach262 individual(b)SwitcherrorasafunctionofMinorAlleleFrequenciesaveragedoverallindividualsineachcontinent.(c)Switcherrorasa263 functionofMinorAlleleFrequenciesforallindividualscoloredbycontinent.264

Figure4ashowsthetotalswitcherrorforeachoftheindividuals.Thetotalswitcherrorsforalltheindividualsstudiedgo265

upto∼2.5%.TheswitcherrorsfortheEastAsianindividualsaregroupedtogether,whilethosefortheSouthAsian266

individualsshowgreatervariability.ThisisinlinewiththegeneralobservationthatSouthAsianpopulationshavean267

overallgreaterheterogeneitythandoEastAsianpopulations[J.Wall,Unpublisheddata] 268

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

12

Analyzingtheswitcherrorasafunctionofminorallelefrequencyaveragedoverallchromosomesofallindividualsofa269

population(Fig.4b),weobservelowswitcherror,<5%,forlowminorallelefrequencies(MAF)(1−5%).ForrareSNPs270

withMAF(0.2−1%),theswitcherroris∼5−10%.ForextremelyrareminoralleleSNPs,i.e.MAF<0.2%,theerroris271

muchhigher,i.e.15−35%.ForallhigherMAFvalues(>5%),theerroris<2.5%.Theaverageerrorrateforthe272

individualsfromtheAfricanpopulationsisalmostthesameovertherangeofMAFvalues>0.1%. 273

AsobservedinFigure4c,thedifferencesintheerrorratesbetweenindividualsdecreasewithincreasingminorallele274

frequency.IndividualsfromSouthAsiashowalargervariationinerrorasafunctionofMAFascomparedtoindividuals275

fromEastAsia.TheindividualsfromtheAfricanpopulationshavethelowestswitcherrorovertherangeofMAFvalues.276

IndividualNA20900,anindividualfromtheGujaratiIndiansinHouston(GIH)populationhasthelowestswitcherrorasa277

functionofminorallelefrequencyforthelowMAFSNPs. 278

279

280 (a) (b)281

Figure5Switcherrorasafunctionofinter-SNPdistance(a)Switcherrorasafunctionofinter-SNPdistancesaveragedoverindividuals282 ineachcontinent.(b)Switcherrorasafunctionofinter-SNPdistancesforallindividualscoloredbycontinent. 283

WealsoanalyzedphasingerrorasafunctionofthedistancesbetweenSNPs(Fig.5).Thephasingerrorincreasesasa284

functionoftheinter-SNPdistance,i.e.SNPswhicharefurtherapartaremorelikelytobeoutofphasewitheachother.The285

withinpopulationtrendsarethesameasforswitcherrorvsMAF,wheretheindividualsfromSouthAsiashowalarger286

spreadascomparedtotheindividualsfromEastAsia.IndividualNA20900showsthelowesterrorrate,sameasforthe287

comparisonoferrorvsMAF(Fig.4c). 288

ComparingtheswitcherrorasafunctionofMAFvs.theswitcherrorasafunctionofinter-SNPdistance,weseethatthe289

individualsfromtheAfricanpopulationsshowdistinctlyoppositetrends.ForlowMAFSNPs,theerroristhelowest290

averagingovertheAfricanindividuals,whileacrosstherangeofinter-SNPdistances,theaverageovertheAfrican291

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

13

individualswasthehighesterror.Thereasonthisoccurscanbeunderstoodfromthefactthatthereareahighernumber292

oflowMAFSNPsintheAfricanindividualsintheexperimentaldata(Fig.1a),aswellasanoverallhighernumberofSNPs293

inthoseindividuals,leadingtoahigherSNPdensityfortheseindividuals.Inaddition,thereislesslinkagedisequilibrium294

(LD)intheindividualsfromtheAfricanpopulations,whichwouldmakeithardertophasethemaccurately29-30.Hence,295

pairsofSNPsaremorelikelytobeoutofphasewitheachother,leadingtohigherswitcherrorasafunctionofinter-SNP296

distance. 297

Imputation 298

299 (a)(b)300

Figure6Totalimputationerror(a)TotalimputationerrorinexperimentalSNPs(numberofincorrectgenotypesinallexperimental301 SNPs/totalnumberofexperimentalSNPs)foreachindividual(b)Totalimputationerrorinall1KGSNPs(numberofincorrectgenotypes302 inall1KGSNPs/totalnumberof1KGSNPs)foreachindividual 303

ImputationerroriscomputedasthefractionofSNPswithincorrectlyimputedgenotypes.However,dependingonthe304

subsetofSNPsunderconsideration,theerrorcanbecomputedintwodifferentways,(1)fractionofexperimentalSNPs305

incorrectlyimputedand(2)fractionofall1KGSNPsincorrectlyimputed.Inthecaseoftheseconddefinitionoferror,the306

experimentalcallsforallthepositionsnotintheexperimentalVCFsaresettohomozygous-reference. 307

Figure6ashowsthetotalimputationerrorintheexperimentalSNPswhileFigure6bshowsthetotalimputationerrorin308

the1KGSNPsforeachoftheindividuals.ThetotalimputationerrorsintheexperimentalSNPsforalltheindividuals309

studiedgoupto∼4%.ForthissubsetofSNPs,thetwoAmericanindividualshavetheamongthehighestimputation310

errors.TheimputationerrorsfortheEastAsianindividualsaregroupedtogether,whilethosefortheSouthAsian311

individualsshowgreatervariability.Thisagreeswithourobservationsfortheswitcherror(Fig.4a).Inthe1KGSNPs,on312

theotherhand,sincewearelookingatamuchlargersetofSNPs,mostofwhicharehomozygous-referenceinanygiven313

individual,weseeamuchsmallererror<∼1%.314

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

14

315 (a)(b)316

Figure7ImputationaccuracyexperimentalVCFpositions(a)ImputationerrorintheexperimentalSNPsasafunctionofMinorAllele317 Frequenciesaveragedoverindividualsineachcontinent.(b)ImputationerrorintheexperimentalSNPsasafunctionofMinorAllele318 Frequenciesforallindividualscoloredbycontinent. 319

ImputationerrorinexperimentalSNPsFigure7ashowsawiderangeoferrorratesasfunctionofthecontinent-320

specificminorallelefrequency.Thecontinentinvariantpositions(MAF=0.0%)areimputedalmostasaccuratelyasthe321

highMAF(>5%in3populations,and>1%intwopopulations)SNPs.Inthesepositions,wemakethesameobservationas322

wedidfortheoriginalgenotypinginthe1000genomesreferencedata(Fig.2a),i.e.theerrorsintheEuropean,EastAsian323

andSouthAsianindividualsforthesecontinentinvariantpositionsarelowerthanthosefortheAmericanandAfrican324

individuals.FortheveryrareSNPs,i.e.MAF<0.2%,theerrorisashighas∼60%.Theseextremelyhigherrorratesare325

onlyobservedintheAmericanindividualsandafewoftheSouthAsianindividuals.Fortherestoftheindividuals,the326

errorratesare<50%.Inthemid-rangeofMAFvalues,i.e.0.2%to1%,theerrorsrangebetween10−20%.TheSNPswith327

higherMAFvaluesarefairlyaccurate,witherrors<2%forcommonSNPs(MAF>5%).Thiscanalsobeseenlookingatall328

theindividualsseparately(Fig.7b).TheSouthAsian(GujaratiinHouston,Texas)individualNA20900stillshowsthe329

lowesterrorrateasafunctionofMAFforimputation,justasitdoesfortheswitcherror(Fig.4c).330

331

(a)(b)332

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

15

Figure8Imputationaccuracyall1KGSNPs(a)Imputationerrorinallthe1000GenomespositionsasafunctionofMinorAllele333 Frequenciesaveragedoverindividualsineachcontinent.(b)Imputationerrorinallthe1000GenomespositionsasafunctionofMinor334 AlleleFrequenciesforallindividualscoloredbycontinent. 335

Imputationerrorinall1KGSNPsComputingtheerrorusingallthe1KGSNPs,weseeadifferenttrendfortheerrorsas336

afunctionofminorallelefrequency(Figs.8a,8b).Theinvariantsiteshaveverylowerrors~10-4.Forthevariantsites,the337

errorsincreaseasafunctionofminorallelefrequency,asopposedtodecreasingastheydointheexperimentalonlySNPs.338

ThereasonthishappensisthatcontrastingthenumberofexperimentalSNPs(Fig.1a)withthenumbersofall1KG339

SNPs(Fig.1b),whilethenumberoflowMAFSNPsis1-2ordersofmagnitudelessthanthenumberofSNPswithMAF>340

5%intheexperimentaldata,thenumberofverylowMAFSNPsis2-10timesgreaterthanthenumberofSNPswithMAF341

>5%inthewhole1000Genomesdata.ThevastmajorityoftheverylowMAFSNPsinthewhole1000Genomesdataare342

homozygous-reference,sincethoseSNPsshowvariationinonlyoneorveryfew1000Genomesindividuals.Hence,343

imputationpredictionsgetmostofthosepositionscorrectinmostoftheindividuals.Asaresult,thefractionofthosevery344

rareSNPswhicharepredictedincorrectlyismuchlowerwhenconsideringallthe1000GenomesSNPsascomparedto345

onlyconsideringtheexperimentalSNPs,wheremostoftheSNPsarehighMAFSNPs. 346

ConsistentwiththeobservationsfortheexperimentalonlySNPs,atveryrareSNPs(MAF<0.2%),theAmerican347

individualsstillhavethehighesterrorrate.TheindividualsfromtheSouthAsianpopulationsstillshowagreaterspread348

thanthosefromtheEastAsianpopulations.IndividualNA20900stillshowsthelowesterrorrateaswithprevious349

observations.350

351 (a)(b)352

Figure9ImputationerrorasafunctionofMinorAlleleFrequenciesforEuropeanindividualscomparingtheEuropeanreferencepanel353 v/stheentire1KGreferencepanel(a)experimentalSNPs(b)All1000GenomesSNPs 354

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

16

ComparisonofreferencepanelsHere,wecomparetheimputationerrorsresultingfromusingdifferentreference355

panelsforimputation.Acontinent-specificreferencepanelfortheindividualofinterest,areferencepanelwhichincludes356

allofthe1000Genomesindividuals,andacontinent-specificreferencepanelforadifferentcontinentfromtheonefrom357

whichtheindividualsare,arechosen.Theminorallelefrequenciesusedhereareforalltheoverall1000Genomesminor358

allelefrequencies,insteadofacontinent-specificminorallelefrequency,sincewewanttounderstandtheimpactofthe359

choiceofreferencepanel,andcontinent-specificMAFswouldnotalignwiththewholereferenceorthereferencefrom360

anothercontinent.Inthiscase,welookattheimputationerrorinthe3Europeanindividualswhenimputationiscarried361

outwiththeEuropeanreference,theSouthAsianreference,andthewhole1000Genomesreference.362

TheobservedresultforexperimentalonlySNPs(Fig.8a)whencomparingreferencepanelsfortheEuropeanindividuals363

isverysimilarwhenlookingatall1000GenomesSNPs(Fig.8b).Theimputationaccuracywhenusingtheentire1000364

GenomesdataasareferencepanelgivesaslightlybetteraccuracythanusingjustaEuropeanspecificreferencepanel.365

Theerrorwhileusinganincorrectreferencepanel,however,isuptoafactorof2greaterthantheerrorwhenusingthe366

appropriatereference,orwhenusingthewhole1000Genomesreferencepanel.ThetrendoferrorasafunctionofMAFis,367

again,theoppositeofwhatwasobservedwhenlookingatonlytheexperimentalSNPs.368

DISCUSSION 369

The1000GenomesProjectdatahavebeenwidelyusedasareferenceforestimatingcontinent-specificallelefrequencies,370

andasareferencepanelforphasingandimputationstudies.Sincetheproject’sdesigninvolvedlow-coverage(~7X)371

sequencingformostofthesamples,itwasunknownapriorihowaccuratethe1KGP’sgenotypeandhaplotypecallswere,372

especiallyforrarevariants.Thisaccuracyobviouslydirectlyimpactstheusefulnessofthe1KGPdata.Withtheadventof373

inexpensive,commercialplatformsforexperimentallyphasingwholegenomes,itispossibletodirectlyquantifythe374

genotypeandhaplotypeerrorratesofthe1KGPdata.375

376

Ourcomparisonof28experimentallyphasedgenomeswiththe1KGPdatafoundthatthelatterishighlyaccuratefor377

commonandlow-frequencyvariants(i.e.,MAF≥0.01).Asexpected,accuracydeclinedwithdecreasingMAF,withrare378

variants(MAF<0.01)notreliablyimputedontohaplotypes.Surprisinglythough,thegenotypecallswerereasonably379

accurateevenforrarevariants.Thisobservationmaynotgeneralizetootherlow-coveragesequencingstudiesduetothe380

complicatedandlabor-intensiveprotocolusedforvariantcallinginthe1KGP.Weconcludethatthe1KGPdataisbest381

usedasareferencepanelforimputingvariantswithMAF≥0.01intopopulationscloselyrelatedtothe1KGPgroups,and382

isprobablyoflimitedutilityforimputationinrarevariantassociationstudies.Largersubsequentimputationpanels,such383

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

17

astheonegeneratedbytheHaplotypeReferenceConsortium(HRC)31,arelikelymuchmoreusefulforimputingrare384

variants,atleastinwell-studiedEuropeanpopulations.However,eventhislargereferencepanelmaybeoflimited385

usefulnessforimputationintootherhumangroups.Whileourresultssuggestthatusingaregion-specificreferencepanel386

(forthecorrectregion)forimputationisonlyslightlyworsethanusingaworldwidepanel,thechoiceofanincorrect387

regionalpanelmakestheimputationconsiderablyworse.So,largeEuropean-basedhaplotypereferencepanelswillbeof388

limitedutilityforimputingvariantsintoEastAsian,SouthAsian,orAfrican-Americangenomes,whileimputationstudies389

involvingunderstudiedgroupssuchasMiddleEasterners,MelanesiansorKhoisanarelikelytohaveerrorrates390

substantiallyhigherthanwhatwasobservedinourstudy.Thisisaconsequenceofthefactthatmostrarevariantsare391

region-specific;imputationonlyworkswhenthevariantbeingimputedshowsupoftenenoughinthereferencepanel.In392

summary,whilethe1KGPandHRCprovidevaluablegenomicresourcesthatcanaugmentthepowerofGWASingroups393

withEuropeanancestry,additionallarge-scalegenomesequencingofdiversehumanpopulationswillbenecessaryto394

obtaincomparablebenefitsofimputationingeneticassociationstudiesofnon-Europeangroups.395

396

Finally,wenotethattheabsoluteerrorratevariedbyanorderofmagnitude,dependingonthespecificdefinitionsof397

errorthatwereused.Thishighlightstheimportanceofdefinitionalclarityinstudiesthatevaluatetheaccuracyof398

genomicresources. 399

ConflictsofInterestDeclaration 400

GenentechauthorsholdsharesinRoche.Theotherauthorsdeclarenoconflictsofinterest.401

Acknowledgments402

JDWwassupportedinpartbyNIHgrantR01GM115433.SBwassupportedbyGenentechresearchgrantCA0095684.403

404

LITERATURECITED 405

1. The1000GenomesProjectConsortium(2010)Amapofhumangenomevariationfrompopulation-scale406

sequencing.Nature467,1061–1073. 407

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

18

2. The1000GenomesProjectConsortium(2012)Anintegratedmapofhumangeneticvariationfrom1092human408

genomes.Nature491,56–65. 409

3. The1000GenomesProjectConsortium(2015)Aglobalreferenceforhumangeneticvariation.Nature526,68–410

74. 411

4. Tewhey,R.,Bansal,V.,Torkamani,A.,Topol,E.J.,andSchork,N.J.(2011)Theimportanceofphaseinformation412

forhumangenomics.NatureReviewsGenetics12,215–223 413

5. Browning,S.andBrowning,B.(2011).Haplotypephasing:existingmethodsandnewdevelopments.Nature414

ReviewsGenetics12,703–714 415

6. Stephens,M.andScheet,P.(2005)Accountingfordecayoflinkagedisequilibriuminhaplotypeinferenceand416

missing-dataimputation.Am.J.Hum.Genet.76,449–462 417

7. Scheet,P.andStephens,M.(2006)Afastandflexiblestatisticalmodelforlarge-scalepopulationgenotypedata:418

applicationstoinferringmissinggenotypesandhaplotypicphase.Am.J.Hum.Genet.78,629–644 419

8. Browning,S.andBrowning,B.(2007).Rapidandaccuratehaplotypephasingandmissing-datainferencefor420

whole-genomeassociationstudiesbyuseoflocalizedhaplotypeclustering.Am.J.Hum.Genet.81,1084–1097 421

9. Browning,B.andBrowningS.(2009).Aunifiedapproachtogenotypeimputationandhaplotype-phase422

inferenceforlargedatasetsoftriosandunrelatedindividuals.Am.J.Hum.Genet.84,210–223 423

10. Delaneau,O.,Coulonges,C.,andZagury,J.-F.(2008).Shape-IT:newrapidandaccuratealgorithmforhaplotype424

inference.BMCBioinformatics9 425

11. Delaneau,O.,Marchini,J.,andZagury,J.-F.(2012)Alinearcomplexityphasingmethodforthousandsofgenomes.426

NatureMethods9,179–181 427

12. Loh,P.-R.,Danecek,P.,Palamara,P.F.,Fuchsberger,C.,ReshefY.A.,Finucane,H.K.Schoenherr,S.,Forer,L.,428

McCarthy,S.,Abecasis,G.R.,etal.,(2016)Reference-basedphasingusingthehaplotypereferenceconsortium429

panel.NatureGenetics48,1443–1448 430

13. Loh,P.-R.,Palamara,P.F.,andPrice,A.L.(2016)Fastandaccuratelong-rangephasinginaUKbiobankcohort.431

NatureGenetics48,811–816 432

14. Howie,B.N.,Donnelly,P.,andMarchini,J.(2009)Aflexibleandaccurategenotypeimputationmethodforthe433

nextgenerationofgenome-wideassociationstudies.PLOSGenetics5 434

15. Synder,M.W.,Adey,A.,Kitzman,J.O.,andShendure,J.(2015)Haplotype-resolvedgenomesequencing:435

experimentalmethodsandapplications.NatureReviewsGenetics16,344–358 436

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

19

16. Zheng,G.X.Y.,Lau,B.T.,Schnall-Levin,M.,Jarosz,M.,Bell,J.M.,Hindson,C.M.,Kyriazopoulou-Panagiotopoulou,437

S.,Masquelier,D.A.,Merrill,L.,Terry,J.M.,etal.,(2016)Haplotypinggermlineandcancergenomeswithhigh-438

throughputlinked-readsequencing.NatureBiotechnology34,303–311 439

17. Spencer,C.C.A.,Su,Z.,Donnelly,P.,andMarchini,J.(2009)Designinggenome-wideassociationstudies:sample440

size,power,imputation,andthechoiceofgenotypingchip.PLoSGenetics5 441

18. Marchini,J.,Howie,B.,Myers,S.,McVean,G.,andDonnelly,P.(2007)Anewmultipointmethodsforgenome-442

wideassociationstudiesbyimputationofgenotypes.NatureGenetics37,906–913 443

19. Zeggini,E.,Scott,L.J.,Saxena,R.,Voight,B.F.,Marchini,J.L.,Hu,T.,deBakker,P.I.,Abecasis,G.R.,Almgren,P.,444

Andersen,G.,etal.(2008)Meta-analysisofgenomewideassociationdataandlarge-scalereplicationidentifies445

additionalsusceptibilitylocifortype2diabetes.NatureGenetics40,638–645 446

20. Marchini,J.andHowie,B.(2010)Genotypeimputationforgenome-wideassociationstudies.NatureReviews447

Genetics11,499–511 448

21. Li,Y.,Willer,C.J.,Sanna,S.,andAbecasis,G.R.(2009)Genotypeimputation.Annu.Rev.GenomicsHum.Genet.449

10,387–406450

22. Li,Y.,Willer,C.J.,Ding,J.,Scheet,P.,andAbecasis,G.R.(2010)Mach:usingsequenceandgenotypedatato451

estimatehaplotypesandunobservedgenotypes.Genet.Epidemiol.34,816–834 452

23. Browning,S.(2006).Multilocusassociationmappingusingvariable-lengthmarkovchains.Am.J.Hum.Genet.78,453

903–913 454

24. Huang,L.,Li,Y.,Singleton,A.B.,Hardy,J.A.,Abecasis,G.,RosenbergN.A.,andScheet,P.(2009)Genotype-455

imputationaccuracyacrossworldwidehumanpopulations.Am.J.Hum.Genet.84,235–250 456

25. Danecek,P.,Auton,A.,Abecasis,G.,Albers,C.A.,Banks,E.,DePristo,M.A.,Handsaker,R.E.,Lunter,G.,Marth,G.457

T.,Sherry,S.T.,etal.(2011).Thevariantcallformatandvcftools.Bioinformatics27,2156–2158. 458

26. Marchini,J.,Cutler,D.,Patterson,N.,Stephens,M.,Eskin,E.,Halperin,E.andLin,S.,Qin,Zhaohui,S.Q.,Munro,H.459

M.Abecasis,G.R.,etal.,(2006)Acomparisonofphasingalgorithmsfortriosandunrelatedindividuals.Am.J.460

Hum.Genet.78,437–450 461

27. Stephens,M.andDonnelly,P.(2003)Acomparisonofbayesianmethodsforhaplotypereconstructionfrom462

populationgenotypedata.Am.J.Hum.Genet.73,1162–1169 463

28. Weisenfeld,N.I.,Kumar,V.,Shah,P.,Church,D.M.,Jaffe,D.B.(2017)Directdeterminationofdiploidgenome464

sequences.GenomeResearch27(5),757-767 465

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

20

29. Frisse,L.,Hudson,R.R.,Bartoszewicz,A.,Wall,J.D.,Donfack,J.,DiRienzo,A.(2001)Geneconversionand466

differentpopulationhistoriesmayexplainthecontrastbetweenpolymorphismandlinkagedisequilibrium467

levels.Am.J.Hum.Genet.69,831–843468

30. Gabriel,S.B.,Schaffner,S.F.,Nguyen,H.,Moore,J.M.,Roy,J.,Blumenstiel,B.,Higgins,J.,DeFelice,M.,Lochner,A.,469

Faggart,M.,etal.(2002)Thestructureofhaplotypeblocksinthehumangenome.Science296,2225–2229470

31. McCarthy,S.,Das,S.,Kretzschmar,W.,Delaneau,O.,Wood,A.R.,Teumer,A.,Kang,H.M.,Fuchsberger,C.,471

Danecek,P.,Sharp,K.,etal.(2016)Areferencepanelof64,976haplotypesforgenotypeimputation.Nature472

Genetics48(10),1279-1283473

.CC-BY-ND 4.0 International licensecertified by peer review) is the author/funder. It is made available under aThe copyright holder for this preprint (which was notthis version posted August 3, 2018. . https://doi.org/10.1101/383950doi: bioRxiv preprint

Recommended