Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Ecology and Evolution. 2018;1–14. | 1www.ecolevol.org
Received:26January2018 | Revised:14March2018 | Accepted:29March2018DOI: 10.1002/ece3.4147
O R I G I N A L R E S E A R C H
Double- digest RADseq loci using standard Illumina indexes improve deep and shallow phylogenetic resolution of Lophodermium, a widespread fungal endophyte of pine needles
Rodolfo Salas-Lizana | Ryoko Oono
ThisisanopenaccessarticleunderthetermsoftheCreativeCommonsAttributionLicense,whichpermitsuse,distributionandreproductioninanymedium,providedtheoriginalworkisproperlycited.©2018TheAuthors.Ecology and EvolutionpublishedbyJohnWiley&SonsLtd.
DepartmentofEcology,Evolution,andMarineBiology,UniversityofCalifornia,SantaBarbara,California
CorrespondenceRyokoOono,DepartmentofEcology,Evolution,andMarineBiology,UniversityofCalifornia-SantaBarbara,Santa Barbara,CA.Email:[email protected]
Present addressRodolfoSalas-Lizana,DepartamentodeBiologíaComparada,FacultaddeCiencias,UniversidadNacionalAutónomadeMéxico,MexicoCity,Mexico
Funding informationUniversityofCaliforniaInstituteforMexicoandtheUnitedStates,Grant/AwardNumber:20150989
AbstractThephylogeneticandpopulationgeneticstructureofsymbioticmicroorganismsmaycorrelatewith importantecological traits thatcanbedifficult todirectlymeasure,suchashostpreferencesordispersalrates.Thisstudydevelopsandtestsalow-costdouble-digest restriction site-associatedDNAsequencing (ddRADseq)protocol toreveal among- andwithin-species genetic structure forLophodermium, a genusoffungalendophyteswhoseevolutionaryanalyseshavebeenlimitedbythescarcityofinformativemarkers.Theprotocolavoidsexpensivebarcodedadaptersandincorpo-ratesuniversalindexesformultiplexing.Wetestedforreproducibilityandfunction-alitybycomparingshared loci fromsample replicatesandassessed theeffectsofnumbersofambiguoussitesandclusteringthresholdsoncoveragedepths,numberof shared loci among samples, and phylogenetic reconstruction. Errors betweentechnicalreplicateswereminimal.Relaxingthequality-filteringcriteriaincreasedthemeancoveragedepthperlocusandthenumberoflocirecoveredwithinasample,buthadlittleeffectonthenumberofsharedlociacrosssamples.Increasingcluster-ingthresholddecreasedthemeancoveragedepthperclusterandincreasedthenum-beroflocirecoveredwithinasamplebutalsodecreasedthenumberofsharedlociacrosssamples,especiallyamongdistantlyrelatedspecies.Thecombinationoflowsimilarityclustering(70%)andrelaxedquality-filtering(allowingupto30ambiguoussitesperread)performedthebestinphylogeneticanalysesatbothrecentanddeepgeneticdivergences.Hence,thismethodgeneratedsufficientnumberofsharedho-mologouslocitoinvestigatetheevolutionaryrelationshipsamongdivergentfungallineageswith smallhaploidgenomes.Thegreatergenetic resolutionalso revealednewstructurewithinspeciesthatcorrelatedwithecologicaltraits,providingvaluableinsightsintotheircrypticlifehistories.
K E Y W O R D S
ddRADseq,genotypingbysequencing,sequencingdepth
2 | SALAS- LIZANA ANd OONO
1 | INTRODUC TION
The genetic diversity and structurewithin species in combinationwithgeographicalandecologicalmetadatacanuncover importantbiogeographicalstructuring,demographichistory,andadaptivetraitslinked to ecological functions (Loveless, 1984; Manel, Schwartz,Luikart,&Taberlet,2003;Neale&Savolainen,2004;Slatkin,1987).Thegeneticrelationshipamongcloselyrelatedspecies,orphyloge-neticassociations,canalsouncover important lifehistory, suchasecologicaleventsortraitsthatinfluencediversificationandspecia-tion(Huyse,Poulin,&Théron,2005).Microbialspecies,whoselifehistories and species delimitations are particularly challenging toobserveormeasure,wouldespeciallybenefitfromgeneticsequenc-ingmethods thatcanuncoverbothmicro-andmacroevolutionarydiversity.Inthisstudy,wetestalow-costdouble-digestrestrictionsite-associatedDNAsequencing(ddRADseq)protocolthat iscom-monlyappliedforpopulation-levelSNPdiscoveriesbutalsoincreas-ingly applied for phylogenetic analyses, on a fungal genuswhoselife cycle is understudied. Lophodermium (Chevall.) is a paraphy-leticgenuswithover100namedspecies thatassociatewithdeador livingplantsworldwide (Lantz,Johnston,Park,&Minter,2011).Thosespeciesassociatedwithpineneedlesconsistofasmaller(ca.30 putative species) and closely related group (Ortiz-García etal.,2003) thatoftendominateoverotherendophyticspecies (Ganley,Brunsfeld, &Newcombe, 2004;Oono, Lefèvre, Simha, & Lutzoni,2015).Theirecologicalsignificancewithpines isunknownbeyondtheir“endophyte”status,althoughafewareconsideredpathogenic(Hanso&Drenkhan,2011)andsomeareexplicitlystudiedassapro-phytes(Osono&Hirose,2011).Lophodermiumspeciesalsoappeartohavelargewithin-speciesgeneticvariation(Deckert,Hsiang,&LarryPeterson,2002;Salas-Lizana,Santini,Adán,&Piñero,2012)withtheoccurrence ofmany cryptic species (Oono etal., 2014; Reignoux,Green,&Ennos,2014).Whileahandfulofgeneticmarkersmaysuc-cessfullydelimitpopulationsorspeciestorevealdispersalratesordemographichistories,thepopulationstructureofsomewidespreadspecieshasyettobeidentifiedwithtraditionalmultilocussequence-typingapproaches(Oonoetal.,2014;Salas-Lizanaetal.,2012).
High-throughputsequencingallowstherapidgenerationofge-nomicdata forhundredsof individuals to addressdiverseecolog-ical and evolutionary questions. As a genotyping technique thatcansequencemoreindividualsforfewerloci,RADseqhasbecome
themostwidelyused,cost-effectivemethod,particularly fornon-model species without reference genomes (reviewed in Andrews,Good,Miller,Luikart,&Hohenlohe,2016;Daveyetal.,2011).Thedouble-digest RADseq (ddRADseq; Peterson, Weber, Kay, Fisher,&Hoekstra,2012), avariationof theoriginalmethod (Bairdetal.,2008), improvesondepthofcoverageper locusbyoptimizingse-quencing effort and reducing missing genotypes. The protocol isflexible,allowingforeasyoptimizationfordifferentorganisms,ge-nome sizes, genetic diversity, and scientific questions (Mastretta-Yanes etal., 2015; Nieto-Montes de Oca etal., 2017; Recknagel,Elmer,&Meyer,2013;Zhouetal.,2014).Consequently,numerousmodificationsorimprovementsofddRADseqarebeingcontinuallyproposed (Franchini, Parera, Kautt, & Meyer, 2017; Heffelfingeretal.,2014;Recknagel,Jacobs,Herzyk,&Elmer,2015).Inthisstudy,weevaluateanunderutilizedversionofthepopularddRADseqpro-tocolthataccommodatestheless-expensiveandstandardindexedprimersformultiplexing(Kess,Gross,Harper,&Boulding,2016).
Double-digest RADseq consists of digestion of the genomicDNA with two different restriction enzymes (typically one withhigh-frequency site and one with low-frequency site), ligation ofdigested fragments toadapters,aPCRstep for theenrichmentofligated fragmentsandattachmentof indexes, size selection,pool-ingofsamples,andfinally,sequencingofpooledfragmentsusingahigh-throughputsequencingplatform.MultiplexingforddRADseqistypicallyachievedbyligatingadapterswithuniquein-linebarcodestodigested genomicDNAand then addingunique indexesduringPCR enrichment (Figure1 & 2; Parchman etal., 2012; Petersonetal.,2012;Mastretta-Yanesetal.,2015).Barcodedligationadapt-ersneedtobesynthesizedinpairsandcanbepricey(e.g.,$140perpair),arespecifictoarestrictionenzyme,andcanthereforeonlybeappliedwithddRADseq.Forexample,preparinga96-sampleddRADlibrarywitheightbarcodedadaptersandtwelveindexesrequiresaninvestmentofover$1,000 inadaptersalone ([eight forward+onereverse]×2 complementary annealing oligos×$70=$1,260) thatcannot be used for non-ddRAD sequencing projects (unlike stan-dardindexes).Consequently,thecostofbarcodedadaptersmayim-pedethestartofnewprojectsforsmallerlaboratories.Inthisstudy,wemodifiedprotocol thatwaspreviouslyused forSNPdiscoveryin a single species by Kess etal. (2016) that lowers the upfrontcostsfrombarcodedadaptersandallowstheuseofcommonlyap-pliedcombinatorialdual-indexedbarcodesusingstandardadapters
F IGURE 1 Designcomparisonbetweenbarcodedadapters(a)andtheadaptersinthisstudycompatiblewithstandarddualindexes(b).Arrowsindicatesequencingprimerregions.Sequencingprimersforthetargetregionarecustomdesignedin(b)tooverlapwithenzymecuttingsites.P1andP2areadaptersligatedtotargetfragmentsthatmay(a)ormaynot(b)havebarcodes.P5andP7areflowcell-bindingregionsforIlluminaplatformsandarealwaysincorporatedusingPCRtotheligationadapters.SeeAppendixS2forsequencesofadaptersandprimersandcostcomparisonanalyses
| 3SALAS- LIZANA ANd OONO
compatiblewithcohesiverestrictionenzymesites,andsequencingusinglow-costcustomprimersthatincludetherestrictionsite.SeeAppendixS2fordetailedcostcomparisonstooriginalddRADproto-col.Weexploredthereproducibilityandgeneticdiversityrevealedbythelow-costddRADseqprotocolonmultiplespeciesofthegenusLophodermium.
At present, Illumina HiSeq is the most commonly used plat-form for genome resequencing albeit the short read lengths perloci (e.g.,<100bps;butMiSeq isusedbyDaviketal. (2015),Kessetal.(2016)andVivianandRn(2017)).Thisplatformhastheabilitytoproducesufficientcoveragedepthper locus,givenanappropri-atemultiplexing density, to overcome potential sequencing errorsandfalseSNPcallsaswellasproducehighnumbersofsharedlociamong samples.Thedeepcoverage canbeparticularly crucial foridentifyingheterozygosityinnonhaploidorganisms.WetestedtheprotocolontheMiSeqplatforminsteadbecausewebelievedfewersequencereadswouldstillproducesufficientnumbersofsharedlociamongsamplesforpopulationgenomicandphylogeneticanalyses,given the relatively small haploid genomes of these species (est.40–80Mbps).Wewere also interested in understanding how se-quencingdepthaffectedtheabilitytoidentifysharedlociorrecover
locirepeatedlyacrosssampleswiththelongerreads(e.g.,>100bps).Weassessedtherecoverablenumberofsharedlociamongtechnicalreplicatesamplesbydeterminingtherelationshipbetweenthenum-berofsharedlocibetweenreplicatesandtheirsequencingdepths.Furthermore,becausethecoverageperlocuswouldbecomparablylower than inHiSeq, potentially introducingmisidentified variantsintothefinaldatasetfromerrorsduringsequencing,PCR,orothertechnical modifications in the preparation of the library, we alsocomparedtheerrorratesbetweensharedlociintechnicalreplicatesamplesruneitherwithinorbetweenIlluminalanes.
The purpose of this study was twofold: (i) constructing a ro-bust phylogeny of a genetically diverse group of understudiedfungiusingarapidandlow-costddRADseqprotocolwithstandardindexesand(ii)assessingtherelationshipamongclusteringandfil-teringparameters,errorrates,andnumbersofhomologouslociformicro-andmacroevolutionaryanalyses.Wealsoevaluatedthere-coverablenumberofsharedlociwithincreasinggeneticdivergencewithin Lophodermium and how these loci might improve speciescircumscription compared to thewidely used internal transcribedspacer (ITS) rDNA locus.We demonstrate its utility on 50 fungalisolates representingpopulationsof six putative species basedon
F IGURE 2 AcomparisonoffourddRADseqprotocols.Optionalstepsareexcluded.DetailedprotocolforthisstudycanbefoundinAppendixS2
4 | SALAS- LIZANA ANd OONO
ITSphylogenetics.Weshowthatthislow-costddRADseqprotocolapplied on theMiSeq can be effectively implemented for circum-scribing more putative fungal species and finer population struc-turesthan ITSalone.Wedemonstrate itspotential tosignificantlyimprovephylogeneticresolutionfornonmodelorganismsandrevealpopulationstructurewithinwidespreadspecies,butclusteringandfilteringarestillkeybioinformaticstagesthatneedtobetestedfordifferentgroupsoforganisms.
2 | MATERIAL S AND METHODS
2.1 | Sampling, culturing, and DNA extraction
MultipleLophodermiumspeciesandindividualswerecollectedfromboth ascocarps on senescent needles and mycelial cultures fromhealthygreenneedlesofPinus spp. (seeTableS1A for geographicorigins, host species, and isolationmethods).Monosporic cultureswere obtained from ascocarps as described in Salas-Lizana etal.(2012)using2%maltextract(ME)agar.TheLophodermiumisolatesfromgreenneedlesused in this studycome fromanongoing sur-veyofendophytesofpinetrees.Greenneedleswerewashedandsurface-sterilizedasdescribedinOonoetal.(2015).Greenneedleswerecutwithasterilerazorbladeinto2-mm-longsections.Sectionswereplaced in2%MEagar slants in1.5mlmicrocentrifuge tubesinsterileconditions.Asubsetofemergingcultureswasgenotypedbysequencingtheinternaltranscribedspacer(ITS)andpartiallargesubunit(LSU)rDNAregion(hereafterITS-LSU)forpositiveidentifi-cationofaLophodermiumspp.Severalsamplescomefromthefungalcollection atNewZealandCrownResearch Institute (Scion).Withtheexceptionofthreespecies(L. australe,L. conigenum,andL. pinas-tri), the remaining specimenswere identified using a combinationofascocarpmorphology(datanotshown)andITS-LSUsequences.TheITS-LSUsequencesofL. australe and L. conigenumweresimilarenough(>97%sequencesimilarity)tobeconsideredasingleputativespeciesinthisstudy,whichisalsosuggestedbyothers(Minter,1981;Ortiz-García etal., 2003). Cultures were grown for 2–3weeks in50mlof2%MEmediaanddriedonfilterpaperundersterilecondi-tions.Between40and100mgofdriedmyceliumwasusedforDNAextractionfollowingamodifiedCTABmethod,seeAppendixS1fordetails.DNAqualitywas assessed by visualizing on 1.5% agarosegels, confirming the absence ofRNA/DNA smears.DNAquantifi-cationwasperformedonaQubit2.0 fluorometerusingtheQubitdsDNAHSAssayKit(Invitrogen,Carlsbad,CA,US).
2.2 | ITS- LSU sequencing and phylogenetic analyses
WeamplifiedtheITS-LSUrDNAregion(ITS1,5.8S,ITS2,andpartial28S)usingprimers ITS1F (Gardes&Bruns,1993) andLR3 (Hopple&Vilgalys,1994).PCRprotocolconsistedofan initialdenaturationstepat95°Cfor4min, followedby35cyclesof30sat95°C,30sat50°C,and90sat72°C,andafinalextensionat72°Cfor10min.PCRproductswerecleanedwithExoSAP-IT(USB-Affymetrix,SantaClara,CA)andsequencedusingSangertechnologyatUCBerkeley
DNA Sequencing Facility. The ITS-LSU sequences were alignedusingMAFFTv7 (Katoh&Standley,2013)withdefaultparametersfor the50sampledLophodermium individualsaloneaswellaswith55 additional reference ITS-LSU sequences from public databases(seeTableS1B for ITSaccessions).Wecompared the topologiesofunrooted trees constructed by ITS-LSU alone and all ddRAD loci.Inallcases,treeswereobtainedusingRAxMLv8.0.26(Stamatakis,2014)with1000bootstrapreplicatesundertheGTRGAMMAmodel(Tavaré,1986).
2.3 | ddRADseq library preparations
ThreetoninehundrednanogramsofDNApersample(6μltotal;50–150ng/μl)wasdouble-digestedusingtherare-cuttingEcoRI-HFandthefrequent-cuttingMseIenzymes(NewEnglandBioLabs,Ipswich,MA).IncaseswherestartingDNAconcentrationswerelow,dupli-catesamplesunderwentdigestionandligationstepsandthenwerepooled (see detailed workbench protocol for additional tips andsimilaritieswithKessetal.(2016)inAppendicesS1–S2).AllsampleswereplacedrandomlyinPCRplatesduringlibrarypreparation.TheDNAwasdigested for fourhoursat37°C,and theenzymeswereheat-killedat65°Cfor10min.DigestedDNAfragmentswereligatedtotheEcoRI-specificP1adapterandtheMseI-specificP2adapter(Figure1b,AppendixS2),whicharenotbarcodedasintheoriginalddRADseqprotocol(Petersonetal.,2012),withtheT4ligaseenzyme(NewEnglandBioLabs,Ipswich,MA).Theligationreactionconsistedofincubatingovernight(>12hs)atroomtemperature(approx.21°C)andheat killing the enzymeat 65°C for 10min. In order to elimi-nateunincorporatedadaptersandsmall(<300bps)DNAfragments,ligation reactionswerepurifiedusingAgencourtAMPureXPSPRImagneticbeads(BeckmanCoulter,Brea,CA)at0.8:1(beads:sample)volume ratio and resuspended in 35μl of 10mmol/L Tris buffer.Aunique combinationof the IlluminaNextera v2.0® dual-indexedbarcodes(P5andP7;Illumina,Inc.SanDiego,CA)wasattachedtopurifiedfragmentswith14cyclesofPCRfortworeplicatesofeachsample.IndexedPCRproductswerenormalizedandpooledinequi-molarproportions.Fragmentsizesbetween300and700bpswereselectedusingPippinPrep(SageScienceInc.,Beverly,MA)2%aga-rosecassettes.ThefinallibrarieswereanalyzedusingaTapeStationInstrument (AgilentTechnologies, SantaClara,CA) to confirm therecoveryoffragmentsontheselectedrange(seeAppendixS2forfragmentsizeprofiles).ThelibraryconcentrationswerequantifiedaspreviouslydescribedandsequencedusingIlluminaMiSeqplatformunder a251-cyclepaired-end readprotocol at the IGMGenomicsCenter,UCSanDiego.Intotal,fourlibraries,whichincludedsamplesfromotherstudies,weresequenced.SeeTableS1Cfor libraryde-scriptionandbarcodedsamples.
We tested reproducibility of the ddRAD data by sequencingtechnical replicatesof sixLophodermium strainswithinandamonglibraries.ReplicatesaresamplesfromthesamegenomicDNAthatwere processed independently (i.e., separate enzyme digestion)andtaggedwithdifferentcombinationsofP5andP7 indexes (seeTableS1Cfordetails).
| 5SALAS- LIZANA ANd OONO
With traditional Illumina sequencing primers, the sequencerdetectszerosequencediversityatthebeginningofthese librariesbecausetheyallhave identicalrestrictionsitepatterns.This initiallow-diversity impairs the MiSeq system to “maintain focus, reg-ister imagestotheclustermap,andmakeproperbasecallstode-liver high-quality data (Pub.No. 770-2013-013, Illumina technicalsupportnote).”Hence,customsequencingprimerswereusedthatincludetherestrictionsitebases(Figure1b;AppendixS2),whichal-lowsthesequencertodetecthighsequencediversityfromthebe-ginning.Theseprimersalsoallowtheusertomaximizesequencingthevariablelociandtoavoidsequencingthebarcodesorrestrictionsites(Figure1;Kessetal.,2016).
2.4 | Bioinformatics—filtering and clustering ddRADseq
Paired-endreadsweredemultiplexedbythesequencingfacilityandmergedusingPEAR(Zhang,Kobert,Flouri,&Stamatakis,2014)withdefaultparameters:aminimumoverlapof10bps,aminimumassem-bledsequencelengthof50bps,andamaximump-valueof.01fortheobserved-expectedalignmentscore,whichtestsstatisticalsig-nificanceofmergedreads.Weexcludedunassembledpairssince,onaverage,<1.58%ofreadswereunmerged(TableS1F,G),suggestingthatanalyzingmergedreadswithlowersequencingerrorswasmorevaluable tous thanretaining longer readswith lower likelihoodofidentifiablesharedhomologyacrosssamples.
Merged readswereprocessedusingpyRADv3.0.66pipeline(Eaton,2014)tofilterforqualityandpotentialparalogs(morede-tailbelow).NucleotidebaseswithPhredQ-scores<20(i.e.,accu-racyofthebasecall is<99%)werechangedtoan“N”characterand reads were excluded (i.e., quality-filtered) based on the al-lowablenumberofNsper read.Wetesteda rangeofallowablenumberofNsperreadfrom5to30(TableS1E),whichrepresentsapproximately1.8%–10.7%ofnucleotidesper readgivenanav-erage assembled sequence length of 281 bps (TableS1F). Thedefault of 4Ns (appropriate for shortHiSeq reads)would havebeentoostringentforthelongerMiSeqreads.Thequality-filteredreadswerethenclusteredwithinsamplesusing70%or85%se-quencing similarity thresholds. We analyzed the effect of thenumberofallowedNsduringthefilteringsteponthenumberofclusterspersample.
Filtered reads were clustered within samples using VSEARCHv1.11.1(Rognes,Flouri,Nichols,Quince,&Mahé,2016).Wetestedarangeofclusteringthresholdsbetween70and95%sequencesim-ilarityat5%incrementsforsamplesthatwerefilteredforreadswithamaximumof15Ns(hereafterdescribedas“15N”).Wethenana-lyzed the effect of clustering thresholds on the average coveragedepth (i.e.,numberof readsper cluster)per sample.Wealsoana-lyzedreadsfilteredwithamaximumcriteriaof30Ns (i.e., “30N”),rather than15N, and clustered these reads just at 70% sequencesimilarity(hereafterdescribedas“70/30N”)to includeinourcom-parativeanalysis.Putativelociorclustersrepresentedbyonlyoneread(i.e.,singletons)werediscarded.
Aconsensussequenceiscalledforeachputativelocusorclus-terusingeithertheestimatederrorrateorthemajorityrule.Whenacluster isrepresentedbymorethanthreereads,consensusbasecallsaremadeusingonlyA/T/G/CorNbasedonasequencingerrorratethatisestimatedbymaximumlikelihoodacrossallclusters(i.e.,putativeloci)withinasample,assumingzeroheterozygositybecausefungalDNAwashaploid.Anyambiguousbasesitethatoccursmoreoften than theexpected ratebasedon theestimatederror rate iscalled“N.”Ontheotherhand,whenaclusterisrepresentedbyonlytwo reads, consensusbase calls aremadeusing themajority rule,whichchoosestheloweralphabeticalbasetocallambiguousbases(i.e.,A/T=A;A/G=A;A/C=A;andG/C=C).Themajorityrulebasecall introducesbias inthedata (e.g.,more likelytoshareAsatthesame site thanTs), but allows retentionof lowcoverage loci. Lociwith consensus sequenceswithmore than fiveNs are discarded,which helps to remove potential paralogs. Decreasing clusteringsimilarity(i.e.,from85%to70%)increasescoveragepercluster,butcanalso increasepotentialparalogswithinclustersaswellasSNPerrorrates.
TheselociwerethenclusteredamongsamplesusingVSEARCHagainandalignedusingMUSCLEv3.5(Edgar,2004).Alignedlociwerekeptinthefinaldatasetiftherewerefewerthan100SNPsand 99 insertions/deletions (indels) across samples (default val-ues).Thesecriteriadiscardlongerreads(>330bps)whosecluster-ingthreshold(e.g.,70%)producehighlyvariableloci.Becauseourdatawerealreadydemultiplexedandpaired readsweremerged,the PyRAD pipelinewas started in step two, wherewe set thedatatype to ddrad, instead of pairddrad. Summary informationfor 85/15N and 70/30N pyRAD pipeline runs can be found inTablesS1F,G.
2.5 | Phylogenetic analyses with ddRAD loci
ToexploretheusefulnessofthelongerddRADreadsforinferencesatadeeptimescale (i.e.,phylogenetics),wecomparedthegeneticdistance,measuredusing ITS-LSU rDNA region, to thenumberofsharedlocibetweenpairsof50individualsfromsixputativespeciesof Lophodermium (L. australe-L. conigenum complex, L. baculiferum, L. molitoris, L. nitens, L. pinastri, and L. sp. nov.).Clusteringthresholdsbetween55and70%sequencesimilarityareconsideredtorecon-structthemostaccuratephylogenetictopologiesforrecentspeciesdivergences (e.g.,<60millionyears)but ishighly recommended tobetestedatvariousclusteringvalues(Rubin,Ree,&Moreau,2012).AsourreadsarelongerthantypicalddRADreadssequencedontheHiSeq,wetestedclusteringat85%threshold(DaCosta&Sorenson,2014;Heffelfingeretal.,2014)allowing15Nsaswellasatalower70% threshold allowing30Ns.Hereafter, the final datasets usingthesetwocriteriaarereferredtoas85/15Nand70/30Nmatrices,respectively.Locifoundinmorethan10sampleswerekept inthefinaldataset forphylogenetic reconstruction.Wechose10as theminimum sample number per locus because the maximum sam-ple size within a putative species was eight (with the exceptionofL. nitens and L. sp. nov.,whichare likely speciescomplexeswith
6 | SALAS- LIZANA ANd OONO
multiplecrypticspecies).Therefore,aminimumsampleof10wouldmostlikelyresultintheinclusionofmorethanonespeciesforeachlocus in the final alignment.Wecompared thephylogenetic reso-lutionbetweenthe70/30Nand85/15Nmatrices.AllphylogeneticanalyseswereperformedusingRaxMLv8.0.26 (Stamatakis,2014)as described above.We also evaluated the number of shared lociwithinspeciesusingthetwocriteriaforspeciesrepresentedbyatleasteightindividuals.
Pairwisegeneticdistancesbetweensamplesandputativespe-cieswereestimatedusingKimura2-parametermodel(Kimura,1980)consideringvariableratesamongsites(α=0.5)intheITS-LSUlocusafter testingseveralevolutionmodelsusingamaximum-likelihoodcriterion (lnL) inMEGA v7 (Kumar, Stecher, & Tamura, 2016). Theeffectofgeneticdistancesonthenumberofsharedlociamongpairswasanalyzedbybest-fitmodelswithminimumnumberofparame-tersinTableCurve2Dv3(SystatSoftwareInc.).
2.6 | Reproducibility analyses
Wetestedthereproducibilityofourmolecularbenchprotocolbyestimatingvariationinbasecallsbetweenninetechnicalreplicatepairs(usingsix individualsamples)withinorbetweensequencingruns(TableS1D).Wefilteredthereplicatepairsusingboth70/30Nand85/15Ncriteriaandcalculatedtheproportionofvariablesitesbetween shared loci (i.e., number of SNPs divided by the totallength of concatenated shared loci). The variable sites betweenshared lociof replicatepairsshould represent theerror rates in-troducedduringPCR,sequencing,filtering,andclustering.Wealsotestedhowsequencingdepths(i.e.,totalnumberofreadspersam-ple) related to the totalnumberofshared locibetweenreplicatepairsafterfiltration.Wewereinterestedinunderstandingwhetherincreasingsequencingdepthhassignificanteffectsonthenumberofsharedlocibetweenreplicatepairsorontheerrorrates.
3 | RESULTS
3.1 | ddRADseq data summary
Atotalof11,901,179sequencereadswereanalyzedforthisstudy(including replicate samples) which were produced in four sepa-rate librariesconsistingof44,48,64,and68samples (TableS1C).Demultiplexed, rawsequencereadsareavailable intheNCBISRAunderBioProjectaccessionPRJNA35677.Onaverage,98.42%(SD 0.89) of all readsweremergedwith aminimum of 94.41% and amaximumof99.41%persample(TableS1F).Theaveragenumberofmerged reads per samplewas 201,715with an average length of281bps.Themeanquality (Q)scoreofmergedreadsrangedfrom35.70 to39.60 for the first251nucleotidepositions, 33.60 to31uptoposition485,andthenfrom30.90to28.70at finalposition492.ThedropinQ-scoreistypicalforreversereads(Schirmeretal.,2015). Themeanpercentageof reads that passedquality-filteringwas 90.22% and 84.47% for 30N and 15N criteria, respectively.Within-sample clustering recovered an average of 48,659 and
52,658clusterspersample(nloci)forthe70/30Nand85/15Ncrite-ria,respectively.Themeandepthspercluster,excludingsingletons,were8.16 (SD11.30) for70/30Nand7.71 (SD11.20) for85/15N,andcorrelatedhighlywithsequencingdepthpersample(FigureS1).Theaveragepercentagesofnloci (i.e., representedby theconsen-sussequenceofagivencluster)remainingafterexcludingsingletonsandthosewithmorethanfiveNs in theconsensussequence (i.e.,f1loci)were28.26%and30.03%for70/30Nand85/15Ncriteria,re-spectively.Theaveragepercentagesoflociremainingafterexclud-ingpotentialparalogs(f2loci/nloci)tocompareacrosssampleswere28.00% for 70/30N criteria and 26.71% for 85/15N. The averagenumbersoftheseremainingloci(i.e.,f2loci)sharedbyaminimumoftensamplesforphylogeneticanalyseswere566(3.87%off2loci)for70/30criteriaand569(4.04%off2loci)for85/15N.SeeTablesS1F,Gforfurtherdetails.
3.2 | Allowable Ns at filtering stage and clustering similarity thresholds
IncreasingtheallowednumberofNsperreadfrom0to30Nsstead-ilyincreasedtheproportionoffilter-passedreads(Figure3).Onav-erage,43.2%,15.5%,and9.8%ofreadshadatleastone,15,and30Nsper sample, respectively.Themarginal increase in filter-passedreads decreasedwith each additional allowed number of Ns. SeeTableS1Eforadditionalsummary.
F IGURE 3 Percentfilter-passedreadsvs.No.ofallowedNsperread.EachpointrepresentsadifferentsamplefilteredwithdifferentnumbersofNs.NucleotidebaseswithPhredQ-scores<20werechangedtoan“N”character,andreadswereexcludedbasedontheallowablenumberofNsperread.Forexample,excludingreadswithmorethanfiveNsisequivalenttoexcludingreadswithmorethan1.8%nucleotidesperreadonaveragewithaPhredQ-scorelessthan20,assuminganaveragereadlengthof281bps.Excludingreadswithmorethan30Nsisequivalenttoexcludingreadswithmorethan10.7%nucleotidesperreadonaveragewithaPhredQ-scorelessthan20,assuminganaveragereadlengthof281bps
| 7SALAS- LIZANA ANd OONO
The number of clusters per sample increased on averagewithnumberofallowedNs,regardlessofclusteringthresholds(TableS1E).Whenreadswereclusteredat70%similarity,allowingzero,15,and30 Ns per read , these parameter combinations produced, on theaverage, 32,954, 45,999, and 48,659 clusters per sample, respec-tively.When readswere clustered at85% similarity, allowing zero,15,and30Nsperread,theseparametercombinationsproduced,ontheaverage,37,124,52,658,and56124clusterspersample,respec-tively(seeTableS1E,F,andGforsummary).IncreasingthenumberofallowedNsalso increasedtheaveragecoveragedepthper locus(Figure4a),althoughtheincreasewasminimal(i.e.,<1)between15and30Ns.Thefinalnumberoflocipersampletobecomparedacrosssamples(i.e.,f2loci)increasedwithincreasingnumberofallowedNs(Figure4c).However,thisincreasewasminimalfrom15to30Ns(e.g.,1.4%increasefor70%clusteringand1.6%increasefor85%cluster-ing;Figure4c).Thequality-filteringparameterhadlittleeffectontheaveragenumberoflocipersampleusedinthephylogeneticanalyseswhenlocisharedbyfewerthan10sampleswereexcluded(Figure4d).
Clustering at lower similarity thresholds decreases the totalnumberoflocipersample(Figure4c,Figure5,TableS1E,FandG)andincreasescoveragedepthpercluster(Figure5),buttheincreaseincoveragedepthperclusterwasminimalfrom85%to70%(i.e.,<1readpercluster).Clusteringthresholdvaluescan,however,signifi-cantlyaffectthenumberofsharedlociamonglineages(TableS1H).
Therelationshipbetweensequencingdepthandthenumberofloci recovered from each sample depended on the Lophodermium lineage (Figure6). For example, whereas the loci accumulationcurveforLophodermium sp. nov.appearedtobereachingaplateauwith greater sequencing depths, other species appeared to havea higher diversity of reads per sample. To test how sequencingdepthdifferentlyaffectedtheaccumulationofnew loci fordiffer-entLophodermium lineages,wemergedsharedlociperlineageandanalyzedthelociaccumulationcurveusingrandomsamplingoflociwithreplacement.L. nitensandtheL. australe–L. conigenumcomplexhadsignificantlygreaterdiversityoflocithanL. sp. nov. or L. molitoris (Figure6c,d).SeeTableS1F,Gfordetails.
Theaveragesequencelengthperlocusdecreased(i.e.,from262to249bps)assingletons,potentialparalogs,andlociwithhighratesofambiguousnucleotideswereexcludedthroughthebioinformaticpipeline. The average sequence length of loci incorporated in thephylogeneticanalyses(i.e.,beingsharedwithatleasttensamples),
however,didnotsignificantlydecrease(i.e.,from249to236bpsfor70/30Nandfrom246to242for85/15N).
3.3 | Phylogenetic analyses
The supported phylogenetic relationships, as revealed by ITS-LSUalone,of thesixputativespeciesused in thisstudy (Figure7)wereidenticaltothoseobtainedwith55additionalreferenceLophodermium sequences frompublicdatabases (FigureS2). Thedeeperbrancheshad lower support (bootstrap values <75%) based on the 85/15Nalignment compared to either the ITS-LSU or 70/30N alignments(Figure7). The 70/30N phylogeny had the highest bootstrap sup-port inalmostallbranches (i.e.,100%bootstrapsupport), includingbranchessupportingputativespecies.The tree topology recoveredbythe70/30NmatrixwasalsohighlysimilartotheoriginalITS-LSUtree.Themaindifferencebetweenthetwotopologieswasthatthe70/30Ntreewasbetterresolvedintheterminalbranches.Thebetter-supportedbrancheswithinputativespeciesoftenrevealedinterest-ingcorrelationsbetweengenetic structureandendophyteecology.ThenewLophodermiumspeciesandL. nitensbothclearlyconsistedoftwolineages,whichmostlycorrelatedwithgeography(i.e.,northvs. southforL. sp. nov.andeastvs.westforL. nitens;seeTableS1Afordetailsongeographicoriginsofsamples).L. molitorisalsoconsistedoftwolineages,whichcorrelatedtodifferentPinushostsubgenera(i.e.,Strobusvs.Pinus).ThetwolineagesofL. australe–conigenumcomplexcouldalsobeclearlydelineatedwiththe70/30Ndataset,butdidnotcorrelatewithanyknownecologicaldifferences.
Thesummaryoffinalalignmentmatricesfor85/15Nand70/30NddRADlociisfoundinTable1.The85/15Nmatrixwasnotaninclusivesubsetofthe70/30Nmatrix,butcontained36locithatwerenotfoundinthe70/30Nmatrix.Wefoundthatthese36lociwerediscardedfromthe 70/30Nmatrix during the alignment stage because they eithercontainedmore than100SNPsacrosssamples (11/36 loci)ormorethan99indelsacrosssamples(15/36loci).Theremainingloci(10/36)wereexcludedaspotentialparalogsbecausemultiplehaplotypeswerefoundwithin a cluster.However, evenwhen these36 lociwereex-cludedfromthe85/15Nmatrixsuchthatthe70/30Nmatrixwasin-clusiveofalllociinthe85/15Nmatrix,thebootstrapsupportvaluesdidnotsignificantlyimproveforthe85/15Nmatrix(datanotshown).
The number of loci shared within-species was greater in the70/30Ncriteria than in the85/15Ncriteria (15.71%more loci, on
TABLE 1 Summaryofthethreealignmentsusedforphylogeneticanalyses
Alignment No. of LociSum length of all loci Polymorphic positionsa
Ambiguous/Gap sitesb (%)
Missing locic (%)
Mean polymorphic positions per locus
ITS-LSU 1 1,143 303 24.68 0.00 303
85/15N 2,068 529,469 54,335 77.78 76.99 26
70/30N 2,388 607,736 84,747 77.59 76.35 35
aPolymorphicpositionsonlyincludenucleotidevariants.bAmbiguous/gapsitesareeither“N”sor“-”sinthealignmentmatrix.cMissinglociwascalculatedastheproportionofabsentlocioverthetotalnumberofpotentialloci(i.e.,thenumberofsamplesmultipliedbythetotalnumberoflociinadataset).
8 | SALAS- LIZANA ANd OONO
average;TableS1F,G).ThenumberofsharedlocibetweensamplepairsandspeciespairsdecreasesasgeneticdistancesbasedonITS-LSUsequencesincrease(Figure8).Thedecayrateofthenumberofshared lociwithgeneticdistance isgreater for the85/15Nmatrixthanforthe70/30Nmatrix.Thispatterndoesnotchangewhenthenumberofsharedlociisdividedbytheaveragenumberoftotallociforpairwisecomparisons(FigureS3).
3.4 | Reproducibility
Thenumberofsharedlocibetweentechnicalreplicatepairsvariedbetween1,254and12,340,mostlikelycorrelatedtotheirrangesinsequencingdepths(Figure9a;FigureS4).Thevariationinbasecalls between shared loci of technical replicate pairs (observederrorrates)rangedfrom0.006%to0.43%forthe70/30Ndatasetand0.004%to0.33%forthe85/15Ndataset (Figure9b,FigureS4,TableS1D).Theobservederrorratesbetweenreplicatesweresignificantlygreaterfor70/30Nthanfor85/15Nanalyses(pairedt test,p < .0001), albeit aminordifference (0.0012vs.0.0007).The average estimated error rates were 0.22% and 0.24% for70/30Nand85/15Ndatasets (TableS1F,G),respectively,basedonthemaximum-likelihoodequationofLynch (2008).Theaver-ageobservederrorratewaslowerthantheestimatederrorrateforbothdatasets(p < .05for70/30Nandp < .01for85/15N).Theestimatederrorratesdidnotsignificantlycorrelatewithsequenc-ing depths for the 59 samples (data not shown), but may havedifferedamongLophodermiumlineages(Figure6).ReplicatepairsoftheL. sp. nov.hadlowererrorratesoverallwithanaverageof0.01%(n=6pairs,2replicates)whereaspairsofL. nitenshadanaverageof0.24%error rate (n=5pairs,3 replicates),but there
werenotenoughreplicatepairsforeachlineagetotestthissta-tistically.Theerrorratesbetweensameanddifferentsequencinglibrariesforthreereplicatesampleswerecomparedwithapairedone-way t test andwerenot statistically significant (p = .11) al-thoughtherewasatendencyforerrorratestobehigherbetweenlibrariesthanwithin(TableS1D).
4 | DISCUSSION
The low-cost ddRADseq, which had been previously applied forSNPvariantdiscoveryinasinglespecies(Kessetal.,2016),gener-atedasufficientnumberofhomologouslocitoconstructastronglyresolvedphylogenyformultipleputativespeciesofthewidespreadLophodermium genus.Wealso found thatgenetic structurewithinputativespeciescanoftenbecorrelatedwithgeographyordiffer-enthostspecies,butitcanalsobeobservedwithinthesamehostsandlocations,suggestingthatecologicaltraitsotherthandispersallimitationorhostspecificitycanactasbarrierstogeneticintrogres-sion.ThephylogeneticresolutionwasimprovedoverITSalone,butdependedonfilteringandclusteringparameters.Theclusteringpa-rameterwasmarkedlymoreimportantthanthefilteringparameter.
4.1 | Filtering and clustering criteria
IncreasingtheallowednumberofNsfrom0to30increasedthepro-portionof filter-passed reads,with thegreatest increasebetween0and5Ns(Figure3).ThenumberofNswasalsopositivelycorre-latedwithnumberofclusters(includingsingletons),coveragedepth(Figure4a),andnumberofloci(excludingsingletons;Figure4c),but
F IGURE 4 EffectsofthenumberofallowedNsinafilter-passedreadonthe(a)coveragedepthofclusteredlociwithmorethanoneread,(b)percentloci(f2loci)outoftotalclusteredloci(nloci)afterfilteringsingletonsandpotentialparalogs,(c)numberoftotalloci(f2loci)persamplefortwoclusteringthresholdvalues(70%and85%),and(d)numberoflocishared(f2loci)byatleasttensamplesfor70%clusteringthresholdvaluesforthe59samples(lightgraylines).Dottedandsolidlinesindicateaverageforlociclusteredwith70%and85%similarity,respectively.Grayareasrepresentstandarderror(n=59).SeealsoTableS1Efordetails
| 9SALAS- LIZANA ANd OONO
the increases in the latter twoweremodest after 5Ns. Applyingaless-stringentfilteringcriteriaonlyledtoremovalofgreaterpro-portionsof clusters later in thepipeline (Figure4b) becausemoreclusters either were singletons, had consensus sequences withgreaterthanfiveNs,orwerepotentialparalogs(i.e.,hadmoream-biguousnucleotidepositionswithinclusters).Overall,thenumberof
allowedNshadlittleeffectonthenumberofsharedlociamongourLophodermiumspeciesthatwereusedinthephylogeneticanalyses(Figure4d).ThissuggeststhatallowingareasonablenumberofNsdoes not bias the final result of this study butmaymarginally in-crease the average coverage depths formore accurate consensussequencesandfewerpotentialparalogstobeexcludedduetoam-biguoussitesinreads.
Increasing clustering thresholds from 70% to 85% or 95% in-creasesnumberoflocipersample(Figure6)butdecreasesthenum-berofsharedlociamongdifferentLophodermiumlineages(Figure8,TableS1H).Interestingly,althoughwetypicallyconsiderdeepercov-eragetobeassociatedwithdecreasederrorrates,lowerclusteringthresholds had greater error rates (TableS1D) despite, albeitmar-ginally,deepercoverage (Figure5).Theerrorrateswere,however,bothminor.Therefore,wesuggestvariousclusteringthresholdstobe tested thatminimizes error rates (i.e., compare replicates) andmaximizessharedlociatthegeneticscaleandbreadthofeachstudy.
4.2 | Phylogenetic topologies with ddRAD loci
Thephylogeneticreconstructionbasedonthe70/30Nmatrixwassignificantly better resolved in the deeper branches than by the85/15Nmatrixandalso in the terminalbranches thanby the ITS-LSUmatrix (Figure7). Thewell-resolved terminal branches of the70/30Ntreecorrespondedtopreviously identified(e.g.,L. australe major vs. cryptic; Oono etal., 2014) or potential cryptic species
F IGURE 5 Effectofclusteringthresholdsonthemeandepthperclusterandnumberofloci(f2loci),excludingsingletons,persample.Solidlineindicatesmeancoveragedepth(leftaxis)anddottedlineindicatesnumberofloci(rightaxis).Inallcases,amaximumof15Nswasallowedinthequality-filteringstep.Shadedareasrepresentstandarderror(n=59).Meandepthandnumberoflocifor70/30Nareprovidedasasinglepointwithstandarderrorbarsforcomparison.SeeTableS1Efordata
F IGURE 6 Readsvs.numberoflociat70%clustering(a&c)or85%clustering(b&d).(a&b)Effectofsequencingdepth(numberofmergedreads)persampleonthenumberofloci(f2loci)persample.Fittedlogarithmiccurvesaredisplayedforeachputativespeciestorepresentaccumulationcurves.Grayregionsindicate95%confidenceintervalsoffittedcurves.(c&d)Theaccumulationofnewlociperspecies(uniquef2loci withinputativespecies)withincreasingsequences.TherewerenoenoughsamplesorreadsforL. baculiferum or L. pinastriforcomparison
10 | SALAS- LIZANA ANd OONO
(withinL. nitens and L. sp. nov.).Forinstance,thetwowell-supportedclades within L. sp. nov. (Figure7, Figure S2) correspond to sam-ples recovered from two distinct geographical regions, NorthernCaliforniaandOregon/Washington,whichmaycorrespondtotwo
species or a single highly structured species. The 70% thresholdallowed the inclusion of highly variable loci, which increased thenumberofphylogeneticallyinformativepositionsandsharedlocibe-tweenmoredistantlyrelatedspecies(Table1,Figure8,TableS1H).
F IGURE 7 Unrootedmaximum-likelihoodtreesfornrDNAITS-LSUregionandddRADlociobtainedwithtwodifferentcombinationsofsimilaritythresholdsandnumberofallowedNsinaread(seetextfordetails).Barsnexttothetreesmapisolatesthatbelongtothesameputativespeciesorspeciescomplex.L. australe and L. conigenumarelabeledaftertheiroriginalidentifiersalthoughtheylikelyrepresentasinglespeciesandarereferredastheL. australe–L. conigenumcomplexthroughoutthismanuscript.L. australeisolatesarelabeledwithM(major), C(cryptic),orH(hybrid)basedonapopulationstructureanalysisbyOonoetal.(2014).PutativeL. molitorisisolatesarelabeledwithP(Pinus)orshadedbarslabeledStrobusbasedonhostspecies.PutativeL. nitensisolatesarelabeledE(East)orW(West)basedongeographicorigins.PutativeL. sp. nov.isolatesarelabeledN(North)orS(South)basedongeographicorigins.Numbersabovebranchesrepresentbootstrapvalues,notallbootstrapvaluesareshownforaclearerview.SeeFigureS2forITS-LSUphylogenywithadditionalreferencesequences
F IGURE 8 Numberoflociclusteredat70%or85%similaritywithatleasttensamplesperlocusthatweresharedbetweenpairsofsamplesasafunctionofgeneticdistance,basedonnrDNAITS-LSUsequences.Theshadedarearepresentswithin-speciesvariation(>97%similarity).Inset:samedatabutmergedtocomparebetweenpairsofputativespeciesusingaminimumof2speciesperlocus.Thetworegressionsrepresentfittedcurvesforeachdataset,70/30N(black)and85/15N(gray).Theequationforthefittedcurvesisy=a+b/x2,wherea=37.842,b=0.404,andr2=.404for70/30Nmatrixanda=3.949,b=0.234,andr2=.675for85/15Nmatrix
| 11SALAS- LIZANA ANd OONO
Althoughclusteringreadsorlocibysimilarityisanimperfectsolu-tionto identifyinghomologous loci,ourresultsagreewithconclu-sionsfromRubinetal.(2012),showingthatclustersthatmayincludeamixtureoforthologousandparalogouslocistillcontainsubstantialphylogeneticsignalthatcanproducecorrecttopologieswithmoder-atetohighaccuracy.
Thephylogeneticreconstructionbasedonthe85/15Nmatrixstillimprovedover thatof the ITS-LSUmatrix forwithin-speciesstruc-ture(i.e.,recentdivergence)despitelowerbootstrapvaluesatdeeperbranches.Ahigher clustering thresholdof85%significantlydimin-ished the number of shared loci across species compared to 70%.Forinstance,whileL. nitensandtheL. australe–L. conigenumcomplex(thetwomostdistantlyrelatedspecies inoursample)hadonlytensharedlociinthe85/15Nmatrix,therewere73inthe70/30Nmatrix(TableS1H).Thissevenfolddifferencecontrastswiththecloselyre-latedspeciesL. baculiferum and L. molitoris,whosenumberofsharedlociin85/15Nand70/30Nwasdifferentbytwofold(30and63loci,respectively).Thisresultisconsistentwiththeobservationthatthenumberofshared locidecreasesabruptlywithgeneticdistanceforboth thresholds but ismorepronounced for85/15N (Figure8).Asa result of using a different number of allowed Ns and clusteringthresholds,the85/15Nmatrixincluded36locithatwerenotfoundinthe70/30Nmatrix.Theseloci,however,werenotsolelyresponsibleforthelowersupportofdeeperbranchesbythe85/15Nmatrix.Asexplainedbefore, it ismorelikelythatthismatrixperformedworsebecauseoffewersharedlociacrossdistantlyrelatedspecies.
Thenumberofhomologouslociindistantlyrelatedlineagescanlikelybeimprovedwithgreatersequencingdepthsornarrowerse-lectionof sequence lengths. In this study, as the genetic distancebetween species increased, the number of shared loci decreased,althoughnotata linearrate.Hence,thisddRADseqmethod ises-peciallyeffectiveinrevealingwithin-speciesfungal(<97%ITSsim-ilarity) diversity, but is also cost-effective for among-species (e.g.,97–70%ITSsimilarity)diversity.However,restrictionsitepolymor-phism will likely limit the efficacy of ddRADseq for phylogeneticreconstruction beyond the genus and family levels (DaCosta &Sorenson, 2016). Furthermore, the mutation risk hypothesis sug-gests that fewer homologous or shared lociwill be recovered forloci with greater risk of mutations. Hence, longer MiSeq ddRADreads,whichhavegreaterlikelihoodofmutationsthatcreatenovel
restrictionsiteswithinreads,mayhavebeenlesslikelytobesharedamongspeciesthanwithinspecies(Arnold,Corbett-Detig,Hartl,&Bomblies,2013;Gautieretal.,2013).Simplemutationsinrestrictionsitesmayalso increase the read lengthvariation for longerMiSeqreadscomparedtoshorterHiSeqreads,butwesawsomesequenceswith restriction sites incorporated in themiddle of the sequence,suggesting that length variation can also be caused by inefficientenzymedigestions. Lengthvariationcausedby inefficientenzymedigestionsormutationsatrestrictionsitesmaycauselowerratesofclustering,butshouldnotbiasphylogeneticanalysesbecausegapsaretreatedasmissingdata.Trimmingreadstoshorterlengthsortar-getingshorterreadswill likelyminimizelossofclustersorlocidueto lengthvariationamongandwithinsamplesofdifferentspecies.However,using longer readsproducesmore informativesites thatmay compensate for having a reduced number of shared loci andcanpotentiallyresolvedeeperdivergences(e.g.,>300millionyears)betterthanshorterreads(Rubinetal.,2012).Longerreadscanalsobemorehelpfulfordownstreampopulationgeneticanalyses,suchasgenomemapping,haplotypereconstruction,andlinkagetoadap-tive loci.Ourcasestudyshowsthatsmallportionsof thegenomesequencedwith theMiSeq platform can be sufficient to producecomparativehomologous lociwithinhighlydiversetaxawithsmallhaploid genomes (Figures6 and 7, TableS1H). Other ddRADseqdatasets that are not based on strict nucleotide polymorphisms,such as indel or presence–absence polymorphisms, may furtherhelp the resolution ofmore distantly related lineages (DaCosta&Sorenson,2016).
4.3 | Reproducibility analysis
Theobservederrorratesbetweenreplicatesamplesthatwererunwithin or between libraries were low, averaging about one errorevery1000basepairs(e.g.,0.0012for70/30Ncriteria,TableS1D)andlowerthanexpectederrorrates.Theobservederrorrateswerealsonotsignificantlydifferentwhensampleswereruninthesameordifferent libraries, but theerror rates tended tobehigherbe-tweenlibrariesthanwithin(TableS1D).Replicatesamplesindiffer-ent librariesrepresentdifferentperiodsof laboratorypreparation(i.e., PCR group, enzyme efficiency) whichmay present variationand increasing error rates than between replicate samples in the
F IGURE 9 Effectofsequencingdepth(totalnumberofmergedreadspersample)on(a)recoveryofsame(shared)loci(f2loci)and(b)nucleotidevariation(observederrorrate)betweenreplicatepairsamplesfilteredwiththe70/30Ncriteria.Replicatesamplesrunonsamesequencinglibraryarecirclesanddifferentlibrariesaretriangles.Markersaretransparent,showingsomeapparentoverlapindatavalueswheretheyappeardarker
0
5.0 × 103
1.0 × 104
1.5 × 104
2.0 × 104
0 5.0 × 103 1.0 × 104 1.5 × 104 2.0 × 104
No. of loci in replicate 1
No.
of l
oci i
n re
plic
ate
2
Shared # of loci20004000800012000
0
5.0 × 103
1.0 × 104
1.5 × 104
2.0 × 104
0 5.0 × 103 1.0 × 104 1.5 × 104 2.0 × 104
No. of loci in replicate 1
No.
of l
oci i
n re
plic
ate
2
% nucleotidevariation betweenreplicates
0.010.100.200.40
(a) (b)
12 | SALAS- LIZANA ANd OONO
samelibrary.Furthermore,themarginalincreaseincoveragedepthanddecrease inerror ratewith increasing sequencingdepthmayhavedependedontheLophodermiumspeciesiftheyhaddifferentgenomesizes.
Increasing sequencing depth per sample is likely the bestmethod to improve reproducibility. Sequencing depths correlatewith coverage depth per cluster (Figure S1), which affects theestimation of nucleotide identity when there are discrepanciesamongreadswithinacluster.Hence,increasingsequencingdepthincreases reproducibility among samples and between librar-ies bymore accurately distinguishing PCR or sequencing errorsfrompolymorphismswithgreatercoverage(Figure9).Sequencingdepth alsohas a strongpositive correlationwithnumberof locipersample(Figure6)andthereforeincreasestheprobabilityofse-quencingsharedlociamongsamples.Narrowingtherangeofreadlengthstobesequencedduringpreparationofthepooledlibrarywould also help improve coverage.However, aswe saw, deepersequencing or greater coverage per locus (i.e., lower error rateswithin locus)wasnotnecessarilyneededfor improvingphyloge-neticassessmentswithinthisgenus.Clusteringatalowerthresh-oldhadthegreatesteffectonimprovingphylogeneticresolutionbyidentifyingmoresharedlociacrossdistantlyrelatedspecies.Afine-scaleanalysisforpopulationstructureanddiversitymayre-quirebetterresolutionofnucleotidevariationwithinclusterswithgreatercoveragedepths.Fortheidentificationofheterozygosityin nonhaploid organisms or homologous loci of larger genomes(e.g., >100Mbps), deeper sequencing depths will be necessarywithadditionalmodifications, suchasadapters that include ran-dom degenerate sites for identifying PCR duplicates (Hoffbergetal., 2016), a narrower selection of read lengths, multiplexingfewersamplesperlibrary,orapplyingthisprotocolontheHiSeq.Thisstudysuggeststhatforhaploidfungithathaverelativelysmallgenome sizes (30-50Mbps; Tavares etal., 2014; Gregory etal.,2007),increasingsequencingdepthsbeyond200kpersamplewillsufficientlydecreasetheerrorratesper locusandwillberobustforfine-scalegeneticanalyses.
5 | CONCLUSIONS
Thelow-costddRADseqprotocolusingstandardindexes(Kessetal.,2016) produced sufficient numbers of loci to resolve the phyloge-neticrelationshipsofadiversegenusoffungalendophytesatlowercosts(seeAppendixS2forcostcomparisonanalysis)thantheoriginalddRADprotocols.Specialattention isneeded,however, to identifyappropriatefilteringandclusteringparameters.Althoughclusteringthresholdssignificantlyaffectedthephylogeneticresolution,quality-filteringhadlittleimpact.Reproducibilityandcoveragedepthswerelinked to sequencing depths, but high coverage depths (e.g., >10typicallyfoundinHiSeqdataanalyses)werenotessentialforstrongphylogeneticsupportinthistaxonomicgroupwitharelativelysmallhaploid genome. The use of longer reads may have reduced the
numberofsharedloci,butthemarginalincreaseinphylogeneticallyinformativesitesperreadmaycompensateforthisdisadvantage.
ACKNOWLEDG MENTS
The IlluminaMiSeq runswereconductedat the IGMGenomicsCenter, University of California, San Diego La Jolla, CA. Wewould especially like to thank Kristen Jepsen,Director of IGMGenomics Center, for consultations on designing primers andadapters. Some analyses were performed using the Pinaceaeserver at InstitutodeBiología,UNAM (CONACyTNo.221694),andCONABIO’scomputingcluster,whichwaspartiallyfundedbySecretaríadeMedioAmbienteyRecursosNaturales(SEMARNAT)through the grant “Contribución de la Biodiversidad para elCambioClimático”toComisiónNacionalparaelConocimientoyUsodelaBiodiversidad(CONABIO).Wealsothankthereviewersfortheircarefulassessmentofpreviousdrafts.Finally,wethanktheUCMEXUS-CONACyTPostdoctoralFellowshipsProgramforfundingtoRS-LandCollaborativeGrantNo.20150989toRO.
CONFLIC T OF INTERE S T
NoneDeclared.
AUTHORS’ CONTRIBUTIONS
RS-LandROconceivedtheideasanddesignedmethodology;RS-LandROcollectedthedata;RS-LandROanalyzedthedata;RS-LandROwrote themanuscript.All authors contributed critically to thedraftsandgavefinalapprovalforpublication.
DATA ACCE SSIBILIT Y
DataandRcodesusedforFigure6areavailableatDataOnehttps://doi.org/10.15146/r31m36. Phylogenetic trees and alignment filesareavailableatTreeBASES22555.
ORCID
Ryoko Oono http://orcid.org/0000-0003-4859-0030
Rodolfo Salas-Lizana http://orcid.org/0000-0002-0761-2910
R E FE R E N C E S
Andrews,K.R.,Good,J.M.,Miller,M.R.,Luikart,G.,&Hohenlohe,P.A.(2016).HarnessingthepowerofRADseqforecologicalandevolu-tionarygenomics.Nature Reviews Genetics,17(2),81–92.https://doi.org/10.1038/nrg.2015.28
Arnold,B.,Corbett-Detig,R.B.,Hartl,D.,&Bomblies,K.(2013).RADsequnderestimatesdiversityandintroducesgenealogicalbiasesduetononrandom haplotype sampling.Molecular Ecology, 22(11), 3179–3190.https://doi.org/10.1111/mec.12276
Baird,N.A.,Etter,P.D.,Atwood,T.S.,Currey,M.C.,Shiver,A.L.,Lewis,Z. A., … Johnson, E. A. (2008). Rapid SNP discovery and genetic
| 13SALAS- LIZANA ANd OONO
mapping using sequenced RAD markers. PLoS One, 3(10), e3376.https://doi.org/10.1371/journal.pone.0003376
DaCosta,J.M.,&Sorenson,M.D.(2014).Amplificationbiasesandcon-sistentrecoveryoflociinadouble-digestRAD-seqprotocol.PLoS One,9(9),e106713.https://doi.org/10.1371/journal.pone.0106713
DaCosta, J. M., & Sorenson, M. D. (2016). ddRAD-Seq phylogenet-ics based on nucleotide, indel, and presence-absence polymor-phisms: Analyses of two avian genera with contrasting histories.Molecular Phylogenetics and Evolution,94(Pt4),122–135.https://doi.org/10.1016/j.ympev.2015.07.026
Davey,J.W.,Hohenlohe,P.A.,Etter,P.D.,Boone,J.Q.,Catchen,J.M.,&Blaxter,M.L.(2011).Genome-widegeneticmarkerdiscoveryandge-notypingusingnext-generationsequencing.Nature Reviews Genetics,12(7),499–510.https://doi.org/10.1038/nrg3012
Davik,J.,Sargent,D.J.,Brurberg,M.B.,Lien,S.,Kent,M.,&Alsheikh,M. (2015). A ddRAD based linkage map of the cultivated straw-berry, Fragaria xananassa. PLoS One, 10(9), e0137746. https://doi.org/10.1371/journal.pone.0137746
Deckert,R.J.,Hsiang,T.,&LarryPeterson,R.(2002).Geneticrelation-ships of endophytic Lophodermium nitens isolates from needles ofPinus strobus. Mycological Research, 106(3), 305–313. https://doi.org/10.1017/S0953756201005494
Eaton,D.A. R. (2014). PyRAD:Assembly of de novoRADseq loci forphylogenetic analyses. Bioinformatics, 30(13), 1844–1849. https://doi.org/10.1093/bioinformatics/btu121
Edgar, R. C. (2004).MUSCLE:Multiple sequence alignment with highaccuracyandhighthroughput.Nucleic Acids Research,32(5),1792–1797.https://doi.org/10.1093/nar/gkh340
Franchini,P.,Parera,D.M.,Kautt,A.F.,&Meyer,A.(2017).quaddRAD:Anewhigh-multiplexingandPCRduplicateremovalDDRADproto-colproducesnovelevolutionaryinsightsinanonradiatingcichlidlin-eage. Molecular Ecology,26(10),2783–2795.https://doi.org/10.1111/mec.14077
Ganley,R.J.,Brunsfeld,S.J.,&Newcombe,G.(2004).Acommunityofunknown,endophyticfungiinwesternwhitepine.Proceedings of the National Academy of Sciences of the United States of America,101(27),10107–10112.https://doi.org/10.1073/pnas.0401513101
Gardes,M.,&Bruns,T.D.(1993).ITSprimerswithenhancedspecificityforbasidiomycetes-application to the identificationofmycorrhizaeandrusts.Molecular Ecology,2(2),113–118.https://doi.org/10.1111/j.1365-294X.1993.tb00005.x
Gautier,M., Gharbi, K., Cezard, T., Foucaud, J., Kerdelhué, C., Pudlo,P., … Estoup, A. (2013). The effect of RAD allele dropout on theestimation of genetic variation within and between populations.Molecular Ecology,22(11),3165–3178.https://doi.org/10.1111/mec. 12089
Gregory,T.R.,Nicol,J.A.,Tamm,H.,Kullman,B.,Kullman,K.,Leitch,I.J.,…Bennett,M.D.(2007).Eukaryoticgenomesizedatabases.Nucleic Acids Research,35,D332–D338.https://doi.org/10.1093/nar/gkl828
Hanso, M., & Drenkhan, R. (2011). Lophodermium needle cast, insectdefoliation and growth responsesof young scots pines in Estonia.Forest Pathology,42(2),124–135.
Heffelfinger, C., Fragoso, C. A., Moreno, M. A., Overton, J. D.,Mottinger, J. P., Zhao, H., … Dellaporta, S. L. (2014). Flexibleand scalable genotyping-by-sequencing strategies for popula-tion studies. BMC Genomics, 15 (November), 979. https://doi.org/10.1186/1471-2164-15-979
Hoffberg, S. L., Kieran, T. J., Catchen, J.M., Devault, A., Faircloth, B.C.,Mauricio,R.,&Glenn,T.C. (2016).RADcap:Sequencecaptureofdual-digestRADseqlibrarieswith identifiableduplicatesandre-ducedmissingdata.Molecular Ecology Resources,16(5),1264–1278.https://doi.org/10.1111/1755-0998.12566
Hopple, J. S., & Vilgalys, R. (1994). Phylogenetic relationships amongCoprinoid taxa and allies based on data from restriction site
mapping of nuclear rDNA. Mycologia, 86(1), 96. https://doi.org/10.2307/3760723
Huyse,T.,Poulin,R.,&Théron,A.(2005).Speciationinparasites:Apop-ulation genetics approach. Trends in Parasitology, 21(10), 469–475.https://doi.org/10.1016/j.pt.2005.08.009
Katoh, K., & Standley, D.M. (2013).MAFFTmultiple sequence align-ment software version 7: Improvements in performance and us-ability.Molecular Biology and Evolution,30(4), 772–780.https://doi.org/10.1093/molbev/mst010
Kess,T.,Gross,J.,Harper,F.,&Boulding,E.G.(2016).Low-costddRADmethodofSNPdiscoveryandgenotypingappliedtotheperiwinkleLittorina Saxatilis. The Journal of Molluscan Studies,82(1),104–109.
Kimura,M. (1980).A simplemethod for estimating evolutionary ratesofbasesubstitutionsthroughcomparativestudiesofnucleotidese-quences.Journal of Molecular Evolution,16(2),111–120.https://doi.org/10.1007/BF01731581
Kumar, S., Stecher,G.,&Tamura,K. (2016).MEGA7:Molecular evolu-tionarygenetics analysis version7.0 forbiggerdatasets.Molecular Biology and Evolution, 33(7), 1870–1874. https://doi.org/10.1093/molbev/msw054
Lantz,H., Johnston, P. R., Park,D.,&Minter,D.W. (2011).MolecularphylogenyrevealsacorecladeofRhytismatales.Mycologia,103(1),57–74.https://doi.org/10.3852/10-060
Loveless,M.(1984).Ecologicaldeterminantsofgeneticstructureinplantpopulations.Annual Review of Ecology and Systematics,15(1),65–95.https://doi.org/10.1146/annurev.es.15.110184.000433
Lynch,M.(2008).Estimationofnucleotidediversity,disequilibriumcoef-ficients,andmutationratesfromhigh-coveragegenome-sequencingprojects.Molecular Biology and Evolution,25(11),2409–2419.https://doi.org/10.1093/molbev/msn185
Manel,S.,Schwartz,M.K.,Luikart,G.,&Taberlet,P.(2003).Landscapegenetics: Combining landscape ecology and population genetics.Trends in Ecology & Evolution,18(4),189–197.https://doi.org/10.1016/S0169-5347(03)00008-9
Mastretta-Yanes,A.,Arrigo,N.,Alvarez,N.,Jorgensen,T.H.,Piñero,D.,&Emerson,B.C.(2015).Restrictionsite-associateddnasequencing,genotypingerrorestimationanddenovoassemblyoptimizationforpopulationgeneticinference.Molecular Ecology Resources,15(1),28–41.https://doi.org/10.1111/1755-0998.12291
Minter, D.W. (1981). Lophodermium on pines.Mycological Papers,147,1–54.
Neale,D.B.,&Savolainen,O. (2004).Associationgeneticsofcomplextraitsinconifers.Trends in Plant Science,9(7),325–330.https://doi.org/10.1016/j.tplants.2004.05.006
Nieto-Montes de Oca, A., Barley, A. J., Meza-Lázaro, R. N., García-Vázquez, U. O., Zamora-Abrego, J. G., Thomson, R. C., & Leaché,A.D. (2017).Phylogenomics and speciesdelimitation in theknob-scaled lizards of the genus xenosaurus (Squamata: Xenosauridae)usingddRADseqdatarevealasubstantialunderestimationofdiver-sity.Molecular Phylogenetics and Evolution,106,241–253.https://doi.org/10.1016/j.ympev.2016.09.001
Oono,R., François Lutzoni,A.,Arnold, E.,Kaye, L.,U’Ren, J.M.,May,G.,&Carbone, I. (2014).Geneticvariation inhorizontally transmit-tedfungalendophytesofpineneedlesrevealspopulationstructurein cryptic species. American Journal of Botany, 101(8), 1362–1374.https://doi.org/10.3732/ajb.1400141
Oono,R.,Lefèvre,E.,Simha,A.,&Lutzoni,F.(2015).Acomparisonofthecommunitydiversityof foliar fungalendophytesbetweenseedlingand adult loblolly pines (Pinus taeda).Fungal Biology,119(10), 917–928.https://doi.org/10.1016/j.funbio.2015.07.003
Ortiz-García,S.,Gernandt,D.S.,Stone,J.K.,Johnston,P.R.,Chapela,I.H.,Salas-Lizana,R.,&Alvarez-Buylla,E.R.(2003).PhylogeneticsofLophodermiumfromPine.Mycologia,95(5),846–859.https://doi.org/10.1080/15572536.2004.11833044
14 | SALAS- LIZANA ANd OONO
Osono,T.,&Hirose,D.(2011).Colonizationandlignindecompositionofpine needle litter by lophodermium Pinastri. Forest Pathology,41(2),156–162.https://doi.org/10.1111/j.1439-0329.2010.00648.x
Parchman,T.L.,Gompert,Z.,Mudge,J.,Schilkey,F.D.,Benkman,C.W.,&AlexBuerkle,C.(2012).Genome-wideassociationgeneticsofanadaptive trait in lodgepole pine.Molecular Ecology, 21(12), 2991–3005.https://doi.org/10.1111/j.1365-294X.2012.05513.x
Peterson,B.K.,Weber,J.N.,Kay,E.H.,Fisher,H.S.,&Hoekstra,H.E.(2012).DoubledigestRADseq:AninexpensivemethodfordenovoSNPdiscoveryandgenotypinginmodelandnon-modelspecies.PLoS One,7(5),e37135.https://doi.org/10.1371/journal.pone.0037135
Recknagel,H., Elmer,K.R.,&Meyer,A. (2013).A hybrid genetic link-agemap of two ecologically andmorphologically divergent midascichlidfishes(AmphilophusSpp.)obtainedbymassivelyparallelDNAsequencing (ddRADSeq). G3: Genes, Genomes, Genetics, 3, 65–74.https://doi.org/10.1534/g3.112.003897
Recknagel,H.,Jacobs,A.,Herzyk,P.,&Elmer,K.R.(2015).Double-digestRADsequencingusingionprotonsemiconductorplatform(ddRADseq-Ion) with nonmodel organisms.Molecular Ecology Resources, 15(6),1316–1329.https://doi.org/10.1111/1755-0998.12406
Reignoux, S.N.A.,Green, S.,&Ennos,R.A. (2014).Molecular identi-ficationandrelativeabundanceofcryptic lophodermium species innatural populationsof scotspine,Pinus Sylvestris L.Fungal Biology,118(9–10),835–845.https://doi.org/10.1016/j.funbio.2014.07.002
Rognes,T.,Flouri,T.,Nichols,B.,Quince,C.,&Mahé,F.(2016).VSEARCH:A versatile open source tool formetagenomics.PeerJ,4(October),e2584.https://doi.org/10.7717/peerj.2584
Rubin, B. E. R., Ree, R.H., &Moreau, C. S. (2012). Inferring phyloge-nies fromRADsequencedata.PLoS One,7(4),e33394.https://doi.org/10.1371/journal.pone.0033394
Salas-Lizana,R., Santini,N. S.,Adán,M.-P.,&Piñero,D. I. (2012). ThePleistoceneglacialcyclesshapedthehistoricaldemographyandphy-logeographyofapinefungalendophyte.Mycological Progress,11(2),569–581.https://doi.org/10.1007/s11557-011-0774-x
Schirmer,M., Ijaz,U.Z.,D’Amore,R.,Hall,N., Sloan,W.T.,&Quince,C.(2015).Insightintobiasesandsequencingerrorsforampliconse-quencingwith the illuminaMiSeqplatform.Nucleic Acids Research,43(6),e37.https://doi.org/10.1093/nar/gku1341
Slatkin,M. (1987).Gene flow and the geographic structure of naturalpopulations.Science,236(4803),787–792.https://doi.org/10.1126/science.3576198
Stamatakis,A.(2014).RAxMLversion8:Atoolforphylogeneticanalysisandpost-analysisof largephylogenies.Bioinformatics,30(9),1312–1313.https://doi.org/10.1093/bioinformatics/btu033
Tavaré,S.(1986).Someprobabilisticandstatisticalproblemsintheanalysisofdnasequences.Lectures on Mathematics in the Life Sciences.books.google.com. https://books.google.com/books?hl=en&lr=&id=8aI1ph-hOKhgC&oi=fnd&pg=PA57&dq=Some+probabilistic+statisti-cal+problems+analysis+DNA+sequences+Tavar%C3%A9&ots=rm-SE8LC8Ti&sig=YAsGLrZt9LcQQIa2_dyW8njcPGw.
Tavares,S.,Ramos,A.P.,Pires,A.S.,Azinheira,H.G.,Caldeirinha,P.,Link,T.,…Talhinhas,P.(2014).GenomesizeanalysesofPuccinialesrevealthelargestfungalgenomes.Frontiers in Plant Science,5 (August),422.
Vivian,A.,&Rn,J.(2017).AstreamlinedddRADtagprotocolforusewiththe iontorrentsequencer,asaversatileprobeforpopulations,ge-neticsandgenomicsv1(protocols.io.khuct6w).Protocols.io.https://doi.org/10.17504/protocols.io.khuct6w
Zhang, J., Kobert, K., Flouri, T.,& Stamatakis, A. (2014). PEAR:A fastandaccurateIlluminaPaired-EndreAdmergeR.Bioinformatics,30(5),614–620.https://doi.org/10.1093/bioinformatics/btt593
Zhou, X., Xia, Y., Ren, X., Chen, Y., Huang, L., Huang, S., … Jiang, H.(2014). Construction of a SNP-based genetic linkage map in cul-tivated peanut based on large scale marker development usingnext-generation double-digest restriction-site-associated DNA se-quencing (ddRADseq). BMC Genomics, 15(May), 351. https://doi.org/10.1186/1471-2164-15-351
SUPPORTING INFORMATION
Additional Supporting Information may be found online in thesupportinginformationtabforthisarticle.
How to cite this article:Salas-LizanaR,OonoR.Double-digestRADseqlociusingstandardIlluminaindexesimprovedeepandshallowphylogeneticresolutionofLophodermium,awidespreadfungalendophyteofpineneedles.Ecol Evol. 2018;00:1–14. https://doi.org/10.1002/ece3.4147