Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
TableofContents
TableofContents...................................................................................................1
S1:GenomicDNAIsolation.....................................................................................3S1.1GenomicDNApreparation....................................................................................................................3S1.2Mucusremovalfromplanariansurface..........................................................................................4S1.3HMWDNAisolationforPacBiosequencingandChicagolibraries.....................................4S1.4PostpurificationofDNAusingCTAB...............................................................................................5
S2:PacBioLongReadSequencing...........................................................................5S2.1Largeinsertlibraries...............................................................................................................................6
S3:AssemblyPipeline.............................................................................................73.1MARVELassembler....................................................................................................................................73.2MARVEL-Determiningreadquality..................................................................................................83.3MARVEL-Annotationofrepeatregions...........................................................................................83.4MARVEL-Readcorrection......................................................................................................................93.5MARVEL-Computationalrequirements...........................................................................................9
S4:Chicago/HiRiseScaffolding.............................................................................10S4.1Chicagolibrarypreparationandsequencing............................................................................10S4.2HiRisescaffoldingofPacBio-MARVELassembly.....................................................................10S4.3AnalysisofHiRisechangestoPacBio-MARVELassembly...................................................11S4.4ScaffoldgapclosureusingPBJelly..................................................................................................11
S5:AssemblyPolishing.........................................................................................11S5.1FinalErrorCorrectionwithQuiver................................................................................................11S5.2Redundancyfiltering............................................................................................................................12S5.3ConsensuspolishingusingPilon.....................................................................................................12
S6:RNAseqandTranscriptomes...........................................................................13S6.1RNAextractionandsequencing......................................................................................................13S6.2Smed_v6transcriptomeassembly:Rawdata,assemblyparameters,filtering,
annotations.........................................................................................................................................................13
WWW.NATURE.COM/NATURE | 1
SUPPLEMENTARY INFORMATIONdoi:10.1038/nature25473
S6.3Smes_v1transcriptomeassembly:Rawdata,assemblyparameters,filtering,
annotations.........................................................................................................................................................13
S7:AssemblyQualityControl...............................................................................14S7.1Transcriptomeback-mapping..........................................................................................................14S7.2Genomecompletenessanalysis.......................................................................................................14S7.3Transcript-basedgenomeassemblyqualitycontrol..............................................................15S7.4Contaminationcheck............................................................................................................................17
S8:UCSCGenomeBrowser...................................................................................18S8.1GEM-Mappabilitytrack.....................................................................................................................18S8.2RepeatMasker–Repbasetrack........................................................................................................18S8.3RepeatMasker–RepeatModelertrack.........................................................................................19S8.4AlternativeScaffoldstrack.................................................................................................................19S8.5Quivertrack..............................................................................................................................................19S8.6PacBiocoveragetrack..........................................................................................................................19S8.7Transcriptstrack....................................................................................................................................19S8.8LinkstoPlanMine..................................................................................................................................20
S9:RepetitiveElementPrediction........................................................................20S9.1Denovopredictionofrepetitiveelements..................................................................................20
S10:LTRAnalysis..................................................................................................21S10.1LTRconsensusconstruction..........................................................................................................21S10.2GypsyLTRelementannotation.....................................................................................................22S10.3TheBurrofamily.................................................................................................................................22S10.4LTRelementexpressionanalysis.................................................................................................23S10.5LTRelementdivergenceanalysis.................................................................................................23S10.6Solo/pairedLTRelementabundanceanalysis.......................................................................23S10.7Scaffoldendanalysis..........................................................................................................................24
S11:AAT-RepeatAnalysis.....................................................................................24S11.1AAT-microsatellitepredictionandreadalignmentbreakanalysis..............................24S11.2MicrosatelliterepeatlengthestimationbyCircularConsensusSequencing(CCS)24S11.3OverlaplengthvariationinCCSandP6/C4reads................................................................25S11.4GenomiccontextofAT-richregions............................................................................................26
S12:AssemblyHeterozygosity..............................................................................26S12.1Drosophilaassembly.........................................................................................................................26S12.2Dotplotanalysis..................................................................................................................................27S12.3Coverageanalysisaltvs.maincontigs.......................................................................................27
WWW.NATURE.COM/NATURE | 2
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S13:GeneAnnotation..........................................................................................27S13.1Geneannotationandgenenumberestimate.............................................................................27S13.2Genestructure......................................................................................................................................28S13.3Orthologyanalysis..............................................................................................................................28
S14:DivergenceAnalysis......................................................................................28S14.1Divergenceanalysis............................................................................................................................28
S15:SyntenyAnalysis...........................................................................................29S15.1Syntenycomparisons........................................................................................................................29
S16:PlanarianSpecificGeneSequences...............................................................29S16.1Identificationofplanarian-specificgenes................................................................................29S16.2Domainpredictions............................................................................................................................31S16.3Expressionanalysis............................................................................................................................31
S17:GeneLossinPlanarians.................................................................................31S17.1Lostgenesidentificationandverification................................................................................31S17.2Essentialityassessmentoflostgenes.........................................................................................33S17.3Mad1/Mad2multiplesequencealignments............................................................................33
S18:PlanarianSACFunction.................................................................................33S18.1Quantificationofmitoticindexindissociatedcellpreparations....................................33
S19:MiscellaneousExperimentalMethods..........................................................36S19.1Animalhusbandry...............................................................................................................................36S19.2Transcriptcloning...............................................................................................................................36S19.3Wholemountinsituhybridization.............................................................................................37S19.4Karyotyping...........................................................................................................................................37
SupplementaryTables..........................................................................................37
References...........................................................................................................38
S1:GenomicDNAIsolation
S1.1GenomicDNApreparation
OurgenomicDNAisolationprotocolimprovesuponapreviouslydevelopedmethod
for DNA isolation procedure for planarians1, making it compatible with single
molecule real-time (SMRTsequencing)2onaPacBioRSIIand theChicagomethod3.
WWW.NATURE.COM/NATURE | 3
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
Importantmodificationsincludeplanarianexternalmucusremoval,thechoiceofsalt
andalcoholforDNAprecipitationenablingpigmentremovalandapost-purification
step that removes likely remainingmucopolysaccharides, which tend to co-isolate
withgDNA.
S1.2Mucusremovalfromplanariansurface
This step makes use the mucolytic agent N-acetyl-L-cysteine (NAC) to strip the
animals of surface mucus, which can be excreted in copious amounts by large
animals.SinceNACisacidicinaqueoussolutionandthehighestmucolyticactivityis
observed between pH 7 to 9 the solution was neutralized, while the pH was
monitoredusingthecommonpHindicatordyephenolred4.
Theanimalswerewashedin10mlfreshlypreparedNACstrippingsolution(0.5w/v
N-acetyl-L-cysteine, 20mMHEPES-NaOH,pH7.25, 5µl phenol-red solution (0.5%
w/v), pH to ~7 using 1MNaOH) for 10-15min at room temperaturewith strong
agitation (e.g. rotator). The worms were rinsed briefly in distilled water prior to
furtherprocessing.
S1.3HMWDNAisolationforPacBiosequencingandChicagolibraries
All procedures were carried out at room temperature (RT) unless otherwise
specified. For handlingHMWDNA, onlywide-bore tipswereused and the sample
wasmixedonlybycarefulinversion.ForDNAisolation,20mucusstrippedanimals(~
1cm;starvedfor1-3weeks)weretransferredtoa50mltubeandresidualliquidwas
removed. The animals were lysed in 15 ml of cold GTC buffer (4 M guanidinium
thiocyanate, 25 mM sodium citrate, 0.5 % (w/v) N-Lauroylsarcosine, 7 % v/v β-
mercaptoethanol)for30minonice.Thetubewascarefullyinvertedevery10minto
help dissociating the tissue. Then, 1volume of phenol/chloroform
(Phenol/chloroform/isoamylalcohol(25:24:1),10mMTris(pH8.0),1mMEDTA)was
addedtothelysateandmixedbycarefullyinvertingthetube10to15times.Phases
wereseparatedbycentrifugationfor20minat4,000xgat4°Candtheupperphase
carefullytransferredtoafreshtube.Thephenol/chloroformextractionwasrepeated
1-2 times, until the interphase disappeared. Residual phenol was removed by
extracting the aqueous phase once with 1 volume of chloroform and mixing by
inversion 10 to 15 times and another roundof centrifugation. The upper aqueous
WWW.NATURE.COM/NATURE | 4
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
phasewastransferredtoafreshtubeand1volumeofice-cold5MNaClwasadded
andmixedwellby invertingthetubeseveraltimes.Thetubewasplacedonicefor
15mintosaltoutadditionalcontaminants,whichwerepelletedbycentrifugationfor
10minatmaximumspeedat4°C.Thesupernatantcontainingthenucleicacidswas
carefullydecantedorpipettedofftoafreshtubewithoutdisturbingthepellet.The
nucleic acids were precipitated using 0.7-1 volumes of isopropanol and
centrifugationat2,000xgat30-45minatRT.Thepelletwascarefullywashedwith
70 % ethanol and spun at 2,000 x g at 5 min at RT and briefly air-dried and re-
suspendedin50µlTEbufferovernightat4°C.
S1.4PostpurificationofDNAusingCTAB
Allprocedureswerecarriedoutat roomtemperature,ascetyltrimethylammonium
bromide(CTAB)precipitatesattemperatureslowerthan15°C.Thisstepallowedthe
removalofcontaminants (e.g.polysaccharides) fromDNApreparationsbycomplex
formationthroughtuningtheNaClconcentration5.
Theisolatedandre-suspendednucleicacidsweretreatedwith1µlof4mg/mlRNase
Aandincubatedfor1h/37°CtoremoveRNA.TheNaClconcentrationwasadjusted
byadding0.3volumesof5MNaClandthevolumewasadjustedto500µlwith2%
CTAB(w/v)/1.4MNaClsolution.Aftercarefulandthoroughmixingbyinversion,1
vol of chloroform was added and carefully mixed by inversion until the sample
turnedmilky. Thephaseswere separatedby centrifugation for 15min at 12,000–
16,000xgatRT.Theupperclearphasewasrecoveredwithoutdisturbingthewhite
interphaseusingawideborepipettetipandtransferredtoanewtube.TheDNAwas
precipitatedusing0.7-1volumesofisopropanolandcentrifugationat2,000xgat30-
45minatRT.Thepelletwascarefullywashedwith70%ethanolandspunat2,000x
gat5minatRTandbrieflyair-driedandre-suspendedin50µlTEbufferovernight
at 4 °C. Quality of the DNA was checked by agarose gel electrophoresis and
quantifiedusingtheQubitTMfluorometer(dsDNABRassay).
S2:PacBioLongReadSequencingQuality control of the high molecular weight genomic DNA was performed by
spectrophotometricdeterminationof260/280and260/230ratiosmakinguseofthe
WWW.NATURE.COM/NATURE | 5
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
Nanodropdevice(ThermoScientific).The260/280ratioswereintherangeof1.66to
2.08and260/230ratiosbetween1.84and2.01.Theconcentrationwasdetermined
intriplicatesviaaQubitTMfluorometerwiththehighsensitivitystandard.Integrityof
the HMW gDNA was determined by pulse field gel electrophoresis run using the
PippinPulseTMdevice(SAGEScience).BeforestartinganyPacBiolibrarypreparation,
planarian gDNA samples were purified with 1X AMPure XP beads. Note that the
AMPuremethodcannotrecoverplanarianDNAoutofcrudelysates,possiblydueto
resorptivecompetitionoftheabundantcontaminants.
S2.1Largeinsertlibraries
Toprepare large insert libraries,AMPurepurifiedgenomicDNAwasfragmentedto
sizesof12-30kbpwiththeMegaruptordevice(Diagenode).Forshearing,either20
kbp shearing settings for medium insert SMRT bell libraries or 35 kbp shearing
settings for large insert SMRT bell libraries were used. PacBio SMRT bell libraries
werepreparedasdescribedinthecompany-supplied“ProcedureandChecklist–20
kbp Template Preparation Using BluePippinTM Size-Selection System”. SMRT bell
libraries were size selected with the BluePippinTM system (Sage Science) with a
minimumfragmentlengthcutoffof7.5and10kbpformediuminsertlibrariesandat
15and17.5kbpforlargeinsertlibraries.
PacBioRSIISMRTcellswere loadedwithSMRTbell librariesafterprimerannealing
and polymerase binding usingMagBead loading. A total of 197 SMRT cells loaded
with the medium insert libraries were sequenced on the PacBio RSII instrument
making use of the P4 polymerase and the C2 sequencing chemistry (P4/C2). In
addition,atotalof38SMRTcells loadedwithlargeinsert librariesweresequenced
ontheRSIIinstrumentwiththeP6polymeraseandC4sequencingchemistry(P6/C4).
MovietimesforallP4/C2SMRTcellswere240minutes.MovietimesforP6/C4SMRT
cellswere240min(20SMRTcells)and360minutes(23SMRTcells).
WWW.NATURE.COM/NATURE | 6
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S3:AssemblyPipeline
3.1MARVELassembler
AtthetimeoftheinitialsequencingofSmedwiththeP4/C2chemistry,thestandard
workflowforassemblingnoisylong-readsfromPacBiowastocleanthereads,either
withIlluminasequencingdataorbyreducingthePacBionativeerrorrateof12-15%
tounder1%byhighcoveragesequencing(PacBioToCA).Afterthiscorrectionstep,
theoriginalCeleraassembler6wasusedtoassemblethelong-reads.However,initial
cleaning of the reads can have one of two disadvantages, especially in a genome
withahighrepeatcontentsuchasSmed.Eithertherepeatswithinthereadsarenot
cleanedandtheassemblyofsuchregionsbytheCeleraassemblerishindereddueto
aanerrorratehigherthanexpected(<1%)ortherepeatsare incorrectlycleaned
duetotheambiguouscharacterofoverlapsbetweenrepeats.Bothapproacheslead
to incorrectandfragmentedassemblieswithpossiblemisjoins.Therefore,MARVEL
was developed to deal with notoriously complex genomes that contain a high
fractionof repetitivegenomic informationandaresequencedwithnoisy long-read
technologies (Novojilov et al., in preparation). For comparisons with the Canu
assembler7 (see Table 1,main text),weusedCanu v1.3with standardparameters
andthesameuncorrectedPacBioreadsasfortheMARVELassembly.
TheMARVELassembler currently consistsof threemajorphasesnamely the setup
phase,patchphaseandtheassemblyphase. Inthesetupphase,dependingonthe
sequencing technology, thereadsareextracted fromtherawsequencingdataand
saved inan internaldatabase.Thepatchphasedetectsandcorrects readartefacts
includinguntrimmedadapters,polymerasestrand jumps,and ligationchimers that
aretheprimaryimpedimentstolongcontiguousassemblies,whicharethenusedfor
the final assembly phase. The assembly phase stitches short alignment artefacts
resultingfrombadsequencingsegmentswithinoverlappingreadpairs.Thedefault
parameterisgenerally50bpinlength,butforSmed300bpwereused.Thisstepis
followed by repeat annotation and the generation of the overlap graph, which is
subsequentlytouredinordertogeneratethefinalcontigs.
WWW.NATURE.COM/NATURE | 7
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
AgenomewithaskewedGC-content,suchastheSmedgenome,hasadifferentk-
merdistributionthanthosewithauniformdistributionoftheGCcontent.Thiscan
beusedtoweighcertaink-mersandtheirexpecteddistributionwithinthegenome.
Therefore,theATbiasoption(-b)wasusedinDALIGNER8(thenoisyreadalignerused
inMARVELtofindtheoverlapsbetweenreads)aswellasinDBdust,toavoidmasking
AT-rich regions as low complexity and thereby excluding them from the k-mer
seeding.
In addition, alignmentswere forced through variableATmicrosatellites (< 300bp)
despiteoftensignificant lengthdiscrepanciesandnon-uniformityof their instances
in the reads. This allowed for amore contiguous assembly. The criteria for repeat
module identification in overlaps was relaxed, allowing for better recognition of
repeat modules and therefore resulting in improved resolution of repetitive
elements.
Since the Smed dataset combined sequencing data generated by two different
chemistries(P4/C2andP6/C4),eachchemistrybatchwaspatchedindividuallyprior
tocombiningthedatasetsasinputfortheassemblyphase.Intheassemblyphase,
theoverlapof thecombineddatasetwascalculatedandassembled.Afterall tours
were calculated, the contigs were corrected and a consensus sequence was
generatedbyremappingthepatchedreads.
3.2MARVEL-Determiningreadquality
Inordertodeterminethequalityofaread,theconsensusofthealignedsupporting
readswereused.MARVELdoes not determine thequality of individual bases, but
rather thequalityof the subreadwithin the tracepoints (usually consistingof100
bp). The end result of the process is the determination of a consensus base
supportedbythealignedreadoverlapsbetweenthetracepoints.
3.3MARVEL-Annotationofrepeatregions
Therepeat regionswere initiallyannotatedby themaskingserverand laterduring
thescrubbingphasebyDBrepeat.Whilebothusethealignmentstodeterminerepeat
regions, themasking serverannotates the repeat regiondynamicallyandonce the
WWW.NATURE.COM/NATURE | 8
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
repeat coverage threshold is exceeded (in most cases twice the sequencing
coverage)theregionismaskedandnotusedforsubsequentk-merseeding.
DBrepeat runs sequentially over the reads to annotate the repeat regions. Once a
basewithacoveragethatexceedstherepeatcoverageisidentified,arepeatregion
isopened.Arepeatregionisclosedonceabaseatorbelowthenon-repeatcoverage
is found. Theminimumoverlap length of 1 kbp results in complete lack of repeat
annotationforthe1kbpregionsatthestartandendofthereads.DBhomogenizecan
beusedtotransitivelytransferanexistingrepeatannotationbasedonthecomputed
alignments,therebyproperlyannotatingthetipsofthereads.
3.4MARVEL-Readcorrection
Despite thenoisynatureof PacBio reads,MARVELdoesnot clean the reads to an
error-rateoflessthan1%asiscommonpracticeofotherassemblerswhendealing
with noisy long-reads. The reads are first patched after the initial overlap phase,
removing themajor sequencing related insertions/deletions (indels), chimeras and
additionalartefacts,whichisnecessaryforqualityoverlapsthatwillbeusedforthe
assembly.Afterasecondroundofoverlaps,MARVEL improvesthealignmentsthat
are corrupt due to short sequencing errors. Finally,Quiver can be used after the
assembly phase in order to generate a high-quality consensus sequence, but this
oftenresultsinlongruntimes,especiallywithlargegenomes.Thiscanbeimproved
bycircumventingtheuseofBLASR9bytransformingthepreviouslycalculatedread
overlapsintothepositionofthereadswithinthecontigs.
3.5MARVEL-Computationalrequirements
Foranythingbeyondbacterialsamples,whichcanbeassembledratherquicklyona
standard laptop, a cluster environment with sufficient CPU and storage is
recommended.We recommendhavingat least8GBofRAMperCPUcore for the
clusternodesandafastinterconnecttothecluster-attachedstorage.RAMpercore
demand can be lessened by further increasing the partitioning of the data into
blocks, at the price of increased IOdemand. Example CPU times canbe found in
SupplementaryTable1andhavebeenmeasuredonourSandy-BridgeandHaswell
architecturebasedcluster.Storagerequirementsarehighlydependentoncoverage
andtherepetitivenessofthegenome.Also,storagerequirementsforasinglephase
WWW.NATURE.COM/NATURE | 9
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
ofMARVELcanbefoundinSupplementaryTable1.Forpeakstoragerequirements,
thisnumberneedstobemultipliedbytwo.
S4:Chicago/HiRiseScaffolding
S4.1Chicagolibrarypreparationandsequencing
AChicago librarywasprepared as describedpreviously3. Briefly, ~500ngofHMW
gDNA(>150kbpmeanfragmentsize)wasreconstitutedintochromatin invitroand
fixed with formaldehyde. Fixed chromatin was then digested with DpnII, the 5’
overhangswerefilledinwithbiotinylatednucleotides,andthenfreebluntendswere
ligated.After ligation,crosslinkswerereversedandtheDNApurified fromprotein.
Purified DNA was treated to remove biotin that was not internal to ligated
fragments. The DNA was sheared to a mean fragment size of ~350 bp, and
sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-
compatible adapters. Biotin-containing fragments were then isolated using
streptavidin beads before PCR enrichment of the library. The libraries were
sequenced on an Illumina HiSeq 2500 in rapid run mode to produce 286 million
2x100bpreadpairs,amountingto93xphysicalcoverage(1-50kbpspanningpairs)of
theSmedgenome.
S4.2HiRisescaffoldingofPacBio-MARVELassembly
AdraftgenomeassemblyofSmed(PacBio-MARVEL,782.1MbpwithacontigN50of
708.7 kbp), Illumina shotgun sequence data (IlluminaHiSeq 2500 rapid runmode,
2x150bp readpairs, SRR5408395),andChicago library readpairs inFASTQ format
wereusedasinputdataforHiRise,asoftwarepipelinedesignedspecificallyforusing
Chicagodataforerror-correctingandscaffoldinggenomeassemblies3.Shotgunand
Chicagolibrarysequenceswerealignedtothedraftinputassemblyusingamodified
SNAPreadmapper(http://snap.cs.berkeley.edu/).TheseparationsofChicagoread
pairsmappedwithindraftscaffoldswereanalysedbyHiRisetoproducealikelihood
modelforgenomicdistancebetweenreadpairs,andthemodelwasusedtoidentify
putativemisjoinsandscoreprospective joins.After scaffolding, shotgunsequences
wereusedtoclosegapsbetweencontigs.
WWW.NATURE.COM/NATURE | 10
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S4.3AnalysisofHiRisechangestoPacBio-MARVELassembly
MARVELrawcontigsweremappedontoHiRisescaffoldswithDALIGNERtofindthe
exact break positions within the raw contigs. Then, each position was visually
inspected using the layout of the toured overlap graph (similar to Figure 2e, left
graph,maintext),aidedbythedirectvisualizationofreadoverlapswiththeMARVEL
toolLAexplorer.
S4.4ScaffoldgapclosureusingPBJelly
FollowingHiRisescaffolding,weusedPBjelly10 to fillor reduceasmanyscaffolding
gapsaspossible.All632HiRisescaffoldswereusedasinputforPBjelly,aswellasall
patched P4/C2 and P6/C4 chemistry reads longer than 8 kbp. The 632 scaffolds
contained1,256gapsof100bp (HiRisedefaultgapsize).PBjellycompletelyclosed
482gaps,176gapswereclosedpartiallyandoverfilling176gaps,meaningthegap
wasactuallylargerthan100bp.Inadditiontothis,itwaspossibleforPBjellytojoin
32 scaffolds into bigger scaffolds, reducing the overall scaffold number to 600.
OverlapanalysisbetweenthescaffoldsusingDALIGNER(seebelow)showedthat119
PBjelly scaffolds were almost completely contained within other scaffolds and
consistedof>95%repeatbases.Thesewere thereforemoved into thealternative
contig set (_alt, seebelow), leaving481scaffoldsas the finalhaploiddd_Smed_g4
assembly.
S5:AssemblyPolishing
S5.1FinalErrorCorrectionwithQuiver
The final consensus sequence was generated first by a round of the PacBio tool
Quiver.DuetotheruntimeofBLASR8,whichisusedbyQuiver,onlyreadsthatwere
partoftheoverlapgraphofeachcontig(i.e.thetruenon-repeatinducedoverlaps)
wereusedtopolishthecontigsequences.Thisstephadmostly todowiththeAT-
biasandhighrepeatcontentoftheSmedgenome. Inordertoenrichthecoverage
and improve the finalpolishedsequence,additional subreadswereextracted from
the ZeroModeWaveguides and used for polishing (if available). This brought the
coveragetoapproximately40x.
WWW.NATURE.COM/NATURE | 11
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S5.2Redundancyfiltering
DALIGNERwasusedto identifythehaplotypecontentoftheSmedPacBio-MARVEL
assembly,bymappingofthecontigsagainsteachother.Ifatleast70%ofthebases
inthetwocontigsaswellas70%ofthereadsusedtobuildthecontigswerepartof
both contigs they were considered alternate haplotypes and separated from the
mainassemblyintoasecondsetofalternativecontigs,designatedbya_altsuffixto
thecontigname.Forcontigsshorterthan100kbpthatterminatedinlongrepetitive
sequences,thesethresholdswerereducedto50%.Thisremoved1,509alternative
contigswithatotallengthof168.1Mbp(notethatthisstepwasperformedpriorto
scaffolding).
S5.3ConsensuspolishingusingPilon
Forfine-polishingofresidualPacBiobasepairinaccuracies,weusedPilon11.Thistool
uses theevidence fromshort-read sequencingdata inorder to correct singlebase
differences, small indels, block substitution events, and gaps. Illumina 2x150 bp
shotgun HiSeq 2500 paired-end sequencing data (SRR5408395; see S4.2 “HiRise
scaffolding of PacBio-MARVEL assembly”) was aligned with bowtie212 against the
scaffoldsofdd_Smed_g4andthecorrespondingalignmentswereprovidedasinput
toPilon.SincePilonwasdesignedmainlyforsmallerprokaryoticgenomes,weranit
over each of the assembled scaffolds separately. To assess the assembly
improvements, variants were called from the scaffolding PacBio data provided by
Dovetail Genomics with samtools/bcftools before and after Pilon correction.
Correction was assessed by counting InDels and SNPs per kbp, and independent
rateswere calculated for exons, genic regions andpolished (dd_Smed_g4) vs. raw
assembly(Pacbio-MARVEL).AfterPilonpolishingtherewasa2.24-foldreductionof
indelsinintergenic,a3.25-foldreductioninexonicanda1.94-foldreductioningenic
regions. The changes in SNP counts were minor, indicating that the detected
differencescorrespondmostlytotruevariants.
WWW.NATURE.COM/NATURE | 12
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S6:RNAseqandTranscriptomes
S6.1RNAextractionandsequencing
For total RNA isolation, 30 sexual animals (~1 cm, 2weeks starved) of the inbred
genome sequencing Smed strain were homogenized in 20 ml Trizol reagent
(Invitrogen) on ice using aMiccra D-8 homogenizer (Miccra) and processed using
Trizol reagent protocol. Following phase separation by chloroform addition and
isopropanol precipitation, the RNA pellet was dissolved in 350 µl RLT buffer and
purified according to the RNeasyMini Kit, including on-columnDNase I digestion.
TheRNAwaselutedinnuclease-freewaterandstoredat-80°C.ThetotalRNAwas
quantified using a NanoDrop 1000 spectrophotometer and RNA integrity was
checkedbycapillaryelectrophoresisonaBioanalyzerusingtheRNA6000NanoKit.
The sample was poly-A selected and sequenced on an Illumina NextSeq 500
sequencer.
S6.2Smed_v6transcriptomeassembly:Rawdata,assemblyparameters,
filtering,annotations
The transcriptome assembly of the asexual strain has been published before13.
Briefly,weusedTrinity14 toassembleRNA-Seqdata fromamixed setofpublically
available data sources (Dresden, Berlin, Muenster). Transcripts were filtered for
contaminants, directionality-corrected and subsequently annotated and filtered by
ORF-content, domains, and RefSeq-protein BLAST hits to define a likely Protein
Coding Fraction (PCF). For many applications, we further sub-filter the longest
isoformofaTrinitygraphcomponent.Suchusesofthetranscriptomeareindicated
by a “PCFL” suffix, e.g. “Smed_v6.PCFL”. Details about the dataflow and the
companion website to access the data are documented in the PlanMine13
publication.
S6.3Smes_v1transcriptomeassembly:Rawdata,assemblyparameters,
filtering,annotations
The transcriptome assembly Smes_v1 of the sexual strain ofSmedwas assembled
from SRR955511 using the same pipeline as for Smed_v6, except that the strand
WWW.NATURE.COM/NATURE | 13
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
correction step and ordering of transcript numbers according to expression level
were omitted. ORF directionality is therefore randomized in the Smes_v1 version.
The PCFL-suffix, e.g., “Smes_v1.PCFL” designates the longest graph component
subsetoftheproteincodingfraction,asdefinedabove.
S7:AssemblyQualityControl
S7.1Transcriptomeback-mapping
De novo assembled transcriptomes of the asexual (Smed_v6) and sexual strains
(Smes_v1) available at the PlanMine13 resource were used to assess the
completenessof thedd_Smed_g4assembly. For thispurpose, the longest isoform
perTrinitygraphcomponentoftheprotein-codingfraction(PCFL)wasselectedand
mappedtothedd_Smed_g4gassemblyusingtheGMAPaligner15,withaminimum
query coverage cut-off of 0.6 and aminimum identity cut-off of 0.6. The resulting
mappingswerethenusedtoinvestigateuniquelymapping,multi-mapping,andnon-
mappingtranscripts,alongwithputativechimerictranscripts.
S7.2Genomecompletenessanalysis
Thetwoback-mappedtranscriptomeswereusedtoestimatedd_Smed_g4assembly
completeness.Thefactthat98.7%oftranscriptsofthesexual(Smes_v1)and97.3%
oftheasexual(Smed_v6)trancriptomemappedtothedd_Smed_g4assemblyunder
our mapping criteria demonstrates that the genome is largely complete. The
remaining non-mapping transcripts (538 or 667 in the Smes_v1 and Smed_v6
transcriptomes, respectively) could either represent true positives, i.e. genes
genuinely missing from the dd_Smed_g4 assembly. Alternatively, they could
represent transcriptome contaminants (e.g., from commensal organisms or other
sourcesofsequencingnoise)andthusfalsepositives.Todistinguishbetweenthese
twopossibilities,weBLASTsearched(blastp)allnon-mappingSmes_v1andSmed_v6
transcriptORFsagainstNCBIRefSeqandrecordedthespecies fromwhichthebest
blastp hit of each transcript originated. We then checked if any species was
specificallyover-representedfornon-mappingtranscriptsusingtheenricherfunction
fromtheClusterProfilerRpackage16. Inthisregardwefoundoverrepresentationof
WWW.NATURE.COM/NATURE | 14
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
the protist species Tetrahymena thermophila, Ichthyophthirius multifiliis, and
Paramecium tetraurelia in Smes_v1,whilstCaenorhabditis elegans andBos taurus
were over-represented in Smed_v6. After the annotation of these transcripts as
likelycontaminants,wefurtherclassifiedtheremaininggenesaslikelymissinggenes
if they uniquely mapped to the SmedSxl v4.017 assembly and were annotated in
PlanMine13ashavingorthologuesinatleast5otherplanarianspecies.Theremaining
transcripts that did not fit the contaminant or likely missing gene criteria were
markedas“unknown”andarelikelyenrichedforassemblynoise.
Altogether, our analysis identified 46 out of a total of 31,966 Smed transcripts as
genuinelymissingfromthedd_Smed_g4assembly,amountingtoacompletenessof
99.9985%.Further,allmissinggenescouldbe identifiedonsmall contigs thathad
beenremovedbyourinitialassemblyfiltering.Wethereforeconcludethatbasedon
themappingofprevioustranscriptomeassemblies,thegenomeisofhighquality.In
contrast, if we consider Smes_v1 transcripts that map to our assembly
(dd_Smed_g4),butdonotmaptothepublishedSangergenomeassembly(SmedSxl
v4.0)17thenweidentify2,193suchtranscriptsofwhich1,229alsowereannotatedin
PlanMine as having homologues in at least 5 other planarian species.Overall, this
analysis thereforedemonstratesconclusively that thedd_Smed_g4assembly is the
mostcompleteSmedgenomeassemblytodate.
S7.3Transcript-basedgenomeassemblyqualitycontrol
We also used transcriptome backmapping to further probe the quality of the
genomeassembly.Sinceanentiredenovoassembledtranscriptomecannotprovide
anerror-free“groundtruth”dataset,wefirstsub-selectedahighconfidencesetof
SmedtranscriptsfromourSmes_v1denovotranscriptomeassemblyonPlanMine13.
Specifically,wealignedprimaryORFs(i.e.thelongestnon-overlappingopenreading
frame of each transcript) against the 7 other in-house assembled planarian
proteomes currently available in PlanMine. The searchwas performedwith blastp
usingtheBLOSUM45similaritymatrixandane-valuecut-offof1E-3.Blastpresults
were filtered for subject and query coverage of at least 90 % in at least 5 other
transcriptomes.Altogether, thisanalysis identified1,509SmedORFs thathadclear
WWW.NATURE.COM/NATURE | 15
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
full-length homologues in multiple other planarian species and therefore
representedhighconfidencecDNAs.
To stringently probe the quality of the dd_Smed_g4 assembly using the high
confidence cDNA (HC-cDNA) subset, we further sub-filtered the previous GMAP
transcriptomemappingresults (seesectionS7.1) foraminimumquerycoverageof
0.9andaminimumidentitycut-offof0.9andcategorizedtheHC-cDNAasuniquely
mapping, non-mapping,multi-mapping. As shown in Extended Data Fig. 2a, 1,432
out of the 1,509 (94.90 %) transcripts mapped uniquely to the dd_Smed_g4
assembly, despite the high stringency and coverage filter criteria. Therefore, this
provides a strong indication that the underlying gene sequences were correctly
assembled. Of the 10 non-mapping transcripts (0.66 %), only 1 represented a
genuineassemblymistakelocatedatacontigend,whereby4ofthe5’-exonswere
invertedandappendedtothe3’-endofthegene(ExtendedDataFig.2b,c).Also,2
ofthenon-mappingtranscriptsrepresentedGMAPfalsepositives(bothwereBLAT-
mappablewith>0.9querycoverageand identity)and theother7were transcripts
already identified as “missing gene (4)”, “unknown (2)” or “putative contaminant
(1)”.Together, thehighconfidencegeneanalysis thereforeconfirmsboth thehigh
assemblyqualityandcompletenessofthedd_Smed_g4assembly.
Furthermore, 67 out of the 1,509 HC-cDNA (4.44 %) were classified as multi-
mapping. Manual inspection demonstrated that all of these genes indeed had
multiple(usually2)“perfect”maplocationsinthedd_Smed_g4assembly.Some,e.g.
dd_Smes_v1_25458 (Extended Data Fig. 2d) mapped on opposite strands within
non-repetitive and high-quality regions of the genome, thus representing likely
biological gene duplications. However, other multi-mapping locations like that of
dd_Smes_v1_28816,weresituatedinlow-confidenceregionsclosetoassemblygaps
(identified by bedtools closest18), indicated by lack of a respectiveQuiver track in
that region (Extended Data Fig. 2e). Indeed, we found that the 67multi-mapping
transcriptsmappedclosetocontigends incomparisonwithuniquelymappinghigh
confidencegenes(ExtendedDataFig.2f,g)andthesizeoftheduplicatedgenomic
region was below 5 kbp in all cases. The size of the duplicated regions was
determined by mapping transcripts containing intronic sequences to the
dd_Smed_g4 assembly usingGMAPwith parameters --chimera-margin=200, --min-
WWW.NATURE.COM/NATURE | 16
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
trimmed-coverage=0.9 and --min-identity=0.9. If the transcript with intronic
sequencesstillmappedtomorethanonelocusinthegenome,aflankingwindowof
2kbp(1kbpup-and1kbpdownstream)wasaddedtothetranscriptsequenceand
GMAPwasrunagain.Thisprocesswasrepeateduntilonlyone locationwas found
foragiventranscript.
Altogether, this indicates that themajority ofmulti-mapping transcripts represent
over-assembly artefacts in the dd_Smed_g4 assembly that are too small to be
detectedbytheChicago/HiRisescaffoldingmethod.Likely,thesecontigendmistakes
ariseasaconsequenceofduplicatedassemblyofhaplotypesandrepresentaknown
probleminassemblinghighlyheterozygousgenomes19.
However, their low overall proportion minimally impact the quality of the
dd_Smed_g4 assembly and the various quality control tracks provided in the
PlanMine genome browser (e.g., gap locations, mappability, quivered intervals)
providetoolsfortheinterpretationofdubiouscases.
Inter and intra-contig similarity in Extended Data Figure 2c-e was visualized using
miropeats20withthedescribedthresholds.
S7.4Contaminationcheck
Although all animals used for genome sequencing were exposed to a stringent
antibiotic treatment prior to DNA extraction, the sequencing of whole animal
extracts always comes at the risk of co-sequencing parasitic or commensal
organisms. To assay for the putative presence of non-Smed contigs in our
dd_Smed_g4 assembly,we screened all contigs for GC-content and coverage (this
method was recently used to identify contamination in the original Hypsibius
dujardini Tardigrade genome assembly)21. Coverage of dd_Smed_g4 contigs was
computed by first aligning reads from SRR954965 and SRR849810 to dd_Smed_g4
usingbowtie212, andestimating coverageper scaffoldwithgenomeCoverageBed18.
GCcontentwasdeterminedusinginfoseqfromtheEMBOSStoolssuite22.Wecould
not detect major contig clusters that deviated from the Smed GC-content or
haploid/diploid read coverage, thus indicating lack of bacterial or other
contaminationintheassembly(notshown).
WWW.NATURE.COM/NATURE | 17
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S8:UCSCGenomeBrowserTofacilitateaccess to thedd_Smed_g4assembly,webuiltacustomversionof the
“UCSC Genome Browser”. Further, we integrated the browser into PlanMine, the
planarian transcriptome platform maintained by our group13. The genome
enhancement thus further improves the utility of PlanMine towards its mission
objectiveascommunityplatformfortheplanarianmodelsystem.
The UCSC Genome Browser23 source code was downloaded from GitHub
(https://github.com/ucscGenomeBrowser/kent) and several tracks were made
available,whicharebrieflydescribedinthefollowing.
S8.1GEM-Mappabilitytrack
Overall genome mappability was determined by the GEM mappability program24
(GEM-binaries-Linux-x86_64-core_i3-20130406-045632). Briefly, using the fast
mapping-based algorithm, it produces mappability scores for different lengths of
NGSreadsonthegenome.AhigherGEM-scoreatagivenpositionindicatesahigher
chanceofuniquelymappingareadof therespective lengthtothatpositionof the
genome.Regionswithlowerscorearelessuniqueandwilllikelyproduceambiguous
mappings for the given read length. Generally, duplicated and repetitive regions
generate lower mappability scores. One exception are repetitive regions in the
assemblythatwerenoterror-correctedbythePacBioQuivertool(seeS8.5).These
regions appearmore unique, due to the higher error rate and tend to have high
mappabilityscores,butcaneasilybespottedbytheabsenceofaQuivertrack.
Togeneratethemappabilitytrack,standardparametersalongwithareadlengthof
200 were applied. After that, the output files were converted to bigwig using
standardparameters.The“mappability_200”trackisavailableunder“Mappingand
SequencingTracks”atthecustomUCSCGenomeBrowser.
S8.2RepeatMasker–Repbasetrack
Inordertoprovideinformationaboutknownrepeats(i.e.RepBase)aRepeatMasker
runusingparameters:-species“schmidteamediterranea”and-s(forsensitive)was
performed on the dd_Smed_g4 assembly. Also, RepeatMasker25 was configured
using rmblastn (v2.2.28+) andTRF (v4.09)26. Theoutput information is available at
WWW.NATURE.COM/NATURE | 18
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
thecustomUCSCGenomeBrowserunder“VariationandRepeats”sectionandtrack
named“RepeatMaskerRepbase”.
S8.3RepeatMasker–RepeatModelertrack
Inordertoalsoprovidedenovorepeatprediction,RepeatModeler(v1.0.8)27wasrun
on the dd_Smed_g4 assembly. Standard parameters were used and the tool was
configuratedwithRepeatMasker patched versionof RECON (v1.08)28, RepeatScout
(v1.0.5)29, TRF (v4.09)26, rmblastn (v2.2.28+). The resulting track is available at
“VariationandRepeats”sectionunder“RepeatMasker-denovo”.
S8.4AlternativeScaffoldstrack
The alternative scaffolds track highlights genomic regions forwhich an alternative
haplotypeexists.Eachalternativehaplotype(indicatedwith“_alt”suffix)isavailable
inthebrowser(describedindetailinsectionS12.3).
S8.5Quivertrack
Quiver status, described indetails in section S5.1, is representedgenomewideon
this track. LackofQuiver coverage indicates thata regionwasnoterror corrected
andthattheunderlyingsequenceistobeinterpretedwithcaution.
S8.6PacBiocoveragetrack
PacBiorawdata(P4/C2andP6/C4)weremappedtodd_Smed_g4usingbwa-mem30
with-xpacbiooption.Usingsamtools31andBEDTools–genomeCoverage18,coverage
informationwasgeneratedanddisplayedonthistrack.
S8.7Transcriptstrack
PlanMine13 Smed assembled transcripts (Smes_v1 and Smes_v6) were mapped to
dd_Smed_g4assemblyusingGMAP(describedinsectionS7.1).
Coverage informationwasfetchedfrommappingrawdatausedfortranscriptomes
assemblies from PlanMine to dd_Smed_g4 using STAR32 with default parameters,
samtools31andBEDToolsGenomeCoverage18wereusedtobuildthetrack.
WWW.NATURE.COM/NATURE | 19
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S8.8LinkstoPlanMine
The customUCSCGenome Browser version is embedded in PlanMine available at
http://planmine.mpi-cbg.de.Also,eachSmedtranscriptislinkedbetweenPlanMine
andtheUCSCGenomeBrowser.
S9:RepetitiveElementPrediction
S9.1Denovopredictionofrepetitiveelements
An initial prediction of repetitive element classes was performed using
RepeatModeler(v1.0.8)27andtheresultinglibrarywasclassifiedusingRepeatMasker
(v4.0.6)25 using libraries fromRepBase (20160829 snapshot)33. TheRepeatModeler
librarycontainedlargenumbersofmodelsthatcouldnotbeclassified(classification
‘Unknown’) and which, together, masked 25.29 % of the assembly. Uponmanual
inspection,highproportionsofthemodelswerechimericorcontainedelementson
varying strands. Removal of these models from the library increased masking
ascribed to other repeat classes, also suggesting conflict between the models
producedbyRepeatModeler.
A library was manually created to ameliorate these issues by compiling repeat
elementclassesfromdifferentsources,orfromRepBasewherenodifferenceswere
notedbetweenmaskingwithRepBaseentries in comparison togeneratedmodels.
Simple repeats and DNA elements were taken directly from RepBase using the
RepeatMasker“queryRepeatsDatabase.pl”script.MaskingofLINEelementswiththe
RepBase entries was poor in comparison to using models produced by
RepeatModeler(4.39%versus10.98%genomemasking,respectively)andthelatter
were included in the library. LTR retroelement models were produced through a
separatepipeline(seeS10.1“LTRelementconsensusconstruction”).
A hard-masked assembly was generated using this library and this assembly was
again used as input into RepeatModeler. Masking from models classified as
‘Unknown’ was reduced to 14.39 %. Again, manual inspection revealed these
contained high proportions of fragmentary and chimeric models (1,502 of 1,768
models)andthesewereexcludedfromdownstreamanalysis.
WWW.NATURE.COM/NATURE | 20
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
For all masking, RepeatMasker was configured with rmblastn (v2.2.27+) and TRF
(v4.09)26andwasruninsensitive(-s)mode.
S10:LTRAnalysis
S10.1LTRconsensusconstruction
LTR elements were identified through an approach combining outputs from 4
predictionprograms:LTR_FINDER(v1.0.6)34(runwithandwithoutinbuiltmaskingof
highly repetitive sequences), LTRharvest (GenomeTools v1.5.9)35, MGEScan-LTR36,
andRed37.Whereprogramsexposetheoptions,maximumelementsizeconstraints
were set to 50 kbp and minimum LTR sizes to 150 bp. As this was program-
dependent,however,allidentifiedfeatureswerefurtherfilteredonthepresenceof
>150 bp LTRs and to be between 2 and 50 kbp in length. Passing features
overlapping>=50%weremergedusingBEDOPS(v2.4.20)38 togivethe locationsof
finalpredictions.
Predicted elements were translated in all frames and scanned using hmmsearch
(HMMER v3.1b2)39 for the 'rve' (PF00665.21), 'RVP' (PF00077.15), and 'RVT_1'
(PF00078.22)PfamHMMs40,representingintegrase(IN),protease(PR),andreverse
transcriptase (RT), respectively. Matching regions were extracted for all elements
and separate alignments built for IN, PR, and RT using MUSCLE (v3.8.31)41. For
elementscontainingall threeHMMfeatures,alignmentswereconcatenatedanda
neighbour-joining tree built with MUSCLE. Clusters were generated using
tree2clusters (USEARCH v9.0.2132)42 using aminimum identity threshold of 80%.
Not all elements contained all three HMM features and were represented in the
tree,thusclustersweremanuallyrefinedbasedontheindividualtreestomaximise
inclusionandtoexcludeelementswithsizes>=2SDsfromthemean.
Nucleotide sequences for all elements of each cluster were extracted from the
assembly and aligned with MAFFT (v7.271)43. Neighbour-joining trees were built
using MUSCLE and any sub-clusters identified using USEARCH were separated.
Levitsky consensus sequenceswere calculated and annotated using an automated
pipelinewithinUGENE(v1.24.2)44.ConcatenatedIN,PR,andRTsequencesfromthe
consensuseswereaddedtothepreviousalignmentandthetreeinspectedtoensure
WWW.NATURE.COM/NATURE | 21
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
representation of all elements. Finally, thiswas confirmed by using the consensus
sequences as a custom library tomask the assembly using RepeatMasker and by
comparisontotheinitialpredictions.
To streamline further analysis, the LTRs were replaced with consensus sequences
formed for each of the (sub-) clusters individually. To match the methods and
nomenclatureusedinRepBase,theLTRandinternalsequenceswereseparatedfor
theRepeatMaskerlibrary.
S10.2GypsyLTRelementannotation
ConsensuselementswereannotatedwiththelocationsassociatedwithPfammodels
used in their discovery (‘rve’, 'RVP', and 'RVT_1') and additionally with ‘RNase_H’
(PF00075.19).tRNAmodelspredictedbytRNAscan-SE(v1.4)45werescannedagainst
the consensus elements using blastn (v2.2.30+)46 and the best matching region
annotated. HMMs compiled from the GyDB 2.0 collection47 were also tested,
alongside the entire PfamA and PfamB HMM sets. CCHC-type Zinc finger
annotations,knowntobeassociatedwithretroviralgagnucleocapsid(NC)proteins
couldbeidentified,howeverfurtherhitswereoflowconfidenceandwereomitted.
AdditionalputativedomainswerepredictedusingScanProsite48andCD-search49.
S10.3TheBurrofamily
We detected 14 classes of gypsy LTR elements in the Smed genome, all of which
share thecentralprotease (PR), reverse transcriptase (RT), ribonucleaseH (RNAse)
and integrase (INT) module of Ty3-gypsy-like retroelements. However, a
phylogenetic analysis of the conserved PR/RT/Int/RNase module could not assign
any of the 14 classes to known retroelement classes. Themost striking structural
featureof theSmed retroelements is themassive sequenceexpansionofgroups4
(Burro-1), 5 (Burro-2) and 13 (Burro-3), which results in LTR elements of 25 kbp
flankedby5kbpLTRs(comparedtothe5-7kbpoftypicalvertebrateLTRelements).
TheOgre-elementsfoundinplantssofarprovidetheonlyexampleofsimilarly-sized
LTRs,butprimingby tRNAPROinsteadof tRNAARG inOGREsand the lackofanopen
reading frame upstream of the likely nucleocapsid protein (zinc knuckle domains)
indicatethattheSmedelementsarelikelynotrelatedtoOGREelements50.Further,
thepresenceofputativenucleaseandBIR-domainsthatarenotpartof thetypical
WWW.NATURE.COM/NATURE | 22
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
domaincomplementofMetaviridaefurther indicatethatthegiantSmedgypsyLTR
elementsrepresentlikelynovelbranchesoftheMetaviridae,theunusualcomplexity
ofwhichharboursthepotentialforcomplexlifecyclesorinteractionswiththehost
transposondefencemachinery.
S10.4LTRelementexpressionanalysis
LTR element expression was estimated by TETranscripts51. Briefly, TETranscripts
estimatesboth, theexpressionof transposableelements (TE) andgene transcripts
takingasinputsTEandgeneannotationsandalignmentfiles(i.e.bamfiles).RNA-Seq
reads (SRR5408394)weremapped to the genomeusing STAR (v2.5.2a_modified)32
with--winAnchorMultimapNmax100,--outFilterMultimapNmax100.
AcustomGTFcontainingcoordinatesofSmed_v6transcriptsthatuniquelymapped
tothegenomebyGMAP15(--chimera-margin=200,--min-trimmed-coverage=0.6and
–min-identity=0.6)inadditiontothegenomewereusedasinputtobuildSTARindex.
Afterthereadalignment,atablewithrawcountswasgeneratedusingTETranscripts
(defaultparameters).
S10.5LTRelementdivergenceanalysis
Kimurasubstitutionlevelsofmaskedregionsagainsttheircorrespondingmodelare
automatically calculated by RepeatMasker and output by default. These were
compiled for each (sub-) cluster and plotted as a means of assessing their
divergence. LTR element agewas calculatedusingmutation rates estimated forC.
elegans52 (per generation: 2.70E-09 / per year 2.46375E-07). Assuming C. elegans
mutationrates,LTRelementinvasionwascalculatedtobe~10to60Mioyearsold
(SupplementaryTable1).
S10.6Solo/pairedLTRelementabundanceanalysis
Wehave estimated the proportion of solo LTRs and complete or nearly complete
LTR/Gypsy.SoloLTRsweredefinedasevents that lackany traceof internal region
andhavealengthofatleast60%ofthespecificconsensussequence.Completeor
nearlycompleterepeatelementsweredefinedaseventsthathaveLTRsflankingan
internal regionandanoverall length (LTRsplus internal region)ofat least60%of
theconsensussequence.
WWW.NATURE.COM/NATURE | 23
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S10.7Scaffoldendanalysis
Toexaminetheeffectofdifferentrepeatclassesontheexistingassembly, their
occurrence-frequency at scaffold ends was compared to the expected chance
occurrence of the class. For each scaffold in the assembly, the terminal 1 kbp
endswereexaminedforoverlapwithrepeatannotationsusingBEDTools18and
scoredasrepetitiveifmorethan600bpwerecoveredbyrepeatannotations.For
calculating the sum of repetitive sequence length, only the actual bases
overlapping the repeat annotation were counted. Chance occurrence was
determinedbyexamininga setof1,000randomgenomic regions (1kbpeach)
forrepeatoverlapannotationasabove.Scaffoldterminiwereexcludedfromthe
randomgenomicregionsset.
S11:AAT-RepeatAnalysis
S11.1AAT-microsatellitepredictionandreadalignmentbreakanalysis
Theanalysisofmicrosatelliteswasbasedona tandemrepeat finderannotationof
the dd_Smed_g4 assembly. All variations of AAT* patternswere converted into a
Marvel interval trackand incorporated intoaMarveldd_Smed_g4contigdatabase.
Patched P6C4 reads (coverage 20x) were mappedwith DALIGNER on the
dd_Smed_g4 assembly. For each of the AAT* regions, the alignments of P6/C4
reads were analyzed and the fraction breaking within the AAT* region versus all
readsspanning500basesleftandrightoftheAAT*regionwasdetermined.Breaking
andspanningreadswerebinned(binsize:180bp)accordingto lengthoftheAAT*
repeatwithinthe interval[100,1000].Bin1:100-280bp,avg166bp;bin2:280-460
bp, avg351bp;bin3:460-640bp, avg536bp;bin4:640-820bp, avg719bp;bin5:
820-1000bp,avg900bp.
S11.2MicrosatelliterepeatlengthestimationbyCircularConsensus
Sequencing(CCS)
Inordertodeterminetheoriginofthelengthvariationscausingthereadalignment
breaksoverAT-regions,circularconsensussequencing(CCS)53readswithaminimal
WWW.NATURE.COM/NATURE | 24
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
genomecoverage(<1x)weregenerated.For2kbpCCSlibrary,2µgofAMPureXP
beadpurifiedhighmolecularweightgenomicDNAwasshearedto2kbpfragments
usingclearminiTUBEsontheCovarisS2instrumentaccordingtothemanufacturers
conditions.Themajorityoffragmentswereabout3.9kbpinsize.Ofthis750ngof
shearedgenomicDNAwentintoPacBioSMRTbelllibrarypreparationasdescribedin
“Procedure and Checklist – 2 kbp template preparation and sequencing”with the
exceptionthatthebluntsequencingadapterwasused inafinalconcentrationof5
µM.Primerdimersandshort fragmentswereremovedafter librarypreparationby
2x0.6XAMPureXPbeadwashingsteps.SMRTbelllibrarieswereloadedontoPacBio
RSII SMRT cells after primer annealing and polymerase binding, applying the
MagBead loading protocol. Sequencing was done on the RSII instrument with P6
polymerase and C4 sequencing chemistry and 360 min movies to generate
polymerase reads that lead to circular consensus sequences (CCS). The sequences
weresubmittedtotheSRAunderaccessionSRR5408393.
S11.3OverlaplengthvariationinCCSandP6/C4reads
CCS raw reads and patched P6/C4 reads were both mapped to the dd_Smed_g4
assemblywithDALIGNER. TheCCS consensus readswere generatedbypbccswith
the use of the default parameters andwere only used to detect uniquemapping
locationsforCCSreads,whileambiguouslymappedP6/C4readswereexcludedfrom
theanalysis.
Fromthis,AATrepeatswerepredictedasfollows:
1. AAT-regionswere selected,by choosing regions that canbe spanned (~100
bp anchor right and left of AAT region) by uniquely mapping (see S11.3)
circularconsensussequences(CCS).Thisresulted in1,141uniqueregions in
thedd_Smed_g4assembly,withlengthbinsrangingfrom300–1000bp.
2. To eliminate sequencing coverage differences between different regions of
thegenome,5randomCCSand5randomP6/C4readswereselectedforeach
ofthose1,141dd_Smed_g4regions.
3. BoxplotsforthefractionofCCSoverlaplengthreadsvsoverlaplengthinCCS
consensusreadandfractionofP6/C4overlaplengthvsdd_Smed_g4overlap
lengthwerecreated.
WWW.NATURE.COM/NATURE | 25
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
4. Theplotincludesonlylengthbinsintherangeof400-800,because
a. Thereareonlyasmallnumberofeventsinbinsof900-1,000
b. For bin 300 the difference between unique andAAT regions is very
smallasAATregionscanhaveupto200anchorbases.
ThevariancebetweentheCCSreadsanduniquesegmentsofthegenomecompared
tothevariancebetweenstandardP6/C4readsanduniquesegmentsofthegenome
wereunder theerror-rateofstandardP6/C4PacBioreads.Thesamewasthecase
forthevariancebetweentheCCSreadsandATT*regionscomparedtothevariance
of standard P6/C4 and AAT* regions. However, if the AAT* loci within the
dd_Smed_g4assemblydidvary,itwouldbeexpectedthatthisvariancewasoutside
theexpectederror-rateofPacBioreads.Singleregionswerefoundthatdidhavea
high variance. However, at this point we cannot exclude that these were due to
differences between haplotype differences being observed. Therefore, a biological
source for the variance within AAT* loci could not be found and it can only be
concludedthattheseareofatechnicalorigindueresultingfrompairwisealignment
ofrepeatregions.
S11.4GenomiccontextofAT-richregions
To characterise the genomic location of AT-rich microsatellite repeats, genomic
features were defined on basis of the mapping results of the Smes_v1
transcriptome13againstthedd_Smed_g4assembly.Then,locationsofthepreviously
defined AT-rich microsatellite repeats were extracted using IntersectBed18. An
overlapofat least60%of thequery (>100bpevents) toagivengenomic feature
wasrequiredtobescoredaseitherintergenic,intronicorexonic.
S12:AssemblyHeterozygosity
S12.1Drosophilaassembly
For the comparison between the dd_Smed_g4 assembly and a D. melanogaster
assembly,weused50xcoverageofthepubliclyavailableSRAdataset(SRX4999318)
withareadlengthcut-offof14kbp.TheresultingstandardMARVELassemblyhada
WWW.NATURE.COM/NATURE | 26
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
contigN50of20Mbpandthe longestcontigwas28Mbp.Theassemblydisplayed
almostnoalternativepaths(bubbles)intheoverlapgraph,indicatingahighlevelof
homozygositywithin thedataset.A representative 1.7Mbpportionof the longest
contigwasusedtoillustratethedifferencesbetweenD.melanogasterandSmed in
Fig.2e.
S12.2Dotplotanalysis
TheanalysisandcorrespondingdotplotinFig.2fwasgeneratedbyMummer
(v3.2.3)54.
S12.3Coverageanalysisaltvs.maincontigs
To quantify genome-wide coverage differences between alternative and main
contigs,weonlyconsideredindelsasregionsofunambiguousdifferencebetweenalt
andmain contigs. For each indel region, the fraction of breaking read alignments
versusthecoverageofspanningreadswasquantified.OnlypositionsofnonAT-rich
indelsoflength>99baseswereconsidered.Thecoverageofbreakingandspanning
alignments was limited to [5, 25] and indel lengths had to be consistent at each
position. TheanalysiswasbasedonMARVELassembled contigs (no correction,no
scaffolding), and on the trimmed P6/C4 reads. 1,509 alternative contigs (bubbles)
that mapped to 1,495 unique locations on 695 contigs in dd_Smed_g4 were
considered (14alternative regions (~1%) representednon-uniformstructuressuch
asstackedalternativecontigs).Alternativecontigshadanaverageof13breakevents
(includingAAT*patternsandinsertsofanysize)withamedianspacingof6,676bp.
S13:GeneAnnotation
S13.1Geneannotationandgenenumberestimate
Thetranscriptomeofthesexualstrain(Smes_v1,PCFL)wasmappedtothe
dd_Smed_g4assemblyusingGMAP15withaminimumquerycoverageof0.6and
minimumidentityof0.6.Ourtranscriptomecontains~31000non-redundantand
likelyprotein-codingtranscripts.Thisnumbercorrespondswelltoaprevious
estimateofgenenumbersintheplanariangenome54.However,theactualnumber
WWW.NATURE.COM/NATURE | 27
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
ofSmedgenesislikelysubstantiallylowerduetofragmentedtranscriptsor
transcribedrepeatelements.
S13.2Genestructure
Gene structure statistics were calculated for a set of high-confidence transcripts,
identifiedonbasisofi)uniquemappinglocationinthedd_Smed_g4assemblyandii)
havingahomologueinatleast4otherspecies.Intotal,12,413transcriptsfulfilledall
criteria.Themedian/meantranscriptsizewas1,545/1,875bpandmappedlength
1,534 / 1,866 bp, respectively. Median/mean gene size was 7,473 / 15,395 bp,
numberof4/5.76introns,andintronlengthof1,329/2,694,respectively.
S13.3Orthologyanalysis
ToestablishanorthologymodelofSmedgenes,weperformedapair-wiseBLASTof
(a) 20 selected Ensembl reference proteomes and (b) the proteome of Smed. To
definethelatter,weusedtheSmes_v1.PCFLtranscriptomeofthesexualstrainand
extracted justtheprimaryOpenReadingFrame(ORF) (i.e. longestnon-overlapping
ORF)ofeachtranscript.BLASTsearchwasperformedusingblastpusingaBLOSUM45
similarity matrix and e-value cutoff of 1E-3. BLAST results were pre-filtered for
subjectandquerycoverageofatleast10%andpost-filteredforreciprocalbesthits.
Resultingdatawere clustered intoorthology groups (i.e. graph components)using
the igraph package in R56. Orthology groups were assessed with respect to
presence/absenceofagivenspeciesand/orcombinationofspecies.
S14:DivergenceAnalysis
S14.1Divergenceanalysis
Orthologuegroupsof52highlyconservedproteinsacross22speciesweregenerated
usingtheorthoMCLpipeline(fordetailsseeS16.1)(SupplementaryTable2).Briefly,
orthogroupsthatcontainedallspeciesandwhereonlyoneortwospecieshadmore
than1orthologuewereselected.Ambiguousaminoacidresidues (letter“X”)were
first removed from each of the sequences. Each orthologue set was then aligned
using PRANK57 (v10603, parameters: 10 iterations, +F) and trimmed using trimAl58
(v1.4rev15,parameters:gappyout).Theorthologuesetswerethenconcatenatedto
WWW.NATURE.COM/NATURE | 28
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
formasupermatrixusingaperlscript.Maximumlikelihoodanalysiswasperformed
using RaxML (v8.2.959 with the PROTCATLG model. The topology of non-
Platyhelminthesspecieswasconstrainedusingaguidetree(basedonthephylogeny
byCannonetal.2016)60.
S15:SyntenyAnalysis
S15.1Syntenycomparisons
We used the dd_Smed_g4 assembly as the reference genome and computed
pairwisegenomealignmentstootherPlatyhelminthes(query)genomesusinglastz61
(v1.03.54;parametersK=2400L=3000Y=3400H=2000,HoxD55scoringmatrix)and
thechain/netpipeline62(parameterschainMinScore1000,chainLinearGaploose).In
addition, we used highly-sensitive local alignments with lastz parameters K=1500
L=2500 and W=5 to find co-linear alignments in the un-aligning regions that are
spannedbylocalalignments(gapsinthechains)63.Thesameprocedurewasapplied
to align theM. lignano genome64 to other Platyhelminthes including Smed. For
comparison,we also aligned the human genome against frog (xenTro7 assembly),
zebrafish(danRer10)andlamprey(petMar2).
S16:PlanarianSpecificGeneSequences
S16.1Identificationofplanarian-specificgenes
To search for putative novel or planarian-specific genes in Smed,wemade use of
OrthoMCL,which groups putative orthologues and paralogs using aMarkov Chain
algorithm65.Here,weappliedOrthoMCL(v2.0.9)totheanalysisofasetof27species
(Supplementary Table 3) representing the taxonomic groups Annelida (n=1),
Arthropoda(n=4),Chordata(n=3),Cnidaria(n=1),Ctenophora(n=1),Mollusca(n=2),
Nematoda (n=2) and Platyhelminthes (n=13,which included the transcriptomes of
n=6 planarian species). To generate the input similarity matrix for OrthoMCL, we
carriedoutanall-against-allreciprocalblastpanalysisamongstthepredictedprotein
sets of the 27 species with -matrix “BLOSUM45” and -evalue 1e-5 parameters.
OrthoMCL was run with a parameter of “-I 1.5”. Only OrthoMCL
WWW.NATURE.COM/NATURE | 29
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
orthologue/paralogue groups comprised exclusively of flatworm entries were
consideredforfurtheranalysis.
Next, theSmed sequences(derivedfromSmes_v1.PCFLORFs)wereextractedfrom
flatworm-specific orthogroups and further sub-filtered according to the following
criteria:
1. ORFlengthequalorgreaterthan160aminoacids.
2. Similarity (blastp -evalue lower or equal to 1e-15) to at least one other
distantplanarianspecies(Dlac,Pten,PnigorPtor).
3. NoRefseqBLASThit(-evaluelowerorequalto1e-3)innon-Platyhelminthes
species(RefSeqBLASTdatabasesnapshotfromMay24th2016).
4. Mappinglocationinthedd_Smed_g4assemblyhaslowoverlapwithrepeats
(lessthan60%ofORFlength).
5. Mapping location in the dd_Smed_g4 assembly has no other transcripts
mappedwithin50bpupstreamand150bpdownstreaminordertoexclude
fragmented transcripts. Manual inspection confirmed that the latter two
filtering criteria greatly reduced the presence of transcript fragments
representing poorly conserved protein regions, yet the presence of some
suchsequencesinthefinaldatasetispossible.
Altogether,1,165transcriptspassedthefilteringandwethereforerefertothisdata
set as flatworm-specific genes (Supplementary Table 4). Given the focus on
transcribed sequences and the additional application of ORF length criteria, these
largelyrepresentactivelytranscribed,protein-codinggenesthathavenodetectable
sequence similarity outside the phylum by standard BLAST criteria (step 3). Our
analysis underestimates the content of “new” genes in the Smed genome, since
recently repeat-derived sequences are excluded by filtering (step 4) and the
requirement for similarity in distant planarian species (step 2) removes recently
generated or rapidly evolving gene sequences. Given the significant sequence
divergencebetweenplanarianspeciesatthetranscriptomelevel(notshown),itisin
factvery likelythatmanysuchgenesexistandthatplanariansthereforeconstitute
aninterestingmodelsystemforstudyingtheevolutionofnewgenes.
WWW.NATURE.COM/NATURE | 30
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
S16.2Domainpredictions
Domainsofflatworm-specificgeneswerepredictedusingSMART66(“Normalmode”
withthecheckedoptions“Outlierhomologuesandhomologuesofknownstructure”,
“PFAMdomains”,“signalpeptides”and“internalrepeats”)andInterProScan67with
SUPERFAMILYandPfamapplications.
S16.3Expressionanalysis
To check the expression of flatworm-specific genes in Smed, transcript expression
data sets for embryogenesis (SRA accession No.: SRR3629913 - SRR3629944),
regeneration (SRR2051616 - SRR2051633) and stem cells (SRR2051616 -
SRR2051633)werefetchedfromtheNCBI–SRAdatabase.
All data were mapped to the dd_Smed_g4 assembly using STAR32 (standard
parameters) and a custom GTF file with genomic coordinates of mapped ORFs.
Differential expression analysis was carried out using the DESeq268 package for
datasets with replicates (stem cells and embryogenesis). Log fold change greater
than0.5orlowerthat-0.5wereusedascut-offfortheregenerationdataset.
S17:GeneLossinPlanarians
S17.1Lostgenesidentificationandverification
ThedetectionofhighconfidencegenelosseventsinSmedwasperformedusingan
in-housecustompipeline.Briefly,weperformedapair-wisereciprocalBLASTsearch
against (a)Ensembl referenceproteomes (seeSupplementaryTable3 fordatabase
versions) and (b) 7proteomespredicted fromassembledplanarian transcriptomes
(including both Smed_v6 and Smes_v1). The inclusion of other planarian
transcriptomesprovidedanimportantguardagainstfalsepositivelosseventsdueto
accidental absence from one transcriptome. Orthology group clusters lacking any
planarian species were then further filtered and validated using the dd_Smed_g4
genomeassembly,asdescribedindetailbelow:
Step1–PairwisereciprocalBLASTandcoveragefiltering
WWW.NATURE.COM/NATURE | 31
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
For the pairwise reciprocal BLAST search, just the primary (i.e. longest non-
overlapping open reading frame) of the longest transcript of each Trinity graph
componentwasusedas input.BLASTsearcheswereperformedusingblastpwitha
BLOSUM45substitutionmatrixandane-valuecutoffof1e-3.
Step2-Orthologygroupconstruction
BLASTresultswerepre-filteredforsubjectandquerycoverageofatleast10%and
post-filtered for reciprocal best hits. Resulting data were clustered into orthology
groups(i.e.graphcomponents)usingtheigraphpackageinR56.
Orthology groups were assessed with respect to species presence/absence, and
groupscontainingatleast9non-flatwormspecies,noSmedandnootherplanarian
specieswereconsideredtobeputativelylostinSmed.
Step3-Genomiclossverificationusingexonerate
ToindependentlyverifythelossofsuchgenesintheSmedgenome,allotherprotein
members of planarian-deficient orthology groups were mapped onto the
dd_Smed_g4 assembly using exonerate (--showvulgar no --showalignment no --
model protein2genome -s 40 --showtargetgff yes --percent 2). Genes were
consideredasgenuinely lost ifnopoint inthegenomewascoveredbyoverlapping
exoneratealignmentsofmorethan2differentnon-planarianspecies.Thisthreshold
wasnecessary inordertofilterforspuriousalignmentsandresidual2-hitpositions
invariablyhadneitherRNAseqcoveragenorcodingpotential,asevidencedbyalack
ofblastxhitsagainstNCBInrdatabaseofthoseintervals.
Overall,ourpipelinebasesthedefinitionofgenelossontheabsenceofdetectable
sequencesimilaritybothin6independentlyassembledplanariantranscriptomesand
intheSmedgenome.Further,the limitationtoorthogroupswith>9non-flatworm
members generally restricts theanalysis to genes that arehighly conservedat the
sequencelevel.Althoughfalsepositivesareconsequentlyhighlyunlikely,wecannot
ruleoutthepresenceofhomologuesthat,onlyinplanarians,havedivergedbeyond
thepointofdetectablehomologybyconventionalsequencecomparisons.
Step4-Validationoflostgenecandidates
WWW.NATURE.COM/NATURE | 32
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
Theproteintranslationsofputativelylostgeneswereadditionallymanuallysearched
against Annelid, Molluscan and Platyhelminth transcriptomes using tblastn
(parameters:-wordsize2-matrixBLOSUM45-soft_maskingTRUE-evalue100).Hits
wereinspectedmanuallytoverifyabsenceintherespectivespecies(Supplementary
Table 5). The fact that we did not recover the previously reported loss of the
centrosome proteins69 indicates that our focus on deeply conserved geneswith a
highdegreeofsequenceconservationlikelyunderestimatestheextentofgeneloss
inSmed.
S17.2Essentialityassessmentoflostgenes
Thehumanand/ormousehomologuesofgeneslostinSmedwereusedtoquerythe
Database of Essential Genes (DEG v13.3)70. Positive hits were annotated as
“essential”inSupplementaryTable5.
S17.3Mad1/Mad2multiplesequencealignments
Multiple sequence alignments for Mad1 and Mad2 were generated using the
Constraint-based Multiple Alignment Tool (COBALT)71 (standard parameters) and
resulting alignments were rendered using the BoxShade website with standard
parameters (http://www.ch.embnet.org/software/BOX_form.html). Furthermore,
the COBALT alignments were imported into Geneious72 (v9.1.5 Build 2016-06-11
21:21,http://www.geneious.com)andsequencesimilaritymatricesbetweenspecies
were computedbasedonBLOSUM62 similaritymatrix (threshold0). The similarity
matriceswererenderedusingmatrix2png73.
S18:PlanarianSACFunction
S18.1Quantificationofmitoticindexindissociatedcellpreparations
RNAiandnocodazoletreatment
RNA interference was performed as previously described74 with some minor
modifications. In vitro synthesized dsRNA was diluted 1:20 in H2O and quantified
against a column purified standard eGFP control dsRNA (1 µg/µl) on a 0.7 % TAE
agarose gel using ImageStudio Lite (v5.2.5, LI-COR Biosciences). The quantified
WWW.NATURE.COM/NATURE | 33
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
dsRNAwasthenaddedtoliverpastetoafinalconcentrationof1µg/µlandfrozenat
-80°Cinsmallaliquots.Theadditionofredfooddyeastracerwasomitted,because
itgeneratedstrongfluorescentbackgroundintherhodaminechannel(see“Imaging
anddataanalysis”).
Foreachexperimentalreplicate,20animals(~0.5mm,asexualCIW4strain)werefed
3x with dsRNA food every third day. On the eighth day, worms were split into 2
batchesof10animalsandeithertransferredtodisheswith50ng/mlnocodazole+1
% DMSO (v/v) in planarian water (+gentamycin) or 1 % DMSO (v/v) in planarian
water(+gentamycin)asavehiclecontrol.Thewormsweretreatedfor24hoursand
thenmacerated.
Animalmaceration,proteinaseKtreatmentandformaldehydefixation
Animals were passively macerated in 4 ml of maceration solution (glycerol:glacial
aceticacid:H2Oinaratioof1:1:13(w/v/v))75for10minonthebenchatRT,followed
by10minonaverticalrotatoratRTuntilthetissuewascompletelydissociated.The
lysatewasfilteredthroughCellTrics®50µmfilters(Sysmex,Cat.No.:04-0042-2317).
Dependingonthesizeoftheanimals,thecellsuspensionwasdiluted1:4(small,7-8
mm) to1:10(large,16-18mm)usingmacerationsolutionand120µlofthediluted
cellsuspensionwastransferredintoeachwellofablack96-wellglassbottomplate
(Greiner Bio-One, Cat. No.: 655090). For each condition, five or six separatewells
wereusedastechnicalreplicates.Thecellswerepelletedfor5minat2,000xgatRT.
Then 44µl of 16% formaldehyde (PFA) in 1X PBSwere added and the cellswere
fixedfor20-25minatRTandrinsedoncewith200µl1XPBS,followedbyawashin
1XPBSfor5min.Afterfixation,thecellsweretreatedwith2µg/mlProteinaseKin
1XPBSfor10minatRT,rinsedoncewith1XPBSandpost-fixedwith4%PFAin1X
PBS for 10min at RT. Then, the cells were rinsed once in 1X PBS andwashed in
PBSTx0.1(1XPBS;0.1%(v/v)TritonX-100)for5min.
Riboprobehybridization,washing
TheriboprobehybridizationbufferswerepreparedaccordingtoPearsonetal.76.
The cells were incubated 5 min in 1:1 (v/v) PBSTx0.1:WashHyb and subsequently
blocked inPreHyb for45-60minat56 °C.PreHybwas removedandreplacedwith
WWW.NATURE.COM/NATURE | 34
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
pre-warmedHybcontainingDIG-labeledriboprobeagainstsmedwi-1andhybridized
over 16 h at 56 °C. Hybwas removed and the cells were briefly rinsedwith pre-
warmedWashHyb. The cellswere then successivelywashed once 5minwith pre-
warmed1:1SSC/WashHyb,followedby2x15min2XSSC+0.1%(v/v)TritonX-100,
2x15min0.2XSSC+0.1%(v/v)TritonX-100.Then,thecellswerebroughttoRTand
rinsedandwashedoncewithPBSTx0.1 for10min.Endogenousperoxidaseactivity
wasinactivatedwith100mMsodiumazideinPBSTx0.1for15minatRTandrinsed
onceandwashedtwicewithPBSTx0.1for10mineach.
Tyramidesignalamplification
Thecellswereblocked1hwith5%(v/v)horseserum+0.5%(v/v)RocheWestern
BlockingReagent(RWBR;Roche,CatNo.:11921673001)inPBSTx0.1atRTandthen
incubatedwithanti-DIG-PODFab fragments (Roche,CatNo.:11207733910) in1%
(v/v)horseserum/0.1%(v/v)RWBRinPBSTx0.1for2hatRT.Afterwashing3x20
minwithPBSTx0.1thesignalwasdevelopedwithrhodamine-tyramideinTSAbuffer
(2MNaCl,0.1Mboricacid,pH8.5)containing0.006%H2O2for10min.
Antibodystaining
Antibody staining was essentially performed as previously described with some
modifications77. After removing the TSA buffer and washing 3x for 10 min with
PBSTx0.1,primaryantibody(1:1,000anti-HistoneH3antibody(rabbit,phosphoS10+
T11)[E173];Abcam,Cat.No.:ab32107)in1%goatserum+0.1%RWBRwasadded
andincubated16hat4°C.Afterwashing4x10minwithPBSTx0.1,thesolutionwas
replaced with secondary antibody in 1% (v/v) goat serum + 0.1 % (v/v) RWBR
(1:1,000 goat anti-rabbit IgG (H+L), Alexa Fluor 647; ThermoFisher Scientific, Cat.
No.: A-21245) and DAPI (1 µg/ml) and incubated 2-3 h at RT in the dark. The
secondaryantibodywaswashedofffor3x20minwithPBSTx0.1andtheplateswere
storedin0.02%(w/v)sodiumazidein1XPBSat4°Cinthedark.
Imaginganddataanalysis
WWW.NATURE.COM/NATURE | 35
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
TheplateswereimagedonanOperettaHigh-ContentImagingSystem(PerkinElmer)
usinga20xlensandthefollowingexcitation/emissionfilters:DAPI(Ex.360-400/Em.
410-480nm),Rhodamine(520-550/560-630nm)andfarred(620-640/650-760nm).
Foreachwell, five separate locationswere imaged. Imageanalysiswasperformed
usingtheintegratedHarmonyHighContentImagingandAnalysisSoftwareandFIJI78.
Statisticalanalysis
Comparisons of group means (± S.D.) for the effect of RNAi interference of SAC
componentsontheabilityoftheplanarianneoblaststoelicitaSACresponseupon
nocodazoletreatment,atwo-wayanalysisofvariance(ANOVA)wasperformedand
incasesofsignificance,followedbyDunnett’stest.
Two-way ANOVA revealed a significant effect of SAC component RNAi on the
activationoftheSACbynocodazoletreatment(F(4,24)=124.7,P<0.0001).Forthe
testedRNAitargets,rod,zwilch,zw10yieldedsignificantreductionsofanti-H3ser10P
-positive cells (M-phase marker) upon nocodazole treatment (P < 0.0001), while
bub3 did not differ from control RNAi (egfp) (P = 0.78). RNAi itself did not
significantlyalterthebaselinenumberofcellsinM-phaseintheDMSOcontrols.
All statistical analyseswereperformedusingGraphPadPrismversion7.0c forMac
OSX,GraphPadSoftware,LaJollaCaliforniaUSA,www.graphpad.com.
S19:MiscellaneousExperimentalMethods
S19.1Animalhusbandry
Sexual(S2)andasexual(CIW4)strainsofSmedwerekeptinplasticcontainersin1X
Montjuïc salt water with 25 mg/L gentamycin sulfate. The animals were fed
homogenizedorganiccalfliverpaste.
S19.2Transcriptcloning
Target cDNA fragments were PCR-amplified from Smed cDNA using KAPA HiFi
HotStart DNA Polymerase (Kapa Biosystems), TAE agarose gel purified and cloned
intopPR-T4P79.Thecloned insertwasused forpreparationof riboprobetemplates
for insituhybridizationanddsRNAproduction forRNA interference.ThePlanMine
WWW.NATURE.COM/NATURE | 36
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
transcript IDs and gene specific forward and reverse primers used for cloning are
providedintheSupplementaryTable7.
S19.3Wholemountinsituhybridization
Wholemount in situ hybridization (WISH)was essentially performed as previously
described76,80.
S19.4Karyotyping
Forkaryotyping,animals (~5mm)ofsexualSmedwere fedthreetimeseverythird
dayusingcalfliverpaste.Thethirddayafterthelastfeed,theplanarianwaterwas
supplementedwith50ng/mlnocodazol (SigmaCat#M1404)and1% (v/v)DMSO
(SigmaCat # 276855) for a final 24h incubationperiod. Subsequently, 15 animals
weremacerated in5ml6.6% (v/v) glacial acetic acid supplementedwith5µg/ml
Hoechst 34580 (ThermoFisher Scientific Cat # H21486). Single cell dispersion was
achievedbycarefullyrotatingthemacerationsolutionforthelast10minofthe20
mintotalincubationperiod.Subsequently,50µlofthecellsuspensionwaspipetted
asasingledropona0.17mmthick/No.1.5borosilicatecoverslip(Menzel).Toassist
chromosomespreading,100µlof0.05%(v/v)TritonX-100solutionwasaddeddrop-
wisetotheslide.Chromosomespreadswereagedat37°Cfor60minuntiltheliquid
completely evaporated. Coverslips were mounted in 80 % (v/v) glycerol prior to
imaging.ChromosomeswereimagedonanAndorRevolutionWDBorealisconfocal
spinning disc system: The Olympus IX83 stand was equipped with an Andor iXon
Ultra 888 EMCCD camera and an Olympus 150x U Apochromat NA 1.45 Oil
immersionobjective.Hoechst34580stainedchromosomeswereexcitedwitha405
nmlaserdiodeanda452/45bandpassfilterwasusedtodetecttheemissionlight.
Imagestackswere3DdeconvolvedusingImageJ/Fijiplugins78,81andfinallymaximum
projectedforbettervisualization.
SupplementaryTablesSupplementaryTable1-CalculatedKimuradistancesandageforSmedLTR
elements
SupplementaryTable2-MARVELcomputationalrequirements&assemblystats
WWW.NATURE.COM/NATURE | 37
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
SupplementaryTable3-Conservedgenesusedfordivergenceanalysis
SupplementaryTable4-Speciesanddatabaseversionsusedfororthologyanalysis
SupplementaryTable5-Planarianspecificgenes
SupplementaryTable6-Lostgenegenes&essentialityanalysis
Supplementary Table 7 - PlanMine identifiers for transcripts andprimers used for
cloningusedinthisstudy
References1. Garcia-Fernàndez,J.,Baguñà,J.&Saló,E.Genomicorganizationand
expressionoftheplanarianhomeoboxgenesDth-1andDth-2.Development
118,241–253(1993).
2. Eid,J.etal.Real-timeDNAsequencingfromsinglepolymerasemolecules.
Science323,133–8(2009).
3. Putnam,N.H.etal.Chromosome-scaleshotgunassemblyusinganinvitro
methodforlong-rangelinkage.GenomeRes26,342–50(2016).
4. Sheffner,A.L.Thereductioninvitroinviscosityofmucoproteinsolutionsbya
newmucolyticagent,N-acetyl-L-cysteine.AnnNYAcadSci106,298–310
(1963).
5. Murray,M.G.&Thompson,W.F.Rapidisolationofhighmolecularweight
plantDNA.NucleicAcidsRes8,4321–5(1980).
6. Adams,M.D.etal.ThegenomesequenceofDrosophilamelanogaster.
Science287,2185–95(2000).
7. Koren,S.etal.Canu:scalableandaccuratelong-readassemblyviaadaptivek-
merweightingandrepeatseparation.GenomeRes27,722–736(2017).
8. Myers,G.EfficientLocalAlignmentDiscoveryamongstNoisyLongReads.
AlgorithmsbioinformaticsWABI2014LectnotesComputSci8701,52–67
(2014).
9. Chaisson,M.J.&Tesler,G.Mappingsinglemoleculesequencingreadsusing
basiclocalalignmentwithsuccessiverefinement(BLASR):applicationand
theory.BMCBioinformatics13,238(2012).
WWW.NATURE.COM/NATURE | 38
SUPPLEMENTARY INFORMATIONRESEARCHdoi:10.1038/nature25473
10. English,A.C.etal.Mindthegap:upgradinggenomeswithPacificBiosciences
RSlong-readsequencingtechnology.PLoSOne7,e47768(2012).
11. Walker,B.J.etal.Pilon:anintegratedtoolforcomprehensivemicrobial
variantdetectionandgenomeassemblyimprovement.PLoSOne9,e112963
(2014).
12. Langmead,B.&Salzberg,S.L.Fastgapped-readalignmentwithBowtie2.Nat
Methods9,357–9(2012).
13. Brandl,H.etal.PlanMine--amineableresourceofplanarianbiologyand
biodiversity.NucleicAcidsRes44,D764-73(2016).
14. Grabherr,M.G.etal.Full-lengthtranscriptomeassemblyfromRNA-Seqdata
withoutareferencegenome.NatBiotechnol29,644–52(2011).
15. Wu,T.D.&Watanabe,C.K.GMAP:agenomicmappingandalignment
programformRNAandESTsequences.Bioinformatics21,1859–75(2005).
16. Yu,G.,Wang,L.-G.,Han,Y.&He,Q.-Y.clusterProfiler:anRpackagefor
comparingbiologicalthemesamonggeneclusters.OMICS16,284–7(2012).
17. Robb,S.M.C.,Gotting,K.,Ross,E.&SánchezAlvarado,A.SmedGD2.0:The
Schmidteamediterraneagenomedatabase.Genesis53,535–546(2015).
18. Quinlan,A.R.&Hall,I.M.BEDTools:aflexiblesuiteofutilitiesforcomparing
genomicfeatures.Bioinformatics26,841–2(2010).
19. Kelley,D.R.&Salzberg,S.L.Detectionandcorrectionoffalsesegmental
duplicationscausedbygenomemis-assembly.GenomeBiol11,R28(2010).
20. Parsons,J.D.Miropeats:graphicalDNAsequencecomparisons.ComputAppl
Biosci11,615–619(1995).
21. Koutsovoulos,G.etal.Noevidenceforextensivehorizontalgenetransferin
thegenomeofthetardigradeHypsibiusdujardini.ProcNatlAcadSci113,