Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
11/19/15
1
2015-BMMB852D:AppliedBioinforma:cs
Week13,Lecture26
IstvánAlbert
BiochemistryandMolecularBiologyandBioinforma:csConsul:ngCenter
PennState
Intervalrelatedtasks
Anintervalsarenotone-dimensionalpoints!–makesuretospecifymoreprecisely
• Foreachfeaturefindtheintervalsfromanotherdatasetthatareclose/overlappingwithit
• Foreachintervalononestrandfindtheclosestontheotherstrand
Thisismaynotbesufficientlywelldefined.
Importantdetails
• Whataretheanchorpoints(theloca:onsthatrepresenttheintervals)
• Whichdirec:ondoesthecomparisonproceed–upstream,downstream?
• Whatgetsreported?OZenweneedtocreateanothertransformedintervaldatathatconformstowhatweactuallyneed
midpoint startsupstream Compu:ngIntervalOverlaps• Unexpectedlycomplextaskasitneedstoaccountforvarioustypesof
posi:oning:– fullcontainmentofeitherinterval– par:aloverlaps
X Y
Neatandusefulformulas(X,Yisthetargetinterval,start,endrefertothequery):• midpoint=(start+end)//2(withintegerdivision)• overlapcondi:on:(start<Y)and(end>X)
11/19/15
2
Overlap/intersect
• Twofeaturesaresaidtooverlaporintersectiftheyshareatleastonebaseincommon.
FeatureA
FeatureB
FeatureC
genome
Compu:ngIntervalOverlaps• Unexpectedlycomplextaskasitneedstoaccountforvarioustypesof
posi:oning:– fullcontainmentofeitherinterval– par:aloverlaps
X Y
Neatandusefulformulas(X,Yisthetargetinterval,start,endrefertothequery):• midpoint=(start+end)//2(withintegerdivision)• overlapcondi:on:(start<Y)and(end>X)
Intervalrepresenta:on
• binningàredundantlystoringdataatdifferentzoomlevels-originallyimplementedinUCSCgenomebrowser(alsousedinBAMandBedTools)
• Adifferentop:onàintervaltree,usuallysupportedbyprogramminglanguages
• Programming:p:forintervalsthatarenotradicallydifferentinsizeasortbystartcoordinatefollowedbyabinarysearchwillbeefficient
BedTools
• HighperformancesoZwarepackagethatoperatesonmul:pleintervalorienteddataformats:BED,GFF,SAM,BAMandVCF
• DownloadandinstallbedtoolshCp://bedtools.readthedocs.org/en/latest/
QuinlanARandHallIM,BEDTools:aflexiblesuiteofu3li3esforcomparinggenomicfeatures.Bioinforma:cs.26,6,(2010)
11/19/15
3
BedToolsconcepts
• Therearemany(25andgrowing)tools/ac:onswithdifferentnames
• Mosttoolswritetothestandardoutput
• The–(minus)characterspecifiesthestandardinput
• CanbechainedwithpipeslikeallUNIXcommands
• Mosttoolswritetheirhelpwheninvoked,othersneed–hflag
• Flagop:onscansubstan:allychangetheoutputformat
Excellentdocumenta:on
Basicconcepts
• Foranyopera:onthatrequirestwofilesthetoolswillrequireafileAandfileB
• EachelementinfileAismatchedagainsteachelementinfileB
• FileBisloadedintomemory–trytomakethatthesmallerfile
(forexampletheAfilecontainsthethereads–Bfilecontainsthefeatures)
Bedtoolsconcepts
• Theoldstylemodecontainsadifferenttoolforeachtask(themanualcoversthesetools):– intersectBed– windowBed– closestBed
• Anewstylemodethatcontainsonlyonetoolthattakescommandslikesamtools:– bedtoolsintersect– bedtoolswindow– bedtoolsclosest
11/19/15
4
AfewBedToolsoperators
– slop(extend)
– flank
– merge
– subtract
– complement
BlueàbeforeRedàaZer
Essen:alfeature:StrandAwareness
• Sometoolstakea–l(leM),-r(right)parameterthatwillhaveadifferenteffectifthe“stranded”modeisturnedon
1. defaultmode:leZ,rightareinterpretedontheforwardstrand’scoordinatesystem
2. strandedmode:leZ,rightareinterpretedinthetranscrip:onaldirec:on5’to3’
Importantdetails
• Whataretheanchorpoints(theloca:onsthatrepresenttheintervals)
• Whichdirec:ondoesthecomparisonproceed–upstream,downstream?
• Whatgetsreported?OZenweneedtocreateanothertransformedintervaldatathatconformstowhatweactuallyneed
midpoint startsupstream Intervalintersec:on(findoverlaps)
• Themostimportantfunc:onalityofthetoolset
• Otherfunc:onalityofbedtoolscouldprobablybeimplementedbyyourprograms
• Efficientlyintersec:ngintervalsisanalgorithmicallymorecomplexproblem
11/19/15
5
Basicconcepts
• Foranyopera:onthatrequirestwofilesthetoolswillrequireafileAandfileB
• EachelementinfileAismatchedagainsteachelementinfileB
• FileBisloadedintomemory–trytomakethatthesmallerfile
(forexampletheAfilecontainsthethereads–Bfilecontainsthefeatures)
Bedtoolsconcepts
• Theoldstylemodecontainsadifferenttoolforeachtask(themanualcoversthesetools):– intersectBed– windowBed– closestBed
• Anewstylemodethatcontainsonlyonetoolthattakescommandslikesamtools:– bedtoolsintersect– bedtoolswindow– bedtoolsclosest
bedtoolsintersect
• Differentflagscanproducericheroutputs
• Therearevariantssuchasclosest/windowthataresimilarinfunc:onalitytointersect
• Some:methesolu:ontogenngwhatyouwantistocreateintervalsoflength1aroundthefeatureofinterest
Next:BedtoolsTutorialbyAaronQuinlan
MaterialtaughtatColdSpringHarborsummerworkshopshop://quinlanlab.org/tutorials/cshl2014/bedtools.html
11/19/15
6
Regionsnotcoveredbyintervals Mergingoverlappingintervals
Genomewidecoverage Homework26CreateanebolafeaturefilethathasonlythefeaturesannotatedasgenesThenusingthisfile:1. Createanewintervalfilethatcontainsonlythegenomicregionsthat
areNOTcoveredbygenes(complement)
2. Createanintervalfilethatcontainsonlythe250bplongregionsthatareupstreamofeachgene(flank).Callthesepromoterregions.
3. Createafastafilethatcontainsthesequencesforthepromoterregionsthatyouextractedinstep2(geSasta).
InyourhomeworkshowthecommandsandascreenshotofIGVthatshowstheintervals