Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
2015-BMMB852D:AppliedBioinforma8cs
Week13,Lecture25
IstvánAlbert
BiochemistryandMolecularBiologyandBioinforma8csConsul8ngCenter
PennState
Genomerepresenta8onconcepts
• Atthesimplestlevelofabstrac8onthegenomeisrepresentedbyaonedimensional“space”(lines)
• Genomeistwostrandedàalinecorrespondstoeachstrand
• Eachstrandhasapolarityàeachlinehasadirec8on
• Strands(lines)arepaired
• Thesmallestunitisonebaseàoneintegeronthenumberline
• Annota8ons(features)aresegments(coordinates)oneachline
Genomiccoordinates–briefoverviewDNAtwostrandedanddirec8onal
Butthereisonlyonecoordinatesystem
200 300
upstreamfortheforwardstrand
Standardformatsusestart<endevenforthereversestrand
Theupstreamregion–beforethe5’endrela8vetothedirec8onoftranscrip8on
upstreamforthereversestrand
5’ 3’
5’3’
Coordinatesystems
• 0basedà0,1,2,…9• 1basedà1,2,3,…10
Typically
• 0basedarenon-inclusive10:20à[10,20)
• 1basedincludebothends10:20à[10,20]
Comparingcoordinatesystems
VoteforwhatyouthinkisbeXer
1 based indexing
0 based indexing
Thirdelement
Firstten
Secondten
Thirdten
Onebaselongintervalstar8ngatthe10thelement.
Lengthofaninterval
Fiveelementsstar8ngatindex1000
Emptyinterval
Fundamentalintervalformats
• SAM/BAM–SequenceAlignmentMap
• VCF/BCFàforvariantcalls
• BED/GFFàGeneAnnota8onrepresenta8on• BEDgraph,Wiggleàvaluesoverintervals
Whatisagenomicfeature?
• Feature:agenomicregion(interval)associatedwithacertainannota8on(descrip8on).
TypicalaXributestodescribeafeature
1. chromosome2. start3. end4. strand5. name
Whydowehavesomanyvariants?Thereisnogoodra8onalreason…historyIguess
Valuesonintervals
• Asinglevaluecharacterizesanen8reintervalàscore(value)fortheinterval
• Con8nuousvaluesàdifferentvalueforeachbaseoftheintervalàanalogoustoaseriesof1bplongintervals
Differentdatarepresenta8onformats
hXp://genome.ucsc.edu/FAQ/FAQformat.html
Twocommonlyusedformats
• BED–UCSCgenomebrowserà0basednoninclusiveàalsousedtodisplaytracksinthegenomebrowser(US“standard”)(variants:bigBed,bedgraph)
• GFF–Sangerins8tuteinGreatBritainà1basedinclusiveindexingsystem(“Europeanstandard”),(variants:GTF,GFF2.0)
BEDformatSearchforBEDformat
Tabseparated3requiredand9op8onalcolumns.Lowernumberedfiledmustbefilled.
1. chrom(nameofthechromosome,sequenceid)2. chromStart(star8ngposi8ononthechromosome)3. chromEnd(endposi8onofthechromosome,notethisbaseisnotincluded!)4. name(featurename)5. score(between0and1000)6. strand(+or-)7. thickStart(thestar8ngposi8onatwhichthefeatureisdrawnthickly)8. thickEnd(theendingposi8onatwhichthefeatureisdrawnthickly)9. itemRGB(RGBcolorà255,0,0displaycolorofthedatacontained)10. blockCount(thenumberofblocks(exons)intheBEDline.)11. blockSizes(acomma-separatedlistoftheblocksizes)12. blockStarts(acomma-separatedlistoftheblockstarts)
Thesefilesmayalsotakeatrackdefini8onlinethatisvisualiza8onspecific
BedGraphFormat
Tabseparated4requiredcolumns.
1. chrom(nameofthechromosome,sequenceid)2. chromStart(star8ngposi8ononthechromosome)3. chromEnd(endposi8onofthechromosome,notethisbaseisnotincluded!)4. dataValue(valueofthedataforthatregion)
GFFformatSearchforGFF3àhXp://www.sequenceontology.org/gff3.shtml
Tabseparatedwith9columns.MissingaXributesmaybereplacedwithadotà.
1. Seqid(usuallychromosome)2. Source(whereisthedatacomingfrom)3. Type(usuallyatermfromthesequenceontology)4. Start(intervalstartrela8vetotheseqid)5. End(intervalendrela8vetotheseqid)6. Score(thescoreofthefeature,afloa8ngpointnumber)7. Strand(+or–)8. Phase(usedtoindicatereadingframeforcodingsequences)9. AZributes(semicolonseparatedaXributesàName=ABC;ID=1)
peopleliketostuffalotofinforma8onhere
Wiggleformat
• twoversionsàfixedstepandvariablestepeachtryingtoop8mizetheamountofdatastorage
fixedStep chrom=chr1 start=100 step=1 10 15 11 22 … … …
variableStep chrom=chr1 100 10 101 15 102 11 103 22 variableStep chrom=chr2 2000 23 2005 40 … … …
Wiggleisannastyformat–itlookssimplerthanitis–pleaseavoid
Wemayhavedataindifferentcoordinatesystems!
Being“oneoff”isoneofthemostcommonerrorsinbioinforma8cs.
ConversionfromGFFtoBED
(start,end)à(start–1,end)
ConversionfromBEDtoGFF
(start,end)à(start+1,end)
NotthattherewillbedifferenceswhenselecangposiaonsthatdependontheENDcoordinate!
Handlingcoordinatesrelaavetointervals
Whatarethecoordinateofthebaseprecedingandfollowingtheinterval.Seemstrivialanditis-withacatch.
GFF[start,end]àbasebeforestartisatstart-1BED[start,end)àbasebeforestartisatstart-1GFF[start,end]ànextbaseaperendisatend+1BED[start,end)ànextbaseaperendisatend
Represen8ngintervalrela8onships
• Wehaveagenewiththreesplicingvariants
Startat1000endsat8000,eachexonis1kbandisseparatedby1kbHowtorepresentthisrelaaonship?
Datarepresenta8on
• BothBEDandGFFfilescanrepresentthem
• TwocommonversionsofGFFàGTF2andGFF3(note:tooldocumenta.oncano/enwrongandshowsaweirdcombina.onofthesetwoformats)
• InGFFthecontentoftheATTRIBUTE(9th)columnspecifiestherela8onshipbetweenfeatures
GTF/GFFformatsGTFaXributes:
– gene_idvalue;agloballyuniqueiden.fierforthegenomicsourceofthetranscript
– transcript_idvalueagloballyuniqueiden.fierforthepredictedtranscript.
gene_id“G1”transcript_id“T1”
GFFaXributes:
ID=exon1;Parent=T1
SeetheGFF3siteforexactspecifica8onofthethesemean.
Important:Morethanoneparentmaybelisted!
ExampleintervalasGTF
Adis8nctlineisenteredforeachexon,repeatedforeachtranscript
ExampleintervalasGFF3
Thesameexonmaybepartofdifferenttranscripts(parents)
ExampleintervalinBED
FromtheBEDformatspecifica8on
VisualizinginIGV
Homework25
• CreateandvisualizeinIGVanintervalfilethatcontainsthreesplicevariantsofa1kblonggenewith5exons.
• Showthefileandascreenshot