Upload
ira-parrish
View
215
Download
0
Embed Size (px)
Citation preview
BC Cancer Agency
Genome Sciences Centre BC Cancer
Agency
Of Mice and Motifs and Best Laid Plans
Michael Smith Genome Sciences CentreTerry Fox Laboratory
British Columbia Cancer Agency, Vancouver
CMMT, Vancouver
BC Cancer Agency
Genome Sciences Centre
Of mice…
• To measure gene expression levels in tissues of developing mice to gain insight into the normal development process.
• To develop supporting technologies and techniques to improve the process for generating and analysing this data
BC Cancer Agency
Genome Sciences Centre
LongSAGE(Saha et al, 2002)
Data is:•not constrained to known transcripts – novel gene discovery•digital in nature•easy to transfer
BC Cancer Agency
Genome Sciences Centre
Overview SAGE data
• 2 dataset– October Freeze
• 72 21-mer libraries• 8.55 million tags• 924,392 unique tag types• 49 tissues, 25 developmental stages
– January Freeze• 105 21-mer libraries (92 fully sequenced)• 11.65 million• 1,235,833 unique tag types
BC Cancer Agency
Genome Sciences Centre
BC Cancer Agency
Genome Sciences Centre
Processing SAGE data
Raw Library Tag ClusteringAssign
Confidencescores
Assign tags totranscripts
LocaliseTranscriptson genome
Analysis Tools:DiscoverySpace
BC Cancer Agency
Genome Sciences Centre
Raw Library Tag ClusteringAssign
Confidencescores
Assign tags totranscripts
LocaliseTranscriptson genome
Analysis Tools:DiscoverySpace
BC Cancer Agency
Genome Sciences Centre
Tag Clustering
• Tag Types cluster in tag space– Colinge and Ferge, 2001
• PCR error + Sequencing error– Akmaev and Wong, 2004
• We have used real PHRED values to quantify p-values per tag type
BC Cancer Agency
Genome Sciences Centre
Filtering out tags with low sequence quality reduces error
rate
5.0%
5.5%
6.0%
6.5%
7.0%
7.5%
8.0%
8.5%
9.0%
9.5%
10.0%
0.001% 0.010% 0.100% 1.000% 10.000% 100.000%
1 - Sequence Quality
Err
or
Rat
e
BC Cancer Agency
Genome Sciences Centre
Tag/ Tag Type Confidence
• Individual Tag Error = (Base Library Error) combined with (Tag Sequence Error)
• Combine Individual Tag Errors to generate Tag Type errors for each library
• Combine Tag Type errors from each library to generate Tag Type error for the metalibrary
BC Cancer Agency
Genome Sciences Centre
Raw Library Tag ClusteringAssign
Confidencescores
Assign tags totranscripts
LocaliseTranscriptson genome
Analysis Tools:DiscoverySpace
BC Cancer Agency
Genome Sciences Centre
CMOST: Tag Mapping
Virtual tagdatabases
Tag “Modification”: single base permutation, addition, deletion
SAGE Library
RefSeq
Genome
Ensembl Transcripts
Mitochondrion
MGC
BC Cancer Agency
Genome Sciences Centre
Raw Library Tag ClusteringAssign
Confidencescores
Assign tags totranscripts
LocaliseTranscriptson genome
Analysis Tools:DiscoverySpace
BC Cancer Agency
Genome Sciences Centre
Tag Localization
MGC RefSeqGenome
Tag Mapper Known Exon
Exon Exon Exon
BC Cancer Agency
Genome Sciences Centre
Tag Localization
MGC RefSeqGenome
Tag Mapper Novel Gene/Exon ?
Exon Exon Exon
BC Cancer Agency
Genome Sciences Centre
Tag Localization
MGC RefSeqGenome
AmbiguousMapping
Exon Exon Exon
Tag Mapper
BC Cancer Agency
Genome Sciences Centre
Raw Library Tag ClusteringAssign
Confidencescores
Assign tags totranscripts
LocaliseTranscriptson genome
Analysis Tools:DiscoverySpace
BC Cancer Agency
Genome Sciences Centre
Abundant tags more likely to map
BC Cancer Agency
Genome Sciences Centre
Coverage of Transcript Databases
Data source
Number of Transcripts
Number Observable(multiple)
% observed (multiple)
Number Observable(single)
% observed (single)
Ensembl(known)
25,226 24674 21277 19536 14334
Ensembl (predicted)
8,317 7598 4455 5122 1308
RefSeq NM 17,720 17,319 15,008 16,416 13,076
MGC 14,594 14,518 14,225 9,413 7,479
BC Cancer Agency
Genome Sciences Centre
Is wider sampling better than very deep sampling ?
• 120,000 tags per library ~ equivalent to chip experiment (Lu et al, 2004)
• Ideally, would like 300,000-400,000 tags sampled to recover most genes
• Benefit to sampling a greater number of tissue/stage combinations
0
2000
4000
6000
8000
10000
12000
0 100000 200000 300000 400000 500000 600000 700000 800000 900000
Sampling Depth
No.
of N
M R
efSe
q ge
nes
obse
rved
BC Cancer Agency
Genome Sciences Centre
BC Cancer Agency
Genome Sciences Centre
BC Cancer Agency
Genome Sciences Centre
GO Analysis of 177 common genes
• 38% – metabolism• 19% - cell growth and/or
maintenance• 13% – transport• 6% - cell communication
BC Cancer Agency
Genome Sciences Centre
Where do the tags map ?Location Gene
Evidence
All (A > 0)c A > 1 A > 10 A > 60 A > 1000
Number of Unique Locations
- 261,134 106,961 25,829 8,855 424
Annotated Exon
Known 12.1% 17.9% 23.8% 28.3% 34.7%
Novel 0.9% 1.2% 1.2% 1.1% 0.7%
Annotated UTR
Known 8.0% 14.6% 30.9% 46.0% 58.0%
Novel 0.3% 0.5% 1.0% 1.2% 1.4%
Intron Known 20.0% 14.3% 4.4% 1.8% 1.2%
Novel 1.5% 1.1% 0.4% 0.2% 0%
Putative UTR Known 0.5% 0.7% 0.8% 0.5% 0.5%
Novel 0.2% 0.2% 0.2% 0.2% 0%
Intergenic - 56.3% 49.5% 37.4% 20.8% 3.5%
BC Cancer Agency
Genome Sciences Centre
How many genes observed ?
• 107k transcripts covering 18.6k high quality annotated genes
• 14k transcripts covering 4k predicted RefSeq and ENSEMBL genes
• ~21k genes observed
BC Cancer Agency
Genome Sciences Centre
What are the “intergenic tags” ?
• 140k tags unaccounted for…• Novel genes ?• 24k transcripts covering 12k
UNIGENE and ENSEMBL EST genes• 36% map antisense to annotated
genes
• Many are singletons
BC Cancer Agency
Genome Sciences Centre
Singletons
• Unannotated singletons – no genes, ESTs• 81% success rate for meta-singletons • 74% success rate for library singletons
BC Cancer Agency
Genome Sciences Centre
Summary
• The majority of singletons represent bona fide transcriptional elements
• We have identified novel transcripts• Evidence of differentially regulated
variants resulting in different protein• Data providing functional annotation
BC Cancer Agency
Genome Sciences Centre
… and motifs…• The transcription of a gene is
dependent on at least – 1) the DNA binding factors present
in the nucleus at a given time and – 2) the DNA sequences, or cis-
regulatory motifs, present in the gene region to which these factors can bind
Our goal is to attempt to identify the regulatory motifs
BC Cancer Agency
Genome Sciences Centre
• High quality in-silico discovery of gene regulatory elements on a genome wide scale
Approach based on:
• Overrepresentation of similar DNA motifs in upstream sequences of genes with the same regulatory control
Project Goals
BC Cancer Agency
Genome Sciences Centre
Our Method
• Use orthologous genesi.e. the equivalent genes different organisms.
• Use regions from genes which display strong co-expression (infer co-regulation).
BC Cancer Agency
Genome Sciences Centre
Orthologues From ComparaDB
E.Birney at al., Nucl.Acids.Res. 32 (2004)M.Clamp et al., Nucl.Acids.Res. 31 (2003)
ActinAlphaCardiac
BC Cancer Agency
Genome Sciences Centre
Multiple Sequence Alignment
ActinAlphaCardiac
BC Cancer Agency
Genome Sciences Centre
Multiple Sequence Alignment
ActinAlphaCardiac
BC Cancer Agency
Genome Sciences Centre
1. Cancer Genome Anatomy Project; Gene Expression Omnibus2. Gene Expression Omnibus3. Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global
discovery of conserved genetic modules. Science 2003, 302(5643):249-255.
Co-expression datasets
1.
2.3.
BC Cancer Agency
Genome Sciences Centre
Pipeline/ cluster
...
(W)C
ON
SE
NS
US
Mo
tifS
amp
ler
ME
ME
MD
mo
du
le
Gib
bs
Mo
tif
Sam
ple
r
Bio
Pro
spec
tor
Accuracyassessmentframework (AAF)
Results
Co-expr'nEnsEMBL(vXX)
Visualization
Original outputfiles fromdiscoverymethods
Gene list
Sequence setbuilder
Synthetic ortholguegenerator (DUNE)
Assign method-independent motif
significance
Knownregulatory motifsfrom literature, incommon format
Accuracyassessment
FASTA sequence sets for each target gene
Target genesequence
set
Backgroundsequence
set
'Null' distnsequence
set
'Known' resources
TRANSFAC
JASPAR
User PFMs,site seqs, ...
GeneOntology
Manager for discovery pipeline jobs
Individualmotifs
Nonredundantmotif clusters
Modules
Versions
Protein-proteinbinding
Literaturemining
GenerateFASTA files formotif discovery
Co-expressed gene pipeline
SAGE data
Affy data
cDNA array data
Identify reliablyco-expressed
genes (Pearsondistance)
Method'wrapper', withmotif discoveryapplication andpre- and post-processing
AN
N-S
pec
Alig
nA
CE
Output files incommon text
format
Filtered motifs
cisRED
Filter motifsby p-value
Convert results
Identifyknownmotifs
Motif clustering
Module detection
Visualize results
Sequencedata
Orthologydata from'compara'
Accuracyresults
Training /optimizing
The Regulatory element Pipeline
Gene Expression Data Sequence Identification Algorithm ImplementationPost-ProcessingKnown ResourcesAccuracy Assessment
BC Cancer Agency
Genome Sciences Centre
Set of Motif Discovery
Algorithms
✗WCONSENSUS✗PHYLOCON✗TEIRESIAS✗MOTIFSAMPLER✗MEME✗MDMODULE✗GIBBS✗CONSENSUS✗BIOPROSPECTOR✗ANNSPEC✗ETC.
Bck Files 2
Back Files N
Bck Files 1
Input MFA 1
Input MFA 2
Input MFA N
Convert
Input file, formatspecific to method M
Input file, formatspecific to method 2
Input file, formatspecific to method 1
HPC Cluster
368 CPUs running #Genes X #[Algorithm, Parameterset] jobs
Raw output(method dependant)
Convert
Standardized, MethodIndependent Results
Pipeline Core Parallel Multi-Method Pipeline
BC Cancer Agency
Genome Sciences Centre
width w
weight w1
weight w2
weight w3
weight wn
SequenceSimilarity(weighted)
Information Content Profile“known” # seq with hits
vs# seq in input file
#base freq compared to
whole genome
# input sequences
Scoring Function (for target sequence hit)
InformationContentProfiles
Transfac
JASPAR
Pipeline core
Discovery Output
Sequence weights based
on phylogenetic distance or co-
expression
Method Independent Scoring
Determine SNP profile for all species sequence
BC Cancer Agency
Genome Sciences Centre
Cumulative distributions of MI scores
BC Cancer Agency
Genome Sciences Centre
HitPlotter: 1500bp
BC Cancer Agency
Genome Sciences Centre
...
TATA box
BC Cancer Agency
Genome Sciences Centre
BC Cancer Agency
Genome Sciences Centre
BC Cancer Agency
Genome Sciences Centre
BC Cancer Agency
Genome Sciences Centre
BC Cancer Agency
Genome Sciences Centre
BC Cancer Agency
Genome Sciences Centre
Co-occurring Motifs
• Red and Blue motifs co-occur in the promoter regions of these two genes
• The separation of the two motifs may be constrained
• Use co-occurrence motifs to define regulatory modules
BC Cancer Agency
Genome Sciences Centre
Putting it all together…
Pipeline/ cluster
...
(W)C
ON
SE
NS
US
Mo
tifS
amp
ler
ME
ME
MD
mo
du
le
Gib
bs
Mo
tif
Sam
ple
r
Bio
Pro
spec
tor
Accuracyassessmentframework (AAF)
Results
Co-expr'nEnsEMBL(vXX)
Visualization
Original outputfiles fromdiscoverymethods
Gene list
Sequence setbuilder
Synthetic ortholguegenerator (DUNE)
Assign method-independent motif
significance
Knownregulatory motifsfrom literature, incommon format
Accuracyassessment
FASTA sequence sets for each target gene
Target genesequence
set
Backgroundsequence
set
'Null' distnsequence
set
'Known' resources
TRANSFAC
JASPAR
User PFMs,site seqs, ...
GeneOntology
Manager for discovery pipeline jobs
Individualmotifs
Nonredundantmotif clusters
Modules
Versions
Protein-proteinbinding
Literaturemining
GenerateFASTA files formotif discovery
Co-expressed gene pipeline
SAGE data
Affy data
cDNA array data
Identify reliablyco-expressed
genes (Pearsondistance)
Method'wrapper', withmotif discoveryapplication andpre- and post-processing
AN
N-S
pec
Alig
nA
CE
Output files incommon text
format
Filtered motifs
cisRED
Filter motifsby p-value
Convert results
Identifyknownmotifs
Motif clustering
Module detection
Visualize results
Sequencedata
Orthologydata from'compara'
Accuracyresults
Training /optimizing
Gene SpecificMotifs andModules
Tissue SpecificGene Expression
Patterns
Tissue SpecificMotifs andModules
BC Cancer Agency
Genome Sciences Centre
…And best laid plans
• But, Mousie, thou art no thy lane, In proving foresight may be vain; The best-laid schemes o' mice an' men Gang aft agley, An'lea'e us nought but grief an' pain, For promis'd joy!
– Robert Burns
BC Cancer Agency
Genome Sciences Centre
The Moral of the Story
• If you’re a mouse, don’t make your home in a farmer’s field – build it next to the field!
• Risk Management!• What are the issues associated with
running a large bioinformatics activity ?
BC Cancer Agency
Genome Sciences Centre
Running a bioinformatics group
• What does everyone do ?• How are they doing it ?• Are they talking to the right people ?• Have they got the right requirements ?• Is anyone waiting for information ?• Are they running on schedule ?• Is there an issue that needs escalating?• Are there HR, training, management,
coaching issues that need to be addressed ?
BC Cancer Agency
Genome Sciences Centre
Organizational Complexity
• The organizational complexity of bioinformatics projects has increased:– Made up of larger teams– Have multiple stakeholders– Contain many organizational layers
BC Cancer Agency
Genome Sciences Centre
Technical Complexity
• Number of databases increasing• Number of methods increasing• Body of knowledge is developing
rapidly• Requirements change rapidly• Must be well-read in a large number
of fields
BC Cancer Agency
Genome Sciences Centre
Common Statements
– “Things change all the time - it’s impossible to plan”
– “I’d like you to do some analysis”– “I don’t have time to plan”– “We’ll figure it out as we go along”
– Not so common – “An ounce of prevention is worth a pound of cure”
BC Cancer Agency
Genome Sciences Centre
Software Engineering Management
• Large body of knowledge• Requirements engineering• Architecture and Design• Validation• Change management• Risk management
BC Cancer Agency
Genome Sciences Centre
Solutions at the GSC
• CM controls• Bug tracking controls• Some validation controls• Various levels of design and architecture• Implementation of structured engineering
process under way to define, track and manage work– Requirements control– Risk/Change management
BC Cancer Agency
Genome Sciences Centre
• ~90% of work performed by the group can be planned or have a LOE assigned
• Some areas harder – finishing a genome, algorithm development, exploratory analysis
• There is always a schedule and a budget
BC Cancer Agency
Genome Sciences Centre
Process controls risk… but at a cost
RE
=P
(L)*
S(L
)
Time and effort invested in Plans
P(L) = probability of loss
S(L) = size of loss
RE Due toInadequateplanning
RE due toMarket shareerosion
BC Cancer Agency
Genome Sciences Centre
Hacky Scripts/Code have their place
• Ideal for prototyping– Only prototype when you are trying to
get a handle on things
• Throw away the prototype, when you’re done experimenting!
• …but stop and think!
BC Cancer Agency
Genome Sciences Centre
And standards…
BC Cancer Agency
Genome Sciences Centre
Steven JonesGenome Sciences Centre
Asim SiddiquiScott ZuyderduynRichard VarholDerek LeungKevin TeagueLisa LeeAnita Landry
Mouse Atlas
Elizabeth M. Simpson CMMT
Robert XieSlavita BohacecByron Kuo
Adrian BurkeGenomeBC
Caroline AstellProject Manager
Pamela HoodlessTerry Fox Laboratory
Jim RupertMona WuRebecca Cullum
Cheryl HelgasonCancer Endocrinology
Brad HoffmanTeresa Ruiz de AlagaraIda Zhang
Marco MarraGenome Sciences Centre
Jaswinder KhattraAllen DelaneyJennifer AsanoSusanna Chan
Gregory RigginsJohn Hopkins
BC Cancer Agency
Genome Sciences Centre
CisRedGSCMarco MarraGordon RobertsonRichard VarholKevin TeagueObi GriffithErin PleasanceDebra FultonKeven LinMikhail BilenkyNeil RoberstonMonica SluemerStephen MontgomeryAsim Siddiqui
Ian Holmes, UC Berkeley
Ewan Birney, EBI
Stanford UniversityRick MyersNathan TrinkleinShelley Force AlldredSarah Hartman
BC Cancer Agency
Genome Sciences Centre
www.mouseAtlas.org
www.cisRed.org