BC Cancer Agency Genome Sciences Centre BC Cancer Agency Of Mice and Motifs and Best Laid Plans Michael Smith Genome Sciences Centre Terry Fox Laboratory

BC Cancer Agency

Genome Sciences Centre BC Cancer

Agency

Of Mice and Motifs and Best Laid Plans

Michael Smith Genome Sciences CentreTerry Fox Laboratory

British Columbia Cancer Agency, Vancouver

CMMT, Vancouver

BC Cancer Agency

Genome Sciences Centre

Of mice…

• To measure gene expression levels in tissues of developing mice to gain insight into the normal development process.

• To develop supporting technologies and techniques to improve the process for generating and analysing this data

BC Cancer Agency


LongSAGE(Saha et al, 2002)

Data is:•not constrained to known transcripts – novel gene discovery•digital in nature•easy to transfer

BC Cancer Agency


Overview SAGE data

• 2 dataset– October Freeze

• 72 21-mer libraries• 8.55 million tags• 924,392 unique tag types• 49 tissues, 25 developmental stages

– January Freeze• 105 21-mer libraries (92 fully sequenced)• 11.65 million• 1,235,833 unique tag types

BC Cancer Agency


BC Cancer Agency


Processing SAGE data

Raw Library Tag ClusteringAssign

Confidencescores

Assign tags totranscripts

LocaliseTranscriptson genome

Analysis Tools:DiscoverySpace

BC Cancer Agency



Confidencescores




BC Cancer Agency


Tag Clustering

• Tag Types cluster in tag space– Colinge and Ferge, 2001

• PCR error + Sequencing error– Akmaev and Wong, 2004

• We have used real PHRED values to quantify p-values per tag type

BC Cancer Agency


Filtering out tags with low sequence quality reduces error

rate

5.0%

5.5%

6.0%

6.5%

7.0%

7.5%

8.0%

8.5%

9.0%

9.5%

10.0%

0.001% 0.010% 0.100% 1.000% 10.000% 100.000%

1 - Sequence Quality

Err

or

Rat

e

BC Cancer Agency


Tag/ Tag Type Confidence

• Individual Tag Error = (Base Library Error) combined with (Tag Sequence Error)

• Combine Individual Tag Errors to generate Tag Type errors for each library

• Combine Tag Type errors from each library to generate Tag Type error for the metalibrary

BC Cancer Agency



Confidencescores




BC Cancer Agency


CMOST: Tag Mapping

Virtual tagdatabases

Tag “Modification”: single base permutation, addition, deletion

SAGE Library

RefSeq

Genome

Ensembl Transcripts

Mitochondrion

MGC

BC Cancer Agency



Confidencescores




BC Cancer Agency


Tag Localization

MGC RefSeqGenome

Tag Mapper Known Exon

Exon Exon Exon

BC Cancer Agency


Tag Localization

MGC RefSeqGenome

Tag Mapper Novel Gene/Exon ?

Exon Exon Exon

BC Cancer Agency


Tag Localization

MGC RefSeqGenome

AmbiguousMapping

Exon Exon Exon

Tag Mapper

BC Cancer Agency



Confidencescores




BC Cancer Agency


Abundant tags more likely to map

BC Cancer Agency


Coverage of Transcript Databases

Data source

Number of Transcripts

Number Observable(multiple)

% observed (multiple)

Number Observable(single)

% observed (single)

Ensembl(known)

25,226 24674 21277 19536 14334

Ensembl (predicted)

8,317 7598 4455 5122 1308

RefSeq NM 17,720 17,319 15,008 16,416 13,076

MGC 14,594 14,518 14,225 9,413 7,479

BC Cancer Agency


Is wider sampling better than very deep sampling ?

• 120,000 tags per library ~ equivalent to chip experiment (Lu et al, 2004)

• Ideally, would like 300,000-400,000 tags sampled to recover most genes

• Benefit to sampling a greater number of tissue/stage combinations

0

2000

4000

6000

8000

10000

12000

0 100000 200000 300000 400000 500000 600000 700000 800000 900000

Sampling Depth

No.

of N

M R

efSe

q ge

nes

obse

rved

BC Cancer Agency


BC Cancer Agency


BC Cancer Agency


GO Analysis of 177 common genes

• 38% – metabolism• 19% - cell growth and/or

maintenance• 13% – transport• 6% - cell communication

BC Cancer Agency


Where do the tags map ?Location Gene

Evidence

All (A > 0)c A > 1 A > 10 A > 60 A > 1000

Number of Unique Locations

- 261,134 106,961 25,829 8,855 424

Annotated Exon

Known 12.1% 17.9% 23.8% 28.3% 34.7%

Novel 0.9% 1.2% 1.2% 1.1% 0.7%

Annotated UTR

Known 8.0% 14.6% 30.9% 46.0% 58.0%

Novel 0.3% 0.5% 1.0% 1.2% 1.4%

Intron Known 20.0% 14.3% 4.4% 1.8% 1.2%

Novel 1.5% 1.1% 0.4% 0.2% 0%

Putative UTR Known 0.5% 0.7% 0.8% 0.5% 0.5%

Novel 0.2% 0.2% 0.2% 0.2% 0%

Intergenic - 56.3% 49.5% 37.4% 20.8% 3.5%

BC Cancer Agency


How many genes observed ?

• 107k transcripts covering 18.6k high quality annotated genes

• 14k transcripts covering 4k predicted RefSeq and ENSEMBL genes

• ~21k genes observed

BC Cancer Agency


What are the “intergenic tags” ?

• 140k tags unaccounted for…• Novel genes ?• 24k transcripts covering 12k

UNIGENE and ENSEMBL EST genes• 36% map antisense to annotated

genes

• Many are singletons

BC Cancer Agency


Singletons

• Unannotated singletons – no genes, ESTs• 81% success rate for meta-singletons • 74% success rate for library singletons

BC Cancer Agency


Summary

• The majority of singletons represent bona fide transcriptional elements

• We have identified novel transcripts• Evidence of differentially regulated

variants resulting in different protein• Data providing functional annotation

BC Cancer Agency


… and motifs…• The transcription of a gene is

dependent on at least – 1) the DNA binding factors present

in the nucleus at a given time and – 2) the DNA sequences, or cis-

regulatory motifs, present in the gene region to which these factors can bind

Our goal is to attempt to identify the regulatory motifs

BC Cancer Agency


• High quality in-silico discovery of gene regulatory elements on a genome wide scale

Approach based on:

• Overrepresentation of similar DNA motifs in upstream sequences of genes with the same regulatory control

Project Goals

BC Cancer Agency


Our Method

• Use orthologous genesi.e. the equivalent genes different organisms.

• Use regions from genes which display strong co-expression (infer co-regulation).

BC Cancer Agency


Orthologues From ComparaDB

E.Birney at al., Nucl.Acids.Res. 32 (2004)M.Clamp et al., Nucl.Acids.Res. 31 (2003)

ActinAlphaCardiac

BC Cancer Agency


Multiple Sequence Alignment

ActinAlphaCardiac

BC Cancer Agency


Multiple Sequence Alignment

ActinAlphaCardiac

BC Cancer Agency


1. Cancer Genome Anatomy Project; Gene Expression Omnibus2. Gene Expression Omnibus3. Stuart JM, Segal E, Koller D, Kim SK: A gene-coexpression network for global

discovery of conserved genetic modules. Science 2003, 302(5643):249-255.

Co-expression datasets

1.

2.3.

BC Cancer Agency


Pipeline/ cluster

...

(W)C

ON

SE

NS

US

Mo

tifS

amp

ler

ME

ME

MD

mo

du

le

Gib

bs

Mo

tif

Sam

ple

r

Bio

Pro

spec

tor

Accuracyassessmentframework (AAF)

Results

Co-expr'nEnsEMBL(vXX)

Visualization

Original outputfiles fromdiscoverymethods

Gene list

Sequence setbuilder

Synthetic ortholguegenerator (DUNE)

Assign method-independent motif

significance

Knownregulatory motifsfrom literature, incommon format

Accuracyassessment

FASTA sequence sets for each target gene

Target genesequence

set

Backgroundsequence

set

'Null' distnsequence

set

'Known' resources

TRANSFAC

JASPAR

User PFMs,site seqs, ...

GeneOntology

Manager for discovery pipeline jobs

Individualmotifs

Nonredundantmotif clusters

Modules

Versions

Protein-proteinbinding

Literaturemining

GenerateFASTA files formotif discovery

Co-expressed gene pipeline

SAGE data

Affy data

cDNA array data

Identify reliablyco-expressed

genes (Pearsondistance)

Method'wrapper', withmotif discoveryapplication andpre- and post-processing

AN

N-S

pec

Alig

nA

CE

Output files incommon text

format

Filtered motifs

cisRED

Filter motifsby p-value

Convert results

Identifyknownmotifs

Motif clustering

Module detection

Visualize results

Sequencedata

Orthologydata from'compara'

Accuracyresults

Training /optimizing

The Regulatory element Pipeline

Gene Expression Data Sequence Identification Algorithm ImplementationPost-ProcessingKnown ResourcesAccuracy Assessment

BC Cancer Agency


Set of Motif Discovery

Algorithms

✗WCONSENSUS✗PHYLOCON✗TEIRESIAS✗MOTIFSAMPLER✗MEME✗MDMODULE✗GIBBS✗CONSENSUS✗BIOPROSPECTOR✗ANNSPEC✗ETC.

Bck Files 2

Back Files N

Bck Files 1

Input MFA 1

Input MFA 2

Input MFA N

Convert

Input file, formatspecific to method M

Input file, formatspecific to method 2

Input file, formatspecific to method 1

HPC Cluster

368 CPUs running #Genes X #[Algorithm, Parameterset] jobs

Raw output(method dependant)

Convert

Standardized, MethodIndependent Results

Pipeline Core Parallel Multi-Method Pipeline

BC Cancer Agency


width w

weight w1

weight w2

weight w3

weight wn

SequenceSimilarity(weighted)

Information Content Profile“known” # seq with hits

vs# seq in input file

#base freq compared to

whole genome

# input sequences

Scoring Function (for target sequence hit)

InformationContentProfiles

Transfac

JASPAR

Pipeline core

Discovery Output

Sequence weights based

on phylogenetic distance or co-

expression

Method Independent Scoring

Determine SNP profile for all species sequence

BC Cancer Agency


Cumulative distributions of MI scores

BC Cancer Agency


HitPlotter: 1500bp

BC Cancer Agency


...

TATA box

BC Cancer Agency


BC Cancer Agency


BC Cancer Agency


BC Cancer Agency


BC Cancer Agency


BC Cancer Agency


Co-occurring Motifs

• Red and Blue motifs co-occur in the promoter regions of these two genes

• The separation of the two motifs may be constrained

• Use co-occurrence motifs to define regulatory modules

BC Cancer Agency


Putting it all together…

Pipeline/ cluster

...

(W)C

ON

SE

NS

US

Mo

tifS

amp

ler

ME

ME

MD

mo

du

le

Gib

bs

Mo

tif

Sam

ple

r

Bio

Pro

spec

tor

Accuracyassessmentframework (AAF)

Results

Co-expr'nEnsEMBL(vXX)

Visualization

Original outputfiles fromdiscoverymethods

Gene list

Sequence setbuilder

Synthetic ortholguegenerator (DUNE)

Assign method-independent motif

significance

Knownregulatory motifsfrom literature, incommon format

Accuracyassessment

FASTA sequence sets for each target gene

Target genesequence

set

Backgroundsequence

set

'Null' distnsequence

set

'Known' resources

TRANSFAC

JASPAR

User PFMs,site seqs, ...

GeneOntology

Manager for discovery pipeline jobs

Individualmotifs

Nonredundantmotif clusters

Modules

Versions

Protein-proteinbinding

Literaturemining

GenerateFASTA files formotif discovery

Co-expressed gene pipeline

SAGE data

Affy data

cDNA array data

Identify reliablyco-expressed

genes (Pearsondistance)

Method'wrapper', withmotif discoveryapplication andpre- and post-processing

AN

N-S

pec

Alig

nA

CE

Output files incommon text

format

Filtered motifs

cisRED

Filter motifsby p-value

Convert results

Identifyknownmotifs

Motif clustering

Module detection

Visualize results

Sequencedata

Orthologydata from'compara'

Accuracyresults

Training /optimizing

Gene SpecificMotifs andModules

Tissue SpecificGene Expression

Patterns

Tissue SpecificMotifs andModules

BC Cancer Agency


…And best laid plans

• But, Mousie, thou art no thy lane, In proving foresight may be vain; The best-laid schemes o' mice an' men Gang aft agley, An'lea'e us nought but grief an' pain, For promis'd joy!

– Robert Burns

BC Cancer Agency


The Moral of the Story

• If you’re a mouse, don’t make your home in a farmer’s field – build it next to the field!

• Risk Management!• What are the issues associated with

running a large bioinformatics activity ?

BC Cancer Agency


Running a bioinformatics group

• What does everyone do ?• How are they doing it ?• Are they talking to the right people ?• Have they got the right requirements ?• Is anyone waiting for information ?• Are they running on schedule ?• Is there an issue that needs escalating?• Are there HR, training, management,

coaching issues that need to be addressed ?

BC Cancer Agency


Organizational Complexity

• The organizational complexity of bioinformatics projects has increased:– Made up of larger teams– Have multiple stakeholders– Contain many organizational layers

BC Cancer Agency


Technical Complexity

• Number of databases increasing• Number of methods increasing• Body of knowledge is developing

rapidly• Requirements change rapidly• Must be well-read in a large number

of fields

BC Cancer Agency


Common Statements

– “Things change all the time - it’s impossible to plan”

– “I’d like you to do some analysis”– “I don’t have time to plan”– “We’ll figure it out as we go along”

– Not so common – “An ounce of prevention is worth a pound of cure”

BC Cancer Agency


Software Engineering Management

• Large body of knowledge• Requirements engineering• Architecture and Design• Validation• Change management• Risk management

BC Cancer Agency


Solutions at the GSC

• CM controls• Bug tracking controls• Some validation controls• Various levels of design and architecture• Implementation of structured engineering

process under way to define, track and manage work– Requirements control– Risk/Change management

BC Cancer Agency


• ~90% of work performed by the group can be planned or have a LOE assigned

• Some areas harder – finishing a genome, algorithm development, exploratory analysis

• There is always a schedule and a budget

BC Cancer Agency


Process controls risk… but at a cost

RE

=P

(L)*

S(L

)

Time and effort invested in Plans

P(L) = probability of loss

S(L) = size of loss

RE Due toInadequateplanning

RE due toMarket shareerosion

BC Cancer Agency


Hacky Scripts/Code have their place

• Ideal for prototyping– Only prototype when you are trying to

get a handle on things

• Throw away the prototype, when you’re done experimenting!

• …but stop and think!

BC Cancer Agency


And standards…

BC Cancer Agency


Steven JonesGenome Sciences Centre

Asim SiddiquiScott ZuyderduynRichard VarholDerek LeungKevin TeagueLisa LeeAnita Landry

Mouse Atlas

Elizabeth M. Simpson CMMT

Robert XieSlavita BohacecByron Kuo

Adrian BurkeGenomeBC

Caroline AstellProject Manager

Pamela HoodlessTerry Fox Laboratory

Jim RupertMona WuRebecca Cullum

Cheryl HelgasonCancer Endocrinology

Brad HoffmanTeresa Ruiz de AlagaraIda Zhang

Marco MarraGenome Sciences Centre

Jaswinder KhattraAllen DelaneyJennifer AsanoSusanna Chan

Gregory RigginsJohn Hopkins

BC Cancer Agency


CisRedGSCMarco MarraGordon RobertsonRichard VarholKevin TeagueObi GriffithErin PleasanceDebra FultonKeven LinMikhail BilenkyNeil RoberstonMonica SluemerStephen MontgomeryAsim Siddiqui

Ian Holmes, UC Berkeley

Ewan Birney, EBI

Stanford UniversityRick MyersNathan TrinkleinShelley Force AlldredSarah Hartman

BC Cancer Agency


www.mouseAtlas.org

www.cisRed.org

Documents

BC Cancer Agency Genome Sciences Centre BC Cancer Agency Of Mice and Motifs and Best Laid Plans Michael Smith Genome Sciences Centre Terry Fox Laboratory