Extracting information from European Bioinformatics ...€¦ · European Bioinformatics Institute Clustering methods ... High-throughput methods (“data mining”) European Bioinformatics

Jaak Vilo

DNA expression data analysis 1

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Extracting information frommicroarray data

Jaak Vilo

European Bioinformatics Institute EMBL-EBI

http://www.ebi.ac.uk/microarray/

http://ep.ebi.ac.uk

Lausanne, 1.03.2001

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Microarray Experiment

RT-PCR

RT-PCR

LASER

DNA “Chip”

High glucose

Low glucose

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Gene expression data

Treated sample labeled red (Cy5)Control data labeled green (Cy3)

Competitive hybridization ontochip

Red dot - gene overexpressed intreated sampleGreen dot - gene underexpressed intreated sampleYellow - equally expressedIntensity - “absolute” level

red/green - ratio of expression2 - 2x overexpressed0.5 - 2x underexpressed

log2( red/green ) - “log ratio” 1 2x overexpressed -1 2x underexpressed

Spotted cDNA microarrayStanford University (Yeast,1997)

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Expression data matrix

Gene 1

Gene 2

Gene n

Condition 1(chip nr. 1)

Condition m(chip nr. m)

•Measurements expressed in what units?•cm/inch?•Reality:

Jaak Vilo


ArrayExpress conceptual overview

SAMPLE

ExpressionValue

Hybridization

EXPERIMENT

ARRAYOntology

Database

Reference

e.g. taxonomy

E.g. gene in SWISS-prot

Publication, web resource

ArrayExpress

External links

Simple version of AE object model

Jaak Vilo


ArrayExpress - features

✦ MIAME-compliant (www.MGED.org)

✦ able to import MAML-formatted data

✦ can deal with both raw and processed data

✦ independence of:– experimental platforms

– image analysis methods

– data normalization methods

✦ object model-based query mechanism

✦ will support upcoming OMG standard forexpression data

Key constructs in the AE object model

✦ structured sample descriptions

✦ notion of ExpressionValueSet

✦ several dimensions for ExpressionValues

✦ Transformations working onExpressionValueSets and their dimensions

Jaak Vilo


Structured representation of sample andtreatment relations

Sample sourcePrimary sample 1

Primary sample 2

Derived sample 1

Labeled extract 1

Extract 1

Derived sample 2

A new state ofsample source

Extract 2

Labeled extract 2Hybridizationlabeling

extraction

treatment

treatment

Microarray expression value representation

expression value types

primary images composite imagese.g., green/red ratios

primaryspots

compositespots

primarymeasurements

derivedvalues

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Expression data matrix

Gene 1

Gene 2

Gene n

Condition 1(chip nr. 1)

Condition m(chip nr. m)

Hierarchicalclustering

Visualization:Pseudo-coloring ofexpression data matrix.Method developed byMike Eisen, StanfordUniversity

Jaak Vilo


EPCLUST(cluster Expression profiles)

GENOMESsequence, function,

annotation

SPEXS(Sequence Pattern Exhaustive Search)

novel patterns

URLMAP:provide links

Components of Expression Profilerhttp://www.ebi.ac.uk/microarray/

Expression data

External data, toolspathways, function,

etc.

PATMATCHknown patterns

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute EPCLUST

✦ Cluster-analysis– Hierarchical and K-means clustering

– Many distance measures

✦ Nearest neighbours

✦ Visualisation - many options

✦ Also clusters short sequences...

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Clustering methods

✦ Hierarchical clustering:– complete linkage

– average linkage

– single linkage

✦ Distance measures:– Euclidean

– Correlation based

– Rank correlation

– Manhattan

– ...

✦ Partition-based✦ K-means

– Specify K

– Randomly select “centers”

– Assign genes to centers

– Recalculate centers to“gravity center”

– Iterate until stabilizes

✦ Can get to local minimum

✦ Fast for large datasets

✦ Initial selection of centersPlus many, many more ...

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Unsupervised vs. Supervised

Find groups inherent to data

Find a “classifier” forknown classes

Minimum distance => Single linkageMaximum distance=> Complete linkage

Average distance=> Average linkage(UPGMA, WPGMA)

Keep joining together two closest clusters by using the:

Hierarchical clustering

Cluster sequences:

Cluster matrices:

Jaak Vilo


* Start clustering by choosing K centersrandomlymost distant centersmore...

* Iterate clustering step until no clusterchanges* Deterministic, might get “stuck” inlocal minimum

K-means clustering

New centers - center of gravity for a cluster

Cluster - objects closest to a center

Data selection and analysis “folder”

Jaak Vilo


GENOMES: Yeast

“Cut”

Hierarchical clustering output

“Zoom”

6200 genes, 80 exp.

Monitor size 1600x1200

Laptop: 800x600

Jaak Vilo


6200 genes, 80 exp.

“COLLAPSE”

Developed and implementedin Expression Profiler inOctober 2000 by Jaak Vilo

Monitor size 1600x1200

Laptop: 800x600

75 subtrees

Running times of hierarchical clustering(calculations only, no visualisation)

Running times on 850MHz Linux farm node with 1GB memory10 100 1000

Nr. Of genes Clustering Distances Clustering Distances Clustering Distances10 0 0 0 0 0 0

100 0 0 0 0 0 0.08500 0.05 0.02 0.05 0.13 0.05 2.72

1000 0.23 0.1 0.25 0.85 0.26 11.042000 1 0.39 1.09 4.46 1.17 44.393000 2.46 0.9 2.77 10.23 3.01 99.524000 5.03 1.63 6.19 18.47 6.15 176.785000 8.69 2.64 10.09 30.34 10.3 276.256000 14.17 4.11 16.13 44.21 16.39 397.797000 21.74 5.98 23.88 58.74 23.95 540.988000 30.53 8.42 33.29 76.1 33.47 706.429000 42.36 11.47 43.12 96.82 44.01 893.43

10000 54 14.67 54.75 119.82 56.1 1102.8715000 128.34 39.59 133.72 271.24 139.22 2481.6620000 258.84 74.99 249.26 483.35 284.69 4409.98

On Alpha server distances up to 3x faster. Clustering 1.1-1.2xBig effect depending on compiler (gcc -> cxx)

Jaak Vilo


Running time cont.

Clustering

Distances 10 attrib

Distances100 attrib

In s

eco n

ds

1minute

K-means clustering output

URLMAP:

Jaak Vilo


More features

✦ Upload your own data

✦ Find most similar genes

✦ Find most distant genes

✦ Seed selected genes asK-means centers

✦ Cluster (short) sequences

✦ In future: direct uploadfrom databases

Data uploadfrom files

Matrixseparator

Gene annotationsid annotation

Annotations can contain<A HREF=“…”>links</A>

Sequence data(Other types?)

Jaak Vilo


URLMAP - no need to “cut & paste”

KEGG:

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

2

..1)(),( ∑ =

−=ci

gifigfd

2

..1)(),( ∑ =

−=ci

gifigfd

Euclidean distance

Euclidean squared

Manhattan (city-block) ∑ =−=

cigifigfd

..1||),(

Average distance2

..1)(1),( ∑ =

−=ci

gificgfd

Some standard distance measures

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Pearson correlation

∑ ∑∑= =

=

−−

−−−=

c

i

c

i ii

c

i ii

ggff

ggffgfd

1 1

22

1

)()(

))((1),(

Θ−=−=∑ ∑

∑= =

= cos11),(

1 1

22

1

c

i

c

i ii

c

i ii

gf

gfgfd

If means of each column are 0, then it becomes:

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Chord distance

)1(2),(

1 1

22

1

∑ ∑∑= =

=−=c

i

c

i ii

c

i ii

gf

gfgfd

)cos1(2),( Θ−=gfd

Legendre & Legendre: Numerical Ecology2nd ed.

Euclidean distance between two vectors whose length hasbeen normalized to 1

x

y

f

g

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Rank correlation

)1(

)(61),(

21

−−

−= ∑ =

cc

rankrankgfd

c

i gifi

Rank - smallest has rank1, next 2, etc.

Equal values have rank that is average of the ranks

f = 3 17 12 12 8

rank= 1 5 3.5 3.5 2

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute GENOMES

✦ Started from the need to get upstreamsequences

✦ Overview, annotation, function, links toother databases

✦ Extraction of upstream, sequences relativeto start codons of genes for sets of genes

✦ Querying capabilities

✦ I wish major databases offered all that...

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Automatic TF-binding siteidentification

✦ How to identify putative transcriptionfactor binding sites on a genomic scale?

✦ Traditional approach (case by case basis)

✦ High-throughput methods (“data mining”)

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Organization of a typical yeastpromoter

URS URS TATA I

Coding Region

40 - 120 bp

20 - 700 bp

RNA

40 - 60 bp

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

“Traditional” approach forTF-binding site prediction

✦ Extract bits of DNA where binding occurs

✦ Carefully handpick coregulated genes (fromliterature, experiments, functional class)

✦ Identify conserved motifs in their upstreams

✦ Build complex models, use expensive searchtechniques (Gibbs, EM, alignments)

✦ Not easily scaleable for full genomes?

✦ Very good domain knowledge required

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute High-throughput method

From expression data to regulatory signals✦ Cluster the genes based on expression

measurements for identifying potentially co-regulated sets of genes

✦ Extract upstream sequences to these genes

✦ Search for patterns over-represented in theseclusters

✦ Assess the quality of findings using somestatistical criteria

Brazma, Jonassen, Vilo, Ukkonen: Genome Research 8:1205-1215, 1998

Vilo, Brazma, Jonassen, Robinson, Ukkonen: ISMB-2000, AAAI Press (August 2000)

Jaak Vilo


Cluster of co-expressedgenes, pattern discovery inregulatory regions��BASEPAIRS

%XPRESSION�PROFILES

5PSTREAM�REGIONS

2ETRIEVE

0ATTERN�OVER REPRESENTED�IN�CLUSTER

Problem of “noise”✦ Gene expression measurement accuracy

(about within a factor of 2 in 95% cases)

✦ Clustering result depends on the choice ofmethod and parameters used in each

✦ Does co-expression mean co-regulation?

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute What questions we asked

✦ How to perform systematic discovery?✦ How to assess the quality of the predictions?✦ Do “better” clusters give “better” signals?✦ Is method scaleable for larger genomes?

✦ Want to discover something unique for eachcluster, not just features common to upsreams

Cluster and pattern “strengths”

✦ Cluster strength: Average silhouette valueHow well each object is classified into it’s owncluster. Use average distances within cluster andcompare them to next closest cluster for each object.Value between -1 .. +1 (well classified)

✦ Pattern strength: binomial distributionGiven probability of “tails” on coin, how probableis to observe k or more “tails” out of n trials.Number of tails = nr. of pattern occurrences.

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Silhouette value(Rousseeuw 1987)

* Assign “goodness” to each clustered object* Average silhouette over each cluster or over the clustering* Not a silver bullet

Average distance to members in same cluster

Average distance to members in closest cluster

bi - ai

Max( bi, ai ) Si =

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Computational experiment:clustering

✦ Yeast Saccharomyces cerevisiae, 6221 genes, 80expression conditions for each (from P. Brown’s lab)

✦ No single best clustering method: K-means, vary K ∈2..1000, repeat 10x for each K. Total: ~1000 x K-means

✦ Calculated average “Silhouette” values for all clusters

✦ Select all unique clusters of size 20..100 (~ 52.000)

✦ Could combine several methods, several distance measures

Jaak Vilo


Clustering:Hierarchical;Partitioning basedK-means clustering

Many more:SOM, fuzzy c-means,graph-based, ...

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Computational experiment:pattern discovery

✦ Upstream sequences of length 600bp from ORFstart

✦ Analyze all upstreams of all 52K clusters withSPEXS looking for substrings only (one weekend,~10 PC-s)

✦ Extract all patterns from upstreams of allclusters with probability less than 1% (binomialdistribution, background probability is calculatedsimultaneously from all 6221 upstreams)

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Pattern selection criteriaBinomial distribution

5 out of 25, p = 0.2

Background -ALLupstreamsequences

Cluster: π occurs 3 times

P(π,6) is probabilityof having 3 ormore matches in 6sequences

P(π,6) =0.0989

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Pattern vs cluster “strength”

The pattern probability vs.the average silhouette forthe cluster

The same for randomisedclusters

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

The most unprobable pattern from best clustersPattern Probability Cluster Occurrences Total nr of K

size in cluster occurrences in K-meansAAAATTTT 2.59E-43 96 72 830 60ACGCG 6.41E-39 96 75 1088 50ACGCGT 5.23E-38 94 52 387 40CCTCGACTAA 5.43E-38 27 18 23 220GACGCG 7.89E-31 86 40 284 38TTTCGAAACTTACAAAAAT 2.08E-29 26 14 18 450TTCTTGTCAAAAAGC 2.08E-29 26 14 18 325ACATACTATTGTTAAT 3.81E-28 22 13 18 280GATGAGATG 5.60E-28 68 24 83 84TGTTTATATTGATGGA 1.90E-27 24 13 18 220GATGGATTTCTTGTCAAAA 5.04E-27 18 12 18 500TATAAATAGAGC 1.51E-26 27 13 18 300GATTTCTTGTCAAA 3.40E-26 20 12 18 700GATGGATTTCTTG 3.40E-26 20 12 18 875GGTGGCAA 4.18E-26 40 20 96 180TTCTTGTCAAAAAGCA 5.10E-26 29 13 18 250CGAAACTTACAAA 5.10E-26 29 13 18 290GAAACTTACAAAAATAAA 7.92E-26 21 12 18 650TTTGTTTATATTG 1.74E-25 22 12 18 600ATCAACATACTATTGT 3.62E-25 23 12 18 375ATCAACATACTATTGTTA 3.62E-25 23 12 18 625GAACGCGCG 4.47E-25 20 11 13 260GTTAATTTCGAAAC 7.23E-25 24 12 18 400GGTGGCAAAA 3.37E-24 33 14 31 475ATCTTTTGTTTATATTGA 7.19E-24 19 11 18 675TTTGTTTATATTGATGGA 7.19E-24 19 11 18 475GTGGCAAA 1.14E-23 28 18 137 725

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute One example

Pattern GGTGGCAA was the “best” for this cluster (visualized in here after hierarchical clustering within the cluster)

25 out of 40 ORFs belong to “cytoplasmic degradation”functional class (MIPS), mostly being proteasome subunits

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

GGTGGCAA cluster (proteasome) ORF Gene lengh disruption description

YBL041W PRE7 241 lethal 20S proteasome subunit(beta6) YBR170C NPL4 580 lethal nuclear protein localization factor and ER translocation component YDL126C CDC48 835 lethal microsomal protein of CDC48/PAS1/SEC18 family of ATPases YDL100C 354 similarity to E.coli arsenical pump-driving ATPase YDL097C RPN6 434 lethal subunit of the regulatory particle of the proteasome YDR313C PIB 286 phosphatidylinositol(3)-phosphate binding protein YDR330W 500 similarity to hypothetical S. pombe protein YDR394W RPT3 428 lethal 26S proteasome regulatory subunit YDR427W RPN9 393 viable subunit of the regulatory particle of the proteasome YDR510W SMT3 101 lethal ubiquitin-like protein YER012W PRE1 198 lethal 20S proteasome subunit C11(beta4) YFR004W RPN11 306 lethal 26S proteasome regulatory subunit YFR033C QCR6 147 viable ubiquinol--cytochrome-c reductase 17K protein YFR050C PRE4 266 lethal 20S proteasome subunit(beta7) YFR052W RPN12 274 lethal 26S proteasome regulatory subunit YGL048C RPT6 405 lethal 26S proteasome regulatory subunit YGL036W MTC2 909 viable Mtf1 Two hybrid Clone 2 YGL011C SCL1 252 lethal 20S proteasome subunit YC7ALPHA/Y8 (alpha1) YGR048W UFD1 361 lethal ubiquitin fusion degradation protein YGR135W PRE9 258 viable 20S proteasome subunit Y13 (alpha3) YGR253C PUP2 260 lethal 20S proteasome subunit(alpha5) YIL075C RPN2 945 lethal 26S proteasome regulatory subunit YJL102W MEF2 819 translation elongation factor, mitochondrial YJL053W PEP8 379 viable vacuolar protein sorting/targeting protein YJL036W 423 weak similarity to Mvp1p YJL001W PRE3 215 lethal 20S proteasome subunit (beta1) YJR117W STE24 453 viable zinc metallo-protease YKL145W RPT1 467 lethal 26S proteasome regulatory subunit YKL117W SBA1 216 viable Hsp90 (Ninety) Associated Co-chaperone YLR387C 432 similarity to YBR267w YMR314W PRE5 234 lethal 20S proteasome subunit(alpha6) YOL038W PRE6 254 20S proteasome subunit (alpha4) YOR117W RPT5 434 lethal 26S proteasome regulatory subunit YOR157C PUP1 261 lethal 20S proteasome subunit (beta2) YOR176W HEM15 393 viable ferrochelatase precursor YOR259C RPT4 437 lethal 26S proteasome regulatory subunit YOR317W FAA1 700 viable long-chain-fatty-acid--CoA ligase YOR362C PRE10 288 lethal 20S proteasome subunit C1 (alpha7) YPR103W PRE2 287 lethal 20S proteasome subunit (beta5) YPR108W RPN7 429 subunit of the regulatory particle of the proteasome

Jaak Vilo


GGTGGCAA is a binding site for RPN4

FEBS Lett 1999 Apr 30;450(1-2):27-34

Rpn4p acts as a transcription factor by binding to PACE, a nonamer boxfound upstream of 26S proteasomal and other genes in yeast.Mannhaupt G, Schnall R, Karpov V, Vetter I, Feldmann HAdolf-Butenandt-Institut der Ludwig-Maximilians-Universitat Munchen, Germany.

We identified a new, unique upstream activating sequence(5’-GGTGGCAAA-3’) in the promoters of 26 out of the 32 proteasomalyeast genes characterized to date, which we propose to call proteasome-associated control element. By using the one-hybrid method, we show that thefactor binding to the proteasome-associated control element is Rpn4p, a proteincontaining a C2H2-type finger motif and two acidic domains. ...

YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1

GGTGGCAA - proteasome associated control element

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute SPEXS - Sequence Pattern EXhaustive Search

Jaak Vilo, 1998

✦ User-definable pattern language: substrings (oligos), grouppositions/wildcards, flexible wildcards (PROSITE)

✦ Fast exhaustive search over pattern language

✦ “Lazy suffix tree construction”-like algorithm

✦ Analyze multiple sets of sequences simultaneously

✦ Restrict search to most frequent patterns only (in each set)

✦ Report most frequent patterns, patterns over- orunderrepresented in selected subsets, or patterns significantby various statistical criteria, e.g. by binomial distribution

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute “Lazy” construction of trie

✦ ATACATA$

✦ 12345678

✦ O(n²)

✦ Kurtz, Giegerich

✦ Good in practice

A

T

$

C

{1,3,5,7} {4} {2,6}{8}

T$C

{2,6}{8}

{4}{3,7}

A

Jaak Vilo


SPEXS: pattern discovery basedon pattern trie. Vilo 1998

✦ Substrings

✦ Group characters

✦ Wildcard positions

✦ Restrictions on nr ofeach separately

✦ At least k occurrences

✦ Exact occurrences ofeach pattern

A

TC{1,3,5,7} {4} {2,6}

[CT]

C ∪ T

*A

{3,5,7}

{2,4,6}

ATACATA$12345678

ε

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Results continued...

✦ Totally over 6000 interesting patterns

✦ Many from homologous upstreams => Remove

✦ 1498 most interesting substrings left

✦ How to summarize?

✦ Clustered these based on mutual similarity

✦ Found alignments, consensi, and profiles

✦ Of 62 clusters 48 had patterns matching SCPD(experimentally mapped) binding site database

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Cluster and align patterns

CLUSTER_0009:TGACAGCTGACAGCTGTGACAGCTTGTGACAGAGTGACAACAGTGACACAGTGACAAGTGACATAGTGACATTGTGACATTGTGACATTGTGACATTGTGACAGTGACAGTGTGACAGGTGACACCAGACAGTGACCCTTGACCGATTGACCGTTTGACC

Words aligned asALIGNMENT: basedon pattern TGAC----TGACAGC-----TGACAGCT---GTGACAGC--TTGTGACAG----AGTGACA---ACAGTGACA----CAGTGACA-----AGTGACAT----AGTGACATT----GTGACATT----GTGACAT----TGTGACA----TTGTGACA------GTGACAG----TGTGACAG-----GTGACAC------TGACCCT-----TGACCGA----TTGACCG----TTTGACC---

NO ALIGNMENT FOR:NO: CAGACAG

PROFILE:>/tmp/.12100.profile Profile for sequences in file /tmp/.12100 based on pattern TGAC

- 19 16 10 4 0 0 0 0 0 6 13 19A 1 0 5 0 0 0 20 0 16 0 1 0C 0 2 0 0 0 0 0 20 4 2 3 0G 0 0 0 14 0 20 0 0 0 8 0 0T 0 2 5 2 20 0 0 0 0 4 3 1

CONSENSUS: (not real one )>/tmp/.12100.consensus Profile for sequences in file /tmp/.12100 based on pattern TGACCONSENSUS_PAT /tmp/.12100

Consensus pattern:a[ct][at]GTGACA[GTC][cta]t

AGTGACAACAGTGACACAGTGACACAGTGACACAGTGACGCAGTGAGCAGTGAGCAGTGATTACAGTGTTTACAGTGATTACAGTGATACAGTGACAGTGATACAGTGATTTACAGTTTTACAGTG

ALIGNMENT: based on pattern AGT-----AGTGACA---ACAGTGACA----CAGTGACA----CAGTGAC----ACAGTGAC----GCAGTGA-----GCAGTG-----AGCAGTGA---TTACAGTG---TTTACAGTGA---TTACAGTGA----TACAGTG------ACAGTGA----TACAGTGA--TTTACAGT----TTTACAGTG---

PROFILE:>/tmp/.5679.profile Profile for sequences in file /tmp/.5679 basedon pattern AGT- 13 11 8 3 1 0 0 0 1 5 11 13A 0 0 1 10 0 16 0 0 0 11 0 3C 0 0 0 0 15 0 0 0 0 0 5 0G 0 0 0 3 0 0 16 0 15 0 0 0T 3 5 7 0 0 0 0 16 0 0 0 0

CONSENSUS: (not real one )>/tmp/.5679.consensus Profile for sequences in file /tmp/.5679based on pattern AGTCONSENSUS_PAT /tmp/.5679 ttt[AG]CAGTGAca

Align a group of patterns

Jaak Vilo


-----------------ACCCAGACATCGGGCTTCCAC----------------ACACCCAGACATC--------------------------ACACCCAGACATC--------------------------GAACCCATACACT--------------------------ACACCCAGACCGCG-------------------------GCACCCACACATTT----------------------GCTAAACCCATGCACAGTGACT----------------------ACCCAGACACGCTCGA-------------------CTTCACCCTCATAC---------------------------ACACCCCTTTTCT--------------------------GCACCCAGTCTT---------------------------GCACCCAAACACCTGCATATTTGG---------------GCACCCAATCACC--------------------------ACACCCAGACCTC--------------------------AAACCCACACAT--------------------------TGCACCCATACCTT-------------------------AACACCCAAGCACAG-----------ATCTCTCGCAACG-------------------------ACCTCCGTACATTC-------------------------ACACCTGGACACC--------------------------ACATCCGTACAACGAGAACCCATACATTA----------

---TCCGTAC--- ACCCATAC---CATCCGTAC--- ACCCATACA---ATCCGTA---- ACCCATACAT--ATCCGTACA-- -CCCATAC------CCGTAC--- -CCCATACA-----CCGTACA-- --CCATACAT---TCCGTACAT- -CCCATACAT--ATCCGTACAT- --CCATACA----TCCGTACA-- -AACATAC------CCGTACAT- --ACATACT----TCCGTA---- ---GATACT---ATCCGTAC--- --AGATACT-----CCGTACC-----ACCGTACC-----ACCGTAC-----CACCGTAC-------CCGTACATT----GCGTAG-------GCGTAGG-------CGTAGG---CATCCGTA----ACATCCGT------CATCCGT-----

In SCPD

Discoveredautomatically

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

tctcaTCTCA[TC][CT][tag]catcaaTCTTCATGtcctcGAA[CG]TGCCATCtcaa[ta][CG]CCTA[AT]AatcgTACCTCTaacc[ac]CCCC[CT][CGT][ag]agT[TA][CA]TCCT[CG]ggACAGCTAca[ct][at]GTGACA[GTC][cta]ttt[tc]ACAGT[GT][AT][tc]g[at][ATC]TACACAttttGTCACA[GAT]ggt[gc]ACATT[GC][CT]tgata[TC]TGGTTCtacaTCCGTAC[acg]tta[gca][atc]TAAG[CG][TAG][tga]atAT[TAC]GTTAAgct[ct][at][AG]AAGT[AT][TA]cgtT[AG]TTA[CT][TG][AG]caACTTTATTT[ag]TAACTT[AT]Cat[ACT]CGCTTA[AT]gaa[ca][gat][acg][AG]CGCG[cta][gat][ca]gc[ac][at][GT]ACGCcaaGGTCG[CT]ActgtTAACGAATCGTTtaaga[at][TC]CGTTTA[ag]gt[ta]CGAATA[AG]aaaaaA[CAG][AT]GAATCttct[ac][tc][at]CGACT[CA][ca][cg]aatcCACGAA[gc][ta]gc[ga][ctg][ACG]TACG[AT][atc]tataC[CA]CATAC[AT]t

atat[CT][AG]GCAC[TC][ac]ataGCGCA[GT][ga]cccgGTGGCAA[AC][ag]t[ca][ga][GA]CGGC[TG][GTA][cta]tttta[cat][AGC]AGGG[GT][ctg][ac]agcg[ag][at][ga][ac]GATGAG[AC]t[ag][at]gaTGGATGCc[ta]TGCATGAAca[TG][GC]GTATAcgc[TAG]TATAT[ATC][gat][ag][tg]ggg[ag][ga][ag][AG][TAG]AT[GA]TG[agt][ga][ag]tag[AG]TAGA[TA]A[ga]aaaagtaTAAATAGAGCtgct[at]a[ag][TG][AT]GCC[CG][ac][ac]gaaC[CT]CAAT[AT][tg]tATCCAAGAgaaacaAAACAAA[AT][ca][ac]aatatgtGTAAA[TC]ATttataaaa[gt][CA][GT]AAAA[GA][cg][gac]aaaagt[gt][TC]GAAAG[AG]Tt[at][tac]t[gta][ag]AAAATTTT[tg][tc][at]ttga[at][acg][CA]GGAA[AG]T[gt]gaat[tc][cat][AT][TC]TTC[GA][ACT][ga]tcgg[ct][ctg][gct][ctg]CTTTTT[CTG][TC][atc][tg]cctTTTTCTG[CT][TA]ct[ta][gta][gtc][TG]TCTA[TG][GTC]a[at][ct]taaat[AT]TTTGTG[ta]cat[acg]CTGTG[CT]a[ac]

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute Summary about identified patterns

SCPDOf 1498 patterns315 patterns match sites73 patterns are matched by some site1134 patterns do not match any nor is matched by any

Of 109 factors with total 799 sites85 factors are matched by some of the patterns19 factors match some patterns24 factors do not have matches nor is matched

Of 498 unique sites of total 799 sites238 sites are matched by some of the patterns21 sites match some patterns252 sites do not have matches nor is matched

TRANSFACOf 1498 patterns297 patterns match sites61 patterns are matched by some site1174 patterns do not match any nor is matched by any

Of 351 DB-entries with total 359 sites205 factors are matched by some of the patterns22 factors match some patterns134 factors do not have matches nor is matched

Of 334 unique sites of total 359 sites198 sites are matched by some of the patterns16 sites match some patterns127 sites do not have matches nor is matched

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute PATMATCH

✦ Match your patterns against sequences

✦ Sequences - extracted from “GENOMES”

✦ Visualise matches along the sequence

✦ Visualise pattern by pattern if sequence has amatch

✦ Order sequences according to hierarchicalclustering order from EPCLUST

✦ Show clustering and upstream next to eachother

Jaak Vilo


YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1

GGTGGCAA - proteasome associated control element

PATMATCH - combine pattern matching with expression data

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Global and Local Data MiningSecondary Data Mining

✦ Find global structure by clustering

✦ Find local structure by pattern discovery

✦ Summarize the findings to the sizefeasible for humans to interpret

✦ Find most interesting rules

✦ Find “explanations” for the behavior

✦ Hypothesis generation for wet-lab

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Gene networks

promoter1 gene1promoter2 gene2 promoter3 gene3 promoter4 gene4DNA

RNA

transcription

translation

proteins

transcription factors

Jaak Vilo

DN

A expression data analysis

35

European Bioinformatics InstituteEuropean Bioinformatics Institute

A gene netw

ork

b1

b2

b3

F1

F2

r1r2

European Bioinformatics InstituteEuropean Bioinformatics Institute

lacZ ...Prom

oterO

perator

Repressor

lacIProm

oter

Activator

Glucose

Lactose

Glucose

Galactose +

Galactosidase

Lac-O

peron

Thom

as Schlitt

Jaak Vilo


Eur

opea

n B

ioin

form

atic

s In

stit

ute

Eur

opea

n B

ioin

form

atic

s In

stit

ute

Thanks

Alvis Brazma EBIUgis Sarkans EBIThomas Schlitt EBIInge Jonassen University of BergenEsko Ukkonen University of HelsinkiAlan Robinson EBI

Documents

Extracting information from European Bioinformatics ...€¦ · European Bioinformatics Institute Clustering methods ... High-throughput methods (“data mining”) European Bioinformatics