View
24
Download
0
Category
Preview:
Citation preview
Jaak Vilo
DNA expression data analysis 1
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Extracting information frommicroarray data
Jaak Vilo
European Bioinformatics Institute EMBL-EBI
http://www.ebi.ac.uk/microarray/
http://ep.ebi.ac.uk
Lausanne, 1.03.2001
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Microarray Experiment
RT-PCR
RT-PCR
LASER
DNA “Chip”
High glucose
Low glucose
Jaak Vilo
DNA expression data analysis 2
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Gene expression data
Treated sample labeled red (Cy5)Control data labeled green (Cy3)
Competitive hybridization ontochip
Red dot - gene overexpressed intreated sampleGreen dot - gene underexpressed intreated sampleYellow - equally expressedIntensity - “absolute” level
red/green - ratio of expression2 - 2x overexpressed0.5 - 2x underexpressed
log2( red/green ) - “log ratio” 1 2x overexpressed -1 2x underexpressed
Spotted cDNA microarrayStanford University (Yeast,1997)
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Expression data matrix
Gene 1
Gene 2
Gene n
Condition 1(chip nr. 1)
Condition m(chip nr. m)
•Measurements expressed in what units?•cm/inch?•Reality:
Jaak Vilo
DNA expression data analysis 3
ArrayExpress conceptual overview
SAMPLE
ExpressionValue
Hybridization
EXPERIMENT
ARRAYOntology
Database
Reference
e.g. taxonomy
E.g. gene in SWISS-prot
Publication, web resource
ArrayExpress
External links
Simple version of AE object model
Jaak Vilo
DNA expression data analysis 4
ArrayExpress - features
✦ MIAME-compliant (www.MGED.org)
✦ able to import MAML-formatted data
✦ can deal with both raw and processed data
✦ independence of:– experimental platforms
– image analysis methods
– data normalization methods
✦ object model-based query mechanism
✦ will support upcoming OMG standard forexpression data
Key constructs in the AE object model
✦ structured sample descriptions
✦ notion of ExpressionValueSet
✦ several dimensions for ExpressionValues
✦ Transformations working onExpressionValueSets and their dimensions
Jaak Vilo
DNA expression data analysis 5
Structured representation of sample andtreatment relations
Sample sourcePrimary sample 1
Primary sample 2
Derived sample 1
Labeled extract 1
Extract 1
Derived sample 2
A new state ofsample source
Extract 2
Labeled extract 2Hybridizationlabeling
extraction
treatment
treatment
Microarray expression value representation
expression value types
primary images composite imagese.g., green/red ratios
primaryspots
compositespots
primarymeasurements
derivedvalues
Jaak Vilo
DNA expression data analysis 6
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Expression data matrix
Gene 1
Gene 2
Gene n
Condition 1(chip nr. 1)
Condition m(chip nr. m)
Hierarchicalclustering
Visualization:Pseudo-coloring ofexpression data matrix.Method developed byMike Eisen, StanfordUniversity
Jaak Vilo
DNA expression data analysis 7
EPCLUST(cluster Expression profiles)
GENOMESsequence, function,
annotation
SPEXS(Sequence Pattern Exhaustive Search)
novel patterns
URLMAP:provide links
Components of Expression Profilerhttp://www.ebi.ac.uk/microarray/
Expression data
External data, toolspathways, function,
etc.
PATMATCHknown patterns
Jaak Vilo
DNA expression data analysis 8
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute EPCLUST
✦ Cluster-analysis– Hierarchical and K-means clustering
– Many distance measures
✦ Nearest neighbours
✦ Visualisation - many options
✦ Also clusters short sequences...
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Clustering methods
✦ Hierarchical clustering:– complete linkage
– average linkage
– single linkage
✦ Distance measures:– Euclidean
– Correlation based
– Rank correlation
– Manhattan
– ...
✦ Partition-based✦ K-means
– Specify K
– Randomly select “centers”
– Assign genes to centers
– Recalculate centers to“gravity center”
– Iterate until stabilizes
✦ Can get to local minimum
✦ Fast for large datasets
✦ Initial selection of centersPlus many, many more ...
Jaak Vilo
DNA expression data analysis 9
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Unsupervised vs. Supervised
Find groups inherent to data
Find a “classifier” forknown classes
Minimum distance => Single linkageMaximum distance=> Complete linkage
Average distance=> Average linkage(UPGMA, WPGMA)
Keep joining together two closest clusters by using the:
Hierarchical clustering
Cluster sequences:
Cluster matrices:
Jaak Vilo
DNA expression data analysis 10
* Start clustering by choosing K centersrandomlymost distant centersmore...
* Iterate clustering step until no clusterchanges* Deterministic, might get “stuck” inlocal minimum
K-means clustering
New centers - center of gravity for a cluster
Cluster - objects closest to a center
Data selection and analysis “folder”
Jaak Vilo
DNA expression data analysis 11
GENOMES: Yeast
“Cut”
Hierarchical clustering output
“Zoom”
6200 genes, 80 exp.
Monitor size 1600x1200
Laptop: 800x600
Jaak Vilo
DNA expression data analysis 12
6200 genes, 80 exp.
“COLLAPSE”
Developed and implementedin Expression Profiler inOctober 2000 by Jaak Vilo
Monitor size 1600x1200
Laptop: 800x600
75 subtrees
Running times of hierarchical clustering(calculations only, no visualisation)
Running times on 850MHz Linux farm node with 1GB memory10 100 1000
Nr. Of genes Clustering Distances Clustering Distances Clustering Distances10 0 0 0 0 0 0
100 0 0 0 0 0 0.08500 0.05 0.02 0.05 0.13 0.05 2.72
1000 0.23 0.1 0.25 0.85 0.26 11.042000 1 0.39 1.09 4.46 1.17 44.393000 2.46 0.9 2.77 10.23 3.01 99.524000 5.03 1.63 6.19 18.47 6.15 176.785000 8.69 2.64 10.09 30.34 10.3 276.256000 14.17 4.11 16.13 44.21 16.39 397.797000 21.74 5.98 23.88 58.74 23.95 540.988000 30.53 8.42 33.29 76.1 33.47 706.429000 42.36 11.47 43.12 96.82 44.01 893.43
10000 54 14.67 54.75 119.82 56.1 1102.8715000 128.34 39.59 133.72 271.24 139.22 2481.6620000 258.84 74.99 249.26 483.35 284.69 4409.98
On Alpha server distances up to 3x faster. Clustering 1.1-1.2xBig effect depending on compiler (gcc -> cxx)
Jaak Vilo
DNA expression data analysis 13
Running time cont.
Clustering
Distances 10 attrib
Distances100 attrib
In s
eco n
ds
1minute
K-means clustering output
URLMAP:
Jaak Vilo
DNA expression data analysis 14
More features
✦ Upload your own data
✦ Find most similar genes
✦ Find most distant genes
✦ Seed selected genes asK-means centers
✦ Cluster (short) sequences
✦ In future: direct uploadfrom databases
Data uploadfrom files
Matrixseparator
Gene annotationsid annotation
Annotations can contain<A HREF=“…”>links</A>
Sequence data(Other types?)
Jaak Vilo
DNA expression data analysis 15
URLMAP - no need to “cut & paste”
KEGG:
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
2
..1)(),( ∑ =
−=ci
gifigfd
2
..1)(),( ∑ =
−=ci
gifigfd
Euclidean distance
Euclidean squared
Manhattan (city-block) ∑ =−=
cigifigfd
..1||),(
Average distance2
..1)(1),( ∑ =
−=ci
gificgfd
Some standard distance measures
Jaak Vilo
DNA expression data analysis 16
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Pearson correlation
∑ ∑∑= =
=
−−
−−−=
c
i
c
i ii
c
i ii
ggff
ggffgfd
1 1
22
1
)()(
))((1),(
Θ−=−=∑ ∑
∑= =
= cos11),(
1 1
22
1
c
i
c
i ii
c
i ii
gf
gfgfd
If means of each column are 0, then it becomes:
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Chord distance
)1(2),(
1 1
22
1
∑ ∑∑= =
=−=c
i
c
i ii
c
i ii
gf
gfgfd
)cos1(2),( Θ−=gfd
Legendre & Legendre: Numerical Ecology2nd ed.
Euclidean distance between two vectors whose length hasbeen normalized to 1
x
y
f
g
Jaak Vilo
DNA expression data analysis 17
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Rank correlation
)1(
)(61),(
21
−−
−= ∑ =
cc
rankrankgfd
c
i gifi
Rank - smallest has rank1, next 2, etc.
Equal values have rank that is average of the ranks
f = 3 17 12 12 8
rank= 1 5 3.5 3.5 2
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute GENOMES
✦ Started from the need to get upstreamsequences
✦ Overview, annotation, function, links toother databases
✦ Extraction of upstream, sequences relativeto start codons of genes for sets of genes
✦ Querying capabilities
✦ I wish major databases offered all that...
Jaak Vilo
DNA expression data analysis 18
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Automatic TF-binding siteidentification
✦ How to identify putative transcriptionfactor binding sites on a genomic scale?
✦ Traditional approach (case by case basis)
✦ High-throughput methods (“data mining”)
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Organization of a typical yeastpromoter
URS URS TATA I
Coding Region
40 - 120 bp
20 - 700 bp
RNA
40 - 60 bp
Jaak Vilo
DNA expression data analysis 19
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
“Traditional” approach forTF-binding site prediction
✦ Extract bits of DNA where binding occurs
✦ Carefully handpick coregulated genes (fromliterature, experiments, functional class)
✦ Identify conserved motifs in their upstreams
✦ Build complex models, use expensive searchtechniques (Gibbs, EM, alignments)
✦ Not easily scaleable for full genomes?
✦ Very good domain knowledge required
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute High-throughput method
From expression data to regulatory signals✦ Cluster the genes based on expression
measurements for identifying potentially co-regulated sets of genes
✦ Extract upstream sequences to these genes
✦ Search for patterns over-represented in theseclusters
✦ Assess the quality of findings using somestatistical criteria
Brazma, Jonassen, Vilo, Ukkonen: Genome Research 8:1205-1215, 1998
Vilo, Brazma, Jonassen, Robinson, Ukkonen: ISMB-2000, AAAI Press (August 2000)
Jaak Vilo
DNA expression data analysis 20
Cluster of co-expressedgenes, pattern discovery inregulatory regions����BASEPAIRS
%XPRESSION�PROFILES
5PSTREAM�REGIONS
2ETRIEVE
0ATTERN�OVER REPRESENTED�IN�CLUSTER
Problem of “noise”✦ Gene expression measurement accuracy
(about within a factor of 2 in 95% cases)
✦ Clustering result depends on the choice ofmethod and parameters used in each
✦ Does co-expression mean co-regulation?
Jaak Vilo
DNA expression data analysis 21
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute What questions we asked
✦ How to perform systematic discovery?✦ How to assess the quality of the predictions?✦ Do “better” clusters give “better” signals?✦ Is method scaleable for larger genomes?
✦ Want to discover something unique for eachcluster, not just features common to upsreams
Cluster and pattern “strengths”
✦ Cluster strength: Average silhouette valueHow well each object is classified into it’s owncluster. Use average distances within cluster andcompare them to next closest cluster for each object.Value between -1 .. +1 (well classified)
✦ Pattern strength: binomial distributionGiven probability of “tails” on coin, how probableis to observe k or more “tails” out of n trials.Number of tails = nr. of pattern occurrences.
Jaak Vilo
DNA expression data analysis 22
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Silhouette value(Rousseeuw 1987)
* Assign “goodness” to each clustered object* Average silhouette over each cluster or over the clustering* Not a silver bullet
Average distance to members in same cluster
Average distance to members in closest cluster
bi - ai
Max( bi, ai ) Si =
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Computational experiment:clustering
✦ Yeast Saccharomyces cerevisiae, 6221 genes, 80expression conditions for each (from P. Brown’s lab)
✦ No single best clustering method: K-means, vary K ∈2..1000, repeat 10x for each K. Total: ~1000 x K-means
✦ Calculated average “Silhouette” values for all clusters
✦ Select all unique clusters of size 20..100 (~ 52.000)
✦ Could combine several methods, several distance measures
Jaak Vilo
DNA expression data analysis 23
Clustering:Hierarchical;Partitioning basedK-means clustering
Many more:SOM, fuzzy c-means,graph-based, ...
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Computational experiment:pattern discovery
✦ Upstream sequences of length 600bp from ORFstart
✦ Analyze all upstreams of all 52K clusters withSPEXS looking for substrings only (one weekend,~10 PC-s)
✦ Extract all patterns from upstreams of allclusters with probability less than 1% (binomialdistribution, background probability is calculatedsimultaneously from all 6221 upstreams)
Jaak Vilo
DNA expression data analysis 24
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Pattern selection criteriaBinomial distribution
5 out of 25, p = 0.2
Background -ALLupstreamsequences
Cluster: π occurs 3 times
P(π,6) is probabilityof having 3 ormore matches in 6sequences
P(π,6) =0.0989
Jaak Vilo
DNA expression data analysis 25
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Pattern vs cluster “strength”
The pattern probability vs.the average silhouette forthe cluster
The same for randomisedclusters
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
The most unprobable pattern from best clustersPattern Probability Cluster Occurrences Total nr of K
size in cluster occurrences in K-meansAAAATTTT 2.59E-43 96 72 830 60ACGCG 6.41E-39 96 75 1088 50ACGCGT 5.23E-38 94 52 387 40CCTCGACTAA 5.43E-38 27 18 23 220GACGCG 7.89E-31 86 40 284 38TTTCGAAACTTACAAAAAT 2.08E-29 26 14 18 450TTCTTGTCAAAAAGC 2.08E-29 26 14 18 325ACATACTATTGTTAAT 3.81E-28 22 13 18 280GATGAGATG 5.60E-28 68 24 83 84TGTTTATATTGATGGA 1.90E-27 24 13 18 220GATGGATTTCTTGTCAAAA 5.04E-27 18 12 18 500TATAAATAGAGC 1.51E-26 27 13 18 300GATTTCTTGTCAAA 3.40E-26 20 12 18 700GATGGATTTCTTG 3.40E-26 20 12 18 875GGTGGCAA 4.18E-26 40 20 96 180TTCTTGTCAAAAAGCA 5.10E-26 29 13 18 250CGAAACTTACAAA 5.10E-26 29 13 18 290GAAACTTACAAAAATAAA 7.92E-26 21 12 18 650TTTGTTTATATTG 1.74E-25 22 12 18 600ATCAACATACTATTGT 3.62E-25 23 12 18 375ATCAACATACTATTGTTA 3.62E-25 23 12 18 625GAACGCGCG 4.47E-25 20 11 13 260GTTAATTTCGAAAC 7.23E-25 24 12 18 400GGTGGCAAAA 3.37E-24 33 14 31 475ATCTTTTGTTTATATTGA 7.19E-24 19 11 18 675TTTGTTTATATTGATGGA 7.19E-24 19 11 18 475GTGGCAAA 1.14E-23 28 18 137 725
Jaak Vilo
DNA expression data analysis 26
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute One example
Pattern GGTGGCAA was the “best” for this cluster (visualized in here after hierarchical clustering within the cluster)
25 out of 40 ORFs belong to “cytoplasmic degradation”functional class (MIPS), mostly being proteasome subunits
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
GGTGGCAA cluster (proteasome) ORF Gene lengh disruption description
YBL041W PRE7 241 lethal 20S proteasome subunit(beta6) YBR170C NPL4 580 lethal nuclear protein localization factor and ER translocation component YDL126C CDC48 835 lethal microsomal protein of CDC48/PAS1/SEC18 family of ATPases YDL100C 354 similarity to E.coli arsenical pump-driving ATPase YDL097C RPN6 434 lethal subunit of the regulatory particle of the proteasome YDR313C PIB 286 phosphatidylinositol(3)-phosphate binding protein YDR330W 500 similarity to hypothetical S. pombe protein YDR394W RPT3 428 lethal 26S proteasome regulatory subunit YDR427W RPN9 393 viable subunit of the regulatory particle of the proteasome YDR510W SMT3 101 lethal ubiquitin-like protein YER012W PRE1 198 lethal 20S proteasome subunit C11(beta4) YFR004W RPN11 306 lethal 26S proteasome regulatory subunit YFR033C QCR6 147 viable ubiquinol--cytochrome-c reductase 17K protein YFR050C PRE4 266 lethal 20S proteasome subunit(beta7) YFR052W RPN12 274 lethal 26S proteasome regulatory subunit YGL048C RPT6 405 lethal 26S proteasome regulatory subunit YGL036W MTC2 909 viable Mtf1 Two hybrid Clone 2 YGL011C SCL1 252 lethal 20S proteasome subunit YC7ALPHA/Y8 (alpha1) YGR048W UFD1 361 lethal ubiquitin fusion degradation protein YGR135W PRE9 258 viable 20S proteasome subunit Y13 (alpha3) YGR253C PUP2 260 lethal 20S proteasome subunit(alpha5) YIL075C RPN2 945 lethal 26S proteasome regulatory subunit YJL102W MEF2 819 translation elongation factor, mitochondrial YJL053W PEP8 379 viable vacuolar protein sorting/targeting protein YJL036W 423 weak similarity to Mvp1p YJL001W PRE3 215 lethal 20S proteasome subunit (beta1) YJR117W STE24 453 viable zinc metallo-protease YKL145W RPT1 467 lethal 26S proteasome regulatory subunit YKL117W SBA1 216 viable Hsp90 (Ninety) Associated Co-chaperone YLR387C 432 similarity to YBR267w YMR314W PRE5 234 lethal 20S proteasome subunit(alpha6) YOL038W PRE6 254 20S proteasome subunit (alpha4) YOR117W RPT5 434 lethal 26S proteasome regulatory subunit YOR157C PUP1 261 lethal 20S proteasome subunit (beta2) YOR176W HEM15 393 viable ferrochelatase precursor YOR259C RPT4 437 lethal 26S proteasome regulatory subunit YOR317W FAA1 700 viable long-chain-fatty-acid--CoA ligase YOR362C PRE10 288 lethal 20S proteasome subunit C1 (alpha7) YPR103W PRE2 287 lethal 20S proteasome subunit (beta5) YPR108W RPN7 429 subunit of the regulatory particle of the proteasome
Jaak Vilo
DNA expression data analysis 27
GGTGGCAA is a binding site for RPN4
FEBS Lett 1999 Apr 30;450(1-2):27-34
Rpn4p acts as a transcription factor by binding to PACE, a nonamer boxfound upstream of 26S proteasomal and other genes in yeast.Mannhaupt G, Schnall R, Karpov V, Vetter I, Feldmann HAdolf-Butenandt-Institut der Ludwig-Maximilians-Universitat Munchen, Germany.
We identified a new, unique upstream activating sequence(5’-GGTGGCAAA-3’) in the promoters of 26 out of the 32 proteasomalyeast genes characterized to date, which we propose to call proteasome-associated control element. By using the one-hybrid method, we show that thefactor binding to the proteasome-associated control element is Rpn4p, a proteincontaining a C2H2-type finger motif and two acidic domains. ...
YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1
GGTGGCAA - proteasome associated control element
Jaak Vilo
DNA expression data analysis 28
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute SPEXS - Sequence Pattern EXhaustive Search
Jaak Vilo, 1998
✦ User-definable pattern language: substrings (oligos), grouppositions/wildcards, flexible wildcards (PROSITE)
✦ Fast exhaustive search over pattern language
✦ “Lazy suffix tree construction”-like algorithm
✦ Analyze multiple sets of sequences simultaneously
✦ Restrict search to most frequent patterns only (in each set)
✦ Report most frequent patterns, patterns over- orunderrepresented in selected subsets, or patterns significantby various statistical criteria, e.g. by binomial distribution
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute “Lazy” construction of trie
✦ ATACATA$
✦ 12345678
✦ O(n²)
✦ Kurtz, Giegerich
✦ Good in practice
A
T
$
C
{1,3,5,7} {4} {2,6}{8}
T$C
{2,6}{8}
{4}{3,7}
A
Jaak Vilo
DNA expression data analysis 29
SPEXS: pattern discovery basedon pattern trie. Vilo 1998
✦ Substrings
✦ Group characters
✦ Wildcard positions
✦ Restrictions on nr ofeach separately
✦ At least k occurrences
✦ Exact occurrences ofeach pattern
A
TC{1,3,5,7} {4} {2,6}
[CT]
C ∪ T
*A
{3,5,7}
{2,4,6}
ATACATA$12345678
ε
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Results continued...
✦ Totally over 6000 interesting patterns
✦ Many from homologous upstreams => Remove
✦ 1498 most interesting substrings left
✦ How to summarize?
✦ Clustered these based on mutual similarity
✦ Found alignments, consensi, and profiles
✦ Of 62 clusters 48 had patterns matching SCPD(experimentally mapped) binding site database
Jaak Vilo
DNA expression data analysis 30
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Cluster and align patterns
CLUSTER_0009:TGACAGCTGACAGCTGTGACAGCTTGTGACAGAGTGACAACAGTGACACAGTGACAAGTGACATAGTGACATTGTGACATTGTGACATTGTGACATTGTGACAGTGACAGTGTGACAGGTGACACCAGACAGTGACCCTTGACCGATTGACCGTTTGACC
Words aligned asALIGNMENT: basedon pattern TGAC----TGACAGC-----TGACAGCT---GTGACAGC--TTGTGACAG----AGTGACA---ACAGTGACA----CAGTGACA-----AGTGACAT----AGTGACATT----GTGACATT----GTGACAT----TGTGACA----TTGTGACA------GTGACAG----TGTGACAG-----GTGACAC------TGACCCT-----TGACCGA----TTGACCG----TTTGACC---
NO ALIGNMENT FOR:NO: CAGACAG
PROFILE:>/tmp/.12100.profile Profile for sequences in file /tmp/.12100 based on pattern TGAC
- 19 16 10 4 0 0 0 0 0 6 13 19A 1 0 5 0 0 0 20 0 16 0 1 0C 0 2 0 0 0 0 0 20 4 2 3 0G 0 0 0 14 0 20 0 0 0 8 0 0T 0 2 5 2 20 0 0 0 0 4 3 1
CONSENSUS: (not real one )>/tmp/.12100.consensus Profile for sequences in file /tmp/.12100 based on pattern TGACCONSENSUS_PAT /tmp/.12100
Consensus pattern:a[ct][at]GTGACA[GTC][cta]t
AGTGACAACAGTGACACAGTGACACAGTGACACAGTGACGCAGTGAGCAGTGAGCAGTGATTACAGTGTTTACAGTGATTACAGTGATACAGTGACAGTGATACAGTGATTTACAGTTTTACAGTG
ALIGNMENT: based on pattern AGT-----AGTGACA---ACAGTGACA----CAGTGACA----CAGTGAC----ACAGTGAC----GCAGTGA-----GCAGTG-----AGCAGTGA---TTACAGTG---TTTACAGTGA---TTACAGTGA----TACAGTG------ACAGTGA----TACAGTGA--TTTACAGT----TTTACAGTG---
PROFILE:>/tmp/.5679.profile Profile for sequences in file /tmp/.5679 basedon pattern AGT- 13 11 8 3 1 0 0 0 1 5 11 13A 0 0 1 10 0 16 0 0 0 11 0 3C 0 0 0 0 15 0 0 0 0 0 5 0G 0 0 0 3 0 0 16 0 15 0 0 0T 3 5 7 0 0 0 0 16 0 0 0 0
CONSENSUS: (not real one )>/tmp/.5679.consensus Profile for sequences in file /tmp/.5679based on pattern AGTCONSENSUS_PAT /tmp/.5679 ttt[AG]CAGTGAca
Align a group of patterns
Jaak Vilo
DNA expression data analysis 31
-----------------ACCCAGACATCGGGCTTCCAC----------------ACACCCAGACATC--------------------------ACACCCAGACATC--------------------------GAACCCATACACT--------------------------ACACCCAGACCGCG-------------------------GCACCCACACATTT----------------------GCTAAACCCATGCACAGTGACT----------------------ACCCAGACACGCTCGA-------------------CTTCACCCTCATAC---------------------------ACACCCCTTTTCT--------------------------GCACCCAGTCTT---------------------------GCACCCAAACACCTGCATATTTGG---------------GCACCCAATCACC--------------------------ACACCCAGACCTC--------------------------AAACCCACACAT--------------------------TGCACCCATACCTT-------------------------AACACCCAAGCACAG-----------ATCTCTCGCAACG-------------------------ACCTCCGTACATTC-------------------------ACACCTGGACACC--------------------------ACATCCGTACAACGAGAACCCATACATTA----------
---TCCGTAC--- ACCCATAC---CATCCGTAC--- ACCCATACA---ATCCGTA---- ACCCATACAT--ATCCGTACA-- -CCCATAC------CCGTAC--- -CCCATACA-----CCGTACA-- --CCATACAT---TCCGTACAT- -CCCATACAT--ATCCGTACAT- --CCATACA----TCCGTACA-- -AACATAC------CCGTACAT- --ACATACT----TCCGTA---- ---GATACT---ATCCGTAC--- --AGATACT-----CCGTACC-----ACCGTACC-----ACCGTAC-----CACCGTAC-------CCGTACATT----GCGTAG-------GCGTAGG-------CGTAGG---CATCCGTA----ACATCCGT------CATCCGT-----
In SCPD
Discoveredautomatically
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
tctcaTCTCA[TC][CT][tag]catcaaTCTTCATGtcctcGAA[CG]TGCCATCtcaa[ta][CG]CCTA[AT]AatcgTACCTCTaacc[ac]CCCC[CT][CGT][ag]agT[TA][CA]TCCT[CG]ggACAGCTAca[ct][at]GTGACA[GTC][cta]ttt[tc]ACAGT[GT][AT][tc]g[at][ATC]TACACAttttGTCACA[GAT]ggt[gc]ACATT[GC][CT]tgata[TC]TGGTTCtacaTCCGTAC[acg]tta[gca][atc]TAAG[CG][TAG][tga]atAT[TAC]GTTAAgct[ct][at][AG]AAGT[AT][TA]cgtT[AG]TTA[CT][TG][AG]caACTTTATTT[ag]TAACTT[AT]Cat[ACT]CGCTTA[AT]gaa[ca][gat][acg][AG]CGCG[cta][gat][ca]gc[ac][at][GT]ACGCcaaGGTCG[CT]ActgtTAACGAATCGTTtaaga[at][TC]CGTTTA[ag]gt[ta]CGAATA[AG]aaaaaA[CAG][AT]GAATCttct[ac][tc][at]CGACT[CA][ca][cg]aatcCACGAA[gc][ta]gc[ga][ctg][ACG]TACG[AT][atc]tataC[CA]CATAC[AT]t
atat[CT][AG]GCAC[TC][ac]ataGCGCA[GT][ga]cccgGTGGCAA[AC][ag]t[ca][ga][GA]CGGC[TG][GTA][cta]tttta[cat][AGC]AGGG[GT][ctg][ac]agcg[ag][at][ga][ac]GATGAG[AC]t[ag][at]gaTGGATGCc[ta]TGCATGAAca[TG][GC]GTATAcgc[TAG]TATAT[ATC][gat][ag][tg]ggg[ag][ga][ag][AG][TAG]AT[GA]TG[agt][ga][ag]tag[AG]TAGA[TA]A[ga]aaaagtaTAAATAGAGCtgct[at]a[ag][TG][AT]GCC[CG][ac][ac]gaaC[CT]CAAT[AT][tg]tATCCAAGAgaaacaAAACAAA[AT][ca][ac]aatatgtGTAAA[TC]ATttataaaa[gt][CA][GT]AAAA[GA][cg][gac]aaaagt[gt][TC]GAAAG[AG]Tt[at][tac]t[gta][ag]AAAATTTT[tg][tc][at]ttga[at][acg][CA]GGAA[AG]T[gt]gaat[tc][cat][AT][TC]TTC[GA][ACT][ga]tcgg[ct][ctg][gct][ctg]CTTTTT[CTG][TC][atc][tg]cctTTTTCTG[CT][TA]ct[ta][gta][gtc][TG]TCTA[TG][GTC]a[at][ct]taaat[AT]TTTGTG[ta]cat[acg]CTGTG[CT]a[ac]
Jaak Vilo
DNA expression data analysis 32
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute Summary about identified patterns
SCPDOf 1498 patterns315 patterns match sites73 patterns are matched by some site1134 patterns do not match any nor is matched by any
Of 109 factors with total 799 sites85 factors are matched by some of the patterns19 factors match some patterns24 factors do not have matches nor is matched
Of 498 unique sites of total 799 sites238 sites are matched by some of the patterns21 sites match some patterns252 sites do not have matches nor is matched
TRANSFACOf 1498 patterns297 patterns match sites61 patterns are matched by some site1174 patterns do not match any nor is matched by any
Of 351 DB-entries with total 359 sites205 factors are matched by some of the patterns22 factors match some patterns134 factors do not have matches nor is matched
Of 334 unique sites of total 359 sites198 sites are matched by some of the patterns16 sites match some patterns127 sites do not have matches nor is matched
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute PATMATCH
✦ Match your patterns against sequences
✦ Sequences - extracted from “GENOMES”
✦ Visualise matches along the sequence
✦ Visualise pattern by pattern if sequence has amatch
✦ Order sequences according to hierarchicalclustering order from EPCLUST
✦ Show clustering and upstream next to eachother
Jaak Vilo
DNA expression data analysis 33
YOR261C YOR261C RPN8 protein degradation 26S proteasome regulatory subunit S0005787 1YDL020C YDL020C RPN4 protein degradation, ubiquitin26S proteasome subunit S0002178 1YDL007W YDL007W RPT2 protein degradation 26S proteasome subunit S0002165 1YDL147W YDL147W RPN5 protein degradation 26S proteasome subunit S0002306 1YOL038W YOL038W PRE6 protein degradation 20S proteasome subunit (alpha4) S0005398 1YKL145W YKL145W RPT1 protein degradation, ubiquitin26S proteasome subunit S0001628 1YDL097C YDL097C RPN6 protein degradation 26S proteasome regulatory subunit S0002255 1YDR394W YDR394W RPT3 protein degradation 26S proteasome subunit S0002802 1YBR173C YBR173C UMP1 protein degradation, ubiquitin20S proteasome maturation factor S0000377 1YER012W YER012W PRE1 protein degradation 20S proteasome subunit C11(beta4) S0000814 1YPR108W YPR108W RPN7 protein degradation 26S proteasome regulatory subunit S0006312 1YOR117W YOR117W RPT5 protein degradation 26S proteasome regulatory subunit S0005643 1YJL001W YJL001W PRE3 protein degradation 20S proteasome subunit (beta1) S0003538 1YPR103W YPR103W PRE2 protein degradation 20S proteasome subunit (beta5) S0006307 1YOR157C YOR157C PUP1 protein degradation 20S proteasome subunit (beta2) S0005683 1YGL048C YGL048C RPT6 protein degradation 26S proteasome regulatory subunit S0003016 1YHR200W YHR200W RPN10 protein degradation 26S proteasome subunit S0001243 1YML092C YML092C PRE8 protein degradation 20S proteasome subunit Y7 (alpha2 S0004557 1YIL075C YIL075C RPN2 tRNA processing 26S proteasome subunit) S0001337 1YMR314W YMR314W PRE5 protein degradation 20S proteasome subunit(alpha6) S0004931 1YGR253C YGR253C PUP2 protein degradation 20S proteasome subunit(alpha5) S0003485 1YGR135W YGR135W PRE9 protein degradation 20S proteasome subunit Y13 (alpha3) S0003367 1YFR004W YFR004W RPN11 transcription putative global regulator S0001900 1YOR259C YOR259C RPT4 protein degradation 26S proteasome regulatory subunit S0005785 1YFR052W YFR052W RPN12 protein degradation 26S proteasome regulatory subunit S0001948 1YFR050C YFR050C PRE4 protein degradation proteasome subunit, B type S0001946 1YGL011C YGL011C SCL1 protein degradation 20S proteasome subunit YC7ALPHA/Y8 S0002979 1YDR427W YDR427W RPN9 protein degradation 26S proteasome regulatory subunit S0002835 1YOR362C YOR362C PRE10 protein degradation 20S proteasome subunit C1 (alpha7) S0005889 1YBL041W YBL041W PRE7 protein degradation 20S proteasome subunit S0000137 1YER021W YER021W RPN3 protein degradation 26S proteasome regulatory subunit S0000823 1YER094C YER094C PUP3 protein degradation 20S proteasome subunit (beta3 S0000896 1YGR270W YGR270W YTA7 protein degradation 26S proteasome subunit; ATPase S0003502 1YHR027C YHR027C RPN1 protein degradation 26S proteasome regulatory subunit S0001069 1YER047C YER047C SAP1 mating type switching AAA family protein S0000849 1YGR232W YGR232W unknown unknown S0003464 1
GGTGGCAA - proteasome associated control element
PATMATCH - combine pattern matching with expression data
Jaak Vilo
DNA expression data analysis 34
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Global and Local Data MiningSecondary Data Mining
✦ Find global structure by clustering
✦ Find local structure by pattern discovery
✦ Summarize the findings to the sizefeasible for humans to interpret
✦ Find most interesting rules
✦ Find “explanations” for the behavior
✦ Hypothesis generation for wet-lab
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Gene networks
promoter1 gene1promoter2 gene2 promoter3 gene3 promoter4 gene4DNA
RNA
transcription
translation
proteins
transcription factors
Jaak Vilo
DN
A expression data analysis
35
European Bioinformatics InstituteEuropean Bioinformatics Institute
A gene netw
ork
b1
b2
b3
F1
F2
r1r2
European Bioinformatics InstituteEuropean Bioinformatics Institute
lacZ ...Prom
oterO
perator
Repressor
lacIProm
oter
Activator
Glucose
Lactose
Glucose
Galactose +
Galactosidase
Lac-O
peron
Thom
as Schlitt
Jaak Vilo
DNA expression data analysis 36
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Eur
opea
n B
ioin
form
atic
s In
stit
ute
Thanks
Alvis Brazma EBIUgis Sarkans EBIThomas Schlitt EBIInge Jonassen University of BergenEsko Ukkonen University of HelsinkiAlan Robinson EBI
Recommended