Upload
others
View
59
Download
1
Embed Size (px)
Citation preview
1
Clustering and machine learning for gene expression data
Torgeir R. Hvidsten
Linnaeus Centre for Bioinformatics
Torgeir R. Hvidsten2006.02.172
Machine learning: to learn general concepts from examples
Real world Data (Feature space)
Knowledge (classes)
Assumed functional relationship partially described by the examples
Data collection
Abstraction
Machine learning
Torgeir R. Hvidsten2006.02.173
Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products
• Molecular function: the tasks performed by individual gene products
• Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions
• Cellular component: subcellular structures, locations, and macromolecular complexes
Gene Ontology
Torgeir R. Hvidsten2006.02.174
Protein structure classification (CATH)
2
Torgeir R. Hvidsten2006.02.175
Microarray
Torgeir R. Hvidsten2006.02.176
Hybridization
Torgeir R. Hvidsten2006.02.177
Image after scanning
Torgeir R. Hvidsten2006.02.178
Numerical data
Gene/Expr E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 … EMG1 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 … -0.94G2 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 … -0.42G3 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 … -1.12G4 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 … -0.62G5 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 … -0.74G6 0.54 0.53 0.16 0.14 0.20 -0.34 -0.38 -0.36 -0.49 -0.58 … -1.47G7 0.20 0.14 0.00 0.11 -0.34 -0.03 0.04 -0.76 -0.81 -1.12 … -1.36G8 0.40 0.43 0.18 0.00 -0.14 0.29 0.07 -0.79 -0.81 -0.92 … -1.22G9 0.01 0.46 0.28 -0.34 -0.23 -0.36 -0.45 -0.64 -0.79 -1.22 … -1.09… … … … … … … … … … … … …GN -0.23 0.04 0.00 -0.30 -0.29 -0.45 -0.97 -2.06 -0.89 -1.22 … -0.97
log(2.3/2.4) = log(“Red/Green”)
M < 100
N ≈ 10000
3
Torgeir R. Hvidsten2006.02.179
Data analysis goals
What to study?
• Classes of experiments; changes in expression levels in tissue samples with different e.g. diseases, treatments, environmental effects etc.
• Classes of genes; expression profiles of genes with similar biological function
• Both of the above
Torgeir R. Hvidsten2006.02.1710
Data analysis methods
• Unsupervised learning (clustering, class discovery); used to “discover” natural groups of genes/experiments e.g.– discover subclasses of a form of cancer that is clinically
homogenous• Supervised learning; used to “learn” a model of a set
of predefined classes of genes/experiments e.g.– diagnosis of cancer/subclasses of cancer
Torgeir R. Hvidsten2006.02.1711
Clustering analysis
Need to define;• measure of similarity• algorithm for using the measure of similarity to
discover natural groups in the data
The number of ways to divide n items into k clusters: kn/k!
Example: 10500/10! = 2.756 ×10493
Torgeir R. Hvidsten2006.02.1712
Measure of similarity
E1
E2
d
What is similar? Euclidean distance
4
Torgeir R. Hvidsten2006.02.1713
Hierarchical clustering
• INPUT: n genes/experiments• Consider each gene/experiment as an individual cluster and
initiate an n × n distance matrix d• Repeat
– identify the two most similar clusters in d (i.e. smallest number in d)– merge the two most similar clusters and update the matrix (i.e. substitute
the two clusters with the new cluster)• OUTPUT: A tree of merged genes/experiments (called a
dendrogram)
Torgeir R. Hvidsten2006.02.1714
Hierarchical clusteringIntercluster similarity measures: (a) single linkage, (b) complete linkage and (c) average linkage
Example of hierarchical clustering: languages of Europe
Distance: Frequency of numbers with different first letter e.g.
dEN = 2 dEDu = 7 dSpI = 1
Intercluster strategy: SINGEL LINKAGE
Iteration 1
E N Da Du G Fr Sp I P H FiE 0N 2 0
Da 2 1 0Du 7 5 6 0G 6 4 5 5 0Fr 6 6 6 9 7 0Sp 6 6 5 9 7 2 0I 6 6 5 9 7 1 1 0P 7 7 6 10 8 5 3 4 0H 9 8 8 8 9 10 10 10 10 0Fi 9 9 9 9 9 9 9 9 9 8 0
I
12345678
Fr
5
Iteration 2
I Fr E N Da Du G Sp P H FiI Fr 0E 6 0N 6 2 0
Da 5 2 1 0Du 9 7 5 6 0G 7 6 4 5 5 0Sp 1 6 6 5 9 7 0P 4 7 7 6 10 8 3 0H 10 9 8 8 8 9 10 10 0Fi 9 9 9 9 9 9 9 9 8 0
I
12345678
Fr Da N
Iteration 3
Da N I Fr E Du G Sp P H FiDa N 0I Fr 5 0E 2 6 0
Du 5 9 7 0G 4 7 6 5 0
Sp 5 1 6 9 7 0P 6 4 7 10 8 3 0H 8 10 9 8 9 10 10 0Fi 9 9 9 9 9 9 9 8 0
I
12345678
Fr Da NSp
Iteration 4
Sp I Da E Du G P H FiSp I Fr 0Da N 5 0
E 6 2 0Du 9 5 7 0G 7 4 6 5 0P 3 6 7 10 8 0H 10 8 9 8 9 10 0Fi 9 9 9 9 9 9 8 0
I
12345678
Fr Da NSp E
Iteration 5
E Da N
Sp I Fr Du G P H Fi
E Da N 0
Sp I Fr 5 0Du 5 9 0G 4 7 5 0P 6 3 10 8 0H 8 10 8 9 10 0Fi 9 9 9 9 9 8 0
I
12345678
Fr Da NSp EP
6
Iteration 6
P Sp I Fr
E Da N Du G H Fi
P Sp I Fr 0
E Da N 5 0
Du 9 5 0G 7 4 5 0H 10 8 8 9 0Fi 9 9 9 9 8 0
I
12345678
Fr Da NSp EP G
Iteration 7
G E Da N
P Sp I Fr Du H Fi
G E Da N 0
P Sp I Fr 5 0Du 5 9 0H 8 10 8 0Fi 9 9 9 8 0
I
12345678
Fr Da NSp EP G Du
Iteration 8
Du G E Da N
P Sp I Fr H Fi
Du G E Da N 0
P Sp I Fr 5 0H 8 10 0Fi 9 9 8 0
I
12345678
Fr Da NSp EP G Du
Iteration 9
P Sp I Fr Du G E
Da N H FiP Sp I Fr Du G E Da N 0
H 8 0Fi 9 8 0
I
12345678
Fr Da NSp EP G Du H
7
Iteration 10
Fi H
P Sp I Fr Du G E Da N
Fi H 0P Sp I
Fr Du G E Da N 8 0
I
12345678
Fr Da NSp EP G Du H Fi
Torgeir R. Hvidsten2006.02.1726
Any data mining result needs to be consistent BOTH with the data and current knowledge!
Torgeir R. Hvidsten2006.02.1727
Evaluation of clusters
I
12345678
Fr Da NSp EP G Du H Fi
Clusters may be evaluated according to how well they describe current knowledge
RomanSlavicGermanicUgro-Finnish Torgeir R. Hvidsten
2006.02.1728
Hierarchical clustering: properties
• Huge memory requirements: stores the n × n matrix• Running time: O(n3)• Deterministic: produces the same clustering each
time• Nice visualization: dendrogram• Number of clusters can be selected using the
dendrogram
8
Torgeir R. Hvidsten2006.02.1729
K-means clustering
• Split the data into k random clusters• Repeat
– calculate the centroid of each cluster– (re-)assign each gene/experiment to the closest centroid– stop if no new assignments are made
Example of K-means:two dimensions
Initial clustersK=2
Iteration 1
Calculate centroids
xx
Iteration 1
(Re-)assign
xx
9
Iteration 2
Calculate centroids
x
x
Iteration 2
(Re-)assign
x
x
Iteration 3
Calculate centroid
x
x
Iteration 3
(Re-)assign
No new assignments! STOP
x
x
10
Torgeir R. Hvidsten2006.02.1737
K-means: properties
• Low memory usage• Running time: O(n)• Improves iteratively: not trapped in previous
mistakes• Non-deterministic: will in general produce different
clusters with different initializations• Number of clusters must be decided in advance
Torgeir R. Hvidsten2006.02.1738
Hierarchical vs. k-means
• Hierarchical clustering: – computationally expensive -> relatively small data sets– nice visualization, no. of clusters can be selected– deterministic– cannot correct early ”mistakes”
• K-means: – computationally efficient -> large data sets– predefined no. of clusters– non-deterministic -> should be run several times– iterative improvement
• Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!
Torgeir R. Hvidsten2006.02.1739
Supervised learning• Uses examples of known classes to learn a model• Examples are expression profiles of genes with known
classes (clinical state or function)• The model can be e.g.
– hyperplanes separating classes in n dimensions– artificial neural networks– decision trees– IF-THEN rules
• Can be used for e.g.– diagnostics– predicting gene function for unknown genes
Torgeir R. Hvidsten2006.02.1740
Support Vector Machines
Maximum marginseparating ”hyperplane”
Support vectors
Soft margin
11
Torgeir R. Hvidsten2006.02.1741
Artificial neural networks
Input layer Output layer
x1
x2
x3
x4
f(x)
…x1
xn ⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
−
∑=
>
otherwise
n1i
if
1
01 ixiww1
wn
Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain,
PortugalGroup 3: Benelux countries, Switzerland,
Austria, Italy, Germany
Christian Democrats > 16
Group 3
Yes
Agrarians > 4
YesGroup 1 Group 2
No
Decision tree learning
No
Agrarians([4, *)) AND Christian Democrats([*, 16)) => Class(1)Agrarians([*, 4)) AND Christian Democrats([*, 16)) => Class(2)Christian Democrats([16, *)) => Class(3)
Rule learning: Rough sets
Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain, PortugalGroup 3: Benelux countries, Switzerland, Austria, Italy, Germany
Torgeir R. Hvidsten2006.02.1744
Supervised vs. clustering
Clustering+ class discovery+ robust towards incorrect knowledge
Supervised+ evaluation+ predictive/descriptive model+ based on actual knowledge rather than idealized
hypotheses
12
Predicting biological process from gene expression time profiles
Papers:
I. T. R. Hvidsten, A. Lægreid and J. Komorowski. Learning rule-based models of biological process from gene expression time profiles using gene ontology, Bioinformatics19(9): 1116-23, 2003.
II. A. Lægreid, T. R. Hvidsten, H. Midelfart, J. Komorowski and A. K. Sandvik. Predicting Gene Ontology Biological Process From Temporal Gene Expression Patterns, Genome Research, 13(5): 965-979, 2003.
Torgeir R. Hvidsten2006.02.1746
Hierarchical clustering
Iyer et al., The transcriptional program in the response of human fibroblasts to serum, Science, 283(5398): 83-87, 1999
Torgeir R. Hvidsten2006.02.1747
Gene Ontology vs. expression clustering
Torgeir R. Hvidsten2006.02.1748
Gene 0HR 15MIN30MIN 1HR 2HR 4HR 6HR 8HR 12HR 16HR 20HR 24HR Process g1 0.00 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 -0.94 Unknown
g2 0.00 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 -0.42Transport and
defense responseg3 0.00 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 -1.12 Cell cycle control
g4 0.00 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 -0.62Positive control of cell proliferation
g5 0.00 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 -0.74Positive control of cell proliferation
... ... ... ... ... ... ... ... ... ... ... ... ... ...
Process
Positive controlof cell
proliferation
Defenseresponse
Cell cyclecontrol
Ontology
Transport
g2 ... g2 ... g3 ...g4 ... g5
0 - 4(Increasing) AND 6 - 10(Decreasing) AND 14 - 18(Constant) => GO(cell proliferation)
1. Annotation
2. Extracting features for learning
3. Inducing minimal decision rules using rough sets
4. The function of uncharacterized genes is predicted using the rules !-2
-1.5
-1
-0.5
0
0.5
1
1.5
0 2 4 6 8 10 12 14 16 18 20 22 24
Methodology
13
Torgeir R. Hvidsten2006.02.1749
Rule Induction
• IF-part (antecedent, premise): the minimal set of discrete changes in expression needed to uphold the discriminatory power of the full data set
• THEN-part (consequent): all functions of genes described by the premise-side
• We want rules that describe the expression profiles of several genes with one or a few functions
– accuracy: the fraction of genes matching the IF-part that are annotated with the process in the THEN-part
– coverage: the fraction of genes annotated with the process in the THEN-part that matches the IF-part
IF 0 - 4(Constant) AND 0 - 10(Increasing)
THEN GO(prot. met. and mod.) OR GO(mesoderm develop.) OR GO(prot. biosynt.)
Torgeir R. Hvidsten2006.02.1750
Rule example
M35296 J02783 D13748 X05130
X60957D13748
0 - 4(Constant) AND 0 - 10(Increasing) => GO(protein metabolism and modification) OR
GO(mesoderm development) OR GO(protein biosynthesis)
Covered genesRule
-1
-0.5
0
0.5
1
1.5
2
2.5
3
0 2 4 6 8 10 12 14 16 18 20 22 24
Torgeir R. Hvidsten2006.02.1751
Classification
IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …IF … THEN …
IF 0 - 4(Constant) AND 0 - 10(Increasing) THEN GO(protein metabolism and modification ) OR
GO(mesoderm development) OR GO(protein biosynthesis)
IF … THEN IF … THEN …IF … THEN …IF … THEN …IF … THEN …
X60957
-1-0.5
00.5
11.5
22.5
3
0 2 4 6 8 10 12 14 16 18 20 22 24
Process Votes protein metabolism and modification 6 mesoderm development 3 proteolysis and peptidolysis 2 transcription 1 protein biosynthesis 1 vision 1 …
+4
Votes are normalized and processes with vote fractions higher than a selection-threshold are chosen as predictions
+1+1
Torgeir R. Hvidsten2006.02.1752
EvaluationEvaluation technique
– divide examples into training set and test set– cross validation
Evaluation measures:– accuracy = (TP+TN)/(TP+FN+TN+FP)– sensitivity = TP/(TP+FN)– specificity = TN/(TN+FP)
14
Threshold selection
1
Fraction of votes for “proteinbiosynthesis”
Test setg1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12
Sensitivity = 2/3, Specificity=1Sensitivity = 1, Specificity=2/3
Gene with function “protein biosynthesis”Gene with a different function
sensitivity: TP/(TP+FN)specificity: TN/(TN+FP)
Threshold 1
Threshold 2
Torgeir R. Hvidsten2006.02.1754
ROC analysis and classifier evaluation
1
sens
itivi
ty
1 - specificity 1
No discrimination
Perfect discrimination
AUC
00
• ROC: Receiver operating characteristics curve results from plotting sensitivity against specificity for all possible thresholds
– sensitivity: TP/(TP+FN)– specificity: TN/(TN+FP)
• AUC: Area under the ROC curve• Cross validation (CV)
– systematic division of data into training and test sets
– CV estimates are interpreted as the classification performance expected on new, unseen data
Torgeir R. Hvidsten2006.02.1755
Over all classes:Coverage = TP/(TP+FN)Precision = TP/(TP+FP)
Coverage: 84%Precision: 50%
Coverage: 71%Precision: 60%
Coverage: 39%Precision: 90%
*Iyer et al.
Cross validation estimatesPROCESS AUC SE P-VALUE Ion homeostasis 1.00 0.00 0.008 Protein targeting 0.99 0.03 0.000 Blood coagulation 0.96 0.08 0.000 DNA metabolism 0.94 0.09 0.000 Intracellular signaling cascade 0.94 0.06 0.000 Cell cycle 0.93 0.04 0.000 Energy pathways 0.93 0.12 0.004 Oncogenesis 0.92 0.11 0.000 Circulation 0.91 0.11 0.001 Cell death 0.90 0.10 0.000 Developmental processes 0.90 0.07 0.000 Defense (immune) response 0.88 0.05 0.000 Transcription 0.88 0.11 0.002 Cell adhesion 0.87 0.09 0.002 Stress response 0.86 0.15 0.002 Protein metabolism and modification 0.85 0.10 0.000 Cell motility 0.84 0.11 0.000 Cell surface rec linked signal transd 0.82 0.15 0.005 Lipid metabolism 0.81 0.14 0.000 Cell organization and biogenesis 0.79 0.11 0.000 Cell proliferation 0.79 0.06 0.002 Transport 0.79 0.17 0.001 Amino acid and derivative metabolism 0.69 0.06 0.288
AVERAGE
0.88
0.09