46
Classification by Machine Learning Approaches Michael J. Kerner – [email protected] Center for Biological Sequence Analysis Technical University of Denmark

Classification by Machine Learning Approaches Michael J. Kerner – [email protected]@cbs.dtu.dk Center for Biological Sequence Analysis Technical

  • View
    227

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Classification by

Machine Learning

Approaches

Michael J. Kerner – [email protected]

Center for Biological Sequence AnalysisTechnical University of Denmark

Page 2: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Outline

• Introduction to Machine Learning

• Datasets, Features

• Feature Selection

• Machine Learning Approaches (Classifiers)

• Model Evaluation and Interpretation

• Examples, Exercise

Page 3: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Machine Learning – Data Driven Prediction

To Learn:“to gain knowledge or understanding of or skill in by study, instruction, or experience”

(Merriam Webster English Dictionary, 2005)

Machine Learning:Learning the theory automatically from the data, through a process of inference, model fitting, or learning from examples:

Automated extraction of useful information from a body of data by building good probabilistic models.

Ideally suited for areas with lots of data in the absence of a general theory.

Page 4: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Why do we need Machine Learning?

• Some tasks cannot be defined well, except by examples (e.g. recognition of faces or people).

• Large amounts of data may have hidden relationships and correlations. Only automated approaches may be able to detect these.

• The amount of knowledge about a certain problem / task may be too large for explicit encoding by humans (e.g. in medical diagnostics)

• Environments change over time, and new knowledge is constantly being discovered. A continuous redesign of the systems “by hand” may be difficult.

Page 5: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

The Machine Learning Approach

InputData

ClassifierML

e.g. Gene Expression Profiles, …

Machine Learning

Prediction:Yes / No

Page 6: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Machine Learning

• Learning Task:– What do we want to learn or predict?

• Data and assumptions:– What data do we have available? – What is their quality?– What can we assume about the given problem?

• Representation:– What is a suitable representation of the examples to be classified?

• Method and Estimation:– Are there possible hypotheses?– Can we adjust our predictions based on the given results?

• Evaluation:– How well does the method perform?– Might another approach/model perform better?

Page 7: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Learning Tasks

• Classification:– Prediction of an item class.

• Forecasting:– Prediction of a parameter value.

• Characterization:– Find hypotheses that describe groups of items.

• Clustering:– Partitioning of the (unassigned) data set into clusters

with common properties. (Unsupervised learning)

Page 8: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Emergence of Large Datasets

Dataset examples:

• Image processing• Spam email detection• Text mining• DNA micro-array data• Protein function• Protein localization• Protein-protein interaction• …

Page 9: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Dataset Examples

Edible or poisonous ?

Page 10: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Dataset Examples

Page 11: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

mRNA Splicing

Page 12: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

mRNA Splice Site Prediction

Page 13: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Protein Function Prediction: ProtFun

• Predict as many biologically relevant features as we can from the sequence

• Train artificial neural networks for each category

• Assign a probability for each category from the NN outputs

Page 14: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

############## ProtFun 2.2 predictions ########

>KCNA1_HUMAN

# Functional category Prob Odds

Amino_acid_biosynthesis 0.042 1.893

Biosynthesis_of_cofactors 0.119 1.654

Cell_envelope 0.031 0.507

Cellular_processes 0.027 0.373

Central_intermediary_metabolism 0.046 0.731

Energy_metabolism 0.036 0.395

Fatty_acid_metabolism 0.019 1.485

Purines_and_pyrimidines 0.214 0.879

Regulatory_functions 0.013 0.083

Replication_and_transcription 0.019 0.073

Translation 0.129 2.925

Transport_and_binding =>0.717 1.748

# Enzyme/nonenzyme Prob Odds

Enzyme 0.231 0.807

Nonenzyme =>0.769 1.078

# Enzyme class Prob Odds

Oxidoreductase (EC 1.-.-.-) 0.040 0.193

Transferase (EC 2.-.-.-) 0.056 0.163

Hydrolase (EC 3.-.-.-) 0.062 0.195

Lyase (EC 4.-.-.-) 0.020 0.430

Isomerase (EC 5.-.-.-) 0.010 0.321

Ligase (EC 6.-.-.-) 0.017 0.326

# Gene Ontology category Prob Odds

Signal_transducer 0.061 0.284

Receptor 0.055 0.323

Hormone 0.001 0.206

Structural_protein 0.002 0.086

Transporter 0.469 4.299

Ion_channel 0.207 3.633

Voltage-gated_ion_channel =>0.280 12.736

Cation_channel 0.348 7.560

Transcription 0.163 1.270

Transcription_regulation 0.166 1.331

Stress_response 0.011 0.125

Immune_response 0.031 0.370

Growth_factor 0.005 0.372

Metal_ion_transport 0.159 0.345

Page 15: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Complexity of datasets:

• Many instances (examples)

• Instances with multiple features (properties / characteristics)

• Dependencies between the features (correlations)

Emergence of Large Datasets

Page 16: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Data Preprocessing

Instance selection:– Remove identical / inconsistent / incomplete

instances (e.g. reduction of homologous genes, removal of wrongly annotated genes)

Feature transformation / selection:– Projection techniques (e.g. principal

components analysis)– Compression techniques (e.g. minimum

description length)– Feature selection techniques

Page 17: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Benefits of Feature Selection

• Attain good and often even better classification performance using a small subset of features– Less noise in the data

• Provide more cost-effective classifiers– Less features to take into account

smaller datasets faster classifiers

• Identification of (biologically) relevant features for the given problem

Page 18: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Feature Selection

FeatureSubset

Selection

LearningAlgorithm

All Features

FeatureSubset

Selection

Learning Algorithm

All Features

Feature SubsetSearch Algorithm

SelectionCriterion

LearningAlgorithm

SelectedFeatures

Evaluation

OptimalFeatures

OptimalFeatures

OptimalFeatures

Filter approach Wrapperapproach

Page 19: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Filter Approach

• Independent of the classification model• A relevance measure for each feature is calculated• Features with a value lower than a selected threshold t will

be removed

Example: Feature-class entropy• Measures the “uncertainty” about the class when

observing feature i

f1 f2 f3 f4 class f1 f2 f3 f4 class

1 0 1 1 1 1 0 0 0 0

0 1 1 0 1 0 0 1 0 0

1 0 1 0 1 1 1 0 1 0

0 1 0 1 1 0 1 0 1 0

Page 20: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Wrapper approach

• Specific to a classification algorithm• The search for a good feature subset is guided by

a search algorithm • The algorithm uses the evaluation of the classifier

as a guide to find good feature subsets• Search algorithm examples: sequential forward or

backward search, genetic algorithms

Sequential backward elimination– Starts with the set of all features– Iteratively discards the feature whose removal

results in the best classification performance

Page 21: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Wrapper approach

Full feature set : f1,f2,f3,f4

f2,f3,f4 0.7 f1,f3,f4 0.8 f1,f2,f4 0.1 f1,f2,f3 0.75

f3,f40.85

f1,f40.1

f1,f30.8

f40.2

f30.7

Page 22: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Classification Methods

- Decision trees

- Hidden Markov Models (HMMs)

- Support vector machines

- Artificial Neural Networks

- Bayesian methods

- …

Page 23: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Decision Trees

• Simple, practical and easy to interpret• Given a set of instances (with a set of features), a

tree is constructed with internal nodes as the features and the leaves as the classes

Page 24: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Example Dataset: Shall we play golf?

Instance  Attributes /   Features   Class

day outlook temperature humidity windy Play Golf ?

1 sunny hot high FALSE no

2 sunny hot high TRUE no

3 overcast hot high FALSE yes

4 rainy mild high FALSE yes

5 rainy cool normal FALSE yes

6 rainy cool normal TRUE no

7 overcast cool normal TRUE yes

8 sunny mild high FALSE no

9 sunny cool normal FALSE yes

10 rainy mild normal FALSE yes

11 sunny mild normal TRUE yes

12 overcast mild high TRUE yes

13 overcast hot normal FALSE yes

14 rainy mild high TRUE no

today sunny cool high TRUE ?

Page 25: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Example: Shall we play golf today?

WEKA data file (arff format) :

@relation weather.symbolic

@attribute outlook {sunny, overcast, rainy}@attribute temperature {hot, mild, cool}@attribute humidity {high, normal}@attribute windy {TRUE, FALSE}@attribute play {yes, no}

@datasunny,hot,high,FALSE,nosunny,hot,high,TRUE,noovercast,hot,high,FALSE,yesrainy,mild,high,FALSE,yesrainy,cool,normal,FALSE,yesrainy,cool,normal,TRUE,noovercast,cool,normal,TRUE,yessunny,mild,high,FALSE,nosunny,cool,normal,FALSE,yesrainy,mild,normal,FALSE,yessunny,mild,normal,TRUE,yesovercast,mild,high,TRUE,yesovercast,hot,normal,FALSE,yesrainy,mild,high,TRUE,no

Instance Independent features (attributes) Class

Day Outlook Temperature Humidity Windy Play Golf?

1 sunny hot high FALSE no

2 sunny hot high TRUE no

3 overcast hot high FALSE yes

4 rainy mild high FALSE yes

5 rainy cool normal FALSE yes

6 rainy cool normal TRUE no

7 overcast cool normal TRUE yes

8 sunny mild high FALSE no

9 sunny cool normal FALSE yes

10 rainy mild normal FALSE yes

11 sunny mild normal TRUE yes

12 overcast mild high TRUE yes

13 overcast hot normal FALSE yes

14 rainy mild high TRUE no

Page 26: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Feature compositions

sun

ny

ove

rcas

t

rain

y

ho

t

coo

l

mil

d

hig

h

no

rmal

Tru

e

Fal

se

YE

S

NO

NOYES

Page 27: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Decision TreesJ48 pruned tree------------------outlook = sunny| humidity = high: no (3.0)| humidity = normal: yes (2.0)outlook = overcast: yes (4.0)outlook = rainy| windy = TRUE: no (2.0)| windy = FALSE: yes (3.0)

Number of Leaves : 5Size of the tree : 8

Attributes / Features

Attribute Values

Classes

Page 28: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Artificial Neural Networks (ANNs)

Artificial Neuron

Neural Network

Page 29: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Overfitting

Overfitting:A classifier that performs well on the training examples, but poorly on new examples.

Training and testing on the same data will generally produce a good classifier (for this dataset) with high overfitting.

To avoid overfitting:• Use separate training and testing data• Use cross-validation• Use the simplest model possible

Page 30: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Performance Evaluation

Cross-Validation (10 fold)

Data

TrainingSet

TestSet

Performance Evaluation

Classifier

ML

(9/10)

(1/10)10x

Page 31: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Performance Evaluation

Confusion Matrix

TP True Positives

TN True Negatives

FP False Positives

FN False Negatives

Predicted Label

positive negative

Known positive TP FNLabel negative FP TN

Page 32: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Performance Evaluation

• Precision (PPV) TP / (TP + FP)– Percentage of correct positive predictions

• Recall / Sensitivity TP / (TP + FN)– Percentage of positively labeled instances, also predicted as positive

• Specificity TN / (TN + FP)– Percentage of negatively labeled instances, also predicted as

negative

• Accuracy (TP + TN) / (TP + TN + FP + FN)– Percentage of correct predictions

• Correlation Coefficient (TP * TN – FP * FN)

(TP+FP)*(FP+TN)*(TN+FN)*(FN+TP)

-1 ≤ cc ≤ 1 cc = 1 : no FP or FNcc = 0 : random cc = -1: only FP and FN

Page 33: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

ROC – Receiver Operating Characteristic

( FP / (FP + TN) )False Positive Rate, (1 - Specificity)

Tru

e P

os

itiv

e R

ate

, Se

ns

itiv

ity

TP

/ (T

P +

FN

)

Page 34: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

ROC – Receiver Operating Characteristic

1 - Specificity

Se

ns

itiv

ity

Page 35: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Case Study - Splice Site Prediction

Page 36: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Case Study - Splice Site Prediction

Splice site prediction:

Correctly identify the borders of introns and exons in genes (splice sites)

• Important for gene prediction

• Split up into 2 tasks:– Donor prediction (exon -> intron)– Acceptor prediction (intron -> exon)

Page 37: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Case Study - Splice Site Prediction

• Splice sites are characterized by a conserved dinucleotide in the intron part of the sequence

– Donor sites :

– Acceptor sites :

• Classification problem:– Distinguish between true GT, AG and false GT, AG.

Page 38: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Case Study - Splice Site Prediction

• Position dependent features

e.g. an A on position 1, C on position 17, ….

• Position independent features

e.g. subsequence “TCG” occurs, “GAG” occurs,…

atcgatcagtatcgat GT ctgagctatgag

atcgatcagtatcgat GT ctgagctatgag

1 2 3 17 28

Features:

Page 39: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Original Data – Human Acceptor Splice Site Sites

>HUMGLUT4B_3535GGGCCCCTAGCGGAAGGAAAAAAATCATGGTTCCATGTGACATGCTGTGTCTTTGTGTCTGCCTGTTCAGGATGGGGAACCCCCTCAGCA>HUMGLUT4B_3763GAGGACAGGTGTCTCGGGGGTGGTGGAAAGGGGACGGTCTGCAGGAAATCTGTCCTCTGCTGTCCCCCAGGTGATTGAACAGAGCTACAA>HUMGLUT4B_4028TGGGGGAAACAGGAAGGGAGCCACTGCTGGGTGCCCTCACCCTCACAGCCTCACTCTGTCTGCCTGCCAGGAAAAGGGCCATGCTGGTCA>HUMGLUT4B_4276TGGGCTTTCAGATGGGAATGGACACCTGCCCTCAGCCCTCTCTTCTTCCCTCGCCCAGGGCTGACATCAGGGCTGGTGCCCATGTACGTG>HUMGLUT4B_4507ATATGGTGGGCTTCCAAGGTAAGGCAGAAGGGCTGAGTGACCTGCCTTCTTTCCCAACCTTCTCCCACAGGTGCTGGGCTTGGAGTCCCT>HUMGLUT4B_4775GCCTCCGCCTCATCTTGCTAGCACCTGGCTTCCTCTCAGGTCCCCTCAGGCCTGACCTTCCCTTCTCCAGGTCTGAAGCGCCTGACAGGC>HUMGLUT4B_5125CCAGCCTGTTGTGGCTGGAGTAGAGGAAGGGGCATTCCTGCCATCACTTCTTCTTCTCCCCCACCTCTAGGTTTTCTATTATTCGACCAG>HUMGLUT4B_5378CCTCACCCACGCGGCCCCTCCTACTTCCCGTGCCCAAAAGGCTGGGGTCAAGCTCCGACTCTCCCCGCAGGTGTTGTTGGTGGAGCGGGC>HUMGLUT4B_5995CTGAGTTGAGGGCAAGGGAAGATCAGAAAGGCCTCAACTGGATTCTCCACCCTCCCTGTCTGGCCCCTAGGAGCGAGTTCCAGCCATGAG>HUMGLUT4B_6716CTGGTTGCCTGAAACTACCCCTTCCCTCCCCACCTCACTCCGTCAACACCTCTTTCTCCACCTGTCCCAGGAGGCTATGGGGCCCTACGT>HSRPS6G_1493CTTTGTAGATGGCTCTACAATTACCTGTATAGATAGTTTCGTAAACTATTTCCCCCCTTTTAATCCTTAGCTGAACATCTCCTTCCCAGC[...]

Page 40: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Arff Data File - WEKA

@RELATION splice-train

@ATTRIBUTE -68_A {0,1}@ATTRIBUTE -68_T {0,1}@ATTRIBUTE -68_C {0,1}@ATTRIBUTE -68_G {0,1}@ATTRIBUTE -67_A {0,1}@ATTRIBUTE -67_T {0,1}@ATTRIBUTE -67_C {0,1}@ATTRIBUTE -67_G {0,1}[...]@ATTRIBUTE 20_A {0,1}@ATTRIBUTE 20_T {0,1}@ATTRIBUTE 20_C {0,1}@ATTRIBUTE 20_G {0,1}@ATTRIBUTE class {true,false}

@DATA0,0,0,1,0,0,0,1, [...] ,1,0,0,0,true0,0,0,1,1,0,0,0, [...] ,1,0,0,0,true0,1,0,0,0,0,0,1, [...] ,1,0,0,0,true0,1,0,0,0,0,0,1, [...] ,0,0,0,1,true[...]1,0,0,0,0,1,0,0, [...] ,0,1,0,0,true0,0,0,1,0,0,1,0, [...] ,0,0,1,0,true0,0,1,0,0,0,1,0, [...] ,0,0,0,1,true0,0,1,0,0,0,1,0, [...] ,0,0,1,0,true

The original sequence files in FASTA format have been converted to represent the four DNA bases in a binary fashion 

A:   1 0 0 0T:   0 1 0 0C:   0 0 1 0G:   0 0 0 1

Page 41: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Case Study - Splice Site Prediction

• Local context of 88 nucleotides around the splice site

• 88 position dependent features• A=1000, T=0100, C=0010, G=0001

352 binary features

• Reduce the dataset to contain fewer but relevant features

Page 42: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

352 Binary features

Page 43: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

15 Binary features

Page 44: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Case Study – Splice Site Sequence Logos

Acceptor Sites:

Donor Sites:

+ 3

+ 2

+ 1- 2

- 3

+ 4- 1

+ 1- 2

- 3

- 1

- 4

- 8

- 9

- 7

- 5

- 6

- 13

- 14

- 12

- 10

- 11

- 15

- 18

- 16

- 17

Page 45: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Exercise:

• Building a prediction tool for human mRNA splice sites

• Feature selection for classification of splice sites

• Tool: The WEKA machine learning toolkit.

• Go tohttp://www.cbs.dtu.dk/~kerner/GeneDisc_Course_2007_MJK/

and follow the instructions

Page 46: Classification by Machine Learning Approaches Michael J. Kerner – kerner@cbs.dtu.dkkerner@cbs.dtu.dk Center for Biological Sequence Analysis Technical

Acknowledgements

Slides and Exercises Adapted from and inspired by:

Søren Brunak

David Gilbert, Aik Choon Tan

Yvan Saeys