24
Cancerous Tissue Classification (Using Microarray Gene Expression) Meenal Goyal Pankhuri Goyal

Classification of cancerous and non cancerous tissues

Embed Size (px)

DESCRIPTION

Binary classification of cancerous and non-cancerous tissues

Citation preview

Page 1: Classification of cancerous and non cancerous tissues

Cancerous Tissue Classification (Using Microarray Gene Expression)

Meenal Goyal Pankhuri Goyal

Page 2: Classification of cancerous and non cancerous tissues

Background

● Decoding gene expression is an important active research area in molecular biology and bioinformatics.

● Microarray technology used to get gene expression level in different cells.

● Applications:○ Tissue classification (Cancer vs non-cancer) ○ Identify novel targets for drug design.○ Extract patterns and analyse.

Page 3: Classification of cancerous and non cancerous tissues

Problem

● Binary classification of cancerous and normal tissue.

● Investigate feature selection and classification (supervised and unsupervised) algorithms.

● Improves the diagnosis, prognosis, and treatment planning by cancer detection in early stages.

● Challenges:○ High dimension of the input features.○ Limited number of tissue samples.

Page 4: Classification of cancerous and non cancerous tissues

Dataset

● GSE3 (renal clear cell carcinoma):○ Modality: numeric○ # features: 36,864 genes○ # samples: 81 cancerous and 90 normal

● High dimensional feature space, not sparse.

● Cell ( i, j ) represents expression level of gene j in tissue i.

Page 5: Classification of cancerous and non cancerous tissues

GSE3

Feature Selection1. T-Test2. Volcano Plot3. mRmR4. PCA5. Weighted kmeans (fisher

weights)

Supervised Learning (KNN, SVM, Boosting)

Unsupervised Learning (K-means, hierarchical learning)

Model GSE3

Resulting error rate and accuracy

Classification Pipeline

Page 6: Classification of cancerous and non cancerous tissues

Feature SelectionMethods

Page 7: Classification of cancerous and non cancerous tissues

T-Test● T scores:

● Null hypothesis: Both classes have equal mean.

● Pvalues : Probability of that observation if null hypothesis is true.

● Features with Pvalues <= 0.01 are selected.

● GSE3 data (916 features).

Page 8: Classification of cancerous and non cancerous tissues

Volcano Plot

GSE3 dataset Pvalues < 0.01 Fold change = 2 Features extracted : 492

Page 9: Classification of cancerous and non cancerous tissues

Minimum redundancy-maximum relevance (MRMR)

● F-test value is defined by

● Top 20 features are selected from the f-test score. ● Rest 130 features extracted using linear incremental

search algorithm : MRMR-FDM

● Total features selected for GSE3 data : 150

Page 10: Classification of cancerous and non cancerous tissues

PCA

● Top 3000 dimensions are selected for GSE3 from two sample t-test for PCA analysis.

Features selected : 170

Page 11: Classification of cancerous and non cancerous tissues

Weighted-kMeans (using Fisher Weights)

● Top 10,000 features selected from two sample t-test for Fisher analysis.

● Fisher score calculated by F(w) = (u1 - u2)

2

(s12 + s2

2)● Weighted - kmeans applied on feature space using

fisher values as weights. ● Centroid from each cluster is selected as a desired

feature. ● Total features for GSE3 dataset : 200

Page 12: Classification of cancerous and non cancerous tissues

ClassificationAlgorithms

Page 13: Classification of cancerous and non cancerous tissues

K- nearest neighbours (k-NN)

● Test / Train data divided using○ Holdout -> test : train = 1 : 1○ Kfold -> k : 5, test : train = 1 : 5

● K parameter varied from k=1 to 10.● Distance metric : Euclidean

Page 14: Classification of cancerous and non cancerous tissues

KNN misclassification error rate plotted for all feature selection methods

Page 15: Classification of cancerous and non cancerous tissues

Support Vector Machine (SVM)

● Test/ Train data divided using○ Holdout -> test : train = 0.2○ Kfold -> k=5, test : train = 1: 5

● Kernel functions used ○ Linear○ Polynomial : order = 2○ Radial

● c parameter varied from 0.01 to 0.3.( for linear kernel, holdout method)

Page 16: Classification of cancerous and non cancerous tissues

Misclassification error rate vs c-parameter for all feature selection methods. (Linear kernel)

Page 17: Classification of cancerous and non cancerous tissues

Accuracy matrix for SVM

T- Test

Volcano Plot

MRMR

PCA

Weighted- kMeans (using Fisher weights)

Best accuracy observed in linear kernel for all cases.

Page 18: Classification of cancerous and non cancerous tissues

Adaboost

● Test/ Train divided using Holdout with ratio 1 : 1.● Weak Learner = Decision Tree● Number of weak learners used = 100

Page 19: Classification of cancerous and non cancerous tissues

K-Means

● Test / Train set divided as○ Holdout -> test : train = 1 : 1○ Kfold -> k =5, test : train = 1 : 5

● K parameter varied from k=1 to 5.

Objective function

Page 20: Classification of cancerous and non cancerous tissues

Misclassification error vs k for all feature selection methods.

Page 21: Classification of cancerous and non cancerous tissues

Hierarchical Clustering

● Some cancer types can contain an arbitrary number of subtypes and usually it is unknown how many or what subtypes a specific cancer has.

● Green, black, and red colors in the heat maps indicate a low, medium, and high expression of the corresponding gene in the sample.

● Lower accuracy rate as compared to other algorithms.

Page 22: Classification of cancerous and non cancerous tissues

T-test Volcano Plots MRMR

PCA Weighted kMeans

Page 24: Classification of cancerous and non cancerous tissues

Thank you