Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma

Evolutionary Algorithms for Finding Optimal Gene Sets in

Micro array Prediction.

J. M. Deutsch

Presented by: Shruti Sharma

Distinguishing one cell from other is very important and can be used in many diseases like cancer.

cDNA and oligonucleotide microarray have been used successfully for the same purpose ie distinguishing one cell from another.

Approaches Used to Classify Microarray Data

Artificial Neural Network

Logistic Regression

Support Vector Machines

Coupled Two Way Clustering

Weighted Votes-Neighborhood Analysis

Feature Selection Technique

Deciding for the Predictor gene

Deciding for the Predictor gene is very important to classify the sample using microarray.

Too few genes

Too many genes.

Choosing of the optimal set of genes

Neighborhood Analysis

Principle Component Analysis

Gene Shaving

Gene selection : important part of prediction algorithm.

All high ranked genes are not chosen.

Use replication algorithm used in quantum simulations and protein folding.

Consider the group of predictors and finds relevant genes.

It add and delete the genes continuously until an optimal performance has been achieved.

Blue cell tumor number of genes reduced from 96 to 15.Similarly good results were obtained for B-cell Lymphoma.

GESSES (Genetic Evolution of Sub Sets of Expressed Data)

Overview of Algorithm One which makes fewest mistakes on test data are the best predictors.

This algorithm uses a scoring scheme it gives higher scores when more data points are correctly classified.

LOOCV (leave –one-out cross validation) is used to calculate the score for a predictor.

The data is separated into clusters, each cluster corresponds to same type of cancer.

Wrapper Method- Search for the highest scoring predictor in the subset of the genes.

Filter Method-filter gene pool and collects the most likely candidate gene

Terminology

Dt = {D1,D2…} Sample of microarray training data.

N- number of genes in the training data.

Gt - Complete set of genes(1 through N)

Gα - subset of complete gene set(α1, α2……. αm)

T - each sample D has classification of type T

Predictor

P - Predictor (a function that takes data D and output T)

K nearest neighbour search – to construct predictor.

Dt training data is compared with target data D by finding Euclidean distance between D and each vector in training set

Scoring Function

Iteratively single out one data point considering it to be pseudo test data. If the point is predicted correctly we add 1.

Consider shortest distance denoted by d1, d2….….dt

Take 2 shortest distance di and dj

Add C | di2 - dj

2 |

C- constant chosen so that value of these added term <<<1

Looping to entire data we calculate the total score.

Initial Gene Pool

To distinguish one gene from other we rank all training sets by expression levels.

Some genes give high level ( ti ) and some low levels ( tj ).

Some times they separate but many times they don’t separate clearly and overlap.

Those with overlaps are ranked low.

Thus we will choose the best M genes.

Replication Algorithm

Suppose there is group of n gene subspaces ε ={G1,G2…..Gn}

1. For each G Є ε produce a subspace as followsA. Set of genes G has genes {g1,g2……gm}. we randomly mutate the genes

ADD : choose a gene gr randomly and add it to G producing new set G’ of genes {g1,g2…gm, gr}.

DELETE : randomly delete a gene from G , new set will be with m-1 of total gene.

Keep G the same

Algorithm Continued

B. Compute the difference in the score of the original gene G and mutated gene G’

δS=SG’-SG

C. Compute weight for G’ w = exp(β δ S)

β is inverse temperature

2. Z- Sum of these weights Normalize the weights by multiplying them by n/Z.

3. Replicate all the subspaces according to their weights w.

Annealing

As the system evolves the scoring function gives similar results.

To improve convergence temperature is made a function of spread in score.

The schedule for the temperature that worked the best was to lower the temperature with the fluctuation of the score from predictor to predictor.

Deterministic evolution

This is computationally expensive but performs better then the statistical method

The statistical method does not explore all combination of genes so can miss optimal gene combinations.

1. Construct all distinct unions of the G’s in ensemble ε with individual gene gi in initial pool ie g1, g2, g3….gm, gi.

2. Sort all these combinations by their score, keeping top n of them.

Small Round Blue Cell Tumors

Hard to classify by current techniques.

Hard to diagnose correctly as all appear similar.

Used 63 samples for training and tested with 20 using principle component analysis .

They reduced the number of genes to 96 yet classified data perfectly.

The same data set of 2308 genes was used and initial pool of genes was constructed by considering how well a gene discriminate cancer i from j.

For each combination top 10 genes best able to discriminate were selected

The Statistical Method was repeated .

Statistical Algorithm Using All Mutational Moves

Dimensions rises to maximum of about 16

Wrong classification of predictors decreases from about 9 to 0.5

By the end of the process the data was classified successfully.

At this point the temperature falls rapidly.

Deterministic Evolution method

Start with initial pool of 90 genes.

15 overlapped, so total are 75.

n top =150

Of the top 100 predictors all predicted the test data perfectly.

Average number of genes in predictor was 12.7.

Picture shows average number of genes in a predictor.

Leukemia

2 types of leukemia1. Acute Lymphoblastic Leukemia2. Acute Myeloid Leukemia

From different cell types bone marrow samples were taken.

Each sample was analyzed by affymetrix microarrays having expression levels of 7129 genes

The data was divided into 38 training data points and 34 test points.

Statistical Algorithm Using All Mutational Moves

The average number of dimensions in a predictor rises to more than14 by iteration 63

Dimensions declines by iteration 80 to dimension of only 2

Average mistake remains constant at 1.

The methods predicts the training data perfectly

Conclusion

With GESSES we could distinguish Diffuse large B-cell lymphoma from b cell lymphoma.

Recently micro array data is used to distinguish 14 different kinds of tumors.

GESSES can help to practically use microarray data in cancer diagnosis and many other diseases.

Documents

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma