Upload
lester-page
View
218
Download
1
Embed Size (px)
Citation preview
Evolutionary Algorithms for Finding Optimal Gene Sets in
Micro array Prediction.
J. M. Deutsch
Presented by: Shruti Sharma
Distinguishing one cell from other is very important and can be used in many diseases like cancer.
cDNA and oligonucleotide microarray have been used successfully for the same purpose ie distinguishing one cell from another.
Approaches Used to Classify Microarray Data
Artificial Neural Network
Logistic Regression
Support Vector Machines
Coupled Two Way Clustering
Weighted Votes-Neighborhood Analysis
Feature Selection Technique
Deciding for the Predictor gene
Deciding for the Predictor gene is very important to classify the sample using microarray.
Too few genes
Too many genes.
Choosing of the optimal set of genes
Neighborhood Analysis
Principle Component Analysis
Gene Shaving
Gene selection : important part of prediction algorithm.
All high ranked genes are not chosen.
Use replication algorithm used in quantum simulations and protein folding.
Consider the group of predictors and finds relevant genes.
It add and delete the genes continuously until an optimal performance has been achieved.
Blue cell tumor number of genes reduced from 96 to 15.Similarly good results were obtained for B-cell Lymphoma.
GESSES (Genetic Evolution of Sub Sets of Expressed Data)
Overview of Algorithm One which makes fewest mistakes on test data are the best predictors.
This algorithm uses a scoring scheme it gives higher scores when more data points are correctly classified.
LOOCV (leave –one-out cross validation) is used to calculate the score for a predictor.
The data is separated into clusters, each cluster corresponds to same type of cancer.
Wrapper Method- Search for the highest scoring predictor in the subset of the genes.
Filter Method-filter gene pool and collects the most likely candidate gene
Terminology
Dt = {D1,D2…} Sample of microarray training data.
N- number of genes in the training data.
Gt - Complete set of genes(1 through N)
Gα - subset of complete gene set(α1, α2……. αm)
T - each sample D has classification of type T
Predictor
P - Predictor (a function that takes data D and output T)
K nearest neighbour search – to construct predictor.
Dt training data is compared with target data D by finding Euclidean distance between D and each vector in training set
Scoring Function
Iteratively single out one data point considering it to be pseudo test data. If the point is predicted correctly we add 1.
Consider shortest distance denoted by d1, d2….….dt
Take 2 shortest distance di and dj
Add C | di2 - dj
2 |
C- constant chosen so that value of these added term <<<1
Looping to entire data we calculate the total score.
Initial Gene Pool
To distinguish one gene from other we rank all training sets by expression levels.
Some genes give high level ( ti ) and some low levels ( tj ).
Some times they separate but many times they don’t separate clearly and overlap.
Those with overlaps are ranked low.
Thus we will choose the best M genes.
Replication Algorithm
Suppose there is group of n gene subspaces ε ={G1,G2…..Gn}
1. For each G Є ε produce a subspace as followsA. Set of genes G has genes {g1,g2……gm}. we randomly mutate the genes
ADD : choose a gene gr randomly and add it to G producing new set G’ of genes {g1,g2…gm, gr}.
DELETE : randomly delete a gene from G , new set will be with m-1 of total gene.
Keep G the same
Algorithm Continued
B. Compute the difference in the score of the original gene G and mutated gene G’
δS=SG’-SG
C. Compute weight for G’ w = exp(β δ S)
β is inverse temperature
2. Z- Sum of these weights Normalize the weights by multiplying them by n/Z.
3. Replicate all the subspaces according to their weights w.
Annealing
As the system evolves the scoring function gives similar results.
To improve convergence temperature is made a function of spread in score.
The schedule for the temperature that worked the best was to lower the temperature with the fluctuation of the score from predictor to predictor.
Deterministic evolution
This is computationally expensive but performs better then the statistical method
The statistical method does not explore all combination of genes so can miss optimal gene combinations.
1. Construct all distinct unions of the G’s in ensemble ε with individual gene gi in initial pool ie g1, g2, g3….gm, gi.
2. Sort all these combinations by their score, keeping top n of them.
Small Round Blue Cell Tumors
Hard to classify by current techniques.
Hard to diagnose correctly as all appear similar.
Used 63 samples for training and tested with 20 using principle component analysis .
They reduced the number of genes to 96 yet classified data perfectly.
The same data set of 2308 genes was used and initial pool of genes was constructed by considering how well a gene discriminate cancer i from j.
For each combination top 10 genes best able to discriminate were selected
The Statistical Method was repeated .
Statistical Algorithm Using All Mutational Moves
Dimensions rises to maximum of about 16
Wrong classification of predictors decreases from about 9 to 0.5
By the end of the process the data was classified successfully.
At this point the temperature falls rapidly.
Deterministic Evolution method
Start with initial pool of 90 genes.
15 overlapped, so total are 75.
n top =150
Of the top 100 predictors all predicted the test data perfectly.
Average number of genes in predictor was 12.7.
Picture shows average number of genes in a predictor.
Leukemia
2 types of leukemia1. Acute Lymphoblastic Leukemia2. Acute Myeloid Leukemia
From different cell types bone marrow samples were taken.
Each sample was analyzed by affymetrix microarrays having expression levels of 7129 genes
The data was divided into 38 training data points and 34 test points.
Statistical Algorithm Using All Mutational Moves
The average number of dimensions in a predictor rises to more than14 by iteration 63
Dimensions declines by iteration 80 to dimension of only 2
Average mistake remains constant at 1.
The methods predicts the training data perfectly
Conclusion
With GESSES we could distinguish Diffuse large B-cell lymphoma from b cell lymphoma.
Recently micro array data is used to distinguish 14 different kinds of tumors.
GESSES can help to practically use microarray data in cancer diagnosis and many other diseases.