Upload
zinna
View
20
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Computational Discrete Mathematics and Statistics for Molecular Array Data. Bill Shannon Washington University School of Medicine. Molecular Biology. “How Genes Work”, http://www.nigms.nih.gov. A B C Gene. Microarrays. *Messenger RNA Levels. A B C Gene. Normal Cell. - PowerPoint PPT Presentation
Citation preview
Computational Discrete Mathematics and Statistics for
Molecular Array Data
Bill Shannon
Washington University
School of Medicine
Molecular Biology
“How Genes Work”, http://www.nigms.nih.gov
Microarrays
A B CGene
A B CGene
Normal Cell Tumor Cell
*Mes
seng
er R
NA
Lev
els
*Brenner, Jacob, Meselson (1961) An unstable intermediate carrying information from genes to ribosomes for protein synthesis. Nature, 476:576-581.
Microarrays (Leukemia PPG)35 Probes Selected from ~50,000
Array Data Present New Data Analysis Challenges
(Curse of Dimensionality)
• Inaccuracy, or error, of a model becomes large very fast
– sparseness (descriptions of the data is impossible)
– model complexity (too many interaction terms, non-
linear effects, etc. to consider)
– random multicollinearity (spurious correlations)
Regression (Curse of Dimensionality)
• y = f(x) + error
• sparseness = little local signal– model parameters not estimated accurately– unstable models over-fit data (not genralizable)
• Non-parametric methods (e.g., CART, neural nets) – require a lot of model searching– use up degree’s of freedom rapidly– little or no information left to determine significance
Cluster Analysis (Curse of Dimensionality)
• Find structure in data
• Many cluster results with same goodness-of-fit
• Deciding among the models is impossible.
Classification Models (Curse of Dimensionality)
• Predict group membership (e.g., tumor versus normal)
• Three broad categories
– geometric methods (discriminant analysis, CART)
– probabilistic methods (Bayesian)
– algorithmic methods (neural networks, k-NN)
• Require training/validation datasets
Other Methods (Curse of Dimensionality)
• Resampling (cross validation, bootstrapping), model averaging (bagging), or iterative re-weighting (boosting)
• Multiple testing adjustment such as false discovery rate or permutation testing
Mantel Statistics
• Transform standard NxP data matrices into NxN subject pairwise distances or similarities
• Instead of analyzing NxP data matrix (P >> N) avoid the curse of dimensionality problem and analyze the NxN matrix
Mantel Statistics
2,,
22,2,
21,1,, PiPiiiiiii xxxxxxd
0
0
0
0
3
2
1
,3
,23,2
,13,12,1
,2,1,
,32,31,3
,22,21,2
,12,11,1
21
N
N
N
P
PNNN
P
P
P
P
d
dd
ddd
D
xxx
xxx
xxx
xxx
N
GGGSample
Shannon (2008) Cluster Analysis, in Handbook of Statistics, Vol. 27, eds. Rao, Rao, Miller.
Mantel Statistics
0
0
0
0
3
2
1
,3
,23,2
,13,12,1
,2,1,
,32,31,3
,22,21,2
,12,11,1
)()2()1(
N
N
N
Pk
PNNN
P
P
P
k
d
dd
ddd
D
xxx
xxx
xxx
xxx
N
GGGSample
Mantel Statistics
0
0
0
0
,3
,23,2
,13,12,1
N
N
N
Pk d
dd
ddd
D
0
0
0
0
,3
,23,2
,13,12,1
N
N
N
P d
dd
ddd
D
Signal + Noise Genes Signal Genes Only
ji
PkPkjiji
PPji
ji
PkPkji
PPjiPkP
dddd
ddddDD
2
,
2
,
,,,
Mantel Statistics
Correlating DP with Dk<<P avoids curse of dimensionality!
A positive Mantel correlation indicates the genes in Dk<<P contains the same information as the genes in DP
Shannon, Watson, et al. (2002). Mantel statistics to correlate gene expression levels from microarrays with clinical covariates. Genet Epidemiology 23: 87-96.
GA-Mantel
• Search algorithm to find signal genes
• Solution representation – list of genes (10 123 456 798 835 888 923)– binary vector {0000100110000….00010}
• Each solution maps to a Mantel correlation value – Assumption: the larger the correlation the more signal genes in
the solution
• Selection keeps solutions with high Mantel correlation
Grefenstette, Thompson, Shannon, and Steinmeyer (2005): Genetic algorithms for feature selection using Mantel correlation scoring. Interface: Classification and Clustering 37th Symposium on the Interface. St. Louis, MO
Recombination
Mutation
Gene Subset Selection
• Given– a data set comprising N microarray experiments with
g genes
• Find:– a subset of genes that captures relevant
relationships among the experiments
• Goal:– reduce data for further analysis– identify meaningful biological markers for diagnosis
1. Randomly generate an initial population
2. Do until stopping criteria is met:
Select individuals to be parents (biased by fitness).Produce offspring by recombination/mutation.Select individuals to die (biased by fitness).
End Do.
3. Return a result.
Genetic Algorithm
Fitness Evaluation for Gene Selection
• Calculate DP using all genes
• For each Subset(k) in current population:– Calculate Dk<<P
– Correlate DP with Dk<<P
• Use Mantel Correlation as fitness to select next population of solutions
• Permute to compute P-values
0
0
0
0
3
2
1111
,3
,23,2
,13,12,1
,2,1,
,32,31,3
,22,21,2
,12,11,1
N
N
N
P
PNNN
P
P
P
d
dd
ddd
D
xxx
xxx
xxx
xxx
N
WtWtWtSample
0
0
0
0
3
2
1101
,3
,23,2
,13,12,1
,2,1,
,32,31,3
,22,21,2
,12,11,1
N
N
N
Pk
KNNN
K
K
K
d
dd
ddd
D
xxx
xxx
xxx
xxx
N
WtWtWtSample
GA on Artificial Data• Simulated data:
– 100 experiments with 10,000 genes• 100 signal genes
• 9900 noise genes
– Two groups• Group 1 has signal genes sampled from N(0, 1)
• Group 2 has signal genes sampled from N(1, 1)
• GA Parameters– population size 200
– generations 200
• Outcome measures (averaged over 10 runs of the GA)
– prevalence (signal, noise) – number of signal and noise genes in GA answer
– correlation (signal, noise) – correlation of best subset distance matrix with the ‘full’ distance matrix
– coverage - number of signal genes identified over all GA runs
GA on Artificial Data
Length = 30
Prevalence:mean number of signal genes = 22.9 (0.7)
76.3% (std 0.53%)
Correlation:mean rho for best subsets = 0.787 (0.009)
p-value < 0.0001
Coverage:total signal genes identified across 10
runs = 65/100
Observation: solutions tends to converge to similar subsets. Same 4 signal genes appear in 90% of runs
GA on Golub Data Set• Data set: Golub training set (38 x 7129)• Two Groups:
– 27 samples from ALL patients– 11 samples from AML patients
• GA searched for subsets of fixed length (10 to 50)• population = 200, generations = 200
• Mantel correlation tends to increase with subset sizeLength Final Mantel Corr p-value
10 0.926 (0.005) < 0.00001
20 0.954 (0.004) < 0.00001
30 0.967 (0.002) < 0.00001
40 0.975 (0.002) < 0.00001
50 0.979 (0.002) < 0.00001
Significant Feature Subsets
Clustering of Samples using all genes Clustering of Samples using 50 genes from GA
Letting GA Select Subset Length• Data set: Golub training set (38 x 7129)• GA searched over variable length subsets (min=5 max=50)• Fitness penalty = d * length / 50• population = 200, generations = 200
• Tradeoff between length of subsets and correlation score
Length Penalty d Pop Final Len Best Final Length Final Mantel Corr
0.00 48.6 49.0 0.979 (0.001)
0.25 17.4 25.0 0.954 (0.005)
0.50 10.7 16.1 0.939 (0.009)
1.00 7.7 10.6 0.922 (0.009)
Data Reduction
• Observation: GA appears to repeatedly converge to same regions of feature space
– In 50 runs, 954/1546 (61%) of "noise" genes appear more than once in feature sets
• GA can also be used to find feature subsets that minimize rho
• pop = 200
• length = 50
• data set = Golub
• GA finds subsets with rho = 0 within 50 gens
GA in Experimental Data Analysis
• Graft Versus Host Disease (GVHD) in bone marrow transplantation (leukemia)
• T-cells in the transplanted bone marrow sees recipient as foreign and initiates an immune response destroying host organs
• Regulatory T-cells (Treg) suppress immune response
• Choi and DiPersio are studying the genetic mechanisms of Treg regulation
Mouse Array Experiment
GROUP TREATMENT ARRAYS
1 Naïve Treg dec1, dec5
2 Activated Treg dec2, dec6, dec10
3 PBST (Control) dec3, dec7, dec11
4 Decitabine treated dec4, dec8, dec12
~$1,000/per array in total costs: $12,000 worth of data including dec9 that did not work
Mouse Array Experiment
• Identify probes (genes) with similar mRNA levels between groups (gene by phenotype analysis)
Act+Dec Vs Naïve Vs Control
Naïve+Dec Vs Act Vs Control
Summary
• GA-Mantel effective at identifying signal genes
• Longer gene subsets associated with higher scores– tradeoff: higher correlations vs. smaller subsets– requires constraining growth of subsets in GA
• GA effective at identifying noise genes
• GA-Mantel can find genes associated with phenotype
Future Directions
• RFA CA-08-005 (under review)
– Optimize algorithm to improve coverage of solution space– Multiple solutions– Combine solutions (weak hierarchies)
• Lung disease R01 (to be submitted)
– Microarrays to identify disease subgroups across the bronchitis/emphysema continuum
Weak Hierarchies
Day, McMorris (2003) Axiomatic Consensus Theory in Group Choice and Biomathematics, SIAM Frontiers in Applied Mathematics, Philadelphia, PA.