1 Machine Learning for Functional Genomics I Matt Hibbs

1 Machine Learning for Functional Genomics I Matt Hibbs http://cbfg.jax.org Slide 2 2 Central Dogma Gene Expression DNA Proteins Phenotypes Slide 3 3 Functional Genomics Identify the roles played by genes/proteins Sealfon et al., 2006. Slide 4 4 Gene Expression Microarrays Simultaneous measurements of mRNA abundance levels for every gene in a genome Genes Conditions Slide 5 5 Simultaneous measurements of mRNA abundance levels for every gene in a genome in thousands of conditions Gene Expression Microarrays Rich functional information in these data, but how can we utilize the entire compendia? Slide 6 6 Biological Data Explosion Huge repositories of biological data are not directly translating into knowledge Year # of genes Mouse genes with known process associationPublically available microarrays in GEO # of measurements Year Slide 7 7 Why is there a Data-Knowledge Gap? Many datasets are analyzed only once Initial publication looks for hypothesis Need standards for naming, formats, collection Data should be aggregated and integrated Modestly significant clues seen repeatedly can become convincing a preponderance of circumstantial evidence Scale of this problem overwhelms traditional biology Slide 8 8 Scalable Artificial Intelligence Computer science is really a study in scalability Use machine learning and data mining techniques to quickly identify important patterns Slide 9 9 Amazon Recommendations Slide 10 10 Amazon Recommendations Purchase History Item Rankings Purchase History Item Rankings Recommendations Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Compare your purchase history to all other customers Find commonalities between profiles Predict potential purchases Observe Browsing Patterns and Account Activity Slide 11 11 Gene Function Prediction Purchase History Item Rankings Purchase History Item Rankings Recommendations Observe Browsing Patterns and Account Activity Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Genome Scale Data MGI Annotations Genome Scale Data MGI Annotations Predictions Laboratory Experiments Machine Learning (Bayesian networks) Machine Learning (Bayesian networks) Slide 12 12 Challenges for AI from Biology Input data is noisy, heterogeneous, constantly evolving Current knowledge is incomplete and biased Can be difficult to determine accuracy Slide 13 13 Promise of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments Slide 14 14 Reality of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments Slide 15 15 Computational Solutions Machine learning & data mining Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. Validate predictions with follow-up lab work Visualization & exploratory analysis Seeing and interacting with data important Show data so that questions can be answered Scalability, incorporate statistics, etc. Slide 16 16 Computational Solutions Machine learning & data mining Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. Validate predictions with follow-up lab work Visualization & exploratory analysis Seeing and interacting with data important Show data so that questions can be answered Scalability, incorporate statistics, etc. Slide 17 17 Similarity Search Approach Re-frame analysis as exploratory search Data Collection Query Genes Search Algorithm (SPELL) Relevant Datasets Related Genes Slide 18 18 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU VtVt = Key Insights Slide 19 19 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU VtVt = Key Insights Slide 20 20 Dataset relevance weighting Datasets Calculate correlation measure among query for each dataset -- This is each datasets weight 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Slide 21 21 Identify Novel Partners Datasets 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 Calculate weighted distance score for all other genes to the query set geneA geneB geneC Slide 22 22 Identify Novel Partners Datasets 0.150.820.050.55 Query Genes: Q = {YQG1, YQG2, YQG3} YQG1 YQG2 YQG3 geneA geneB geneC Calculate weighted distance score for all other genes to the query set Best score Worst score + Takes advantage of functional diversity + Addresses statistical concerns + Fast running times [O(GDQ 2 )] (ms per query) + Top results are candidates for investigation + Search process is iterative to refine results Slide 23 23 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU VtVt = Key Insights Slide 24 24 Singular Value Decomposition (SVD) Projects data into another orthonormal basis Correlations in U (rather than X) often contain better biological signals Signal Balancing Data - SVD Slide 25 25 Signal Balancing SVD Signal Balancing Slide 26 26 Signal Balancing Use correlations among left singular vectors Downweights dominant patterns, amplifies subtle patterns Top eigengenes dominate data Sometimes correspond to systematic bias Often correspond to common biological processes eg. ribosome biogenesis, etc. Accuracy of signal balancing improved over re-projection Slide 27 27 Context-Sensitive Search Process Signal Balancing Correlation Comparability XU VtVt = Key Insights Slide 28 28 Between-dataset normalization Commonly used Pearson correlation yields greatly different distributions of correlation These differences complicate comparisons DeRisi et al., 97Primig et al., 00 Histograms of Pearson correlations between all pairs of genes Slide 29 29 Fisher Z-transform, Z-score equalizes distributions Increases comparability between datasets Histograms of Z-scores between all pairs of genes Between-dataset normalization Slide 30 30 SPELL Algorithm Overview Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, Troyanskaya OG. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics, 2007. Slide 31 31 Web Interface http://spell.princeton.edu Slide 32 32 Evaluation of Performance Leave-k-in cross validation / bootstrapping Results averaged across 125 diverse GO biological process terms (defined in the GRIFn system, Myers et al., 2006) Many predictions also verified through experimental validations in other studies Hibbs et al., Bioinf, 2007 Hess et al., PLoS Gen, 2009 Hibbs*, Myers*, Huttenhower*, et al., PLoS Comp Biol, 2009 Slide 33 33 Order Genome Search Accuracy Perform leave-k-in cross-validation Genes with common function For all pairs Master List Rank Average Slide 34 34 Search Accuracy Precision-Recall Curve Master List Precision TP TP + FP Recall TP TP + FN 0 0 1 1 Slide 35 35 Accuracy of Context-Sensitive Search Slide 36 36 Sample & Query Size Effects Even relatively small sample sizes produce similar results (1000 samples used for all other tests) Significant performance gain between 2 and 3 query genes, little change beyond (5 query genes used for all other tests) Slide 37 37 Effect of Signal Balancing Signal balancing further improves context-specific search performance Improvement is robust to missing value imputation method Slide 38 38 Effects of Signal Balancing n% re-projectionn% balanced signal balanced Slide 39 39 Effects of Signal Balancing n% re-projectionn% balanced Slide 40 40 Specific Performance Slide 41 41 Computational Solutions Machine learning & data mining Use existing data to make new predictions Similarity search algorithms Bayesian networks Support vector machines etc. Validate predictions with follow-up lab work Visualization & exploratory analysis Seeing and interacting with data important Show data so that questions can be answered Scalability, incorporate statistics, etc. Slide 42 42 Cross-validation based on known biology Most often used method in literature Results are useful, but can be biased Laboratory evaluation More accurate, more difficult Ultimate goal of functional genomics Identify novel biology Publish biological corpus Function Prediction Evaluation Huttenhower C*, Hibbs MA*, Myers CL* et al. The impact of incomplete knowledge on gene function prediction. Bioinformatics, 2009. Slide 43 43 Promise of Computational Functional Genomics Data & Existing Knowledge Computational Approaches Predictions Laboratory Experiments Slide 44 44 Slide 45 45 Petite Frequency Assay Slide 46 46 Petite Frequency Phenotypes for Predictions Slide 47 47 Overall Result Summary Slide 48 48 Double mutant petite freq. Slide 49 49 Mitochondrial Motility Slide 50 50 Respiratory Growth Rate Slide 51 51 Biological Benefits of Computational Direction Effective Candidate prioritization 6 months of work vs. 8 years for whole genome screen Unbiased (actually, just less biased) Both uncharacterized genes and genes with known function predicted and verified 40 of 75 (53%) for genes with known function 60 of 118 (51%) for uncharacterized genes Testing only mitochondrial localized proteins would miss 43% of our discoveries 59% accuracy among mitochondria localized 44% accuracy among non-mitochondria localized Slide 52 52 Computational Expectations Original Gold StandardExperimental Results Slide 53 53 Complementary Computational Approaches Slide 54 54 Computational Reality Original Gold StandardExperimental Results Slide 55 55 Method Comparison Input DataMicroarrays Only Diverse Data Algorithmic Approach Context-specific search Bayesian integration Detailsheavily cross- validated, only pos. correlation, uses signal balanced data nave Bayes inference after training, pairwise correlations binned nave Bayes inference after training, each data type converted to pairwise scores Slide 56 56 Method Accuracy is Biologically Diverse Slide 57 57 Underlying Data Changes Predictions Slide 58 58 Methods Converge During Iteration Slide 59 59 Computational Lessons Underlying data, Choice of algorithm important Data affects which biological areas can be studied Algorithm affects biological context, nature of results Possible for many combinations to be accurate Utilizing an ensemble of methods broadens scope and reliability Iteration in an ensemble can lead to converging predictions Evaluating the results of computational prediction methods is not as simple as recapitulating GO Slide 60 60 Conclusions Microarray search system (& Bayesian data integration) produce good predictions of gene function Experimental verification of predictions is important 109 novel gene functions discovered Subtle phenotypes important to consider Big challenge: Make this work in mammals Slide 61 61 Acknowledgements Hibbs Lab Karen Dowell Tongjun Gu Al Simons Olga Troyanskaya Lab Patrick Bradley Maria Chikina Yuanfang Guan Chad Myers David Hess Florian Markowetz Edo Airoldi Curtis Huttenhower Kai Li Lab Grant Wallace Amy Caudy Maitreya Dunham Botstein, Kruglyak, Broach, Rose labs Kyuson Yun Carol Bult

Documents

1 Machine Learning for Functional Genomics I Matt Hibbs