77
Machine Learning for Functional Genomics II Matt Hibbs http://cbfg.jax.org

1 Machine Learning for Functional Genomics II Matt Hibbs

Embed Size (px)

Citation preview

Page 1: 1 Machine Learning for Functional Genomics II Matt Hibbs

1

Machine Learning for Functional Genomics II

Matt Hibbs

http://cbfg.jax.org

Page 2: 1 Machine Learning for Functional Genomics II Matt Hibbs

2

Functional Genomics• Identify the roles played by

genes/proteins

Sealfon et al., 2006.

Page 3: 1 Machine Learning for Functional Genomics II Matt Hibbs

3

Promise of Computational Functional Genomics

Data & Existing

Knowledge

Data & Existing

Knowledge

Computational Approaches

Computational Approaches

PredictionsPredictions

Laboratory ExperimentsLaboratory

Experiments

Page 4: 1 Machine Learning for Functional Genomics II Matt Hibbs

4

Computational Solutions

• Machine learning & data mining– Use existing data to make new predictions

• Similarity search algorithms• Bayesian networks• Support vector machines• etc.

– Validate predictions with follow-up lab work

• Visualization & exploratory analysis– Seeing and interacting with data important– Show data so that questions can be

answered• Scalability, incorporate statistics, etc.

Page 5: 1 Machine Learning for Functional Genomics II Matt Hibbs

5

Computational Solutions

• Machine learning & data mining– Use existing data to make new predictions

• Similarity search algorithms• Bayesian networks• Support vector machines• etc.

– Validate predictions with follow-up lab work

• Visualization & exploratory analysis– Seeing and interacting with data important– Show data so that questions can be

answered• Scalability, incorporate statistics, etc.

Page 6: 1 Machine Learning for Functional Genomics II Matt Hibbs

6

Bayesian Networks• Encodes dependence relationships

between observed and unobserved events

Raining?Raining?

Jim brought umbrella

Jim brought umbrella

Cloudy this morning

Cloudy this morning

Rain in forecastRain in forecast

Page 7: 1 Machine Learning for Functional Genomics II Matt Hibbs

7

Bayesian Network Overview• Graphical representation of

relationships– Probabilistic information from data to

concepts

Page 8: 1 Machine Learning for Functional Genomics II Matt Hibbs

8

Bayesian Network Overview• Graphical representation of

relationships– Probabilistic information from data to

concepts

Page 9: 1 Machine Learning for Functional Genomics II Matt Hibbs

9

Bayesian Network OverviewP(FR | CE, AP, Y2H)

P(FR | CE=yes, AP=yes, Y2H=yes)

= α P(FR) P(CE=yes|FR) Σ P(PI|FR) P(AP=yes|PI) P(Y2H=yes|PI)

P(FR=yes) + P(FR=no) = 0.0105α + 0.0216α P(FR) = .327 (up from 0.10)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)

Page 10: 1 Machine Learning for Functional Genomics II Matt Hibbs

10

Naïve Bayes• No internal hidden nodes• Greatly simplifies problem, reduces

computational complexity and time• Imposes independence assumption

Page 11: 1 Machine Learning for Functional Genomics II Matt Hibbs

11

Naïve BayesP(FR | D1, D2, D3, D4)

Bayes’ Rule: P(A|B) ~ P(A) P(B|A)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)

= α P(FR) P(D1|FR) P(D2|FR) P(D3|FR) P(D4|FR)

Assumes that all measures are independent

Page 12: 1 Machine Learning for Functional Genomics II Matt Hibbs

12

Learning Naïve Bayes Nets

FR = yes FR = no

100 900 counts

0.1 0.9 prob.

FR # D1 = yes # D1 = no P(D1=yes)

yes 70 30 .7

no 300 600 .33…

Page 13: 1 Machine Learning for Functional Genomics II Matt Hibbs

13

Steps for Bayesian network integration

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

Page 14: 1 Machine Learning for Functional Genomics II Matt Hibbs

14

Steps for Bayesian network integration

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

Page 15: 1 Machine Learning for Functional Genomics II Matt Hibbs

15

Gold Standard Construction• Gene Ontology annotations used to define known

functional relationships

Threshold for positive relationships

Threshold for negative relationships

Myers et al., 2006

Page 16: 1 Machine Learning for Functional Genomics II Matt Hibbs

16

Gold Standard Used For Trainingpositive relationshipsnegative relationships

Global Gold Standard

Page 17: 1 Machine Learning for Functional Genomics II Matt Hibbs

17

Steps for Bayesian network integration

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

Page 18: 1 Machine Learning for Functional Genomics II Matt Hibbs

18

Gene-Gene Scores• Binary data– PPI, co-localization, synthetic lethality– Can use binary scores– Can use profiles to generate scores (dot

product)

• Continuous data– Profile distance metrics

• Binning results– Converts everything to discrete case

Page 19: 1 Machine Learning for Functional Genomics II Matt Hibbs

19

Distance Metrics• Choice of distance measure is important for

quantifying relationships in datasets• Pair-wise metrics – compare vectors of numbers– e.g. genes x & y, ea. with n measurements

Euclidean Distance

Pearson Correlation

Spearman Correlation

Page 20: 1 Machine Learning for Functional Genomics II Matt Hibbs

20

Distance Metrics

Euclidean Distance

Pearson Correlation

Spearman Correlation

Page 21: 1 Machine Learning for Functional Genomics II Matt Hibbs

21

• Commonly used Pearson correlation yields greatly different distributions of correlation

• These differences complicate comparisons

DeRisi et al., 97 Primig et al., 00

Histograms of Pearson correlations between all pairs of genes

Sensible Binning

Page 22: 1 Machine Learning for Functional Genomics II Matt Hibbs

22

• Fisher Z-transform, Z-score equalizes distributions

• Increases comparability between datasets

Histograms of Z-scores between all pairs of genes

Sensible Binning

Page 23: 1 Machine Learning for Functional Genomics II Matt Hibbs

23

Pre-calculation and Storage• Pair-wise distances only need to be

calculated once, even if using different binnings

• Typical mouse microarray ~5-20k genes

• 16M pair-wise distances• ~50-700 MB of storage for one dataset• ~800 datasets in GEO• ~200 GB for all datasets

Page 24: 1 Machine Learning for Functional Genomics II Matt Hibbs

24

Steps for Bayesian network integration

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

Page 25: 1 Machine Learning for Functional Genomics II Matt Hibbs

25

Counting & Learning• Conceptually straightforward• Counting– Just look at all of the pairs in each dataset,

see which bin it falls into, increment a counter

– But… you need to do this 16M times/dataset

– “Dumb” parallelization – each dataset is independent

• Learning CPTs– Fractions based on counts

Page 26: 1 Machine Learning for Functional Genomics II Matt Hibbs

26

Steps for Bayesian network integration

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

Page 27: 1 Machine Learning for Functional Genomics II Matt Hibbs

27

Inference• Also pretty straightforward– For all pairs of genes…• For each dataset

– Look-up value from pre-calculated distances– Determine bin and value from CPT– Multiply probability into product

• Do this for FR=yes and FR=no• Normalize out α• Store Result

• 1.5GB result file

Page 28: 1 Machine Learning for Functional Genomics II Matt Hibbs

28

Steps for Bayesian network integration

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

Page 29: 1 Machine Learning for Functional Genomics II Matt Hibbs

29

Evaluation Metrics• TPs, FPs, TNs, FNs• Agnostic to pairs not appearing in

standard

• ROC curves: Sensitivity-Specificity• PR curves: Precision-Recall

Page 30: 1 Machine Learning for Functional Genomics II Matt Hibbs

30

Precision Recall Curves

Ordered Predictions

Precision

TPTP + FP

Recall TPTP + FN

0

0 1

1

Page 31: 1 Machine Learning for Functional Genomics II Matt Hibbs

31

Summary Statistics• AUC – area under the (ROC) curve– equivalent to Mann-Whitney U

• Average Precision – average of the precisions calculated at each true positive– quantized version of area under

precision recall curve (AUPRC)

• Precision @ n% recall

Page 32: 1 Machine Learning for Functional Genomics II Matt Hibbs

32

Cross Validation

Page 33: 1 Machine Learning for Functional Genomics II Matt Hibbs

33

Steps for Bayesian network integration

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

Page 34: 1 Machine Learning for Functional Genomics II Matt Hibbs

34

Graph Analysis for Predictions

ci = confidence of functionS = set of genes in functionG = set of all geneswi,j = weight of edge

gi

Page 35: 1 Machine Learning for Functional Genomics II Matt Hibbs

35

Steps for Our Evaluation

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

Page 36: 1 Machine Learning for Functional Genomics II Matt Hibbs

36

Bayesian Network IntegrationG

ene

exp

ress

ion

Gene expression dataset 1

Gene expression dataset 2

Gene expression dataset N

Ph

ysic

al

inte

ract

ion

s

Yeast two-hybrid dataset 1

Co-precipitation dataset 1

Oth

er

Transcription factor bin sites

Localization

Curated literature

Gen

etic

in

tera

ctio

ns

Synthetic lethality dataset

Synthetic rescue dataset

Myers et al., 2005; Huttenhower et al., 2006; Guan et al., 2008

New genes predicted to interact with known mitochondrial genes

Data integration via a Bayesian network

User-selected query focuses search

Probabilistic, weighted networks of gene function

Results displayed

Page 37: 1 Machine Learning for Functional Genomics II Matt Hibbs

37

Basic Approach Applied Several Times

Myers et al., 2005; 2007

Guan et al., 2008

Huttenhower et al., 2007

Huttenhower et al., 2009

Page 38: 1 Machine Learning for Functional Genomics II Matt Hibbs

38

Limitations and Improvements• Original work designed for yeast, and

general notion of functionally related– Ignores reality that some genes are related

only under certain conditions– Treats multi-cellular organisms as big single-

celled organisms

• Increased specificity can be used to improve results– 2nd iteration of bioPIXIE included biological

processes into gold standards– Currently working on 2nd generation

mouseNET to account for tissue and developmental stages

Page 39: 1 Machine Learning for Functional Genomics II Matt Hibbs

39

General mouseNET Approach

Page 40: 1 Machine Learning for Functional Genomics II Matt Hibbs

40

Global Gold Standardpositive relationshipsnegative relationships

Global Gold Standard

Page 41: 1 Machine Learning for Functional Genomics II Matt Hibbs

41

Specific Gold Standards• Not all datasets capture all functional

relationships– Process/Pathway specific

• Functionally related genes aren’t always functionally related– Tissue specific– Developmental stage specific

Page 42: 1 Machine Learning for Functional Genomics II Matt Hibbs

42

Specific Gold Standard Construction

positive relationshipsnegative relationships

Global Gold Standard Specific Gold Standard

Page 43: 1 Machine Learning for Functional Genomics II Matt Hibbs

43

Tissue/Stage Gold Standards• Based on data from GXD• Cross reference Theiler stages with

mammalian anatomy hierarchy• 729 total intersections– ranging from 50 to ~3500 genes– not including post-natal stages

Page 44: 1 Machine Learning for Functional Genomics II Matt Hibbs

44

Initial Computational Evaluations

Page 45: 1 Machine Learning for Functional Genomics II Matt Hibbs

45

Preliminary Results• Running 4-fold cross validation using

tissue/stage specific GO-based gold standards

training evaluation

test evaluation

Page 46: 1 Machine Learning for Functional Genomics II Matt Hibbs

46

Preliminary Results• Accounting for developmental stage

helps

training evaluation

test evaluation

Page 47: 1 Machine Learning for Functional Genomics II Matt Hibbs

47

Preliminary Results• Many specific tissue/stage

combinations are overfitting

training evaluation

test evaluation

Page 48: 1 Machine Learning for Functional Genomics II Matt Hibbs

48

Preliminary Results• Folds were randomly generated, are

biased, need to balance positives and negatives

Page 49: 1 Machine Learning for Functional Genomics II Matt Hibbs

49

New Visualization Interface• Graphle

Page 50: 1 Machine Learning for Functional Genomics II Matt Hibbs

50

Simple Things Long Times• No single step is too complicated• Mostly O(G2D)• 16M * 800 * 4• Evaluating one fold ~7 hours• So far have results for ~200

tissue/stages– Should take ~3 days on the cluster– Actually took ~15 days

Page 51: 1 Machine Learning for Functional Genomics II Matt Hibbs

51

Bayesian network utility• Bayesian networks powerful tool• Currently improving on existing

MouseNET project by incorporating tissue/stage information

• Preliminary results are promising, standards may be too limited

• Multiple stage process may be useful– predict tissue/stage specific expression– use these predictions in functional gold

standards– use a continuous gold standard?

Page 52: 1 Machine Learning for Functional Genomics II Matt Hibbs

52

Computational Solutions

• Machine learning & data mining– Use existing data to make new predictions

• Similarity search algorithms• Bayesian networks• Support vector machines• etc.

– Validate predictions with follow-up lab work

• Visualization & exploratory analysis– Seeing and interacting with data important– Show data so that questions can be

answered• Scalability, incorporate statistics, etc.

Page 53: 1 Machine Learning for Functional Genomics II Matt Hibbs

53

From Relationships to Phenotypes

• Use the outputs of Bayesian integration of data as inputs to a phenotype prediction problem

• For each gene – vector of relationship probabilities used as feature vector

• Use a Support Vector Machine (SVM) to classify genes involved in a phenotypes vs. not involved in a phenotype

• Process repeated for hundreds of phenotypes

Page 54: 1 Machine Learning for Functional Genomics II Matt Hibbs

54

SVM Methodology• Every feature vector is thought of as

a point in space• Points nearer to each other tend to

belong to the same class• In our case, we have a ~20k

dimensional space where each point is a gene, and its location is determined by the relationship probabilities

~20K

gen

es

~20K probabilities

Page 55: 1 Machine Learning for Functional Genomics II Matt Hibbs

55

SVM Methodology

Page 56: 1 Machine Learning for Functional Genomics II Matt Hibbs

56

SVM Methodology

Page 57: 1 Machine Learning for Functional Genomics II Matt Hibbs

57

SVM Methodology

Page 58: 1 Machine Learning for Functional Genomics II Matt Hibbs

58

SVM Methodology

Page 59: 1 Machine Learning for Functional Genomics II Matt Hibbs

59

SVM Methodology

Page 60: 1 Machine Learning for Functional Genomics II Matt Hibbs

60

SVM Methodology

Page 61: 1 Machine Learning for Functional Genomics II Matt Hibbs

61

Software for SVM• SVMlight & SVMperf from Cornell• http://svmlight.joachims.org/• Several simple kernels implemented,

can write additional code to use custom kernels

• “perf” version maximizes different statistics (AUC, precision, etc.)

Page 62: 1 Machine Learning for Functional Genomics II Matt Hibbs

62

Phenotype Predictions• Using the MGI phenotype info as a

starting point, predicted genes for ~1150 phenotypes

Page 63: 1 Machine Learning for Functional Genomics II Matt Hibbs

63

Phenotype Predictions• Every gene with at least one allele

annotated considered “involved” with the phenotype

Page 64: 1 Machine Learning for Functional Genomics II Matt Hibbs

64

Phenotype Predictions• Selected phenotypes with >30

annotations, <500 annotations, non-identical

• SVM trained for each phenotype• Classification predictions created for

all tested phenotypes• Can assess prediction performance

computationally

Page 65: 1 Machine Learning for Functional Genomics II Matt Hibbs

65

Evaluation Metrics• TPs, FPs, TNs, FNs• Agnostic to pairs not appearing in

standard

• ROC curves: Sensitivity-Specificity (TPR-FPR)

• PR curves: Precision-Recall

Page 66: 1 Machine Learning for Functional Genomics II Matt Hibbs

66

PR-curves vs. ROC curves• ROC gives you credit for correctly

predicting negatives• For function/phenotype predictions,

we realistically are only concerned with positives

• Further, we care most about high confidence positives

• PR-curves better at showing this

Page 67: 1 Machine Learning for Functional Genomics II Matt Hibbs

67

Performance Measurements• On average, 10 fold improvement

over random–Median - ~7.5 fold over random–Max ~100, Min ~1

• Some phenotypeswe can predictwell, others, not somuch

Page 68: 1 Machine Learning for Functional Genomics II Matt Hibbs

68

Some Top Phenotypes• Arrested B cell differentiation• Abnormal joint morphology• Abnormal cell cycle checkpoint• Decreased circulating hormone level• Abnormal liver development• …

Page 69: 1 Machine Learning for Functional Genomics II Matt Hibbs

69

Some Bottom Phenotypes• hepatoma• head bobbing• disheveled coat• necrosis• increased glycogen level• lethargy• …

Page 70: 1 Machine Learning for Functional Genomics II Matt Hibbs

70

PR vs. ROC

AUC=0.63~45 fold improvement AUC=0.63

Page 71: 1 Machine Learning for Functional Genomics II Matt Hibbs

71

PR vs. ROC

AUC=0.70~3 fold improvement

Page 72: 1 Machine Learning for Functional Genomics II Matt Hibbs

72

Some Interesting Phenotypes

Page 73: 1 Machine Learning for Functional Genomics II Matt Hibbs

73

Laboratory Evaluation• Computational evaluation helpful, but

not the real goal• Cheryl has been kindly testing two

predictions related to bone phenotypes– Timp2 and Abcg8– Timp2-/- female mice have decreased

bone density, and possible morphological defects

– Abcg8-/- male mice have increased bone density

Page 74: 1 Machine Learning for Functional Genomics II Matt Hibbs

74

Timp2 Preliminary Results

Timp2-/-, 5 days old

Page 75: 1 Machine Learning for Functional Genomics II Matt Hibbs

75

Results are complementary to Quantitative Genetics

Page 76: 1 Machine Learning for Functional Genomics II Matt Hibbs

76

Conclusions & Plans• Bayes nets and SVMs are powerful

tools• Careful construction of Training Sets

(Gold Standard) is key• Computational evaluations need to

be appropriate to the problem context

• Laboratory evaluations are critical• Complementary approaches are good

Page 77: 1 Machine Learning for Functional Genomics II Matt Hibbs

77

Acknowledgements

• Hibbs Lab– Karen Dowell– Tongjun Gu– Al Simons

• Olga Troyanskaya Lab– Patrick Bradley– Maria Chikina– Yuanfang Guan

• Chad Myers• David Hess• Florian Markowetz• Edo Airoldi• Curtis Huttenhower

• Kai Li Lab– Grant Wallace

• Amy Caudy

• Maitreya Dunham

• Botstein, Kruglyak, Broach, Rose labs

• Kyuson Yun

• Carol Bult