1 Machine Learning for Functional Genomics II Matt Hibbs

1

Machine Learning for Functional Genomics II

Matt Hibbs

http://cbfg.jax.org

2

Functional Genomics• Identify the roles played by

genes/proteins

Sealfon et al., 2006.

3

Promise of Computational Functional Genomics

Data & Existing

Knowledge

Data & Existing

Knowledge

Computational Approaches

Computational Approaches

PredictionsPredictions

Laboratory ExperimentsLaboratory

Experiments

4

Computational Solutions

• Machine learning & data mining– Use existing data to make new predictions

• Similarity search algorithms• Bayesian networks• Support vector machines• etc.

– Validate predictions with follow-up lab work

• Visualization & exploratory analysis– Seeing and interacting with data important– Show data so that questions can be

answered• Scalability, incorporate statistics, etc.

5







6

Bayesian Networks• Encodes dependence relationships

between observed and unobserved events

Raining?Raining?

Jim brought umbrella

Jim brought umbrella

Cloudy this morning

Cloudy this morning

Rain in forecastRain in forecast

7

Bayesian Network Overview• Graphical representation of

relationships– Probabilistic information from data to

concepts

8

Bayesian Network Overview• Graphical representation of

relationships– Probabilistic information from data to

concepts

9

Bayesian Network OverviewP(FR | CE, AP, Y2H)

P(FR | CE=yes, AP=yes, Y2H=yes)

= α P(FR) P(CE=yes|FR) Σ P(PI|FR) P(AP=yes|PI) P(Y2H=yes|PI)

P(FR=yes) + P(FR=no) = 0.0105α + 0.0216α P(FR) = .327 (up from 0.10)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)

10

Naïve Bayes• No internal hidden nodes• Greatly simplifies problem, reduces

computational complexity and time• Imposes independence assumption

11

Naïve BayesP(FR | D1, D2, D3, D4)

Bayes’ Rule: P(A|B) ~ P(A) P(B|A)Bayes’ Rule: P(A|B) ~ P(A) P(B|A)

= α P(FR) P(D1|FR) P(D2|FR) P(D3|FR) P(D4|FR)

Assumes that all measures are independent

12

Learning Naïve Bayes Nets

FR = yes FR = no

100 900 counts

0.1 0.9 prob.

FR # D1 = yes # D1 = no P(D1=yes)

yes 70 30 .7

no 300 600 .33…

13

Steps for Bayesian network integration

• Construct a gold standard• Convert data to pair-wise format• Count positive/negative pairs in each

dataset• Create CPTs to define Bayes net• Inference to calculate all pair-wise

probabilities• Evaluate performance• Predict functions given network

14





15

Gold Standard Construction• Gene Ontology annotations used to define known

functional relationships

Threshold for positive relationships

Threshold for negative relationships

Myers et al., 2006

16

Gold Standard Used For Trainingpositive relationshipsnegative relationships

Global Gold Standard

17





18

Gene-Gene Scores• Binary data– PPI, co-localization, synthetic lethality– Can use binary scores– Can use profiles to generate scores (dot

product)

• Continuous data– Profile distance metrics

• Binning results– Converts everything to discrete case

19

Distance Metrics• Choice of distance measure is important for

quantifying relationships in datasets• Pair-wise metrics – compare vectors of numbers– e.g. genes x & y, ea. with n measurements

Euclidean Distance

Pearson Correlation

Spearman Correlation

20

Distance Metrics

Euclidean Distance

Pearson Correlation

Spearman Correlation

21

• Commonly used Pearson correlation yields greatly different distributions of correlation

• These differences complicate comparisons

DeRisi et al., 97 Primig et al., 00

Histograms of Pearson correlations between all pairs of genes

Sensible Binning

22

• Fisher Z-transform, Z-score equalizes distributions

• Increases comparability between datasets

Histograms of Z-scores between all pairs of genes

Sensible Binning

23

Pre-calculation and Storage• Pair-wise distances only need to be

calculated once, even if using different binnings

• Typical mouse microarray ~5-20k genes

• 16M pair-wise distances• ~50-700 MB of storage for one dataset• ~800 datasets in GEO• ~200 GB for all datasets

24





25

Counting & Learning• Conceptually straightforward• Counting– Just look at all of the pairs in each dataset,

see which bin it falls into, increment a counter

– But… you need to do this 16M times/dataset

– “Dumb” parallelization – each dataset is independent

• Learning CPTs– Fractions based on counts

26





27

Inference• Also pretty straightforward– For all pairs of genes…• For each dataset

– Look-up value from pre-calculated distances– Determine bin and value from CPT– Multiply probability into product

• Do this for FR=yes and FR=no• Normalize out α• Store Result

• 1.5GB result file

28





29

Evaluation Metrics• TPs, FPs, TNs, FNs• Agnostic to pairs not appearing in

standard

• ROC curves: Sensitivity-Specificity• PR curves: Precision-Recall

30

Precision Recall Curves

Ordered Predictions

Precision

TPTP + FP

Recall TPTP + FN

0

0 1

1

31

Summary Statistics• AUC – area under the (ROC) curve– equivalent to Mann-Whitney U

• Average Precision – average of the precisions calculated at each true positive– quantized version of area under

precision recall curve (AUPRC)

• Precision @ n% recall

32

Cross Validation

33





34

Graph Analysis for Predictions

ci = confidence of functionS = set of genes in functionG = set of all geneswi,j = weight of edge

gi

35

Steps for Our Evaluation




36

Bayesian Network IntegrationG

ene

exp

ress

ion

Gene expression dataset 1

Gene expression dataset 2

Gene expression dataset N

Ph

ysic

al

inte

ract

ion

s

Yeast two-hybrid dataset 1

Co-precipitation dataset 1

Oth

er

Transcription factor bin sites

Localization

Curated literature

Gen

etic

in

tera

ctio

ns

Synthetic lethality dataset

Synthetic rescue dataset

Myers et al., 2005; Huttenhower et al., 2006; Guan et al., 2008

New genes predicted to interact with known mitochondrial genes

Data integration via a Bayesian network

User-selected query focuses search

Probabilistic, weighted networks of gene function

Results displayed

37

Basic Approach Applied Several Times

Myers et al., 2005; 2007

Guan et al., 2008

Huttenhower et al., 2007

Huttenhower et al., 2009

38

Limitations and Improvements• Original work designed for yeast, and

general notion of functionally related– Ignores reality that some genes are related

only under certain conditions– Treats multi-cellular organisms as big single-

celled organisms

• Increased specificity can be used to improve results– 2nd iteration of bioPIXIE included biological

processes into gold standards– Currently working on 2nd generation

mouseNET to account for tissue and developmental stages

39

General mouseNET Approach

40

Global Gold Standardpositive relationshipsnegative relationships

Global Gold Standard

41

Specific Gold Standards• Not all datasets capture all functional

relationships– Process/Pathway specific

• Functionally related genes aren’t always functionally related– Tissue specific– Developmental stage specific

42

Specific Gold Standard Construction

positive relationshipsnegative relationships

Global Gold Standard Specific Gold Standard

43

Tissue/Stage Gold Standards• Based on data from GXD• Cross reference Theiler stages with

mammalian anatomy hierarchy• 729 total intersections– ranging from 50 to ~3500 genes– not including post-natal stages

44

Initial Computational Evaluations

45

Preliminary Results• Running 4-fold cross validation using

tissue/stage specific GO-based gold standards

training evaluation

test evaluation

46

Preliminary Results• Accounting for developmental stage

helps

training evaluation

test evaluation

47

Preliminary Results• Many specific tissue/stage

combinations are overfitting

training evaluation

test evaluation

48

Preliminary Results• Folds were randomly generated, are

biased, need to balance positives and negatives

49

New Visualization Interface• Graphle

50

Simple Things Long Times• No single step is too complicated• Mostly O(G2D)• 16M * 800 * 4• Evaluating one fold ~7 hours• So far have results for ~200

tissue/stages– Should take ~3 days on the cluster– Actually took ~15 days

51

Bayesian network utility• Bayesian networks powerful tool• Currently improving on existing

MouseNET project by incorporating tissue/stage information

• Preliminary results are promising, standards may be too limited

• Multiple stage process may be useful– predict tissue/stage specific expression– use these predictions in functional gold

standards– use a continuous gold standard?

52







53

From Relationships to Phenotypes

• Use the outputs of Bayesian integration of data as inputs to a phenotype prediction problem

• For each gene – vector of relationship probabilities used as feature vector

• Use a Support Vector Machine (SVM) to classify genes involved in a phenotypes vs. not involved in a phenotype

• Process repeated for hundreds of phenotypes

54

SVM Methodology• Every feature vector is thought of as

a point in space• Points nearer to each other tend to

belong to the same class• In our case, we have a ~20k

dimensional space where each point is a gene, and its location is determined by the relationship probabilities

~20K

gen

es

~20K probabilities

55

SVM Methodology

56

SVM Methodology

57

SVM Methodology

58

SVM Methodology

59

SVM Methodology

60

SVM Methodology

61

Software for SVM• SVMlight & SVMperf from Cornell• http://svmlight.joachims.org/• Several simple kernels implemented,

can write additional code to use custom kernels

• “perf” version maximizes different statistics (AUC, precision, etc.)

62

Phenotype Predictions• Using the MGI phenotype info as a

starting point, predicted genes for ~1150 phenotypes

63

Phenotype Predictions• Every gene with at least one allele

annotated considered “involved” with the phenotype

64

Phenotype Predictions• Selected phenotypes with >30

annotations, <500 annotations, non-identical

• SVM trained for each phenotype• Classification predictions created for

all tested phenotypes• Can assess prediction performance

computationally

65

Evaluation Metrics• TPs, FPs, TNs, FNs• Agnostic to pairs not appearing in

standard

• ROC curves: Sensitivity-Specificity (TPR-FPR)

• PR curves: Precision-Recall

66

PR-curves vs. ROC curves• ROC gives you credit for correctly

predicting negatives• For function/phenotype predictions,

we realistically are only concerned with positives

• Further, we care most about high confidence positives

• PR-curves better at showing this

67

Performance Measurements• On average, 10 fold improvement

over random–Median - ~7.5 fold over random–Max ~100, Min ~1

• Some phenotypeswe can predictwell, others, not somuch

68

Some Top Phenotypes• Arrested B cell differentiation• Abnormal joint morphology• Abnormal cell cycle checkpoint• Decreased circulating hormone level• Abnormal liver development• …

69

Some Bottom Phenotypes• hepatoma• head bobbing• disheveled coat• necrosis• increased glycogen level• lethargy• …

70

PR vs. ROC

AUC=0.63~45 fold improvement AUC=0.63

71

PR vs. ROC

AUC=0.70~3 fold improvement

72

Some Interesting Phenotypes

73

Laboratory Evaluation• Computational evaluation helpful, but

not the real goal• Cheryl has been kindly testing two

predictions related to bone phenotypes– Timp2 and Abcg8– Timp2-/- female mice have decreased

bone density, and possible morphological defects

– Abcg8-/- male mice have increased bone density

74

Timp2 Preliminary Results

Timp2-/-, 5 days old

75

Results are complementary to Quantitative Genetics

76

Conclusions & Plans• Bayes nets and SVMs are powerful

tools• Careful construction of Training Sets

(Gold Standard) is key• Computational evaluations need to

be appropriate to the problem context

• Laboratory evaluations are critical• Complementary approaches are good

77

Acknowledgements

• Hibbs Lab– Karen Dowell– Tongjun Gu– Al Simons

• Olga Troyanskaya Lab– Patrick Bradley– Maria Chikina– Yuanfang Guan

• Chad Myers• David Hess• Florian Markowetz• Edo Airoldi• Curtis Huttenhower

• Kai Li Lab– Grant Wallace

• Amy Caudy

• Maitreya Dunham

• Botstein, Kruglyak, Broach, Rose labs

• Kyuson Yun

• Carol Bult

Documents

1 Machine Learning for Functional Genomics II Matt Hibbs