85
L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Embed Size (px)

Citation preview

Page 1: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

1

L10: Classic RF to uncover biological interactions

Kirill BessonovGBIO0002

Nov 24th 2015

Page 2: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

2

Talk Plan

• Trees– Basic concepts– Examples

• Tree-based algorithms– Regression trees– Random Forest

• Practical on RF– RF variable selection

• Networks– network vocabulary– biological networks

Page 3: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

3

Data Structures

• arrangement of data in a computer's memory• Convenient access by algorithms• Main types

– Arrays– Lists– Stack – Queue– Binary Trees – Graph

stack

queue

tree

graph

Page 4: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

4

Trees

• is a data structure with– a hierarchy relationships

• Basic elements– Nodes (N)

• Variables• Features• e.g. files, genes, cities

– Edges (E)• directed links from

– From lower to higher depth

Page 5: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

5

Nodes (N)

• Usually defined by one variable• Are selected from {x1 …. xp} variables

– Differential selection criteria• E.g. strength of association to response (Y~X) • E.g. “best” split• Others

• Node variable could take many forms– question– feature– data point

Page 6: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

6

Binary splits• a node could be split in

– Two • binary split• Two

– child nodes

– Multiple ways • multi-split• Several

– Child nodes

• Travelling from top to bottom of a tree– Like being lost in cave maze

Page 7: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

7

Edges (E)

• Edges connects– Parent and child nodes (parent child)– Directional

• Do not have weight• Represent node splits

parent

children

Page 8: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

8

Tree types

• Decision– To move to next node need to take decision

• Classification– Allow to predict class

• use input to predict output class label• i.e. classify input

• Regression– Allow to predict output value

• i.e. use input to predict output

Page 9: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

9

Predicting response(s)

• Input: data on a sample; Nodes: variables• Travel from root down to child nodes

Page 10: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

10

Decision Tree example

Page 11: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

11

Classification tree

• Leaf nodes predict class of input (i.e. customer)• Output class label – yes or no answer

A banking customer would accept a personal loan? Classes = {yes, no}

Page 12: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

12

Classification example

Name Cough Fever Weight Pain ClassMarie yes yes skinny noneJean no no normal noneMarc yes no normal none

Page 13: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

13

Classification example

Name Cough Fever Weight Pain ClassMarie yes yes skinny none fluJean no no normal none noneMarc yes no normal none cold

Page 14: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

14

Regression trees

• Predict outcome – e.g. price of a car

Page 15: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Trees• Purpose 1:

– recursively partition data• cut data space into perpendicular hyper-

planes (w)

• Purpose 2: classify data• class label at the leaf node• E.g. a potential customer will respond to

a direct mailing?– predicted binary class: YES or NO

Source: DECISION TREES by Lior Rokach

Page 16: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Tree growth and splitting

• In top-down approach– assign all data to root node

• Select attribute(s)/feature(s) to split the node

• Stop tree growth based onMax depth reached Splitting criteria is not met

Leaf

s/Te

rmin

al

node

s

Selected feature(s)

X>x X<x

Y>y Y<y

Page 17: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Recursive splitting (1)Splits

Data

Page 18: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Recursive splitting (2)

Each node corresponds to a quadrant

Varia

ble

2

Variable 1

Page 19: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

19

Recursive splitting (3)

Page 20: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

20

Stopping rules

• When splitting is enough?– Max depth reached

• No new levels wanted

– Node contains too few samples• Prone to unreliable results

– Further splits do not improve• purity of child nodes• association of Y~X is below threshold

Page 21: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

21

Other tree uses

• Trees can be used also for – Clustering– Hierarchy determination

• E.g. phylogenetic trees

• Convenient visualization– effective visual condensation of the

clustering results• Gene Ontology

– Direct acyclic graph (DAG)– Example of functional hierarchy

Page 22: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

22

GO tree example

Page 24: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

24

Alignment trees

Page 25: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

25

Tree EnsemblesRandom Forest (RF)

Page 26: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

26

Random Forest

• 1999 introduced by Breiman– Ensemble of tree predictors– Let each tree “vote” for most popular class

• Significant performance improvement– over previous classification algorithms

• CART and C4.5

• Relies on randomness• Variable selection based on purity of a split

– GINI index

Page 27: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Random ForestsRandomly build ensemble of trees

1. Bootstrap a sample of data, start building a tree

2. Create a node by1. Randomly selecting m variables from M2. Keep m constant (except for term. nodes)

3. Split the node based on m variables4. Grow a tree until no more splits

possible5. Repeat steps 1-4 n times

Generate an ensemble of trees

6. Calculate variable importance for each predictor variable X

Page 28: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Random Forest animation1

{A,B,C,D}

A1 B1 C1 D1A2 B2 C2 D2A3 B3 C3 D3A4 B4 C4 D4A5 B5 C5 D5

A6 B6 C6 D6A7 B7 C7 D7

C

2{A,B,D,D}

3{A,B,C,D}

A1 B1 C1 D1A2 B2 C2 D2A3 B3 C3 D3A4 B4 C4 D4A5 B5 C5 D5

A6 B6 C6 D6

A7 B7 C7 D7

>X<X

D

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

A4 B4 C4 D4

A5 B5 C5 D5

A6 B6 C6 D6

A7 B7 C7 D7

A

1C >X<X

2B

3A

1C >X<X

2B

3B

1C >X<X

2C

3B

Page 29: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

29

Building a forest

• Forest– Collection of several trees

• Random Forest– Aggregation of several decision trees

• Logic– Single tree – too variable performance– Forest of trees – good and stable performance

• Predictions averaged over several trees

Page 30: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

30

Splits

• Based on split purity criteria of a node– gini impurity criterion (GINI index)– measures impurity of the outputs after a split

• a split of a node is made on variable m– with lowest Gini Index split (slide 33)

where j is class and pj is probability of class (proportion of samples with class j)

Page 31: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

31

Goodness of splitWhich two splits would give

– the highest purity ?– The lowest Gini Index ?

Page 32: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

32

GINI Index calculation

• At given tree node the probability of p(normal)=0.4 and p(asthma)=0.6. Calculate node GINI Index.

Gini Index = 1 – (0.42 + 0.62) = 0.48

• What will be GINI index of a “pure” node– Gini Index = 0

Page 33: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

33

GINI Index of a split

• In case of tree building• GINI Index of a split is used instead• Given node N and a splitting value j*

– the left child (Nleft) has Sleft samples

– The right child(Nright) has Sright samples

Page 34: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

34

GINI Index split• Given tax evasion data sorted by income

variable, choose the best split value based on GINI Index split?

Income 10K 20K 30K 40K 50K 60K 70K 80K 90K 100K 110KSplit <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0GINI Index split 0.42 0.4 0.375 0.343 0.417 0.4 0.3 0.343 0.375 0.4 0.42

Page 35: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

35

mtry parameter

• To build each node m variables are selected• Specified by mtry parameter• Allows to build different trees

– Randomized selection of variables at each split– Gives heterogeneity to a forest

• Given X={A,B,C,D,E,F,G} and mtry=2– Node N1 = {A,B}– Node N2 = {C,G}– Node N3 = {A,D}– …

• Default mtry = sqrt(p)– p – number of predictor variables

Page 36: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

36

RF stopping criteria

• RF – ensemble of non-pruned decision trees• Growth tree until

– Node has maximum purity (GINI index = 0)• all samples of the same class

– No more samples for next split• 1 sample in a node

• Greedy sleeting • randomForest library min samples per node

– 1 in classification– 5 in regression

Page 37: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

37

Performance comparison

• RF handles – missing values– continuous and categorical predictors (i.e. X)– and high-dimensional dataset where p>>N

• Improves over single tree performance

* Lower is betterSource: Breiman, Leo. "Statistical modeling: The two cultures." Quality control and applied statistics 48.1 (2003): 81-82.

Page 38: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

38

RF variable importance (1)

• Need to estimate “importance” of each predictor {x1 … xp} in predicting a response y

• Ranking of predictors• Variable importance measure (VIM)

– Classification: misclassification rate (MR)– Regression: mean square error (MSE)

• VIM - the increasing in mean of the errors (MR or MSE) in the forest when the y values are randomly permuted in the OOB samples

Page 39: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Out-of-the-bag (OOB) samples• Dataset divided into

– Bootstrap (i.e. training)– OOB (i.e. testing)

• Trees are built on bootstrap• Predictions are made on OOB

• Benefits– avoids over-fitting

• false results

OOBbootstrap

X1…1.5X2…1.2X3…-0.5

OOBbootstrap

VIM

Page 40: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

40

RF variable importance (2)

1. Predict classes of the OOB samples using each tree of a RF

2. Calculate the misclassication rate = out of bag error rate (OOBerrorobs) for each tree

3. For each variable in the tree, permute the variables values

4. Using the tree compute the permutation-based out-of-bag error (OOBerrorperm)

5. Aggregate OOBerrorperm over all trees

6. Compute the final VIM

Page 41: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

41

RF variable importance (3)

• The VIM mathematically is defined as

• VIM domain {}– Thus, VIM could be negative

• Means that variable is insignificative

• Ideal situation when >

Page 42: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

42

Aims of variable selection

• Find variables related to response– E.g. predictive of class with highest probability

• Simplify problem– Summarize dataset by fever variables– Decrease dimensionality

Page 43: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

43

RFs in RrandomForest library

Titanic example

Page 44: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

44

randomForest

• Well implemented library• Good performance• Main functions

– randomForest()– importance()

• Install library– install.packages("randomForest", repos="http://cran.freestatistics.org")

• Load titanic_example.Rdata

Page 45: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

45

randomForestWhich factor was the most important in survival of passengers?library(randomForest);titanic_data = read.table(file="http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt", header=T, sep=",");train_idx = sample(1:1313,0.75*1313);test_idx = which(!1:1313 %in% train_idx);

titanic_data.train = titanic_data[train_idx,];titanic_data.test = titanic_data[test_idx,];

titanic.survival.train.rf = randomForest(as.factor(survived) ~ pclass + sex + age, data=titanic_data.train, ntree=1000, importance=TRUE, na.action=na.omit);

c_matrix = titanic.survival.train.rf$confusion[1:2,1:2];

print(accuracy in training data);sum( diag(c_matrix) ) / sum(c_matrix);

imp = importance(titanic.survival.train.rf);print(imp);

survived 0 1 MeanDecreaseAccuracy MeanDecreaseGinipclass 25.08341 27.00470 30.59483 16.38874sex 77.77933 82.17791 84.19724 74.51846age 22.82038 22.48106 30.55145 18.07370

varImpPlot(titanic.survival.train.rf);

Page 46: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

46

Variable importance• Which variable was the most important in

survival of passengers?

Page 47: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

47

RF to uncover networks of interactions

Page 48: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

48

One to many

• So far seen one response Y to many predictors X

• Can use RF to predict many Ys sequentially– create interaction network

Page 49: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

49

RF to interaction networks (1)

• Can we build networks from tree ensembles?– Need to consider all possible interactions

• Y1~X, then Y2~X … Yp~X

– Need to “shift” Y to new variable• Assign Y to new variable (previously X)

– Complete matrix of interactions (i.e. network)

Page 50: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

50

RF to interaction networks (1)

• For example given continuous data and A,B,C variables, build an interaction network– Consider interaction scenarios

• A ~ {B,C} • B ~ {A,C}• C ~ {A,B}

– Need to have 3 RF runs giving 3 sets of VIMs– Fill out interaction network matrix A

A B CA 0 B 0 C 0

Interaction network (p x p matrix)

Page 51: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

51

RF to interaction networks (2)

• Read in input data matrix D with 3 variables– Continuous scale

• Aim: variable ranking and response prediction• Load RFtoNetworks.Rdata A B C

1 4.41 7.95 6.272 11.18 5.76 3.353 1.32 7.74 1.114 6.82 3.83 3.645 10.51 5.05 10.426 2.67 7.78 1.837 6.24 5.30 6.078 6.85 1.56 3.509 10.50 4.89 8.44

10 9.15 8.05 9.73

D =

rf =randomForest(A~B+C, data=D, importance=T);importance(rf,1) %IncMSEB 8.706760C 8.961513

rf =randomForest(B~A+C, data=D, importance=T);importance(rf,1) %IncMSEA 9.603829C 3.271325

rf =randomForest(C~A+B, data=D, importance=T);importance(rf,1) %IncMSEA 8.830840B 1.519951

Page 52: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

52

RF to interaction networks (3)

• Fill out the interaction matrix D

• The resulting network

A B CA 0 8.706760 8.961513B 9.603829 0 3.271325C 8.830840 1.519951 0

8.70

8.96

9.60

3.278.83 1.51

Page 53: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

53

Networks ofbiological interactions

Page 54: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

54

Networks

• What comes to your mind? Related terms?• Where can we find networks?• Why should we care to study them?

Page 55: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

55

We are surrounded by networks

Page 56: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

56

Page 57: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

57

Transportation Networks

Page 58: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

58

Computer Networks

Page 59: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

59

Social networks

Page 60: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

60

Internet submarine cable map

Page 61: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

61

Social interaction patterns

Page 62: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

62

PPI (Protein Interaction Networks)

• Nodes – protein names• Links – physical binding event

Page 63: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

63

Network Definitions

Page 64: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

64

Network components

• Networks also called graphs– Graph (G) contains

• Nodes (N): genes, SNPs, cities, PCs, etc.• Edges (E): links connecting two nodes

Page 65: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

65

Characteristics• Networks are

– Complex– Dynamic– Can be used to reduce data dimensionally

time = t0 time = t

Page 66: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

66

Topology

• Refers to connection pattern of a network– The pattern of links

Page 67: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

67

Modules• Sub-networks with

– Specific topology– Function

• Biological context– Protein complex– Common function

• E.g. energy production

clique

Page 68: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

68

Network types• Directed

– Edge have directionality– Some links are unidirectional– Direction matters

• Going A B is not the same as BA

– Analogous to chemical reactions• Forward rate might not be the same as reverse

– E.g. directed gene regulatory networks (TF gene)• Undirected

– Edges have no directionality– Simpler to describe and work with– E.g. co-expression networks

Page 69: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Edges Types

N nodesE edges

graph:

directed

undirected

Page 70: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

70

Neighbours of node(s)

• Neighbours(node, order) = {node1 … nodep}• Neighbours(3,1) = {2,4}• Neighbours(2,2) = {1,3,5,4}

Page 71: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

71

Node degree (k)

• the number of edges connected to the node

• k(6) = 1• k(4) = 3

Page 72: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Connectivity matrix(also known as adjacency matrix)

A =

Sizebinary or weighted

Page 73: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

73

Degree distribution (P(k))

• Determines the statistical properties of uncorrelated networks

source: http://www.network-science.org/powerlaw_scalefree_node_degree_distribution.html

Page 74: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Topology: randomDegree distribution of nodes is statistically independent

Page 75: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

75

Topology: Scale-free

• Biological processes are characterized by this topology– Few hubs (highly connected nodes)– Predominance of poorly connected nodes– New vertices attach preferentially to highly connected ones

• Barabási, Albert-László, and Réka Albert. "Emergence of scaling in random networks." science 286.5439 (1999): 509-512.

Page 76: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

Topologies: scale-freeMost real networks have Degree distribution that follows power-law

• the sizes of earthquakes craters on the moon

• solar flares• the sizes of activity patterns of neuronal

populations• the frequencies of words in most languages• frequencies of family names• sizes of power outages• criminal charges per convict • and many more

Page 77: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

77

Shortest path (p)

• Indicates the distance between i and j in terms of geodesics (unweighted)

•p(1,3) =– {1-5-4-3}– {1-5-2-3}– {1-2-5-4-3}– {1-2-3}

Page 78: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

78

Cliques

• A clique of a graph G is a complete subgraph of G– i.e. maximally interconnected subgraph

• The highlighted clique is the maximal clique of size 4 (nodes)

Page 79: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

–Robert Kiyosaki

“The richest people in the world look for and build networks. Everyone else looks for work.”

Page 80: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

80

Biological context

Page 81: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

81

Biological Networks

Page 82: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

82

Biological examples

• Co-expression– For genes that have similar expression profile

• Directed gene regulatory networks (GRNs)– show directionality between gene interactions

• Transcription factor target gene expression

– Show direction of information flow– E.g. transcription factor activating target gene

• Protein-Protein Interaction Networks (PPI)– Show physical interaction between proteins– Concentrate on binding events

• Others– Metabolic, differential, Bayesian, etc.

Page 83: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

83

Biological networks

• Three main classesType Name Nodes Edges Resource

molecular interactions PPI proteins physical bonds BioGRIDDTI drugs/targets physical bonds PubChem

functional associationsGI genes

genetic interactions BioGRID

ON Gene Ontologyfunctional relations GO

GDA genes/diseases associations OMIM

functional/structural similarities Co-Ex genesexpression profile similarity

GEO, ArrayExpress

PStrS proteinsstructural similarities PDB

Source: Gligorijević, Vladimir, and Nataša Pržulj. "Methods for biological data integration: perspectives and challenges." Journal of The Royal Society Interface 12.112 (2015): 20150571.

Page 84: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

84

Summary

• Trees are powerful techniques for– Response prediction

• e.g. Classification

• Random Forest is a powerful tree ensemble technique for variable selection

• RF can be used to build networks– assess pair-wise variable associations

• Networks are well suited for interactions representation– Biological networks are scale-free

Page 85: L10: Classic RF to uncover biological interactions Kirill Bessonov GBIO0002 Nov 24 th 2015 1

85

References1) https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm2) Breiman, Leo. "Statistical modeling: The two cultures." Quality control and applied

statistics 48.1 (2003): 81-82.3) Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest

." R news 2.3 (2002): 18-22.4) Loh, Wei-Yin, and Yu-Shan Shih. "Split selection methods for classification trees."

Statistica sinica 7.4 (1997): 815-840.