30
Using Random Forests to explore a complex Metabolomic data set Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington

Using Random Forests to explore a complex Metabolomic data set

  • Upload
    clodia

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Using Random Forests to explore a complex Metabolomic data set. Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington. Collaborators. Dr. David Banks (Duke) Dr. Jacqueline Hughes-Oliver (NC State) Dr. Stan Young (NISS) Dr. Young Truoung (UNC) - PowerPoint PPT Presentation

Citation preview

Page 1: Using Random Forests to explore a complex Metabolomic data set

Using Random Forests to explore a complex Metabolomic data set

Susan SimmonsDepartment of Mathematics and StatisticsUniversity of North Carolina Wilmington

Page 2: Using Random Forests to explore a complex Metabolomic data set

Collaborators

• Dr. David Banks (Duke)• Dr. Jacqueline Hughes-Oliver (NC State)• Dr. Stan Young (NISS)• Dr. Young Truoung (UNC)• Dr. Chris Beecher (Metabolon)• Dr. Xiaodong Lin (SAMSI)

Page 3: Using Random Forests to explore a complex Metabolomic data set
Page 4: Using Random Forests to explore a complex Metabolomic data set

Large data sets

• Examples– Walmart

• 20 million transactions daily

– AT&T• 100 million customers and carries 200 million calls a day on

its long-distance network

– Mobil Oil • over 100 terabytes of data with oil exploration

– Human genome• Gigabytes of data

– IRA

Page 5: Using Random Forests to explore a complex Metabolomic data set

Dimensionality

Page 6: Using Random Forests to explore a complex Metabolomic data set

Dimensionality

• 3,000 metabolites• 40,000 genes• 100,000 chemicals• Try to find the signal in these data sets (and

not the noise)…..Data mining• Examples of data mining techniques:

pattern recognition, expert systems, genetic algorithms, neural networks, random forests

Page 7: Using Random Forests to explore a complex Metabolomic data set

Today’s talk

• Focus on classification (supervised learning…use a response to guide the learning process)

• Response is categorical (Each observation belongs to a “class”)

• Interested in relationship between variables and the response

• Short, fat data (instead of long, skinny data)

Page 8: Using Random Forests to explore a complex Metabolomic data set

Long, skinny dataX Y Z

2 8 9

3 4 4

7 5 46

8 7 3

4 56 35

6 58 63

12 9 3

14 2 35

24 1 45

2 7 4

13 78 25

14 56 34

18 6 89

35 8 56

Page 9: Using Random Forests to explore a complex Metabolomic data set

Short, fat data

n<p problem

X Y Z S T V M N R Q L H G K B C W

4 36 5 8 30 4 35 7 3 78 9 3 1 40 2 5 34

6 7 34 6 7 67 8 89 8 4 2 6 5 9 8 67 3

7 46 2 4 5 6 7 58 9 7 9 50 4 45 7 8 45

8 4 5 65 57 57 42 2 7 23 4 6 76 8 0 56 90

Page 10: Using Random Forests to explore a complex Metabolomic data set

Random Forests

• Developed by Leo Breiman (Berkeley) and Adele Cutler (Utah State)

• Can handle the n<p problem• Random forests are comparable in accuracy

to support vector machines• Random forests are a combination of tree

predictors

Page 11: Using Random Forests to explore a complex Metabolomic data set

Constructing a tree

Observation Gender Height (inches)1 F 602 F 663 M 684 F 705 F 666 M 727 F 648 M 67

Page 12: Using Random Forests to explore a complex Metabolomic data set

Tree for previous data set

All observations

N=8

Height < 66

N=4

Height > 66

N=4

Male

N=0

Female

N=4

Male

N=3

Female

N=1

Page 13: Using Random Forests to explore a complex Metabolomic data set

Random Forest

• First, the number of trees to be grown must be specified.

• Also, the number of variables randomly selected at each node must be specified (m).

• Each tree is constructed in the following manner:1. At each node, randomly select m variables to

split on.

Page 14: Using Random Forests to explore a complex Metabolomic data set

Random Forest

2. The node is split using the best split among the selected variables.

3. This process is continued until each node has only one observation, or all the observations belong to the same class.

• Do this for each tree in the “forest”

Page 15: Using Random Forests to explore a complex Metabolomic data set

Example: Cereal Data

Page 16: Using Random Forests to explore a complex Metabolomic data set

N=70

(40 G, 30K)

Calories <100

(2 G, 15 K)

Calories <100

(38 G, 15 K)

Fat <1

15 K

Fat >1

2 G

Carbo<12

15 K

Carbo>12

38G

Page 17: Using Random Forests to explore a complex Metabolomic data set

Random Forest• Another important feature is that each tree is

created using a bootstrap sample of the learning set.• Each bootstrap sample contains approximately 2/3

of the data (thus approximately 1/3 is left)• Now, we can use the trees built not containing

observations to get an idea of the error rate (each tree will “vote” on which class the observation belongs to).

• Example

Page 18: Using Random Forests to explore a complex Metabolomic data set

N=70

(40 G, 30K)

Calories <100

(2 G, 15 K)

Calories <100

(38 G, 15 K)

Fat <1

15 K

Fat >1

2 G

Carbo<12

15 K

Carbo>12

38G

Observation withheld from creating this tree

Calories Fat Carbo Mfr

98 2 10 K

Page 19: Using Random Forests to explore a complex Metabolomic data set

Random Forest

• This gives us an “out of bag” error rate• Random forests also give us an idea of

which variables are important for classifying individuals.

• Also gives information about outliers

Page 20: Using Random Forests to explore a complex Metabolomic data set

The era of the “omics” sciences

Page 21: Using Random Forests to explore a complex Metabolomic data set

Just a few of the “omics” sciences

• Genomics• Transcriptomics• Proteomics• Metabolomics• Phenomics• Toxicogenomics• Phylomics• Foldomics

• Kinomics• Interactomics• Behavioromics• Variomics• Pharmacogenomics

Page 22: Using Random Forests to explore a complex Metabolomic data set

Functional Genomics

Genomics

Transciptomics

Proteomics

Metabolomics

Page 23: Using Random Forests to explore a complex Metabolomic data set

Metabolomics

• Metabolites are all the small molecules in a cell (i.e. ATP, sugar, pyruvate, urea)

• 3,000 metabolites in the human body (compared to 35,000 genes and approximately 100,000 proteins)

• Most direct measure of cell physiology• Uses GC/MS and LC/MS to obtain

measurements

Page 24: Using Random Forests to explore a complex Metabolomic data set

Data

• Currently only have GC/MS information• Missing values are very informative (below

detection limits)• Imputed data using uniform random

variables from 0 to minimum value• 105 metabolites• 58 individuals (42 “disease 1”, 6 “disease

2”, and 10 “controls”)

Page 25: Using Random Forests to explore a complex Metabolomic data set

Confusion matrix

1 2 3

1 40 1 8

2 0 5 1

3 2 0 1

Oob error = 20.69%

Page 26: Using Random Forests to explore a complex Metabolomic data set

Outlier

Page 27: Using Random Forests to explore a complex Metabolomic data set

Variable Importance

Page 28: Using Random Forests to explore a complex Metabolomic data set

Visual Data

• Dostat

Page 29: Using Random Forests to explore a complex Metabolomic data set

Conclusions

• Random forests, support vector machines, and neural networks are some of the newest algorithms for understanding large datasets.

• There is still much more to be done.

Page 30: Using Random Forests to explore a complex Metabolomic data set

Thank you