37
1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery 25 June 03

Micro Array Literature

  • Upload
    raquel

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery 25 June 03. Micro Array Literature. Guilt by Association : You are known by the company you keep. Data Matrix - PowerPoint PPT Presentation

Citation preview

Page 1: Micro Array Literature

1

High Throughput Target Identification

Stan Young, NISS

Doug Hawkins, U Minnesota

Christophe Lambert, Golden Helix

Machine Learning, Statistics, and Discovery

25 June 03

Page 2: Micro Array Literature

2

PublicationYear

All Journals PNAS

1992 0 01993 0 01994 0 01995 4 01996 3 11997 8 21998 37 11999 134 82000 409 342001 773 46

Micro Array Literature

Page 3: Micro Array Literature

3

Guilt by Association :

You are known

by the company you keep.

Page 4: Micro Array Literature

4

Data Matrix

Goal: Associations over the genes.

Guilty Gene

Genes

Tissues

Page 5: Micro Array Literature

5

Goals

1. Associations.

2. Deep associations – beyond 1st level correlations.

3. Uncover multiple mechanisms.

Page 6: Micro Array Literature

6

Problems

1. n < < p

2. Strong correlations.

3. Missing values.

4. Non-normal distributions.

5. Outliers.

6. Multiple testing.

Page 7: Micro Array Literature

7

Technical Approach

1. Recursive partitioning.

2. Resampling-based, adjusted p-values.

3. Multiple trees.

Page 8: Micro Array Literature

8

Recursive Partitioning

Tasks

1. Create classes.

2. How to split.

3. How to stop.

Page 9: Micro Array Literature

9

Differences:

Recursive Partitioning• Top-down analysis• Can use any type of descriptor.• Uses biological activities to

determine which features matter.

• Produces a classification tree for interpretation and prediction.

• Big N is not a problem!• Missing values are ok.• Multiple trees, big p is ok.

Clustering• Often bottom-up

• Uses “gestalt” matching.

• Requires an external method for determining the right feature set.

• Difficult to interpret or use for prediction.

• Big N is a severe problem!!

Page 10: Micro Array Literature

10

Forming Classes, Categories, Groups

Profession Av. Income

Baseball Players 1.5MFootball Players 1.2M

Doctors .8MDentists .5M

Lawyers .23MProfessors .09M

. . . . .

Page 11: Micro Array Literature

11

Forming Classes from “Continuous” Descriptor

0 31 2 4 5 6-1-2-3

How many “cuts” and where to make them?

Page 12: Micro Array Literature

12

Splitting : t-test

n = 1650ave = 0.34sd = 0.81

n = 1614ave = 0.29sd = 0.73

n = 36ave = 2.60sd = 0.9

Signal 2.60 - 0.29t = = = 18.68Noise 0.734 1 1

36 1614+

TT: NN-CCNN-CC

rP = 2.03E-70

aP = 1.30E-66

Page 13: Micro Array Literature

13

Splitting : F-test

n = 1650ave = 0.34sd = 0.81

n = 1553ave = 0.21sd = 0.73

n = 36ave = 2.60sd = 0.9

n = 61ave = 1.29sd = 0.83

n = 61ave = 1.29sd = 0.83

Signal Among Var (Xi. - X..)2/df1F = = =

Noise Within Var (Xij - Xi.)2/df2

Page 14: Micro Array Literature

14

How to Stop

Examine each current terminal node.

Stop if no variable/class has a

significant split, multiplicity adjusted.

Page 15: Micro Array Literature

15

Levels of Multiple Testing

1. Raw p-value.

2. Adjust for class formation, segmentation.

3. Adjust for multiple predictors.

4. Adjust for multiple splits in the tree.

5. Adjust for multiple trees.

Page 16: Micro Array Literature

16

Understanding observations

NB: Splitting variables govern the process,NB: Splitting variables govern the process, linked to response variable.linked to response variable.

MultipleMechanisms

Conditionally important descriptors.

Page 17: Micro Array Literature

17

Multiple Mechanisms

Page 18: Micro Array Literature

18

Reality: Example Data

60 Tissues

1453 Genes

Gene 510 is the “guilty” gene, the Y.

Page 19: Micro Array Literature

19

1st Split of Gene 510 (Guilty Gene)

Page 20: Micro Array Literature

20

Split Selection

14 spliters

with adjusted

p-value

< 0.05

Page 21: Micro Array Literature

21

Histogram

Non-normal, hence

resampling p-values

make sense.

Page 22: Micro Array Literature

22

Resampling-based Adjusted p-value

Page 23: Micro Array Literature

23

Single Tree RP Drawbacks

• Data greedy.

• Only one view of the data. May miss other mechanisms.

• Highly correlated variables may be obscured.

• Higher order interactions may be masked.

• No formal mechanisms for follow-up experimental design.

• Disposition of outliers is difficult.

Page 24: Micro Array Literature

24

Etc.

Multiple Trees, how and why?Multiple Trees, how and why?

Page 25: Micro Array Literature

25

How do you get multiple trees?

1. Bootstrap the sample, one tree per sample.

2. Randomize over valid splitters.

Etc.

Page 26: Micro Array Literature

26

RandomTreeBrowsing,

1000 Trees.

Page 27: Micro Array Literature

27

Example Tree

Page 28: Micro Array Literature

28

1st Split

Page 29: Micro Array Literature

29

Example Tree, 2nd Split

Page 30: Micro Array Literature

30

Conclusion for Gene G510

If G518 < -0.56

and

G790 < -1.46

then

G510 = 1.10 +/- 0.30

Page 31: Micro Array Literature

31

Using Multiple Trees to Understand variables

• Which variables matter?

• How to rank variables in importance.

• Correlations.

• Synergistic variables.

Page 32: Micro Array Literature

32

CorrelationInteractionMatrix

Red=Syn.

Page 33: Micro Array Literature

33

Summary

• Review recursive partitioning.

• Demonstrated multiple tree RP’s capabilities– Find associated genes

– Group correlated predictors (genes)

– Synergistic predictors (genes that predict together)

• Used to understand a complex data set.

Page 34: Micro Array Literature

34

Needed research

• Real data sets with known answers.

• Benchmarking.

• Linking to gene annotations.

• Scale (1,000*10,000).

• Multiple testing in complex data sets.

• Good visualization methods.

• Outlier detection for large data sets.

• Missing values. (see NISS paper 123)

Page 35: Micro Array Literature

35

Teams

NC State University :Jacqueline Hughes-OliverKatja Rimlinger

U Waterloo :Will WelchHugh ChipmanMarcia WangYan Yuan

U. Minnesota :Douglas Hawkins NISS :

Alan Karr(Consider post docs)GSK :

Lei ZhuRay Lam

Page 36: Micro Array Literature

36

References/Contact

1. www.goldenhelix.com.

2. www.recursive-partitioning.com.

3. www.niss.org, papers 122 and 123.

4. [email protected]

5. GSK patent.

Page 37: Micro Array Literature

37

Questions