59
Introduction Visualization Machine Learning Visualization and Machine Learning for exploratory data analysis Xiaochun Li 1,2 1 Division of Biostatistics Indiana University School of Medicine 2 Regenstrief Institute May 2, 2008 / CCBB Journal Club Xiaochun Li Visualization and ML

Visualization and Machine Learning - for exploratory data

  • Upload
    butest

  • View
    1.902

  • Download
    3

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Visualization and Machine Learningfor exploratory data analysis

Xiaochun Li1,2

1Division of BiostatisticsIndiana University School of Medicine

2Regenstrief Institute

May 2, 2008 / CCBB Journal Club

Xiaochun Li Visualization and ML

Page 2: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Xiaochun Li Visualization and ML

Page 3: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Introduction

Mining large scale datasets, methods are needed tosearch for patterns, e.g., biologically important gene sets,or samplespresent data structure succinctlyboth are essential in the analysis.

Xiaochun Li Visualization and ML

Page 4: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

ObjectiveVisualization

An essential part of exploratory data analysis, and reporting theresults.

plot data as isplot data after simple summarizationplot data based on more advanced methods

clusteringPCA (Principal component analysis)MDS (Multidimensional scaling)Silhouette, randomForest, . . .

Xiaochun Li Visualization and ML

Page 5: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Xiaochun Li Visualization and ML

Page 6: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Plot data as isQuality Inspection

An affymetrics chip

image. Some images

may have obvious local

contaminations.

Xiaochun Li Visualization and ML

Page 7: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Plot data as isQuality Inspection

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Ins+, white

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Ins−, white

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Ins+, black

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Ins−, black

An RNAi experiment with

white and black plates,

insulin stimulated +/-.

Xiaochun Li Visualization and ML

Page 8: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Plot data as isR tools

image or heatmap for any chip arraysfor cell-based assays, could also use plotPlate in Rpackage prada

Xiaochun Li Visualization and ML

Page 9: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Xiaochun Li Visualization and ML

Page 10: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Simple SummarizationAlong Genomic Coordinates

020

000

4000

060

000

8000

0

Cumulative expression levels by genes in chromosome 21 scaling method: none

Representative Genes

Cum

ulat

ive

expr

essi

on le

vels

NR

IP1

BT

G3

JAM

2A

DA

MT

S1

CC

T8

GR

IK1

HU

NK

SY

NJ1

IFN

AR

2C

21or

f55

SO

NIT

SN

1A

TP

5OD

SC

R1

RU

NX

1R

UN

X1

CB

R3

CLD

N14

DS

CR

5T

TC

3D

YR

K1A

DY

RK

1AD

SC

R4

KC

NJ1

5E

TS

2H

MG

N1

BA

CE

2A

NK

RD

3T

FF

1P

DE

9AU

2AF

1P

DX

KT

ME

M1

B7H

2A

IRE

C21

orf2

UB

E2G

2A

DA

RB

1C

OL1

8A1

SLC

19A

1C

OL6

A1

CO

L6A

1LS

SM

CM

3AP

− + − + + + − − + − + − + + − − + + − + + −

Cumulative expression

profiles along

Chromosome 21 for

samples from 10 children

with trisomy 21 and a

transient myeloid

disorder, colored in red,

and children with different

subtypes of acute myeloid

leukemia (M7), colored in

blue.

Xiaochun Li Visualization and ML

Page 11: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Simple SummarizationAlong Genomic Coordinates

The previous wiggle plot was produced usingalongChrom of the R package geneplotter

Could plot just a segment of chromosome of interest

Xiaochun Li Visualization and ML

Page 12: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Xiaochun Li Visualization and ML

Page 13: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

MASS Spec Example“Latin Square” Design for B-F

Group Cytochrome c Ubiquitin Lysozyme Myoglobin TrypsinogenA 0 0 0 0 0B 0 1 2 5 10C 1 2 5 10 0D 2 5 10 0 1E 5 10 0 1 2F 10 0 1 2 5G 10 10 10 10 10

Design and the protein concentration, proteins 1= Ubiquitin (1 fmol/uL),

Cytochrome/Lysozyme/Myoglobin (10 fmol/uL), Trypsinogen(100 fmol/uL)

Xiaochun Li Visualization and ML

Page 14: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Mass SpecExample

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

010

2030

40

mz

x[1,

] One spectrum fromgroup A

Xiaochun Li Visualization and ML

Page 15: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Mass SpecMDS

−400 −200 0 200 400 600 800 1000−20

0−

150

−10

0 −

50

0 5

0 1

00 1

50 2

00

−600−400

−200 0

200 400

first coordinate

seco

nd c

oord

inat

e

third

coo

rdin

ate

●●

●●

●●

●●

Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.

Xiaochun Li Visualization and ML

Page 16: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

MASS Specpairs plot

spec 1

0 10 20 30 40 0 10 20 30

020

4060

80

010

2030

40

0.66 spec 2

0.60 0.98 spec 3

010

2030

40

0 20 40 60 80

010

2030

0.59 0.970 10 20 30 40

0.99 spec 4

The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.

Xiaochun Li Visualization and ML

Page 17: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

MASS Specpairs plot

spec 1

0 10 20 30 40 0 10 20 30

05

1015

2025

30

010

2030

40

0.99 spec 2

0.96 0.98 spec 3

010

2030

40

0 5 10 15 20 25 30

010

2030

0.96 0.980 10 20 30 40

0.99 spec 4

The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.

Xiaochun Li Visualization and ML

Page 18: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Mass SpecMDS: 3-D

−400 −300 −200 −100 0 100 200 300 400−20

0−

100

0

100

200

−200

−150

−100

−50

0

50

100

150

200

first coordinate

seco

nd c

oord

inat

e

third

coo

rdin

ate

●●

●●

●●

●●

Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.

Xiaochun Li Visualization and ML

Page 19: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Silhouette plotvisualize clustering results

A A A A A A A A A A A A AD

G D DG

G GG

D GG

D D D D D DG

G G D D G G D G

010

2030

4050

Cluster Dendrogram

hclust (*, "complete")d.s.nocut

Hei

ght

Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.

Xiaochun Li Visualization and ML

Page 20: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Silhouette plotvisualize clustering results

A AA A

AA A

AA A

AA A

DD D

D DD

D DD

D D D DG

G GG

G GG

GG G G

G G

010

020

030

040

0

Cluster Dendrogram

hclust (*, "complete")d.s.cut

Hei

ght

Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.

Xiaochun Li Visualization and ML

Page 21: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Silhouette plotvisualize clustering results

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

whole spec

Average silhouette width : 0.57

n = 39 3 clusters Cj

j : nj | avei∈∈Cj si

1 : 17 | 0.67

2 : 16 | 0.48

3 : 6 | 0.56

Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.

Xiaochun Li Visualization and ML

Page 22: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Silhouette plotvisualize clustering results

Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

mz<1000 cut

Average silhouette width : 0.65

n = 39 3 clusters Cj

j : nj | avei∈∈Cj si

1 : 13 | 0.82

2 : 13 | 0.60

3 : 13 | 0.53

Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.

Xiaochun Li Visualization and ML

Page 23: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

Silhouette plotsilhouette width

For each observation i , the silhouette width si is defined asfollows:

ai = average dissimilarity between i and all other points ofthe cluster to which i belongsfor all other clusters C, put di,C = average dissimilarity of ito all observations of Cbi = minC di,C , and can be seen as the dissimilaritybetween i and its “neighbor” cluster, i.e., the nearest one towhich it does not belongsi = (bi − ai)/max(ai ,bi)

Xiaochun Li Visualization and ML

Page 24: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

As IsSimple SummarizationMore Advanced Methods

VisualizationR tools

classical MDS: cmdscale2-D, 3-D scatter plot: plot and R packagescatterplot3d

2-D scatter plot matrix: pairssilhouette plot: silhouette

Xiaochun Li Visualization and ML

Page 25: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Machine Learning

Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.

Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?

Xiaochun Li Visualization and ML

Page 26: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Machine Learning

Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.

Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?

Xiaochun Li Visualization and ML

Page 27: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Machine Learning

Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.

Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?

Xiaochun Li Visualization and ML

Page 28: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Machine Learning

Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.

Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?

Xiaochun Li Visualization and ML

Page 29: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Xiaochun Li Visualization and ML

Page 30: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Supervised Learning

linear modelnearest neighbor (k -nn)LDA (Linear Discriminant Analysis): same covariance Σacross classesLDA variants: QDA (class-specific Σk ), DLDA (Σ isdiagonal), RDA (regularized use αΣ + (1− α)I, SVMrandomForest

Xiaochun Li Visualization and ML

Page 31: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Xiaochun Li Visualization and ML

Page 32: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Unsupervised Learning

ClusteringPCA (Principal component analysis)MDS (Multidimensional scaling), classical MDS usingEuclidean distance=PCAK-meansSOM (Self-organizing maps)Unsupervised as Supervised Learning

Xiaochun Li Visualization and ML

Page 33: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Unsupervised as Supervised Learningthrough data augmentation

Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.

x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1

xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0

x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2

µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by

supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)

g(x) = g0(x) µ(x)1−µ(x)

E.g., using this techinque with RandomForest.

Xiaochun Li Visualization and ML

Page 34: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Unsupervised as Supervised Learningthrough data augmentation

Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.

x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1

xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0

x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2

µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by

supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)

g(x) = g0(x) µ(x)1−µ(x)

E.g., using this techinque with RandomForest.

Xiaochun Li Visualization and ML

Page 35: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Unsupervised as Supervised Learningthrough data augmentation

Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.

x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1

xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0

x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2

µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by

supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)

g(x) = g0(x) µ(x)1−µ(x)

E.g., using this techinque with RandomForest.

Xiaochun Li Visualization and ML

Page 36: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Unsupervised as Supervised Learningthrough data augmentation

Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.

x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1

xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0

x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2

µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by

supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)

g(x) = g0(x) µ(x)1−µ(x)

E.g., using this techinque with RandomForest.

Xiaochun Li Visualization and ML

Page 37: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Unsupervised as Supervised Learningthrough data augmentation

Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.

x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1

xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0

x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2

µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by

supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)

g(x) = g0(x) µ(x)1−µ(x)

E.g., using this techinque with RandomForest.

Xiaochun Li Visualization and ML

Page 38: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Unsupervised as Supervised Learningthrough data augmentation

Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.

x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1

xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0

x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2

µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by

supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)

g(x) = g0(x) µ(x)1−µ(x)

E.g., using this techinque with RandomForest.

Xiaochun Li Visualization and ML

Page 39: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Xiaochun Li Visualization and ML

Page 40: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

What are Random Forests

Random forests are a combination of tree predictors whichdepends on iid values random vectors, {θθθk}.Example - Bagging (bootstrap aggregation):

bootstrap samples are drawn from the training set, whereθθθk is counts in n boxes resulting from sampling withreplacementa tree is grown from each bootstrap sampleassign class per majority votes.

Xiaochun Li Visualization and ML

Page 41: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Motivation:

Improve predictiona single tree has poor accuracy for problems with manyvariables, each of them having very little information e.g.,genomics data setscombining trees grown using random features can improveaccuracy

Assess Performancetraining error (error rate from the training set) does notindicate performance over new dataoverfit→ small training error but poor generalization errorneed data which were not used to grow a particular tree toassess the performance of the tree.

Xiaochun Li Visualization and ML

Page 42: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Strength and Correlation

for a given case (X, Y), and a given ensemble of classifiersmargin = proportion of votes for the right class −maxother classes(proportion of votes for any other class)

generalization error PE∗ = PX,Y(margin < 0)

s ≡strength = EX,Y(margin)ρ ≡correlation = some correlation btw any two trees.Thm 1.2. generalization error convergesThm 2.3. Gen. Error is bounded, PE∗ ≤ ρ(1− s2)/s2.

Xiaochun Li Visualization and ML

Page 43: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Random Forests Converge

Theorem 1.2. As the number of trees increases,generalization error a.s. for all {θθθk} converges.

this is why random forests do not overfit as more trees areadded, but tend to a limiting value of the generalizationerror.

Xiaochun Li Visualization and ML

Page 44: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

StrategyMinimize Correlation While Keeping Strength

Using randomly selected inputs or combinations of inputs ateach node to grow each tree:

Random Input Selection - Forest-RIat each node, select at random F variables to split on,grow the tree to maximum size and do not prune.Random Feature Selection - Forest-RCsame idea as above but with F Features- "linear combinations of randomly selected L variables"with random coefficients runif(L, -1, 1)⇒ further reducecorrelation

Xiaochun Li Visualization and ML

Page 45: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Gauging Performance

Bagging makes it possible to estimate the generalizationerror without a test set.

Why: in any bootstrap sample, about 1/3 of cases from theoriginal training set are left out due to sampling withreplacement (1− 1

n )n ≈ e−1 ≈ 1/3.Out-Of-Bag Estimates of Error, Strength and Correlation

For each (x, y), aggregate the votes over trees grownwithout (x, y) - out-of-bag classifier.Out-of-bag estimate of generalization error = error rate ofout-of-bag classifier.Same idea for out-of-bag strength and correlation.

Xiaochun Li Visualization and ML

Page 46: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

ConclusionsRandomForest

Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.

Xiaochun Li Visualization and ML

Page 47: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

ConclusionsRandomForest

Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.

Xiaochun Li Visualization and ML

Page 48: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

ConclusionsRandomForest

Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.

Xiaochun Li Visualization and ML

Page 49: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

ConclusionsRandomForest

Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.

Xiaochun Li Visualization and ML

Page 50: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

ConclusionsRandomForest

Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.

Xiaochun Li Visualization and ML

Page 51: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

RandomForest in Unsupervised Learning

RandomForest can be used in the unsupervised mode forvariable selectionproximity matrix (for clustering)

Xiaochun Li Visualization and ML

Page 52: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

RandomForest in Unsupervised Learning

RandomForest can be used in the unsupervised mode forvariable selectionproximity matrix (for clustering)

Xiaochun Li Visualization and ML

Page 53: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM

Xiaochun Li Visualization and ML

Page 54: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

What are SVMs

Support vector machines (SVMs) are a set of supervisedlearning methods used for classification and regressionAn extension of LDA

many hyperplanes could classify the datainterested in the one achieving maximum separation(margin) between the two classesmathematically, for (yi , xi ), yi = ±1, i = 1, . . . ,n, min 1

2 ||x ||2

s.t., yi (x ′i x − b) ≥ 1 (if separable)min 1

2 ||w ||2 + λ

∑ni=1 ξi s.t., ξi ≥ 0, yi (x ′i w − b) ≥ 1− ξi (if

not separable)

Xiaochun Li Visualization and ML

Page 55: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

SVMseparable case

http://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png

http://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png 5/2/2008 9:40:18 AM

Separable case.

Xiaochun Li Visualization and ML

Page 56: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

SVMseparable case

http://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png

http://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png 5/2/2008 9:41:23 AM

Separable case.

Xiaochun Li Visualization and ML

Page 57: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Predictive Models

Are we only interested in a predictive black box, or are we alsointerested in which features predict?

p >> n, it’s easy to find classifiers to separate data - arethey meaningful?if features are suspected to be sparse, most features areirrelevant; need automatic feature selection. E.g., LASSO,SVM with L1 penalty

Xiaochun Li Visualization and ML

Page 58: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Predictive Models

Are we only interested in a predictive black box, or are we alsointerested in which features predict?

p >> n, it’s easy to find classifiers to separate data - arethey meaningful?if features are suspected to be sparse, most features areirrelevant; need automatic feature selection. E.g., LASSO,SVM with L1 penalty

Xiaochun Li Visualization and ML

Page 59: Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Summary

Visualization is an important aspect of EDA. "A picture isworth a thousand words".Supervised Learning allows one to select features, andclassify (prediction).Unsupervised Learning allows study of associationsamong features, feature selection, and cluster.

Xiaochun Li Visualization and ML