Visualization and Machine Learning - for exploratory data

IntroductionVisualization

Machine Learning

Visualization and Machine Learningfor exploratory data analysis

Xiaochun Li1,2

1Division of BiostatisticsIndiana University School of Medicine

2Regenstrief Institute

May 2, 2008 / CCBB Journal Club

Xiaochun Li Visualization and ML


Machine Learning

Outline

1 Introduction

2 VisualizationAs IsSimple SummarizationMore Advanced Methods

3 Machine LearningSupervised LearningUnsupervised LearningRandom ForestsSVM



Machine Learning

Introduction

Mining large scale datasets, methods are needed tosearch for patterns, e.g., biologically important gene sets,or samplespresent data structure succinctlyboth are essential in the analysis.



Machine Learning

As IsSimple SummarizationMore Advanced Methods

ObjectiveVisualization

An essential part of exploratory data analysis, and reporting theresults.

plot data as isplot data after simple summarizationplot data based on more advanced methods

clusteringPCA (Principal component analysis)MDS (Multidimensional scaling)Silhouette, randomForest, . . .



Machine Learning


Outline

1 Introduction





Machine Learning


Plot data as isQuality Inspection

An affymetrics chip

image. Some images

may have obvious local

contaminations.



Machine Learning


Plot data as isQuality Inspection

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Ins+, white

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Ins−, white

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Ins+, black

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

16

15

14

13

12

11

10

9

8

7

6

5

4

3

2

1

Ins−, black

An RNAi experiment with

white and black plates,

insulin stimulated +/-.



Machine Learning


Plot data as isR tools

image or heatmap for any chip arraysfor cell-based assays, could also use plotPlate in Rpackage prada



Machine Learning


Outline

1 Introduction





Machine Learning


Simple SummarizationAlong Genomic Coordinates

020

000

4000

060

000

8000

0

Cumulative expression levels by genes in chromosome 21 scaling method: none

Representative Genes

Cum

ulat

ive

expr

essi

on le

vels

NR

IP1

BT

G3

JAM

2A

DA

MT

S1

CC

T8

GR

IK1

HU

NK

SY

NJ1

IFN

AR

2C

21or

f55

SO

NIT

SN

1A

TP

5OD

SC

R1

RU

NX

1R

UN

X1

CB

R3

CLD

N14

DS

CR

5T

TC

3D

YR

K1A

DY

RK

1AD

SC

R4

KC

NJ1

5E

TS

2H

MG

N1

BA

CE

2A

NK

RD

3T

FF

1P

DE

9AU

2AF

1P

DX

KT

ME

M1

B7H

2A

IRE

C21

orf2

UB

E2G

2A

DA

RB

1C

OL1

8A1

SLC

19A

1C

OL6

A1

CO

L6A

1LS

SM

CM

3AP

− + − + + + − − + − + − + + − − + + − + + −

Cumulative expression

profiles along

Chromosome 21 for

samples from 10 children

with trisomy 21 and a

transient myeloid

disorder, colored in red,

and children with different

subtypes of acute myeloid

leukemia (M7), colored in

blue.



Machine Learning


Simple SummarizationAlong Genomic Coordinates

The previous wiggle plot was produced usingalongChrom of the R package geneplotter

Could plot just a segment of chromosome of interest



Machine Learning


Outline

1 Introduction





Machine Learning


MASS Spec Example“Latin Square” Design for B-F

Group Cytochrome c Ubiquitin Lysozyme Myoglobin TrypsinogenA 0 0 0 0 0B 0 1 2 5 10C 1 2 5 10 0D 2 5 10 0 1E 5 10 0 1 2F 10 0 1 2 5G 10 10 10 10 10

Design and the protein concentration, proteins 1= Ubiquitin (1 fmol/uL),

Cytochrome/Lysozyme/Myoglobin (10 fmol/uL), Trypsinogen(100 fmol/uL)



Machine Learning


Mass SpecExample

0e+00 2e+04 4e+04 6e+04 8e+04 1e+05

010

2030

40

mz

x[1,

] One spectrum fromgroup A



Machine Learning


Mass SpecMDS

−400 −200 0 200 400 600 800 1000−20

0−

150

−10

0 −

50

0 5

0 1

00 1

50 2

00

−600−400

−200 0

200 400

first coordinate

seco

nd c

oord

inat

e

third

coo

rdin

ate

●

●●

●●

●

●●

●

●

●●

●

Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.



Machine Learning


MASS Specpairs plot

spec 1

0 10 20 30 40 0 10 20 30

020

4060

80

010

2030

40

0.66 spec 2

0.60 0.98 spec 3

010

2030

40

0 20 40 60 80

010

2030

0.59 0.970 10 20 30 40

0.99 spec 4

The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.



Machine Learning


MASS Specpairs plot

spec 1

0 10 20 30 40 0 10 20 30

05

1015

2025

30

010

2030

40

0.99 spec 2

0.96 0.98 spec 3

010

2030

40

0 5 10 15 20 25 30

010

2030

0.96 0.980 10 20 30 40

0.99 spec 4

The outlier in groupA and 3 otherspectra from thesame group areplotted against eachother. The lower leftpanels show thePearson correlationcoefficients of pairsof spectra.



Machine Learning


Mass SpecMDS: 3-D

−400 −300 −200 −100 0 100 200 300 400−20

0−

100

0

100

200

−200

−150

−100

−50

0

50

100

150

200

first coordinate

seco

nd c

oord

inat

e

third

coo

rdin

ate

●●

●

●

●

●

●

●

●●

●●

●●

Classical MDSscaling results of 39spectra from groupsA, D and G. Circlesrepresent group A,squares group Dand triangles groupG. Each group has13 spectra.



Machine Learning


Silhouette plotvisualize clustering results

A A A A A A A A A A A A AD

G D DG

G GG

D GG

D D D D D DG

G G D D G G D G

010

2030

4050

Cluster Dendrogram

hclust (*, "complete")d.s.nocut

Hei

ght

Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.



Machine Learning



A AA A

AA A

AA A

AA A

DD D

D DD

D DD

D D D DG

G GG

G GG

GG G G

G G

010

020

030

040

0

Cluster Dendrogram

hclust (*, "complete")d.s.cut

Hei

ght

Dendrogram ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.



Machine Learning



Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

whole spec

Average silhouette width : 0.57

n = 39 3 clusters Cj

j : nj | avei∈∈Cj si

1 : 17 | 0.67

2 : 16 | 0.48

3 : 6 | 0.56

Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.



Machine Learning



Silhouette width si

0.0 0.2 0.4 0.6 0.8 1.0

mz<1000 cut

Average silhouette width : 0.65

n = 39 3 clusters Cj

j : nj | avei∈∈Cj si

1 : 13 | 0.82

2 : 13 | 0.60

3 : 13 | 0.53

Silhouette plot ofclustering results of39 spectra fromgroups A, D and G -before and after lowmolecular range isremoved.



Machine Learning


Silhouette plotsilhouette width

For each observation i , the silhouette width si is defined asfollows:

ai = average dissimilarity between i and all other points ofthe cluster to which i belongsfor all other clusters C, put di,C = average dissimilarity of ito all observations of Cbi = minC di,C , and can be seen as the dissimilaritybetween i and its “neighbor” cluster, i.e., the nearest one towhich it does not belongsi = (bi − ai)/max(ai ,bi)



Machine Learning


VisualizationR tools

classical MDS: cmdscale2-D, 3-D scatter plot: plot and R packagescatterplot3d

2-D scatter plot matrix: pairssilhouette plot: silhouette



Machine Learning

Supervised LearningUnsupervised LearningRandom ForestsSVM

Machine Learning

Machine Learning: computational and statistical approaches toextract important patterns and trends hidden in large data sets.

Supervised: predict outcome y based on X , a number ofinputs (variables). E.g., predict the class labels of “tumor”or “normal”, based on gene expressionUnsupervised: no y ; describe the associations andpatterns among X . E.g., which subset of genes has similarexpression? Which subgroup of patients has similar geneexpression profiles?



Machine Learning


Machine Learning





Machine Learning


Machine Learning





Machine Learning


Machine Learning





Machine Learning


Outline

1 Introduction





Machine Learning


Supervised Learning

linear modelnearest neighbor (k -nn)LDA (Linear Discriminant Analysis): same covariance Σacross classesLDA variants: QDA (class-specific Σk ), DLDA (Σ isdiagonal), RDA (regularized use αΣ + (1− α)I, SVMrandomForest



Machine Learning


Outline

1 Introduction





Machine Learning


Unsupervised Learning

ClusteringPCA (Principal component analysis)MDS (Multidimensional scaling), classical MDS usingEuclidean distance=PCAK-meansSOM (Self-organizing maps)Unsupervised as Supervised Learning



Machine Learning


Unsupervised as Supervised Learningthrough data augmentation

Let g(x) be the unknown density to be estimated, and g0(x) bea specified reference density.

x1, x2, . . . , xni.i.d.∼ g(x); assign class label Y = 1

xn+1, xn+2, . . . , x2ni.i.d.∼ g0(x); assign class label Y = 0

x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2

µ(x) ≡ E(Y |x) = g(x)/g0(x)1+g(x)/g0(x) can be estimated by

supervised learning using the combined sample,(y1, x1), (y2, x2), . . . , (y2n, x2n)

g(x) = g0(x) µ(x)1−µ(x)

E.g., using this techinque with RandomForest.



Machine Learning






x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2



g(x) = g0(x) µ(x)1−µ(x)




Machine Learning






x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2



g(x) = g0(x) µ(x)1−µ(x)




Machine Learning






x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2



g(x) = g0(x) µ(x)1−µ(x)




Machine Learning






x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2



g(x) = g0(x) µ(x)1−µ(x)




Machine Learning






x1, x2, . . . , x2ni.i.d.∼ (g(x) + g0(x))/2



g(x) = g0(x) µ(x)1−µ(x)




Machine Learning


Outline

1 Introduction





Machine Learning


What are Random Forests

Random forests are a combination of tree predictors whichdepends on iid values random vectors, {θθθk}.Example - Bagging (bootstrap aggregation):

bootstrap samples are drawn from the training set, whereθθθk is counts in n boxes resulting from sampling withreplacementa tree is grown from each bootstrap sampleassign class per majority votes.



Machine Learning


Motivation:

Improve predictiona single tree has poor accuracy for problems with manyvariables, each of them having very little information e.g.,genomics data setscombining trees grown using random features can improveaccuracy

Assess Performancetraining error (error rate from the training set) does notindicate performance over new dataoverfit→ small training error but poor generalization errorneed data which were not used to grow a particular tree toassess the performance of the tree.



Machine Learning


Strength and Correlation

for a given case (X, Y), and a given ensemble of classifiersmargin = proportion of votes for the right class −maxother classes(proportion of votes for any other class)

generalization error PE∗ = PX,Y(margin < 0)

s ≡strength = EX,Y(margin)ρ ≡correlation = some correlation btw any two trees.Thm 1.2. generalization error convergesThm 2.3. Gen. Error is bounded, PE∗ ≤ ρ(1− s2)/s2.



Machine Learning


Random Forests Converge

Theorem 1.2. As the number of trees increases,generalization error a.s. for all {θθθk} converges.

this is why random forests do not overfit as more trees areadded, but tend to a limiting value of the generalizationerror.



Machine Learning


StrategyMinimize Correlation While Keeping Strength

Using randomly selected inputs or combinations of inputs ateach node to grow each tree:

Random Input Selection - Forest-RIat each node, select at random F variables to split on,grow the tree to maximum size and do not prune.Random Feature Selection - Forest-RCsame idea as above but with F Features- "linear combinations of randomly selected L variables"with random coefficients runif(L, -1, 1)⇒ further reducecorrelation



Machine Learning


Gauging Performance

Bagging makes it possible to estimate the generalizationerror without a test set.

Why: in any bootstrap sample, about 1/3 of cases from theoriginal training set are left out due to sampling withreplacement (1− 1

n )n ≈ e−1 ≈ 1/3.Out-Of-Bag Estimates of Error, Strength and Correlation

For each (x, y), aggregate the votes over trees grownwithout (x, y) - out-of-bag classifier.Out-of-bag estimate of generalization error = error rate ofout-of-bag classifier.Same idea for out-of-bag strength and correlation.



Machine Learning


ConclusionsRandomForest

Random forests do not overfit - effective tool in prediction.Fast in computationOut-of-bag estimates gauge the performance of the forest.Forests give results competitive with boosting and adaptivebagging, without progressively changing the training set.Their accuracy indicates that they reduce bias.Random inputs and random features produce good resultsin classification but less so in regression.



Machine Learning






Machine Learning






Machine Learning






Machine Learning






Machine Learning


RandomForest in Unsupervised Learning

RandomForest can be used in the unsupervised mode forvariable selectionproximity matrix (for clustering)



Machine Learning


RandomForest in Unsupervised Learning

RandomForest can be used in the unsupervised mode forvariable selectionproximity matrix (for clustering)



Machine Learning


Outline

1 Introduction





Machine Learning


What are SVMs

Support vector machines (SVMs) are a set of supervisedlearning methods used for classification and regressionAn extension of LDA

many hyperplanes could classify the datainterested in the one achieving maximum separation(margin) between the two classesmathematically, for (yi , xi ), yi = ±1, i = 1, . . . ,n, min 1

2 ||x ||2

s.t., yi (x ′i x − b) ≥ 1 (if separable)min 1

2 ||w ||2 + λ

∑ni=1 ξi s.t., ξi ≥ 0, yi (x ′i w − b) ≥ 1− ξi (if

not separable)



Machine Learning


SVMseparable case

http://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png

http://upload.wikimedia.org/wikipedia/commons/2/20/Svm_separating_hyperplanes.png 5/2/2008 9:40:18 AM

Separable case.



Machine Learning


SVMseparable case

http://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png

http://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png 5/2/2008 9:41:23 AM

Separable case.



Machine Learning


Predictive Models

Are we only interested in a predictive black box, or are we alsointerested in which features predict?

p >> n, it’s easy to find classifiers to separate data - arethey meaningful?if features are suspected to be sparse, most features areirrelevant; need automatic feature selection. E.g., LASSO,SVM with L1 penalty



Machine Learning


Predictive Models

Are we only interested in a predictive black box, or are we alsointerested in which features predict?

p >> n, it’s easy to find classifiers to separate data - arethey meaningful?if features are suspected to be sparse, most features areirrelevant; need automatic feature selection. E.g., LASSO,SVM with L1 penalty



Machine Learning


Summary

Visualization is an important aspect of EDA. "A picture isworth a thousand words".Supervised Learning allows one to select features, andclassify (prediction).Unsupervised Learning allows study of associationsamong features, feature selection, and cluster.


Documents

Visualization and Machine Learning - for exploratory data