50
Visualizing and Exploring Data 1

Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Embed Size (px)

Citation preview

Page 1: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Visualizing and Exploring Data

1

Page 2: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Outline1.Introduction2.Summarizing Data: Some Simple Examples3.Tools for Displaying Single Variable4.Tools for Displaying Relationships between Two

Variables5.Tools for Displaying More Than Two Variables6.Principal Components Analysis7.Multidimensional Scaling

2

Page 3: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Introduction

• Visual methods are important and ideal for sifting through data to find unexpected relationships.

• Exploratory data analysis is to find the structure that may indicate deeper relationships between cases or variables.

3

Page 4: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples

The measure of locationMeanMedianFirst quartileThird quartileDecilesPercentilesMode

4

Page 5: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

Suppose that x(1),x(2),…..x(n) comprise a set of n data value.

• Sample mean

μ: true mean of population : estimate of true mean

5

Page 6: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

Sample mean can minimize the sum of squared difference between it and the data values.

Ex. data set{1,2,3,4,5}μ =3

μ =1

6

Page 7: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

• Median: The value that has equal number of data points above and below it.

Ex.data set{1,2,3,4,5}Median=3Ex.data set{1,2,3,4,5,6}Median=(3+4)/2=3.5

7

Page 8: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

• First quartile: The value that is greater than a quarter of data points.

• Third quartile: The value that is greater than three quarters of data points.

• Interquartile range: The difference between the third and first quartile.

• Range: The difference between the largest and smallest data point.

8

Page 9: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

Percentiles: The value of a variable below which a certain percent of observations fall.

Deciles

9

Page 10: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

• Mode: The value that occurs most frequently in a data set or a probability distribution

Ex.data set{1,3,6,6,6,6,7,7,12,12,17}Mode=6Ex.data set{1,1,2,4,4}Mode=1,4

10

Page 11: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

• Unimodal: A data set or a distribution with one mode

• Bimodal• Multimodal

11

Page 12: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

• Variance

If μ is replaced with then the variance is estimated as

12

Page 13: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

• Standard deviation

13

Page 14: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Summarizing Data: Some Simple Examples(Cont.)

• Skewness: It measures whether or not a distribution has a single long tail.

• A distribution is said to be right-skewed if the long tail extends in the direction of increasing values and left-skewed otherwise. Symmetric distribution have zero skewness.

14

Page 15: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying Single Variable

• Histogram-1

15

Page 16: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying Single Variable(Cont.)

• Histogram-2

16

Page 17: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying Single Variable(Cont.)

• Kernel estimateA single variable X Have measured values

{x(1),x(2),……x(n)}

K():Kernel function, Gaussian curve in commonh: Width

17

Page 18: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying Single Variable(Cont.)

• Gaussian curve

C: Normalization constantt=x-x(i)h:standard deviation

18

Page 19: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

19

Page 20: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying Single Variable(Cont.)

• Box and whisker plot

20

Page 21: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying Relationships between Two Variables

• Scatterplot

21

Page 22: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying Relationships between Two Variables(Cont.)

• Contour plot

22

Page 23: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying More Than Two Variables

• Scatterplot matrix

23

Page 24: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying More Than Two Variables(Cont.)

• Trellis plot

24

Page 25: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying More Than Two Variables(Cont.)

• Star plot

25

Page 26: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying More Than Two Variables(Cont.)

• Chernoff’s face

26

Page 27: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Tools for Displaying More Than Two Variables(Cont.)

• Parallel coordinates plot

27

Page 28: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis

28

• Objective: To find vectors let data project on them to keep maximum variance.

• Advantage: This method can reduce the dimensions of data.

Page 29: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

29

• Suppose an n×p data matrix X that each row is a data vector x and columns represent the variables.

• X is mean-centered (i.e column has subtracted the sample mean for that variable )

Page 30: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

• a p×1 column vector a of projection weights and let the data vector x project along a represent that .

• All data vectors in X are projected on a represent that Xa is an n×1column vector of projected values.

30

p

jjj

T xa1

xa

Page 31: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

• Define the variance along a as

• : The p×p covariance matrix of the data

31

Vaa

XaXa

XaXaa

T

TT

T

)()(2

XXV T

Page 32: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

• Using some constraint such that and use Lagrange multiplier to find a that maximize the variance along a.

• Differentiating with respect to a yields

32

1aaT

)1( aaVaa TTu

aVa

aVaa

022u

Page 33: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

• The first principal component a is the eigenvector associated with the largest eigenvalue of the covariance matrix V

• The second principal component is associated with the second largest eigenvalue and it’s direction orthogonal to the first , and so on.

33

Page 34: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

• The data are projected into first k eigenvectors the variance of the projected data can be expressed as

• : The jth eigenvalue

34

k

jj

1

j

Page 35: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

• The loss of data

35

p

ll

p

kjj

1

1

Page 36: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

• Scree plot

36

Page 37: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

37

• Ex.269.8 38.9 50.5

272.4 39.5 50.0

272.0 39.3 50.2

268.2 38.6 50.2

268.2 38.6 50.8

267.0 38.2 51.1

267.8 38.4 51.0

273.6 39.6 50.0

271.2 39.1 50.4

270.0 38.9 50.5

Page 38: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

38

Page 39: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Principal Components Analysis(Cont.)

39

Page 40: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling

• Objective: To seek to represent data points in lower dimensional space while preserving ,as far as is possible, the distances between the data points.

40

Page 41: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

• Classical multidimensional scaling• Metric multidimensional scaling• Non-metric multidimensional scaling

41

Page 42: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

• Assume an 3×2 data matrix X that the mean of each variable is zero.

• Then compute an 3×3 matrix B that

42

3231

2221

1211

xx

xx

xx

X

333231

232221

131211

232

2312232213122321131

32223121222

22112221122

3212311122122111212

211

bbb

bbb

bbb

xxxxxxxxxx

xxxxxxxxxx

xxxxxxxxxxTXXB

i j

ijij bb 0

Page 43: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

• The squared Euclidean distance between object1 and 2 that

43

)1.....(....................2

2

2

)(2

22

22

122211

22122111222

212

212

211

2222212

212

2212111

211

212

ijjjiiijijjjiiij

dbbbbbbd

bbb

xxxxxxxx

xxxxxxxxd

Page 44: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

• Define an 3×3 distance matrix D that

44

022

202

220

322233311133

233322211122

133311122211

233

232

231

223

222

221

213

212

211

bbbbbb

bbbbbb

bbbbbb

ddd

ddd

ddd

D

)4....(......................................................................).........(2

)3........(......................................................................)(

)2.......(......................................................................)(

3

220

2

2

11332211

311133211122

231

221

211

2

B

B

B

trnd

nbtrd

nbtr

bbbb

bbbbbb

dddd

ijij

iij

ij

jj

iij

Page 45: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

45

)9....(..................................................21

)8....(..................................................21

thenEq(6)andEq(5)into )(fordsubstitute is Eq(7)

)7......(........................................2

1)()4(

)6...(........................................

)(

)3(

)5...(........................................)(

)2(

22

22

2

2

2

n

dn

d

b

n

dn

d

b

tr

dn

trEq

n

trd

bEq

n

trdbEq

ijij

iij

jj

ijij

jij

ii

ijij

jij

ii

iij

jj

B

B

B

B

Page 46: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

46

)111

(2

1

2

1

2

1

2

1

2

12

12

21

21

Eq(1) into andfor dsubstitute are Eq(9) and Eq(8)

22

222

22

222

2222

2

2222

ijij

jij

iijij

ijij

jij

iijij

ijij

ijj

iji

ij

ijij

iji

ijij

ijj

ij

ij

jjii

dn

dn

dn

d

dn

dn

dn

d

n

nddn

dd

dn

dn

d

n

dn

d

b

bb

Page 47: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

47

• Using Singular Value Decomposition to B that

n

n

nTnn

TT

T

....,

of eigenvalue is diagonalon element each matrix, diagonal:

1],......[

of rseigenvecto are torscolumn vec alland

, meansit matrix, lorthonorma:

212

1

21

B

vvvvvV

B

IVVVVV

VVB

Page 48: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

• We can choose first r eigenvalues more large than others that decide to how many dimensions we want to map.

48

matrix:

matrix:

,2

1~

rr

rn

pr

r

r

rr

T

T

V

VX

XX

VVB

Page 49: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

• Ex.• Data eigenvalues distance

• Transformed data stress distance

49

1 2 8

3 4 5

5 6 9

16.9641

7.7025

0

-2.4621 1.5436

-0.7528 -2.2085

3.2149 0.6649

0 4.1231 5.7446

4.1231 0 4.8990

5.7446 4.8990 0

0 4.1231 5.7446

4.1231 0 4.8990

5.7446 4.8990 0

1.0325e-016

Page 50: Visualizing and Exploring Data 1. Outline 1.Introduction 2.Summarizing Data: Some Simple Examples 3.Tools for Displaying Single Variable 4.Tools for Displaying

Multidimensional Scaling(Cont.)

• Stress

: The observed distance between point i and j in the p-dimensional space.

: The distance between points representing these objects in the two-dimensional space.

• Sstress

50

i j

iji j

ijij dd 22/)(

i j

iji j

ijij dd 4222 /)(

ij

ijd