41
1 Introduction to exploratory data analysis Jean Paul Maalouf [email protected] Illustrated with XLSTAT www.xlstat.com October 26, 2017

Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

  • Upload
    others

  • View
    38

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

1

Introduction to exploratory data analysis

Jean Paul [email protected]

Illustrated with XLSTAT

www.xlstat.com

October 26, 2017

Page 2: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

2

PLAN

• XLSTAT: who are we?

• Statistics: categories

• Reminder: Variables, individuals, Descriptive Statistics

• Toward exploratory data analysis: scatter plot colored by group

• Exploratory statistics & Data Mining

• Principal Component Analysis (PCA): concept and practice

• Agglomerative Hierarchical Clustering (AHC): concept and practice

All the data in this webinar were made up unless

otherwise specified

Page 3: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

3

XLSTAT: Who are

we?

XLSTAT is a user-friendly

statistical add-on software

for Microsoft Excel®

Page 4: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

4

XLSTATA growing software and team

Thierry Fahmydevelops a user-friendly solution

for data analysis:

XLSTAT is born

XLSTAT realizes its

first sale on the Internet

New version, VBA interface,

C++ computations, 7

languages

New products, new website, growing and

dynamic team

The company Addinsoft is

created

New offersadapted to

business needs

XLSTAT 365Cloud version for Excel 365XLSTAT-Free

1993 2000 2009 2016

201520061996

R integration

?

2017

Page 5: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

5

XLSTAT in a few numbers

200+ statistical features

General or field-oriented solutions

100k users

Across the world. Companies, education, research

22 employees

Always receptive to the needs of users

220k visits/month on the website

Easy tutorials available in 5 languages

7 languages 10k downloads/month

Page 6: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

6

Statistics: 4

categories

Page 7: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

7

Statistics: 4 categories

Description Exploration Tests Modeling

I want to summarize

data using simple

statistics or charts

(mean, standard

deviation,

boxplots...)

I want to easily extract

information from a

large data set without

necessarily having a

precise question to

answer. (PCA, AHC...)

I want to accept / reject

a very precise

hypothesis assuming

error risks. (t-tests,

ANOVA, correlation

tests, chi-square...)

I want to understand

the way a phenomenon

evolves according to a

set of parameters.

(regression, ANOVA,

ANCOVA...)

Page 8: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

8

Reminder:

variables,

individuals,

descriptive

statistics

Page 9: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

9

Variables, individuals...

Variable

An element that can take different values

Qualitative variable

A variable that cannot be quantified. Examples:

socioprofessional category, geographical origin, type

of licence, blood type... The possible values it can take

are called categories or modalities

Quantitative variable

A variable that can be quantified. Examples: invoice

amount, number of likes on Facebook, sugar

concentration, height...

Individual

Elementary statistical unit. Can be described with

variables. Examples: customers, surveyed people,

patients, laboratory mice...

Page 10: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

10

Data set: online shoe selling platform

Variables

Indiv

iduals

Page 11: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

11

Descriptive statisticsCommonly used tools according to the situation

1 qual. variableFlat sorting, mode, pie charts

1 quant. variableCenter (mean / median) ; dispersion

(variance / std. deviation / quartiles) ;

box plot

1 qual. variable x 1 qual. variableCross tabulation (contingency table)

1 quant. variable x 1 quant. variableScatter plot

1 quant. variable x 1 qual. variableQuantitative descriptive statistics per

category of the qualitative variable; multiple

box plot chart

1 quant. variable x 1 quant. variable

x 1 qual. variable

Scatter plot with points colored according

to the categories of the qualitative variable

Page 12: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

12

Toward

exploratory data

analysis: scatter

plot colored by

group

Page 13: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

13

Toward exploratory data analysis: scatter plot

colored by group

- Invoice amount decreases with time spent

on the website.

- Plutonians spend more money on the website

compared to others.

- Martians and humans form a relatively

homogeneous group

- ...

Page 14: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

14

Imagine having the same kind of reasoning

on a higher number of variables... Time for Exploratory statistics (or Exploratory

Data Analysis)

Page 15: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

15

Example: Principal Component Analysis (PCA)We want to analyze multiple variables (dimensions) at a time the same way we did with the 2D scatter plot.

Page 16: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

16

Exploratory data

analysisI want to easily extract information

from a large data set without

necessarily having a precise question

to answer.

Page 17: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

17

Exploratory data analysis: a few words

Exploratory statistics

Look for information in a multi-variables data set, without having very

precise expectations. Exploratory tools are part of Data Mining.

First thing you can do: concentrate the information of big data

sets in a few dimensionsExamples: Principal Component Analysis, Correspondence Analysis…

Second thing you can do: classification ( = clustering = segmentation)Examples: Agglomerative Hierarchical Clustering, k-means…

Page 18: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

18

Principal

Component

Analysis (PCA)I’d like to summarize a big data set in a

few simple charts

- Relationships among

variables

We’ll be able to investigate:

- Proximity among individuals

- How individuals relate to

variables

Page 19: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

19

PCA: concept

Initial dataset

+

Amount of

information

-

Artificial data set synthesized by PCA

The information is re-distributed in a

way to concentrate most of it on a few

dimensions.

PCA jargon:

dimension

= axis

= factor

information

= variability

= inertia

Page 20: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

20

Setting up a PCA in XLSTATChart 1: correlation circle

PCA tutorial link

Page 21: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

21

How PCA looks like in realityChart 1: correlation circle

- Acute angle: positively-linked variables

(e.g. weight & height)

- Right angle: uncorrelated variables (e.g.

height & shoe size)

- Obtuse angle: negatively-linked

variables (e.g. weight & time spent on

site)

Vector length reflects

representativeness in the

selected plan (F1/F2 here)

Page 22: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

22

Interpreting the axesChart 1: correlation circle

- F1 reflects:

- High weight & height (right)

- Long time spent on site (left)

- F2 is strongly related to shoe size:

- Big shoes (top)

- Small shoes (bottom)

Page 23: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

23

How PCA looks like in realityChart 1: correlation circle ; chart 2: observations

Weight+

Height+

time on site-

Weight-

Height-

time on site+

Page 24: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

24

PCA: explorations ...

Weight increases with height Shoe size is unrelated to weight / height

Time spent on site decreases with weight & height Derrick has big feet. Shaun has small feet.

Looks like there are two clusters in the data And so on...

PCA tutorial link

PCA works only with quantitative data. Click here to check out other exploratory methods.

Page 25: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

25

It was easy to detect two clusters of

customers. Nice for marketing!

Weight+

Height+

time on site-

Weight-

Height-

time on site+

But what if groups were not that

easy to define visually?

According to our PCA, customers can

be split into two clusters characterized

by height, weight and time spent on site.

This may help us define tailored

marketing campaigns.

Page 26: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

26

Agglomerative

Hierarchical

Clustering (AHC)

I want to cluster ( = classify =

segment) individuals in homogeneous

groups ( = segments = clusters =

classes)

Page 27: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

27

Agglomerative Hierarchical Clustering (AHC)

How to cluster consumers into different groups?

Illustration with 2 variablesEXAMPLE: sensory analysis, chocolate consumers survey

Page 28: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

28

AHC – how it works on 2 variables

xx

x

19 groups18 groups17 groups16 groups15 groups14 groups8 groups9 groups7 groups6 groups5 groups4 groups3 groups2 groups1 group

Choosing a

“cutting” level

Segments

are now

defined

Age

This can obviously be

generalized over

more than 2 variables

Page 29: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

29

Agglomerative Hierarchical Clustering (AHC)Setting things up in XLSTAT

AHC tutorial link

Page 30: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

30

Agglomerative Hierarchical Clustering (AHC)What it looks like in XLSTAT:

The higher the “vertical

distance” between two

individuals (or groups), the

more different the

individuals.

Here we could split the

individuals into 3 or 4

homogeneous groups

Art

uro

Trac

yJo

rdan

Co

rnel

ius

An

ita

Elen

aC

and

ice

Jake

Juan

aK

rist

enD

ana

Mar

lon

Mo

na

Car

roll

Cri

stin

aH

op

eD

uan

eP

hili

pJo

eEd

mu

nd

Mau

rice

Mar

sha

Sam

Pe

dro

Co

nra

dSo

ph

ieB

ryan

tA

nn

eM

elin

da

Kar

laC

asey

Ro

sem

ary

Tam

iD

ori

sSa

mu

elSa

lvad

or

Trav

isR

and

all

Kev

inD

erek

Kri

sta

Fran

kJo

dy

Cly

de

Dan

aR

ose

mar

ieC

ame

ron

Ro

ger

Mik

e Al

Max

Jon

ath

anA

na

Gab

riel

Bec

kyFa

yeA

me

liaSa

raJe

rom

eD

om

inic

Stac

yJo

nat

ho

nA

lfre

do

Terr

ell

Pat

tiLe

ahP

ablo

Ran

dal

Bra

nd

iEd

ith

Tim

my

Mar

yB

yro

nC

lau

de

Gw

end

oly

nM

ich

eal

Eula

Joey

Bra

nd

on

Eliz

abet

hD

avid

Bo

bb

yC

aro

lC

od

yO

pal

Shel

iaD

on

Alis

on

Will

isIr

vin

Ted

Cec

elia

Shir

ley

Mu

riel

Luke

Wilb

ur

Lisa

Dar

rel

Sher

riSh

eryl

0

50

100

150

200

250

Dis

sim

ila

rity

Dendrogram

Page 31: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

31

Agglomerative Hierarchical Clustering (AHC)3-cluster split:

Okay. And now what?

Let’s describe the 3 groups to see how we

could take action on a marketing scale

Art

uro

Trac

yJo

rdan

Co

rnel

ius

An

ita

Elen

aC

and

ice

Jake

Juan

aK

rist

enD

ana

Mar

lon

Mo

na

Car

roll

Cri

stin

aH

op

eD

uan

eP

hili

pJo

eEd

mu

nd

Mau

rice

Mar

sha

Sam

Pe

dro

Co

nra

dSo

ph

ieB

ryan

tA

nn

eM

elin

da

Kar

laC

asey

Ro

sem

ary

Tam

iD

ori

sSa

mu

elSa

lvad

or

Trav

isR

and

all

Kev

inD

erek

Kri

sta

Fran

kJo

dy

Cly

de

Dan

aR

ose

mar

ieC

ame

ron

Ro

ger

Mik

e Al

Max

Jon

ath

anA

na

Gab

riel

Bec

kyFa

yeA

me

liaSa

raJe

rom

eD

om

inic

Stac

yJo

nat

ho

nA

lfre

do

Terr

ell

Pat

tiLe

ahP

ablo

Ran

dal

Bra

nd

iEd

ith

Tim

my

Mar

yB

yro

nC

lau

de

Gw

end

oly

nM

ich

eal

Eula

Joey

Bra

nd

on

Eliz

abet

hD

avid

Bo

bb

yC

aro

lC

od

yO

pal

Shel

iaD

on

Alis

on

Will

isIr

vin

Ted

Cec

elia

Shir

ley

Mu

riel

Luke

Wilb

ur

Lisa

Dar

rel

Sher

riSh

eryl

0

50

100

150

200

250

Dis

sim

ila

rity

Dendrogram

Page 32: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

32

How can I describe

segments?

Things become quite

straightforward when you extract

class membership in the AHC

results

Page 33: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

33

Describing the segments

Split individuals into classes and run

descriptive statistics on each

segment

Use Class membership as a

supplementary variable in a PCA

Use Parallel Coordinates Plots

Things you can do

Page 35: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

35

Describing clusters: descriptive statistics

Consumers from

clusters 1 & 3 are

more loyal to

brands than those

from cluster 2

Consumers from

cluster 2 are

younger

Page 36: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

36

Describing clusters: parallel coordinates plot

Tutorial link

Page 37: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

37

Describing clusters: parallel coordinates plot

Cluster 3: older consumers, loyal to

brands, who prefer bitter chocolate

and are not online buyers...

Cluster 2: younger consumers, prefer

frozen chocolate, are sensitive to

prices and care less about brands

Consequences :

- Promote branded bitter chocolate

to older consumers

- Promote cheaper chocolates to

younger consumers

- …

Brand loyalty Price sensitivity Online buyer Bitter Frozen Crunchy Age

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

Parallel coordinates plot

1 2 3

Page 38: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

38

In summary...

Description Exploration Tests Modeling

I want to summarize

small data sets (1-3

variables) using

simple statistics or

charts. Leads to

hypotheses.

I want to easily extract

information from a

large data set without

necessarily having a

precise question to

answer. Leads to

hypotheses.

I want to validate /

reject a very precise

hypothesis assuming

error risks. (t tests,

ANOVA, correlation

tests, chi-square...)

I want to understand

the way a phenomenon

evolves according to a

set of parameters.

(regression, ANOVA,

ANCOVA...)

Page 39: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

39

Exploratory statistics: Take Home

Message

Exploratory statistics

Allow to gain insight into large data sets

They give a synthetic view of large data sets

Examples: Principal Component Analysis, Correspondence Analysis, MDS…

They allow clustering data sets

Examples: Agglomerative Hierarchical Clustering, k-means

Click here to choose an appropriate exploratory data analysis tool according to

your situation

Page 40: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

40

Data exploration inspired us many hypotheses. Are they valid?

Statistical tests

See you on November 16!

Subscribe

Page 41: Introduction to exploratory data analysis · Exploratory data analysis: a few words Exploratory statistics Look for information in a multi-variables data set, without having very

41

Thanks for attending!All the tools we saw are available in all XLSTAT solutions

Download 30-day Free Trial

Discover our products