Processing Big Data - ULisboausers.isr.tecnico.ulisboa.pt/~csoares/BigDataDocs/EDA.pdfProcessing Big...

Preview:

Citation preview

Processing Big DataExploratory Data Analysis

Exploratory Data Analysis

What it is

Any method of looking at data that does not include formal statistical modeling and inference

NYC OD: Emergency Response Incidents

3

Why it matters

Confirmatory statistical analyses are based on models.

4

Why it matters

Confirmatory statistical analyses are based on models.

y = Ax+N (0,�2)

Structural component

Random component

4

Why it matters

Confirmatory statistical analyses are based on models.

y = Ax+N (0,�2)

Structural component

Random component

Signal Noise

4

Why it matters

Models are not perfect representations of the real world.

5

Why it matters

Models are not perfect representations of the real world.

5

Why it matters

Models are not perfect representations of the real world.

But some are close enough to be useful!

https://commons.wikimedia.org/w/index.php?curid=521370

5

Why it mattersWhat is close enough to reality?

6

Why it mattersWhat is close enough to reality?

Statistical inference always depends on model assumptions about the data.

6

Why it mattersWhat is close enough to reality?

Statistical inference always depends on model assumptions about the data.

6

Use EDA for:

7

Use EDA for:

• Detecting data noise

7

Use EDA for:

• Detecting data noise

• Checking assumptions

7

Use EDA for:

• Detecting data noise

• Checking assumptions

• Selecting data models

7

Use EDA for:

• Detecting data noise

• Checking assumptions

• Selecting data models

• Determining relationships between the explanatory variables

7

Use EDA for:

• Detecting data noise

• Checking assumptions

• Selecting data models

• Determining relationships between the explanatory variables

• Determining relationships between explanatory and outcome variables

7

TechniquesLook at the raw data

• What values the variables take? • How often variables take those values?

Compute summary statistics

• Look at the top and bottom of your data. • How much missing data? • How noisy is the data?

Visualize• Show comparisons • Show structure • Show multivariate data8

311 Service Requests from 2010 to Present

Non-emergency City services and information9

How to: Look at the raw data

Data sets are in general huge

Do not load a huge file into memory

10

How to: Look at the raw data

Data sets are in general huge

Do not load a huge file into memory

Example: ~500M tweets/day x 140 B x 7 days ~500 GB of RAM10

How to: Look at the raw dataOld school shell tools for raw data inspection

see more @ https://pixorblog.wordpress.com/2016/06/24/csv-files-and-bash/

11

Summary stats

Mean, mode, median

Variance

Range, max, min

Correlation

12

Beware of summary stats

Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician. 27 (1): 17–21. From wikipedia

13

Beware of summary stats

https://www.autodeskresearch.com/publications/samestats

14

Data Visualization

Data points across some features

Features across all data points

Histograms

15

Followup

Do you have the right question?

Refine your question

Do you have the right data?

Do you need other data?More data?

Collect data according to needs

16

Recommended