13
Exploiting Your Data Quantified Self 22 March 2012 Akram Najjar This talk is en “eye opener” We will Not discuss Techniques or “HowData is Analyzed! We will Only talk about “Whatsuch methods can give us

Akram najjar exploiting your data (in color)

Embed Size (px)

DESCRIPTION

March 22nd, Beirut Quantified Self

Citation preview

Page 1: Akram najjar   exploiting your data (in color)

Exploiting Your Data

Quantified Self

22 March 2012Akram Najjar

This talk is en “eye opener”We will Not discuss Techniques or “How”

Data is Analyzed!

We will Only talk about “What” such methods can give us

Page 2: Akram najjar   exploiting your data (in color)

3 / 25

What Methods can you Apply to Your Data?

A. The Bell Shaped Curve (Normal Distribution)

B. Correlation of two variables

C. Forecasting using Simple Linear Regression (Best Line of Fit)

D. Statistical Process Control

4 / 25

Other Tools that work directly on Data . . . .

Goodness of Fit testing

Independence Testing

Moving Averages and Exponential Smoothing

Non-Linear Regression (polynomial, exponential, logarithmic)

Weighted Index Scoring

Excel: The Pivot Table

Excel: Conditional Formatting

Page 3: Akram najjar   exploiting your data (in color)

5 / 25

A. The Bell Shaped Curve (The Gaussian or Normal Distribution)

Useful when you have a lot of data

Prepare a Bar Chart or a Frequency Table

Most likely, they will plot as a Bell Shaped Curve (Normal/Gauss Curve)

Example: Measurements of most natural variables

Example: Measurements of most manufactured items

Prepare a frequency table of your data

How many times did you get a specific value?

Out of 200 measurements, how many times was your Systolic Blood Pressure = 110,115, 120, 125, 130, 135, 140 . .

How

man

y tim

es?

Here are 24 Systolic Blood Pressure Measurements – They Look like a Bell Curve

Probability of Pressure > 125 = (4 + 2) / 24 = 1/4 = 25%

Probability of Pressure > 125 = (4 + 2) / 24 = 1/4 = 25%

Page 4: Akram najjar   exploiting your data (in color)

If we had 201 measurements . . . .

Total Count in Bars = Area of Bars = Probability > 122= 15.83%

The Bell Shaped Curve is completely defined by:

a) Average (115) of the data

b) Standard deviation (7) of the data. It indicates how spread is our data from the average.

(Approx 70% of observations are between 115-7 and 115+7)

Page 5: Akram najjar   exploiting your data (in color)

9 / 25

What do we get if we use the Bell Shaped Curve (Normal Distribution)?

Benefit 1: measuring the spread of our data

Benefit 2: we can now compare specific scores in two different population (next slide)

Benefit 3: if we know the measure, we can compute the probability of it happening

Benefit 4: if we know the probability, we can work out the cut off measure that will give it

72

88

78

If I have the same score 78 in Courses A and B, can I say I am doing the same in both?

Page 6: Akram najjar   exploiting your data (in color)

11 / 25

Benefits 3 and 4

Given a specific measurement or range, what is the probability of their occurrence? Probability I will get a fever of more than 38 degrees?

Probability flights will be more than 30 minutes late?

Probability my systolic is > 122

Given the probability, what is the cutoff measurement? I want to remain at a sugar level representing the top 15%

allowed, what is the level related to that?

If Human Resources want the top 15% results, what is the passing grade?

12 / 25

B. Correlation

If we have two sets of data, how are they related?

Example: Blood Pressure vs Intake of Salt

Example: Advertising Expenditure vs Sales Revenue

Example: Hours walked per day vs Weight in Kilograms

What is the direction of the relationship? Direct or inverse?

What is the strength of the relationship? Correlation

We use the Correlation Function (Demonstrate in Excel)

Page 7: Akram najjar   exploiting your data (in color)

13 / 25

C. Forecasting using Simple LinearRegression (Best Line of Fit)

If we have an independent variable (X): Sugar Intake

And a dependent variable (Y): Weight

What is the relationship that allows us to forecast Weight for different Sugar Intakes?

We need two columns: X and Y

Simple Linear Regression allows us to find the Best Line to fit our data

Regression finds the Best Line that Fits our Observations

Y

1 2 3 4 5 6 7 8

1

2

3

4

5

0,0

Page 8: Akram najjar   exploiting your data (in color)

Which Straight Line Best Fits our Observations?

Y

1 2 3 4 5 6 7 8

1

2

3

4

5

0,0

16 / 25

Multiple Regression: allows us to find the Equation Y = aX1 + bX2 + cX3 + d

YX1

X2 X3

Page 9: Akram najjar   exploiting your data (in color)

17 / 25

D. Statistical Process Control (SPC)

The Purpose of SPC is to Monitor a Process

SPC allows us to Check if a variable is behaving properly Over time

Over different locations/departments

Over different events

Over different samples

Control Charts were first used in Bell Labs (1924)

Although mostly used in industry SPC can be used in any sector

The General Form of a Control Chart: 4 Components

1) UCL : Upper Control Limit

3) LCL : Lower Lower Limit

2) AL : Average Line

45

46

47

48

49

50

51

52

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

The IDs of the Samples - - - - - OR The Time Series

Our

Var

iabl

e

4) Process Data

Page 10: Akram najjar   exploiting your data (in color)

This Process is “In control”

0

5

10

15

20

25

30

35

40

45

50

Upper Limit

Lower Limit

0

5

10

15

20

25

30

35

40

45

50

This Process is Regularly “Out of Control”

Look for an explanation INSIDE the system

Page 11: Akram najjar   exploiting your data (in color)

0

5

10

15

20

25

30

35

40

45

Look for an explanation OUTSIDE the system

This Process is Irregularly “Out of Control”

0

5

10

15

20

25

30

35

40

45

Look for an explanation OUTSIDE the system

This Process is Irregularly “Out of Control”.

Trends in either direction of 5 or

more points

Page 12: Akram najjar   exploiting your data (in color)

0

5

10

15

20

25

30

Look for an explanation OUTSIDE the system

The 7 Point Rule: there is a problem if 7 points in a row (Or more) are above the average or below it

Types of Control ChartsType The Measurement The Series ExampleX-Chart Single Variable Several Measurements Blood Pressure, Sugar

Levels, Cholestrol, Time I wake up

X-Chart Average of a Sample Several Samples We plot the average age in each 20 Families

R-Chart A range of values: Hi measurement - Lo measurement

Several Samples Hi/Lo fever for 20 days. We plot the range OR we measure the Hi/Lo level of contaminants in 20 rivers.

p-Chart Proportion in a sample Several Samples Football teams: how many in each are foreign?

c-Chart Count in a sample Several Samples How many errors in each report?

Page 13: Akram najjar   exploiting your data (in color)

Thank youfor your kind

attention