60
Introduction to Data Analytics Techniques and their Implementation in R Dr Bruno Voisin Irish Centre for High End Computing (ICHEC) November 14, 2013 Introduction to analytics techniques 1

2013.11.14 Big Data Workshop Bruno Voisin

Embed Size (px)

DESCRIPTION

Bruno Voisin from the Irish Centre for High End Computing presented this Introduction to Data Analytics Techniques and their Implementation in R during the Big Data Workshop hosted by the Social Sciences Computing Hub at the Whitaker Institute on the 14th November 2013

Citation preview

Page 1: 2013.11.14 Big Data Workshop Bruno Voisin

Introduction to Data Analytics Techniques and their

Implementation in R

Dr Bruno VoisinIrish Centre for High End Computing (ICHEC)

November 14, 2013

Introduction to analytics techniques 1

Page 2: 2013.11.14 Big Data Workshop Bruno Voisin

Outline

Preparing vs Processing

Preparing the Data◮ Outliers◮ Missing values◮ R data types: numerical vs factors◮ Reshaping data

Forecasting, Predicting, Classifying...◮ Linear Regression◮ K nearest neighbours◮ Decision Trees◮ Time Series

Going Further◮ Ensembles of models◮ Rattle

Introduction to analytics techniques 2

Page 3: 2013.11.14 Big Data Workshop Bruno Voisin

Preparing vs Processing

Before considering what mathematical models could fit your data, askyourself: ”is my data ready for this?”

Pro-tip: the answer is no. Sorry. Chances are...

It’s ”noisy”.

It’s wrong.

It’s incomplete.

It’s not in shape.

Spending 90% of your time preparing data, 10% fitting models isn’tnecessarily a bad ratio!

Introduction to analytics techniques 3

Page 4: 2013.11.14 Big Data Workshop Bruno Voisin

Data preparation

Outliers

Missing values

R data types: numerical vs factors

R Reshaping

Introduction to analytics techniques 4

Page 5: 2013.11.14 Big Data Workshop Bruno Voisin

Outliers

Outliers are records with unusual values for an attribute orcombination of attributes. As a rule, we need to:

◮ detect them◮ understand them (typo vs genuine but unusual value)◮ decide what to do with them (remove them or not, correct them)

Introduction to analytics techniques 5

Page 6: 2013.11.14 Big Data Workshop Bruno Voisin

Detecting outliers: mean vs median

Both mean and median provide an expected ’typical’ value useful todetect outlier.

Mean has some nice useful properties (standard deviation).

Median is more tolerant of outliers and asymetrical data.

Rule of thumb:◮ nicely symetrical data with mean ≈ median: safe to use mean.◮ noisy, asymetrical data where mean 6= median: use median.

Introduction to analytics techniques 6

Page 7: 2013.11.14 Big Data Workshop Bruno Voisin

Detecting outliers: 2 standard deviations

> x <- iris$Sepal.Width

> sdx <- sd(x)

> m <- mean(x)

> iris[(m-2*sdx)>x | x>(m+2*sdx),]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

15 5.8 4.0 1.2 0.2 setosa

16 5.7 4.4 1.5 0.4 setosa

33 5.2 4.1 1.5 0.1 setosa

34 5.5 4.2 1.4 0.2 setosa

61 5.0 2.0 3.5 1.0 versicolor

Introduction to analytics techniques 7

Page 8: 2013.11.14 Big Data Workshop Bruno Voisin

Detecting outliers: the boxplot

Graphical representation of median, quartiles, and last observationsnot considered as outliers.> data(iris)

> boxplot(iris[,c(1,2,3,4)], col=rainbow(4), notch=TRUE)

Introduction to analytics techniques 8

Page 9: 2013.11.14 Big Data Workshop Bruno Voisin

Detecting outliers: the boxplot

Use identify to turn outliers into clickable dots and have R returntheir indices:> boxplot(iris[,c(1,2,3,4)], col=rainbow(4), notch=TRUE)

> identify(array(2,length(iris[[2]])),iris$Sepal.Width)

[1] 16 33 34 61

> outliers <- identify(array(2,length(iris[[1]])),

iris$Sepal.Width)

> iris[outliers,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

16 5.7 4.4 1.5 0.4 setosa

33 5.2 4.1 1.5 0.1 setosa

34 5.5 4.2 1.4 0.2 setosa

61 5.0 2.0 3.5 1.0 versicolor

Introduction to analytics techniques 9

Page 10: 2013.11.14 Big Data Workshop Bruno Voisin

Detecting outliers: the boxplot

For automated tasks, use the boxplot object itself:> x <- iris$Sepal.Width

> bp <- boxplot( iris$Sepal.Width )

> iris[x %in% bp$out,]

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

16 5.7 4.4 1.5 0.4 setosa

33 5.2 4.1 1.5 0.1 setosa

34 5.5 4.2 1.4 0.2 setosa

61 5.0 2.0 3.5 1.0 versicolor

Introduction to analytics techniques 10

Page 11: 2013.11.14 Big Data Workshop Bruno Voisin

Detecting outliers: Mk1 Eyeballs

Some weird cases may always show up which quick stats won’t pickup.

Visual approach: show visual identification of weird cases like:

*******.........*...........*********

^

outlier

Introduction to analytics techniques 11

Page 12: 2013.11.14 Big Data Workshop Bruno Voisin

Understanding Outliers

No general rule, pretty much a domain-dependent task.

Data analyst/domain experts work together and identify genuinerecord vs obvious errors (127 years old driver renting a car).

Class information is at the centre of automated classification.Consider outliers in regards to their own class if available.

Introduction to analytics techniques 12

Page 13: 2013.11.14 Big Data Workshop Bruno Voisin

Understanding OutliersIris example: for Setosa, of the three extreme Sepal.Width values,only one genuinely out of range. For Versicolor, odd one outdisappears. New outliers appear on other variables:> par(mfrow = c(1,2))

> boxplot(iris[iris$Species=="setosa",c(1,2,3,4)], main="Setosa"

> boxplot(iris[iris$Species=="versicolor",c(1,2,3,4)], main="Versicolor"

Introduction to analytics techniques 13

Page 14: 2013.11.14 Big Data Workshop Bruno Voisin

Managing outliers

Incorrect data should be treated as missing (to be ignored orsimulated, see below).

Genuine but unusual data is processed according to context:◮ should generally be kept (sometimes even, particular interest in the

exceptions, ex: fraud detection)◮ may eventually be removed (bad practice, but sometimes there’s

interest in modelling only the mainstream data)

Introduction to analytics techniques 14

Page 15: 2013.11.14 Big Data Workshop Bruno Voisin

Missing values

Missing values are represented in R by a special ‘NA‘ value.

Amusingly, ’NA’ in some data sets may mean ’North America’, ’NorthAmerican Airlines’, etc. Keep it in mind while importing/exportingdata.

Finding/counting them from a variable or data frame:> sum(is.na(BreastCancer[,7]))

[1] 16

> incomplete <- BreastCancer[!complete.cases(BreastCancer),]

> nrow(incomplete)

[1] 16

Introduction to analytics techniques 15

Page 16: 2013.11.14 Big Data Workshop Bruno Voisin

Strategies for missing values

removing NAs:> nona <- BreastCancer[complete.cases(BreastCancer),]

replacing NAs:◮ mean

> x <- iris$Sepal.Width

> x[sample(length(x),5)] <- NA

> x[is.na(x)] <- mean(x, na.rm=TRUE)

◮ median (can be grabbed from the boxplot), most common nominativevariable

◮ value of closest case in other dimensions◮ domain expert decided value (caveat: DM aims at finding unknowns

from domain experts)◮ etc.

Introduction to analytics techniques 16

Page 17: 2013.11.14 Big Data Workshop Bruno Voisin

R data types: numerical vs factors

Mainly number crunching algorithms.

However, discrete variables can be managed by some techniques. Rmodules generally require those to be stored as factors.

Discrete variables better fit for some techniques (decision trees)◮ consider conversion of numerical to meaningful ranges (ex: customer

age range)◮ integer variables can be used as either numerical or factor

Introduction to analytics techniques 17

Page 18: 2013.11.14 Big Data Workshop Bruno Voisin

Factor to numerical

as.numeric isn’t sufficient since it would simply return the the factorlevels of a variable. Need to ’translate’ the level into its value.> library(mlbench)

> data(BreastCancer)

> f <- BreastCancer$Cell.shape[1:10]

> as.numeric(levels(f))[f]

[1] 1 4 1 8 1 10 1 2 1 1

Introduction to analytics techniques 18

Page 19: 2013.11.14 Big Data Workshop Bruno Voisin

Numerical to factor

Converting numerical to factor ”as is” with as.factor :> s <- c(21, 43, 55, 18, 21, 50, 20, 67, 36, 33, 36)

> as.factor(s)

[1] 21 43 55 18 21 50 20 67 36 33 36

Levels: 18 20 21 33 36 43 50 55 67

Converting numerical ranges to a factor with cut:> cut(s, c(-Inf, 21, 26, 30, 34, 44, 54, 64, Inf), labels=

c("21 and Under", "22 to 26", "27 to 30", "31 to 34",

"35 to 44", "45 to 54", "55 to 64", "65 and Over"))

[1] 21 and Under 35 to 44 55 to 64 21 and Under 21 and

[6] 45 to 54 21 and Under 65 and Over 35 to 44 31 to

[11] 35 to 44

8 Levels: 21 and Under 22 to 26 27 to 30 31 to 34 35 to 44 ...

Introduction to analytics techniques 19

Page 20: 2013.11.14 Big Data Workshop Bruno Voisin

ReshapingMore often than not, the ’shape’ of the data as it comes won’t beconvenient.Look at the following example:> pop <- read.csv("http://2010.census.gov/2010census/data/pop

> pop <- pop[,1:12]

> colnames(pop)

[1] "STATE_OR_REGION" "X1910_POPULATION" "X1920_POPULATION"

[5] "X1940_POPULATION" "X1950_POPULATION" "X1960_POPULATION"

[9] "X1980_POPULATION" "X1990_POPULATION" "X2000_POPULATION"

> pop[1:10,]

STATE_OR_REGION X1910_POPULATION X1920_POPULATION X1930_POPULATION

1 United States 92228531 106021568

2 Alabama 2138093 2348174

3 Alaska 64356 55036

4 Arizona 204354 334162

5 Arkansas 1574449 1752204

6 California 2377549 3426861

7 Colorado 799024 939629

8 Connecticut 1114756 1380631

9 Delaware 202322 223003

10 District of Columbia 331069 437571

how to turn this into plotting-friendly state/year/population 3 columnIntroduction to analytics techniques 20

Page 21: 2013.11.14 Big Data Workshop Bruno Voisin

Reshaping: melt

The reshape2 package provides convenient functions for reshapingdata:> library(reshape2)

> colnames(pop) <- c("state", seq(1910, 2010, 10))

> mpop <- melt(pop, id.vars="state", variable.name="year",

value.name="population")

> mpop[1:10,]

state year population

1 United States 1910 92228531

2 Alabama 1910 2138093

3 Alaska 1910 64356

4 Arizona 1910 204354

5 Arkansas 1910 1574449

6 California 1910 2377549

7 Colorado 1910 799024

8 Connecticut 1910 1114756

9 Delaware 1910 202322

10 District of Columbia 1910 331069

more friendly to a relational database table too.

Introduction to analytics techniques 21

Page 22: 2013.11.14 Big Data Workshop Bruno Voisin

Reshaping: cast

acast and dcast reverse the melt and produce respectively anarray/matrix or a data frame:> dcast(mpop, state˜year, value_var="population")[1:10,]

Using population as value column: use value.var to override.

state 1910 1920 1930 1940 1950

1 Alabama 2138093 2348174 2646248 2832961 3061743

2 Alaska 64356 55036 59278 72524 128643

3 Arizona 204354 334162 435573 499261 749587

4 Arkansas 1574449 1752204 1854482 1949387 1909511

5 California 2377549 3426861 5677251 6907387 10586223

6 Colorado 799024 939629 1035791 1123296 1325089

7 Connecticut 1114756 1380631 1606903 1709242 2007280

8 Delaware 202322 223003 238380 266505 318085

9 District of Columbia 331069 437571 486869 663091 802178

10 Florida 752619 968470 1468211 1897414 2771305

Introduction to analytics techniques 22

Page 23: 2013.11.14 Big Data Workshop Bruno Voisin

Forecasting, Predicting, Classifying...

Ultimately, we’re trying to understand a behaviour from our data.

To this end, various mathematical models have been developed,matching various known behaviours.

Each model will come with its own sweet/blind spots and its ownscaling issues when moving towards Big Data.

Today’s overview of models will cover: Linear Regression, kNN,Decision Trees and basic Time Series, but there’s a lot more modelsaround...

Introduction to analytics techniques 23

Page 24: 2013.11.14 Big Data Workshop Bruno Voisin

Linear Regression

One of the simplest models.

Establish a linear relationship between variables, predicting onevariable’s value (the response) from the others (the predictors).

Intuitively, it’s all about drawing a line. But the right line.

Introduction to analytics techniques 24

Page 25: 2013.11.14 Big Data Workshop Bruno Voisin

Simple Linear Regression

> data(trees)

> plot(trees$Girth, trees$Volume)

Introduction to analytics techniques 25

Page 26: 2013.11.14 Big Data Workshop Bruno Voisin

Simple Linear Regression> lm(formula=Volume˜Girth, data=trees)

Call:

lm(formula = Volume ˜ Girth, data = trees)

Coefficients:

(Intercept) Girth

-36.943 5.066

> abline(-36.943, 5.066)

Introduction to analytics techniques 26

Page 27: 2013.11.14 Big Data Workshop Bruno Voisin

Simple Linear Regression

For a response variable r and predictor variables p1, p2, . . ., pn thelm() function generates a simple linear model based on a formula

object of the form:r ∼ p1 + p2 + · · · + pn

Example: building a linear model using both Girth and Height aspredictors for a tree’s Volume:> lm(formula=Volume˜Girth+Height, data=trees)

Call:

lm(formula = Volume ˜ Girth + Height, data = trees)

Coefficients:

(Intercept) Girth Height

-57.9877 4.7082 0.3393

By default, lm() fits the model that minimizes the sum of squareerrors.

Introduction to analytics techniques 27

Page 28: 2013.11.14 Big Data Workshop Bruno Voisin

Linear Model Evaluation> fit <- lm(formula=Volume˜Girth+Height, data=trees)

> summary(fit)

Call:

lm(formula = Volume ˜ Girth + Height, data = trees)

Residuals:

Min 1Q Median 3Q Max

-6.4065 -2.6493 -0.2876 2.2003 8.4847

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***Girth 4.7082 0.2643 17.816 < 2e-16 ***Height 0.3393 0.1302 2.607 0.0145 *---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 3.882 on 28 degrees of freedom

Multiple R-squared: 0.948, Adjusted R-squared: 0.9442

F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16

Introduction to analytics techniques 28

Page 29: 2013.11.14 Big Data Workshop Bruno Voisin

Refining the model

Low-significance predictor attributes complexify a model for a smallgain.

The anova() function helps evaluate predictors.

The update() function allows us to remove predictors from themodel.

The step() function can repeat such a task using a differentcriterion.

Introduction to analytics techniques 29

Page 30: 2013.11.14 Big Data Workshop Bruno Voisin

Refining the model: anova()

> data(airquality)

> fit <- lm(formula = Ozone ˜ . , data=airquality)

> anova(fit)

Analysis of Variance Table

Response: Ozone

Df Sum Sq Mean Sq F value Pr(>F)

Solar.R 1 14780 14780 33.9704 6.216e-08 ***Wind 1 39969 39969 91.8680 5.243e-16 ***Temp 1 19050 19050 43.7854 1.584e-09 ***Month 1 1701 1701 3.9101 0.05062 .

Day 1 619 619 1.4220 0.23576

Residuals 105 45683 435

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Sum Sq shows the reduction in the residual sum of squares as eachpredictor is added. Small values contribute less.

Introduction to analytics techniques 30

Page 31: 2013.11.14 Big Data Workshop Bruno Voisin

Refining the model: update()> fit2 <- update (fit, . ˜ . - Day)

> summary(fit2)

Call:

lm(formula = Ozone ˜ Solar.R + Wind + Temp + Month, data = airquality)

[...]

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -58.05384 22.97114 -2.527 0.0130 *Solar.R 0.04960 0.02346 2.114 0.0368 *Wind -3.31651 0.64579 -5.136 1.29e-06 ***Temp 1.87087 0.27363 6.837 5.34e-10 ***Month -2.99163 1.51592 -1.973 0.0510 .

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 20.9 on 106 degrees of freedom

(42 observations deleted due to missingness)

Multiple R-squared: 0.6199, Adjusted R-squared: 0.6055

F-statistic: 43.21 on 4 and 106 DF, p-value: < 2.2e-16

We removed the Day predictor from the model.

New model is slightly worse... but simpler...

Introduction to analytics techniques 31

Page 32: 2013.11.14 Big Data Workshop Bruno Voisin

Refining the model: step()

The step() function can automatically reduce the model:

> final <- step(fit)

Start: AIC=680.21

Ozone ˜ Solar.R + Wind + Temp + Month + Day

Df Sum of Sq RSS AIC

- Day 1 618.7 46302 679.71

<none> 45683 680.21

- Month 1 1755.3 47438 682.40

- Solar.R 1 2005.1 47688 682.98

- Wind 1 11533.9 57217 703.20

- Temp 1 20845.0 66528 719.94

Step: AIC=679.71

Ozone ˜ Solar.R + Wind + Temp + Month

Df Sum of Sq RSS AIC

<none> 46302 679.71

- Month 1 1701.2 48003 681.71

- Solar.R 1 1952.6 48254 682.29

- Wind 1 11520.5 57822 702.37

- Temp 1 20419.5 66721 718.26

Introduction to analytics techniques 32

Page 33: 2013.11.14 Big Data Workshop Bruno Voisin

K-nearest neighbours (KNN)

K-nearest neighbour classification is amongst the simplestclassification algorithms.

consists in classifying an element as the majority of the k elements ofthe learning set closest to it in the multidimensional feature space.

no training needed.

classification can be compute-intensive for high k values (manydistances to evaluate) and requires access to learning data set.

very intuitive for end-user, but does not provide any insight into thedata.

Introduction to analytics techniques 33

Page 34: 2013.11.14 Big Data Workshop Bruno Voisin

An example

With k = 5, the central dot would be classified as red.

Introduction to analytics techniques 34

Page 35: 2013.11.14 Big Data Workshop Bruno Voisin

What value for k?

smaller k value faster to process.

higher k values more robust to noise.

n-fold cross validation can be used on incremental values of k toselect a k value that minimises error.

Introduction to analytics techniques 35

Page 36: 2013.11.14 Big Data Workshop Bruno Voisin

KNN with R

The knn function (package class) provides KNN classification for R.knn(train, test, cl, k = 1, l = 0, prob = FALSE, use.all = TRUE

Arguments:

train: matrix or data frame of training set cases.

test: matrix or data frame of test set cases. A vector will

interpreted as a row vector for a single case.

cl: factor of true classifications of training set

k: number of neighbours considered.

l: minimum vote for definite decision, otherwise ’doubt’

precisely, less than ’k-l’ dissenting votes are allowed,

if ’k’ is increased by ties.)

prob: If this is true, the proportion of the votes for the

class are returned as attribute ’prob’.

use.all: controls handling of ties. If true, all distances equal

the ’k’th largest are included. If false, a random

of distances equal to the ’k’th is chosen to use exactly

neighbours.

Introduction to analytics techniques 36

Page 37: 2013.11.14 Big Data Workshop Bruno Voisin

Using knn()> library(class)

> train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])

> test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])

> cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))

> knn(train, test, cl, k = 3, prob=TRUE)

[1] s s s s s s s s s s s s s s s s s s s s s s s s s c c v c c

[39] c c c c c c c c c c c c v c c v v v v v c v v v v c v v v v

attr(,"prob")

[1] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

[8] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

[15] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

[22] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

[29] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667

[36] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

[43] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

[50] 1.0000000 1.0000000 0.6666667 0.7500000 1.0000000 1.0000000

[57] 1.0000000 1.0000000 0.5000000 1.0000000 1.0000000 1.0000000

[64] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

[71] 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667

Levels: c s v

Introduction to analytics techniques 37

Page 38: 2013.11.14 Big Data Workshop Bruno Voisin

Counting errors with cross-validation for different k values

‘knn.cv‘ does leave one out cross-validation: classifies each item whileleaving it out of the learning set.> train <- rbind(iris3[,,1], iris3[,,2], iris3[,,3])

> cl <- factor(c(rep("s",50), rep("c",50), rep("v",50)))

> sum( (knn.cv(train, cl, k = 1) == cl) == FALSE )

[1] 6

> sum( (knn.cv(train, cl, k = 5) == cl) == FALSE )

[1] 5

> sum( (knn.cv(train, cl, k = 15) == cl) == FALSE )

[1] 4

> sum( (knn.cv(train, cl, k = 30) == cl) == FALSE )

[1] 8

Introduction to analytics techniques 38

Page 39: 2013.11.14 Big Data Workshop Bruno Voisin

Decision trees

A decision tree is a tree-structured representation of a dataset and itsclass-relevant partitioning.

The root node ’contains’ the entire learning dataset.

Each non-terminal node is split according to a particularattribute/value combination.

Class distribution in terminal nodes is used to affect a classprobability to further unclassified data.

Human readable!

Introduction to analytics techniques 39

Page 40: 2013.11.14 Big Data Workshop Bruno Voisin

An example

Introduction to analytics techniques 40

Page 41: 2013.11.14 Big Data Workshop Bruno Voisin

Building the tree

A tree is built by successive partitioning.

Starting from the root, every attribute is considered for a potentialsplit of the data set.

For each attribute, every possible split is considered.

The ”best split” is picked by comparing the resulting distribution ofclasses in the generated child nodes.

Each child node is then considered for further partitioning, and so onuntil:

◮ partitioning a node doesn’t improve the class distribution (ex: only 1class represented in a node),

◮ a node’s ”population” is too small (min split),◮ a node’s potential partitioning would generate a child node with a too

small population (min bucket).

Introduction to analytics techniques 41

Page 42: 2013.11.14 Big Data Workshop Bruno Voisin

Decision trees with R: the rpart module

rpart is a R module providing functions for generating decision trees(among other things).rpart(formula, data, weights, subset, na.action = na.rpart,

method, model = FALSE, x = FALSE, y = TRUE,

parms, control, cost, ...)

formula: class ˜ att1 + att2 + · · ·+ attn

data: name of dataframe whose columns include attributes used inthe formula.

weights: optional case weights.

subset: optional subsetting of the data set for use in the fit.

na.action: strategies for missing values.

method: defaults to ”class” which applies to class-based decisiontrees.

control: rpart control options (like min split/bucket, refer to?rpart.control for details).

Introduction to analytics techniques 42

Page 43: 2013.11.14 Big Data Workshop Bruno Voisin

Using rpart

> library(rpart)

> model <- rpart(Species ˜ Sepal.Length + Sepal.Width +

Petal.Length + Petal.Width, data=iris)

> # textual representation of tree:

> model

n= 150

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)

2) Petal.Length< 2.45 50 0 setosa (1.00000000 0.00000000 0.00000000

3) Petal.Length>=2.45 100 50 versicolor (0.00000000 0.50000000

6) Petal.Width< 1.75 54 5 versicolor (0.00000000 0.90740741

7) Petal.Width>=1.75 46 1 virginica (0.00000000 0.02173913

Introduction to analytics techniques 43

Page 44: 2013.11.14 Big Data Workshop Bruno Voisin

Plotting the tree model

> # basic R graphics plot of tree:

> plot(model)

> text(model)

> # fancier postscript plot of tree:

> post(model, file="mytree.ps", title="Iris Classification")

Introduction to analytics techniques 44

Page 45: 2013.11.14 Big Data Workshop Bruno Voisin

rpart model classification

Use ‘predict‘ to apply a model to a dataframe:> unclassified <- iris[c(13,54,76,104,32,114,56),c(

"Sepal.Length","Sepal.Width", "Petal.Length", "Petal.Width")]

> predict(model, newdata=unclassified, type="prob")

setosa versicolor virginica

13 1 0.00000000 0.0000000

54 0 0.90740741 0.0925926

76 0 0.90740741 0.0925926

104 0 0.02173913 0.9782609

32 1 0.00000000 0.0000000

114 0 0.02173913 0.9782609

56 0 0.90740741 0.0925926

> predict(model, newdata=unclassified, type="vector")

13 54 76 104 32 114 56

1 2 2 3 1 3 2

> predict(model, newdata=unclassified, type="class")

13 54 76 104 32 114

setosa versicolor versicolor virginica setosa virginica

Levels: setosa versicolor virginica

Introduction to analytics techniques 45

Page 46: 2013.11.14 Big Data Workshop Bruno Voisin

rpart model evaluation

Use a confusion matrix to measure accuracy of predictions:> pred <- predict(model, iris[,c(1,2,3,4)], type="class")

> conf <- table(pred, iris$Species)

> sum(diag(conf)) / sum(conf)

[1] 0.96

Introduction to analytics techniques 46

Page 47: 2013.11.14 Big Data Workshop Bruno Voisin

Time Series

Another type of model, applying to time-related data.

Additive or multiplicative decomposition of signal into components.

Many models and parameters used to fit the series, to then be usedfor forecasting.

Some automated fitting is available in R.

Introduction to analytics techniques 47

Page 48: 2013.11.14 Big Data Workshop Bruno Voisin

Time Series

R manages time series objects by default:

> data(AirPassengers)

> AirPassengers

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

1949 112 118 132 129 121 135 148 148 136 119 104 118

1950 115 126 141 135 125 149 170 170 158 133 114 140

1951 145 150 178 163 172 178 199 199 184 162 146 166

1952 171 180 193 181 183 218 230 242 209 191 172 194

1953 196 196 236 235 229 243 264 272 237 211 180 201

1954 204 188 235 227 234 264 302 293 259 229 203 229

1955 242 233 267 269 270 315 364 347 312 274 237 278

1956 284 277 317 313 318 374 413 405 355 306 271 306

1957 315 301 356 348 355 422 465 467 404 347 305 336

1958 340 318 362 348 363 435 491 505 404 359 310 337

1959 360 342 406 396 420 472 548 559 463 407 362 405

1960 417 391 419 461 472 535 622 606 508 461 390 432

Introduction to analytics techniques 48

Page 49: 2013.11.14 Big Data Workshop Bruno Voisin

Time Series

Simple things like changing the series frequency are handled nativelytoo:

> ts(AirPassengers, frequency=4, start=c(1949,1),

end=c(1960,4))

Qtr1 Qtr2 Qtr3 Qtr4

1949 112 118 132 129

1950 121 135 148 148

1951 136 119 104 118

1952 115 126 141 135

1953 125 149 170 170

1954 158 133 114 140

1955 145 150 178 163

1956 172 178 199 199

1957 184 162 146 166

1958 171 180 193 181

1959 183 218 230 242

1960 209 191 172 194

Introduction to analytics techniques 49

Page 50: 2013.11.14 Big Data Workshop Bruno Voisin

Time Series

As are plotting and decomposition:

> plot(AirPassengers)

> plot(decompose(AirPassengers))

Introduction to analytics techniques 50

Page 51: 2013.11.14 Big Data Workshop Bruno Voisin

Time Series Decomposition

In a simple seasonal time series, the signal can decomposed into threecomponents that can then be analysed separately:

◮ the Trend component, that shows the progression of the series.◮ the Seasonal component, that shows the periodic variation.◮ the Irregular component, that shows the rest of the variations.

In an additive decomposition, our signal isTrend + Seasonal + Irregular .

In a multiplicative decomposition, our signal isTrend ∗ Seasonal ∗ Irregular .

Multiplicative decomposition makes sense when absolute difference invalues are of less interest that percentage changes.

A multiplicative signal can also be decomposed in additive fashionthrough working on log(data).

Introduction to analytics techniques 51

Page 52: 2013.11.14 Big Data Workshop Bruno Voisin

Additive/Multiplicative Decomposition

Our example shows typical multiplicative behaviour.

> plot(decompose(AirPassengers))

> plot(decompose(AirPassengers, type="multiplicative"))

Introduction to analytics techniques 52

Page 53: 2013.11.14 Big Data Workshop Bruno Voisin

Log of a multiplicative series

Using log() to decompose our series in additive fashion:

> plot(log(AirPassengers))

> plot(decompose(log(AirPassengers)))

Introduction to analytics techniques 53

Page 54: 2013.11.14 Big Data Workshop Bruno Voisin

The ARIMA model

ARIMA stands for AutoRegressive Integrated Moving Average.

ARIMA is one of the most general class of models for time seriesforecasting.

An ARIMA model is characterized by three non-negative integerparameters commonly called (p, d , q):

◮ p is the autoregressive order (AR).◮ d is the integrated order (I).◮ q is the moving average order (MA)

An ARIMA model with zero for some of those values is in fact asimpler model, be it AR, MA or ARMA...

Like for linear regression, an information criterion can be used toevaluate which values of (p, d , q) provide a better fit.

A Seasonal ARIMA model (p, d , q)× (P ,D,Q) has three additionalparameters modelling the seasonal behaviour of the series in the samefashion.

Introduction to analytics techniques 54

Page 55: 2013.11.14 Big Data Workshop Bruno Voisin

Automated ARIMA fitting and forecasting

The auto.arima() function will explore a range of values for(p, d , q)× (P ,D,Q) and return the best fitting model, which canthen be used for forecasting:

> library(forecast)

> fit <- auto.arima(AirPassengers)

> plot(forecast(fit, h=20))

Introduction to analytics techniques 55

Page 56: 2013.11.14 Big Data Workshop Bruno Voisin

Going further

Ensembles of models

Rattle

Introduction to analytics techniques 56

Page 57: 2013.11.14 Big Data Workshop Bruno Voisin

Ensembles of models

Models built with a specific set of parameters have a limit to the datarelationship they can express.

Choice of model or initial parameter will create specific recurringmisclassification.

Solution: build several competing models and average classification.

Some techniques are built around the idea, like random forests (see’rf’ module in R).

Introduction to analytics techniques 57

Page 58: 2013.11.14 Big Data Workshop Bruno Voisin

Rattle

Rattle is a data mining framework for R. Installable as a CRANmodule, it features:

◮ Graphical user interface to common mining modules◮ Full mining framework: data preprocessing, analysis, mining, validating◮ Automatic generation of R code

In addition to fast hands-on data mining, the rattle log is a great Rlearning resource.

Introduction paper at:http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf

> install.packages("RGtk2")

> install.packages("rattle")

> library(rattle)

Rattle: Graphical interface for data mining using R.

Version 2.5.40 Copyright (c) 2006-2010 Togaware Pty Ltd.

Type ’rattle()’ to shake, rattle, and roll your data.

> rattle()

Introduction to analytics techniques 58

Page 59: 2013.11.14 Big Data Workshop Bruno Voisin

Rattle

Introduction to analytics techniques 59

Page 60: 2013.11.14 Big Data Workshop Bruno Voisin

The End

Thank you. :)

Introduction to analytics techniques 60