62
1/ 61 Data munging Machine Learning Supervised learning in R Probability in R Big data Advanced R Programming - Lecture 7 Krzysztof Bartoszek (slides by Leif Jonsson and M˚ ans Magnusson) Link¨opingUniversity [email protected] 29 September 2017 K. Bartoszek (STIMA LiU) STIMA LiU Lecture 7

Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

1/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Advanced R Programming - Lecture 7

Krzysztof Bartoszek(slides by Leif Jonsson and Mans Magnusson)

Linkoping University

[email protected]

29 September 2017

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 2: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

2/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Today

Data munging

Machine Learning

Supervised learning in R

Probability in R

Big data

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 3: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

3/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Questions since last time?

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 4: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

4/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Tidy data

Tidy data and messy data

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 5: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

5/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Tidy data

1. Each variable forms a column

2. Each observation forms a row

3. Each type of observational unit forms a table

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 6: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

6/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Tidy data

1. Each variable forms a column

2. Each observation forms a row

3. Each type of observational unit forms a table

Examples: iris and faithful

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 7: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

7/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Why tidy?

80 % of Big Data work is data munging

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 8: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

8/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Why tidy?

80 % of Big Data work is data munging

Analysis and visualization is based on tidy data

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 9: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

9/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Why tidy?

80 % of Big Data work is data munging

Analysis and visualization is based on tidy data

Performant code

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 10: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

10/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Data analysis pipeline

Messy data → Tidy data → Analysis

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 11: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

11/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Messy data

1. Column headers are values, not variable names.(AirPassengers)

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 12: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

12/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Messy data

1. Column headers are values, not variable names.(AirPassengers)

2. Multiple variables are stored in one column. (mtcars)

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 13: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

13/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Messy data

1. Column headers are values, not variable names.(AirPassengers)

2. Multiple variables are stored in one column. (mtcars)

3. Variables are stored in both rows and columns. (crimtab)

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 14: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

14/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Messy data

1. Column headers are values, not variable names.(AirPassengers)

2. Multiple variables are stored in one column. (mtcars)

3. Variables are stored in both rows and columns. (crimetab)

4. Multiple types of observational units are stored in the sametable.

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 15: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

15/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Messy data

1. Column headers are values, not variable names.(AirPassengers)

2. Multiple variables are stored in one column. (mtcars)

3. Variables are stored in both rows and columns. (crimetab)

4. Multiple types of observational units are stored in the sametable.

5. A single observational unit is stored in multiple tables.

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 16: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

16/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

dplyr

Verbs for handling data

Highly optimized C++ code (backend)

Handling larger datasets in R(no copy-on-modify)

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 17: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

17/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

dplyr+tidyr

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

https://www.rstudio.com/wp-content/uploads/2015/02/

data-wrangling-cheatsheet.pdf

the cheatsheet

Page 18: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

18/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Regular Expressions

Language for manipulating strings

Find strings that match a pattern

Extract patterns from strings

Replace patterns in strings

Component in many functions(grep, gsub, stringr::, tidyr::)

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 19: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

19/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Regular Expressions - Syntax

fruit <- c(”apple”, ”banana”, ”pear”, ”pineapple”)

Symbol Description Example? The preceding item is op-

tional and will be matchedat most once

grep(”pi?”,fruit)

* The preceding item will bematched zero or more times

grep(”pi*”,fruit)

+ The preceding item will bematched one or more times

grep(”pi+”,fruit)

n The preceding item ismatched exactly n times

grep(”p{2}”,fruit)

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 20: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

20/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Regex Examples - Finding matching

> library(gapminder)

> grep("we", gapminder$country)

[1] 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1693 1694

1695 1696 1697

[18] 1698 1699 1700 1701 1702 1703 1704

grep("we", gapminder$country , value=TRUE)

[1] "Sweden" "Sweden" "Sweden" "Sweden" "Sweden"

"Sweden" "Sweden" "Sweden"

[9] "Sweden" "Sweden" "Sweden" "Sweden" "Zimbabwe" "Zimbabwe"

"Zimbabwe" "Zimbabwe"

[17] "Zimbabwe" "Zimbabwe" "Zimbabwe" "Zimbabwe" "Zimbabwe" "Zimbabwe"

"Zimbabwe" "Zimbabwe"

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 21: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

21/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Regex Examples - Extraction

> library(stringr)

> strs <- c("The 13 Cats in the Hats are 17 years", "4 scores and 7 beers

ago")

> str_extract_all(strs , "[a-z]+")

[[1]]

[1] "he" "ats" "in" "the" "ats" "are" "years"

[[2]]

[1] "scores" "and" "beers" "ago"

> str_extract(strs , "[a-z]+")

[1] "he" "scores"

> str_extract(strs , "[0 -9]+")

[1] "13" "4"

> str_extract_all(strs , "[0 -9]+")

[[1]]

[1] "13" "17"

[[2]]

[1] "4" "7"

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 22: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

22/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Regex Examples - Tidyr separate

> library(gapminder)

> # Create artificial column with numeric data in text

> rnds <- ceiling(runif(nrow(gapminder ) ,80 ,200))

> gapminder$country <- paste(gapminder$country , rnds , " population")

> tdy <- gapminder %>% separate(country , into = c("Count", "CPop"), sep

="\\d+")

> head(tdy)

# A tibble: 6 <U+00D7 > 7

Count CPop continent year lifeExp pop gdpPercap

<chr > <chr > <fctr > <int > <dbl > <int > <dbl >

1 Afghanistan population Asia 1952 28.801 8425333 779.4453

2 Afghanistan population Asia 1957 30.332 9240934 820.8530

3 Afghanistan population Asia 1962 31.997 10267083 853.1007

4 Afghanistan population Asia 1967 34.020 11537966 836.1971

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 23: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

23/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Machine learning?

Automatically detect patterns in data

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 24: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

24/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Machine learning?

Automatically detect patterns in data

Predict future observation

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 25: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

25/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Machine learning?

Automatically detect patterns in data

Predict future observation

Decision making under uncertainty

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 26: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

26/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Types of Machine learning

Supervised learning

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 27: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

27/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Types of Machine learning

Supervised learning

Unsupervised learning

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 28: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

28/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Types of Machine learning

Supervised learning

Unsupervised learning

Reinforcement learning

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 29: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

29/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Supervised learning

(also called predictive learning)

response variable

covariates/features

training set

D = (xi , yi )N(i=1)

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 30: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

30/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Supervised learning examples

If yi is categorical:classification

If yi is real:regression

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 31: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

31/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Unsupervised learning

(also called knowledge discovery)

dimensionality reduction

latent variable modeling

D = (xi )N(i=1)

clustering, PCA, discovering of graph structures

data visualization

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 32: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

32/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Curse of dimensionality

The more variables the larger distance between datapoints

Euclidian metric

d2(~x , ~y) = (x1 − y1)2 + (x2 − y2)2 + (x3 − y3)2 + (x4 − y4)2 + . . .

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 33: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

33/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Curse of dimensionality

http://www.newsnshit.com/curse-of-dimensionality-interactive-demo/

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 34: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

34/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Fit (bias) and variance in ML

Underfit = bad fit, low varianceOverfit = good fit, high variance

NetMaker (Neural networks simulator and designer by Robert Sulej, Warsaw Univ. Tech.):

http://www.ire.pw.edu.pl/~rsulej/NetMaker/index.php?pg=e06

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 35: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

35/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Model selection

fit and variance - tradeoff

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 36: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

36/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Model selection

fit and variance - tradeoff

hyper parameters

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 37: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

37/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Model selection

fit and variance - tradeoff

hyper parameters

generalization error

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 38: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

38/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Model selection

fit and variance - tradeoff

hyper parameters

generalization error

validation set/cross validation

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 39: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

39/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Model selection

fit and variance - tradeoff

hyper parameters

generalization error

validation set/cross validation

information criteria: model fit penalized for model dimensionality

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 40: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

40/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Predictive modeling pipeline

1. Set aside data for test (estimate generalization error)

2. Set aside data for validation (if hyperparams)

3. Run algorithms

4. Find best/optimal hyperparameters (on validation set)

5. Choose final model

6. Estimate generalization error on test set

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 41: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

41/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

No free lunch

different models work in different domains

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 42: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

42/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

No free lunch

different models work in different domains

accuracy-complexity-intepretability tradeoff

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 43: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

43/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

No free lunch theorem

different models work in different domains

accuracy-complexity-intepretability tradeoff

...but more data always wins

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 44: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

44/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

the caret package

package for supervised learning

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 45: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

45/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

the caret package

package for supervised learning

does not contain methods - a framework

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 46: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

46/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

the caret package

package for supervised learning

does not contain methods - a framework

compare methods on hold-out-data

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 47: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

47/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

the caret package

package for supervised learning

does not contain methods - a framework

compare methods on hold-out-data

http://topepo.github.io/caret/

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 48: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

48/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

the caret package

package for supervised learning

does not contain methods - a framework

compare methods on hold-out-data

http://topepo.github.io/caret/

specific algorithms are part of other courses

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 49: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

49/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Probability Functions

Prefix Description Exampler Random draw rnorm

d Density function dbinom

q Quantile function qbeta

p CDF pgamma

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 50: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

50/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data

Today’s trend:

I whole genome

I surveillance cameras (CCTV)

I Internet traffic

I credit card transactions

I everything communicating with everything

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 51: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

51/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data is relative...

... to computational complexity

O(N) 1012

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 52: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

52/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data is relative...

... to computational complexity

O(N) 1012

O(N2) 106

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 53: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

53/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data is relative...

... to computational complexity

O(N) 1012

O(N2) 106

O(N3) 104

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 54: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

54/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data is relative...

... to computational complexity

O(N) 1012

O(N2) 106

O(N3) 104

O(2N) 50

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 55: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

55/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data is relative...

... to computational complexity

O(N) 1012

O(N2) 106

O(N3) 104

O(2N) 50

We need algorithms that scale!

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 56: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

56/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data is relative...

... to computational complexity

O(P2 ∗ N) Linear regressionO(N3) Gaussian processesO(N2)/O(N3) Support vector machinesO(T (P ∗ N ∗ log(N)) Random forestsO(I ∗ N) Topic models

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 57: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

57/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data in R

R stores data in RAM

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 58: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

58/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data in R

R stores data in RAM

integers4 bytes

numerics8 bytes

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 59: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

59/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

Big data in R

R stores data in RAM

integers4 bytes

numerics8 bytes

A matrix with 100M rows and 5 cols with numerics100000000 ∗ 5 ∗ 8/(10243) ≈ 3.8GB

help(Memory); help("Memory-limits")Genome storage ... ?

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 60: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

60/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

How to deal with large data sets

Handle chunkwiseSubsampling

More hardwareC++/Java backend (dplyr)

Reduce data in memoryDatabase backend

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 61: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

61/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

If not enough

Spark and SparkR

Fast cluster computations for ML /STATS

Introduction to Spark:https://www.youtube.com/watch?v=_Ss1Cm6WO-I

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7

Page 62: Advanced R Programming - Lecture 7€¦ · Verbs for handling data Highly optimized C++ code (backend) Handling larger datasets in R (no copy-on-modify) K. Bartoszek (STIMA LiU) STIMA

62/ 61

Data munging Machine Learning Supervised learning in R Probability in R Big data

The End... for today.Questions?

See you next time!

K. Bartoszek (STIMA LiU) STIMA LiU

Lecture 7