33
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | Mark Hornick January 2016 Oracle Advanced Analytics Oracle R Enterprise 1.5 Hot New Features

Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Mark Hornick January 2016

Oracle Advanced Analytics Oracle R Enterprise 1.5 – Hot New Features

Page 2: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Safe Harbor Statement

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

2

Page 3: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Agenda

• Oracle’s Advanced Analytics and R Technologies

• Overview of Oracle R Enterprise

• ORE 1.5 Features

3

Page 4: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

OBIEE

Oracle Database Enterprise Edition

Oracle’s Advanced Analytics Multiple interfaces across platforms — SQL, R, GUI, Dashboards, Apps

Oracle Advanced Analytics - Database Option SQL Data Mining & Analytic Functions + R Integration

for Scalable, Distributed, Parallel in-Database ML Execution

SQL Developer/ Oracle Data Miner

Applications

R Client

Data / Business Analysts R programmers Business Analysts/Mgrs Domain End Users Users

Platform

Oracle Database 12c

Hadoop

ORAAH Parallel,

distributed algorithms

Oracle Cloud

4

Page 5: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Oracle’s R Technologies Supporting R, Oracle Database, and Big Data Appliance/Hadoop

• Oracle R Distribution

• ROracle

• Oracle R Enterprise Component of the Oracle Advanced Analytics Option to Oracle Database

• Oracle R Advanced Analytics for Hadoop Component of the Big Data Connectors Software Suite

Software available to R Community for free

5

Page 6: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Oracle R Enterprise Component of Oracle Advanced Analytics option

• Scale R to Big Data

• Use Oracle Database as HPC environment

• Use in-database parallel and distributed machine learning algorithms

• Manage R scripts and R objects in Oracle Database

• Integrate R results into applications and dashboards via SQL

6

Client R Engine

ORE packages

Oracle Database User tables

In-db stats

Database Server Machine

SQL Interfaces SQL*Plus, SQLDeveloper, …

Page 7: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

IoT Use Case: Energy Demand • Model each customer’s usage to understand

behavior and predict individual usage and overall aggregate demand

• 200 thousand households, each with a utility “smart meter”

• 1 reading / meter / hr

• 200K x 8760 hrs / yr 1.752B readings

• 3 years worth of data 5.256B readings

• Each customer has 26280 readings

• If each model takes 10 seconds to build, 555.6 hrs (23.2 days) …with 128 DOP 4.3 hrs

Page 8: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

f(dat,args,…) {

}

Oracle Database + ORE

Data c1 c2 ci cn

R Script build model

f(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…)

Model c1

Model c2

Model cn

Model ci

R Datastore R Script Repository

Scalable Analysis – Model Building Smart meter scenario

Page 9: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Build models, partitioned on CUST_ID, and store in database

ore.groupApply (CUST_USAGE_DATA,

CUST_USAGE_DATA$CUST_ID,

function(dat, ds.name) {

cust_id <- dat$CUST_ID[1]

mod <- lm(Consumption ~ . -CUST_ID, dat)

mod$effects <- mod$residuals <- mod$fitted.values <- NULL

name <- paste("mod", cust_id,sep="")

assign(name, mod)

ds.name1 <- paste(ds.name,".",cust_id,sep="")

ore.save(list=paste("mod",cust_id,sep=""), name=ds.name1, overwrite=TRUE)

TRUE

},

ds.name="myDatastore", ore.connect=TRUE, parallel=128

)

9

Page 10: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

So, what’s new in ORE 1.5?

10

Page 11: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Oracle R Enterprise 1.5 – New Features

• Upgraded R version compatibility: R 3.2.0

• Parallel distributed algorithms

– ore.randomForest

– svd

– prcomp

• ore.summary performance enhancement

• ore.grant and ore.revoke on R scripts and datastores

• Datatypes CLOB and BLOB supported for embedded R execution input and output, as well as for ore.create, ore.pull, and ore.push

• ore.groupApply supports partitioning on multiple columns

11

Page 12: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

• ORE 1.5 is certified >= R-3.2.0

– Open source R

– Oracle R Distribution

• R-3.2.0

– Performance improvements

– big in-memory data objects

– compatibility with more than 7000

community-contributed R packages

• Supporting packages for ORE

– New package: randomForest 4.6-10

– Updates to other packages

• arules 1.1-9

• cairo 1.5-8

• DBI 0.3-1

• png 0.1-7

• ROracle 1.2-1

• statmod 1.4.21

12

Upgraded R 3.2.x version compatibility

Page 13: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Oracle R Enterprise Predictive Analytics algorithms in-Database

Decision Tree Logistic Regression

Naïve Bayes RandomForest

Support Vector Machine

Regression

Linear Model Generalized Linear Model

Multi-Layer Neural Networks Stepwise Linear Regression Support Vector Machine

Classification

Attribute Importance

Minimum Description Length

Clustering

Hierarchical k-Means Orthogonal Partitioning Clustering

Feature Extraction

Nonnegative Matrix Factorization Principal Component Analysis Singular Value Decomposition

Market Basket Analysis

Apriori – Association Rules

Anomaly Detection

1 Class Support Vector Machine

Time Series

Single Exponential Smoothing Double Exponential Smoothing

New in ORE 1.5

13

…plus open source R packages for algorithms in combination with embedded R data- and task-parallel execution

Page 14: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Random Forest Algorithm

• Ensemble learning technique for classification and regression

• Known for high accuracy models

• Constructs many “small” decision trees

• For classification, predicts mode of classes predicted by individual trees

• For regression, predicts mean prediction of individual trees

• Avoids overfitting, which is common for decision trees

• Developed by Leo Breiman and Adele Cutler combining the ideas of “bagging” and random selection of variables resulting in a collection of decision trees with controlled variance

14

Page 15: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

ore.randomForest supports classification

• Enables performance and scalability for larger data sets

• Executes in parallel for model building and scoring

– ore.parallel global option used for preferred DOP

• Oracle R Distribution new randomForest function

– Reduces memory requirements over standard R (~7X)

– As a result, reduces memory requirements for ore.randomForest

– ORD randomForest supports classification only

• Can use Oracle R Distribution’s or R’s randomForest package

15

Page 16: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 16

ore.randomForest – parallel distributed implementation Exadata 5-2 – half rack, ORE DOP = 40

273

1389

6366

28 70

360

1

10

100

1000

10000

10K 100K 1M

Tim

e (

seco

nd

s)

# rows (ntree = 500)

R vs. ORE Random Forest Build Time

R

ORE

Order of magnitude faster ~2.8 hours

~17 minutes

Page 17: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

ore.randomForest

• ore.randomForest() builds a random forest model by growing trees in parallel

• Scoring method 'predict' runs in parallel

options(ore.parallel=4)

IRIS <- ore.push(iris)

mod <- ore.randomForest(Species~., IRIS)

tree10 <- grabTree(mod, k = 10, labelVar = TRUE)

ans <- predict(mod,IRIS,type="all",supplemental.cols="Species", cache.model=FALSE)

table(ans$Species, ans$prediction)

17

Page 18: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 18

ore.randomForest Results

Page 19: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Singular Value Decomposition Principal Component Analysis

• The functions svd and prcomp overloaded –Execute in parallel

–Accept ore.frame objects

• In-database execution to improve scalability and performance

• No data movement

19

Page 20: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

SVD example using ore.frame # Set up the data

dat <- iris[,-5]; dat$IDX <- seq_len(nrow(dat))

ore.create(dat,table="DAT")

ore.exec("alter table DAT add constraint DAT primary key (\"IDX\")")

ore.sync(table = "DAT", use.keys = TRUE)

# Compute svd on ore.frame

sol <- svd(DAT[,-5])

plot(cumsum(sol$d^2/sum(sol$d^2))) # % explained variance

# Derive the U matrix since not provided with model

sol.U <- as.matrix(DAT[,-5]) %*% (sol$v) %*% diag(1./sol$d)

class(sol.U) # ore.tblmatrix

k<-1 # use one singular vector

recon1 <- (sol.U)[,1:k,drop=FALSE] %*%

diag((sol$d)[1:k,drop=FALSE],nrow=k,ncol=k) %*%

t((sol$v)[,1:k,drop=FALSE])

class(recon1) # ore.tblmatrix

myviz(mat,recon1,lab1="Iris data", lab2="Recon 1")

20

Example inspiration: StackExchange Cross Validated

Page 21: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Performance Enhancement – ore.summary

• ore.summary(data, var, stats = c("n", "mean", "min", "max"), class = NULL, types = NULL, ways = NULL, weight = NULL, order = NULL, maxid = NULL, minid = NULL, mu = 0, no.type = FALSE, no.freq = FALSE)

• More than an order of magnitude performance improvement

21

> options("ore.parallel")

$ore.parallel

[1] 40

> system.time(res <- ore.summary(ONTIME_10M, var=c("ARRDELAY","DEPDELAY","DISTANCE"),

+ class=c("YEAR", "MONTH", "DAYOFWEEK", "UNIQUECARRIER", "CANCELLED"), order="-type"))

user system elapsed

0.018 0.000 17.248

> system.time(res <- ore.summary(ONTIME_1B, var=c("ARRDELAY","DEPDELAY","DISTANCE"),

+ class=c("YEAR", "MONTH", "DAYOFWEEK", "UNIQUECARRIER", "CANCELLED"), order="-type"))

user system elapsed

0.016 0.000 55.141

Page 22: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

ore.summary

• Wide range of statistical functions available for stats argument

22

"n" or "freq" Count of non-missing values "count" or "cnt" Count of all observations "nmiss" Count of missing values "mean" or "avg" Average of values "min" Minimum of values "max" Maximum of values "css" Corrected sum of squares "uss" Uncorrected sum of squares "cv" Coefficient of variation "sum" Sum of values "sumwgt" Weighted sum of values "range" Range of values "stddev" or "std" Standard deviation of values "stderr" or "stdmean" Standard error for the mean "variance" or "var" Variance of values "kurtosis" or "kurt" Kurtosis "skewness" or "skew" Skewness

"loccount<" or "loc<" # observations whose values < supplied mu "loccount>" or "loc>" # observations whose values > supplied mu "loccount!" or "loc!" # observations whose values != supplied mu "loccount" or "loc" # observations whose values == supplied mu Percentiles Types: "p0", "p1", "p5", "p10", "p25" or "q1", "p50" or "q2" or "median", "p75" or "q3", "p90", "p95", "p99", "p100" --> Percentile or quantile "qrange" or "iqr" Interquartile range, Q3-Q1 "mode" Most frequently occurring value "lclm" 2-sided left confidence limit with confidence level of interval = 0.95 "rclm" 2-sided right confidence limit with confidence level of interval = 0.95 "clm" 2-sided confidence interval with confidence level of interval = 0.95 "t" Student's t-test statistic "probt" or "prt" Two-tailed p-value for student's t-test

Page 23: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

ORE 1.5 Datastore – grant and revoke

• Save and load R objects using Oracle Database for persistence

• In ORE 1.4.1, each schema has a single datastore table that stores all named datastores

• In ORE 1.5, users can provide read-only access to datastores created “grantable”

– “Grantable” datastores created as individual tables in the user’s schema

– “Private” datastores still reside in a common table in the user’s schema

• Functions

– ore.save, ore.load

– ore.datastore, ore.datastoreSummary

– ore.delete

– ore.grant, ore.revoke

23

Page 24: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 24

Datastore – granting access

Client 1 Client 2

Oracle Database

User1 Schema User2 Schema

Private Datastores (schema local) Shared Datastores

one table per datastore

mtcars df

iris df

rq$datastoreinventory

ore$ds2_20

Grant read access to User2

User1.ore$ds2_20

ore.save(iris, name="ds_1")

ore.save(mtcars, name="ds_2", grantable=TRUE)

ore.grant(name="ds_2", type="datastore", user=“User2")

ds_1

ore.datastore(type="all")

Page 25: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

R Script Repository – granting access • To create/drop scripts, user must have RQADMIN role

• Ordinary ORE users are not allowed to create R scripts at database server

• RQADMIN user can drop any global R script

– Only script creator can drop users scripts, i.e., created with ore.scriptCreate(global = FALSE)

– A user can grant access to a script to individuals or to all (public)

• Any ORE user can execute (global or granted) repository R scripts

• Determining script visibility – ore.scriptList(type = …)

– "global” shows global scripts

– "user” shows all scripts created by the current user with global=FALSE

– "all” shows global and user scripts

– "grant" shows scripts user has granted access to others

– "granted" shows scripts user has been granted access to

25

Page 26: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

User-defined R functions – shared functions # create an R script for the current user ore.scriptCreate("privateFunction",

function(data, formula, ...) lm(formula, data, ...))

# create a global R script available to any user

ore.scriptCreate("globalFunction",

function(data, formula, ...) glm(formula=formula, data=data, ...),

global = TRUE)

# list R scripts

ore.scriptList()$NAME # type= "user" default

ore.scriptList(pattern="Function",

type="all")$NAME

26

Page 27: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

User-defined R Functions – ore.grant and ore.revoke # load an R script by name to an R function object ore.scriptLoad(name="privateFunction")

ore.scriptLoad(name="globalFunction", newname="privateFunction2")

# grant and revoke R script read privilege to and from public

ore.grant(name = "privateFunction", type = "rqscript")

ore.scriptList(pattern="Funct",type="grant")$NAME

ore.revoke(name = "privateFunction", type = "rqscript")

ore.scriptList(pattern="Funct",type="grant")$NAME

# drop an R script

ore.scriptDrop("privateFunction")

ore.scriptDrop("globalFunction", global=TRUE)

ore.scriptList(pattern="Funct",type="all")$NAME

27

Page 28: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

ore.groupApply – multi-column INDEX

• INDEX: A ore.vector or ore.frame object containing ore.factor objects or columns, each of which is the same length as argument 'X'. It is used to partition the data in 'X' before sending it to function 'FUN'

28

res <- ore.groupApply(ONTIME_S[c(7,12,18,19,22)],

INDEX = ONTIME_S[,c(7,12)], # day of week, unique carrier

function(df) {

if(nrow(df) == 0)

NULL

else

list(df[1,1],df[1,2],

summary(lm(ARRDELAY ~ DEPDELAY+DISTANCE,data=df)))

},

parallel = 4)

Page 29: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 29

Page 30: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |

Oracle R Enterprise 1.5 – Summary

• Upgraded R version compatibility: R 3.2.0

• Parallel distributed algorithms

– ore.randomForest

– svd

– prcomp

• ore.summary performance enhancement

• ore.grant and ore.revoke on R scripts and datastores

• Datatypes CLOB and BLOB supported for embedded R execution input and output, as well as for ore.create, ore.pull, and ore.push

• ore.groupApply supports partitioning on multiple columns

30

Page 31: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 31

To Learn More about Oracle’s R Technologies…

http://oracle.com/goto/R

Page 32: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument

Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 32

Page 33: Oracle Advanced Analytics · Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | ore.summary •Wide range of statistical functions available for stats argument