Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Mark Hornick January 2016
Oracle Advanced Analytics Oracle R Enterprise 1.5 – Hot New Features
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
2
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Agenda
• Oracle’s Advanced Analytics and R Technologies
• Overview of Oracle R Enterprise
• ORE 1.5 Features
3
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
OBIEE
Oracle Database Enterprise Edition
Oracle’s Advanced Analytics Multiple interfaces across platforms — SQL, R, GUI, Dashboards, Apps
Oracle Advanced Analytics - Database Option SQL Data Mining & Analytic Functions + R Integration
for Scalable, Distributed, Parallel in-Database ML Execution
SQL Developer/ Oracle Data Miner
Applications
R Client
Data / Business Analysts R programmers Business Analysts/Mgrs Domain End Users Users
Platform
Oracle Database 12c
Hadoop
ORAAH Parallel,
distributed algorithms
Oracle Cloud
4
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Oracle’s R Technologies Supporting R, Oracle Database, and Big Data Appliance/Hadoop
• Oracle R Distribution
• ROracle
• Oracle R Enterprise Component of the Oracle Advanced Analytics Option to Oracle Database
• Oracle R Advanced Analytics for Hadoop Component of the Big Data Connectors Software Suite
Software available to R Community for free
5
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise Component of Oracle Advanced Analytics option
• Scale R to Big Data
• Use Oracle Database as HPC environment
• Use in-database parallel and distributed machine learning algorithms
• Manage R scripts and R objects in Oracle Database
• Integrate R results into applications and dashboards via SQL
6
Client R Engine
ORE packages
Oracle Database User tables
In-db stats
Database Server Machine
SQL Interfaces SQL*Plus, SQLDeveloper, …
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
IoT Use Case: Energy Demand • Model each customer’s usage to understand
behavior and predict individual usage and overall aggregate demand
• 200 thousand households, each with a utility “smart meter”
• 1 reading / meter / hr
• 200K x 8760 hrs / yr 1.752B readings
• 3 years worth of data 5.256B readings
• Each customer has 26280 readings
• If each model takes 10 seconds to build, 555.6 hrs (23.2 days) …with 128 DOP 4.3 hrs
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
f(dat,args,…) {
}
Oracle Database + ORE
Data c1 c2 ci cn
R Script build model
f(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…)
Model c1
Model c2
Model cn
Model ci
R Datastore R Script Repository
Scalable Analysis – Model Building Smart meter scenario
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Build models, partitioned on CUST_ID, and store in database
ore.groupApply (CUST_USAGE_DATA,
CUST_USAGE_DATA$CUST_ID,
function(dat, ds.name) {
cust_id <- dat$CUST_ID[1]
mod <- lm(Consumption ~ . -CUST_ID, dat)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL
name <- paste("mod", cust_id,sep="")
assign(name, mod)
ds.name1 <- paste(ds.name,".",cust_id,sep="")
ore.save(list=paste("mod",cust_id,sep=""), name=ds.name1, overwrite=TRUE)
TRUE
},
ds.name="myDatastore", ore.connect=TRUE, parallel=128
)
9
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
So, what’s new in ORE 1.5?
10
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise 1.5 – New Features
• Upgraded R version compatibility: R 3.2.0
• Parallel distributed algorithms
– ore.randomForest
– svd
– prcomp
• ore.summary performance enhancement
• ore.grant and ore.revoke on R scripts and datastores
• Datatypes CLOB and BLOB supported for embedded R execution input and output, as well as for ore.create, ore.pull, and ore.push
• ore.groupApply supports partitioning on multiple columns
11
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
• ORE 1.5 is certified >= R-3.2.0
– Open source R
– Oracle R Distribution
• R-3.2.0
– Performance improvements
– big in-memory data objects
– compatibility with more than 7000
community-contributed R packages
• Supporting packages for ORE
– New package: randomForest 4.6-10
– Updates to other packages
• arules 1.1-9
• cairo 1.5-8
• DBI 0.3-1
• png 0.1-7
• ROracle 1.2-1
• statmod 1.4.21
12
Upgraded R 3.2.x version compatibility
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise Predictive Analytics algorithms in-Database
Decision Tree Logistic Regression
Naïve Bayes RandomForest
Support Vector Machine
Regression
Linear Model Generalized Linear Model
Multi-Layer Neural Networks Stepwise Linear Regression Support Vector Machine
Classification
Attribute Importance
Minimum Description Length
Clustering
Hierarchical k-Means Orthogonal Partitioning Clustering
Feature Extraction
Nonnegative Matrix Factorization Principal Component Analysis Singular Value Decomposition
Market Basket Analysis
Apriori – Association Rules
Anomaly Detection
1 Class Support Vector Machine
Time Series
Single Exponential Smoothing Double Exponential Smoothing
New in ORE 1.5
13
…plus open source R packages for algorithms in combination with embedded R data- and task-parallel execution
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Random Forest Algorithm
• Ensemble learning technique for classification and regression
• Known for high accuracy models
• Constructs many “small” decision trees
• For classification, predicts mode of classes predicted by individual trees
• For regression, predicts mean prediction of individual trees
• Avoids overfitting, which is common for decision trees
• Developed by Leo Breiman and Adele Cutler combining the ideas of “bagging” and random selection of variables resulting in a collection of decision trees with controlled variance
14
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
ore.randomForest supports classification
• Enables performance and scalability for larger data sets
• Executes in parallel for model building and scoring
– ore.parallel global option used for preferred DOP
• Oracle R Distribution new randomForest function
– Reduces memory requirements over standard R (~7X)
– As a result, reduces memory requirements for ore.randomForest
– ORD randomForest supports classification only
• Can use Oracle R Distribution’s or R’s randomForest package
15
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 16
ore.randomForest – parallel distributed implementation Exadata 5-2 – half rack, ORE DOP = 40
273
1389
6366
28 70
360
1
10
100
1000
10000
10K 100K 1M
Tim
e (
seco
nd
s)
# rows (ntree = 500)
R vs. ORE Random Forest Build Time
R
ORE
Order of magnitude faster ~2.8 hours
~17 minutes
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
ore.randomForest
• ore.randomForest() builds a random forest model by growing trees in parallel
• Scoring method 'predict' runs in parallel
options(ore.parallel=4)
IRIS <- ore.push(iris)
mod <- ore.randomForest(Species~., IRIS)
tree10 <- grabTree(mod, k = 10, labelVar = TRUE)
ans <- predict(mod,IRIS,type="all",supplemental.cols="Species", cache.model=FALSE)
table(ans$Species, ans$prediction)
17
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 18
ore.randomForest Results
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Singular Value Decomposition Principal Component Analysis
• The functions svd and prcomp overloaded –Execute in parallel
–Accept ore.frame objects
• In-database execution to improve scalability and performance
• No data movement
19
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
SVD example using ore.frame # Set up the data
dat <- iris[,-5]; dat$IDX <- seq_len(nrow(dat))
ore.create(dat,table="DAT")
ore.exec("alter table DAT add constraint DAT primary key (\"IDX\")")
ore.sync(table = "DAT", use.keys = TRUE)
# Compute svd on ore.frame
sol <- svd(DAT[,-5])
plot(cumsum(sol$d^2/sum(sol$d^2))) # % explained variance
# Derive the U matrix since not provided with model
sol.U <- as.matrix(DAT[,-5]) %*% (sol$v) %*% diag(1./sol$d)
class(sol.U) # ore.tblmatrix
k<-1 # use one singular vector
recon1 <- (sol.U)[,1:k,drop=FALSE] %*%
diag((sol$d)[1:k,drop=FALSE],nrow=k,ncol=k) %*%
t((sol$v)[,1:k,drop=FALSE])
class(recon1) # ore.tblmatrix
myviz(mat,recon1,lab1="Iris data", lab2="Recon 1")
20
Example inspiration: StackExchange Cross Validated
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Performance Enhancement – ore.summary
• ore.summary(data, var, stats = c("n", "mean", "min", "max"), class = NULL, types = NULL, ways = NULL, weight = NULL, order = NULL, maxid = NULL, minid = NULL, mu = 0, no.type = FALSE, no.freq = FALSE)
• More than an order of magnitude performance improvement
21
> options("ore.parallel")
$ore.parallel
[1] 40
> system.time(res <- ore.summary(ONTIME_10M, var=c("ARRDELAY","DEPDELAY","DISTANCE"),
+ class=c("YEAR", "MONTH", "DAYOFWEEK", "UNIQUECARRIER", "CANCELLED"), order="-type"))
user system elapsed
0.018 0.000 17.248
> system.time(res <- ore.summary(ONTIME_1B, var=c("ARRDELAY","DEPDELAY","DISTANCE"),
+ class=c("YEAR", "MONTH", "DAYOFWEEK", "UNIQUECARRIER", "CANCELLED"), order="-type"))
user system elapsed
0.016 0.000 55.141
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
ore.summary
• Wide range of statistical functions available for stats argument
22
"n" or "freq" Count of non-missing values "count" or "cnt" Count of all observations "nmiss" Count of missing values "mean" or "avg" Average of values "min" Minimum of values "max" Maximum of values "css" Corrected sum of squares "uss" Uncorrected sum of squares "cv" Coefficient of variation "sum" Sum of values "sumwgt" Weighted sum of values "range" Range of values "stddev" or "std" Standard deviation of values "stderr" or "stdmean" Standard error for the mean "variance" or "var" Variance of values "kurtosis" or "kurt" Kurtosis "skewness" or "skew" Skewness
"loccount<" or "loc<" # observations whose values < supplied mu "loccount>" or "loc>" # observations whose values > supplied mu "loccount!" or "loc!" # observations whose values != supplied mu "loccount" or "loc" # observations whose values == supplied mu Percentiles Types: "p0", "p1", "p5", "p10", "p25" or "q1", "p50" or "q2" or "median", "p75" or "q3", "p90", "p95", "p99", "p100" --> Percentile or quantile "qrange" or "iqr" Interquartile range, Q3-Q1 "mode" Most frequently occurring value "lclm" 2-sided left confidence limit with confidence level of interval = 0.95 "rclm" 2-sided right confidence limit with confidence level of interval = 0.95 "clm" 2-sided confidence interval with confidence level of interval = 0.95 "t" Student's t-test statistic "probt" or "prt" Two-tailed p-value for student's t-test
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
ORE 1.5 Datastore – grant and revoke
• Save and load R objects using Oracle Database for persistence
• In ORE 1.4.1, each schema has a single datastore table that stores all named datastores
• In ORE 1.5, users can provide read-only access to datastores created “grantable”
– “Grantable” datastores created as individual tables in the user’s schema
– “Private” datastores still reside in a common table in the user’s schema
• Functions
– ore.save, ore.load
– ore.datastore, ore.datastoreSummary
– ore.delete
– ore.grant, ore.revoke
23
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 24
Datastore – granting access
Client 1 Client 2
Oracle Database
User1 Schema User2 Schema
Private Datastores (schema local) Shared Datastores
one table per datastore
mtcars df
iris df
rq$datastoreinventory
ore$ds2_20
Grant read access to User2
User1.ore$ds2_20
ore.save(iris, name="ds_1")
ore.save(mtcars, name="ds_2", grantable=TRUE)
ore.grant(name="ds_2", type="datastore", user=“User2")
ds_1
ore.datastore(type="all")
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
R Script Repository – granting access • To create/drop scripts, user must have RQADMIN role
• Ordinary ORE users are not allowed to create R scripts at database server
• RQADMIN user can drop any global R script
– Only script creator can drop users scripts, i.e., created with ore.scriptCreate(global = FALSE)
– A user can grant access to a script to individuals or to all (public)
• Any ORE user can execute (global or granted) repository R scripts
• Determining script visibility – ore.scriptList(type = …)
– "global” shows global scripts
– "user” shows all scripts created by the current user with global=FALSE
– "all” shows global and user scripts
– "grant" shows scripts user has granted access to others
– "granted" shows scripts user has been granted access to
25
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
User-defined R functions – shared functions # create an R script for the current user ore.scriptCreate("privateFunction",
function(data, formula, ...) lm(formula, data, ...))
# create a global R script available to any user
ore.scriptCreate("globalFunction",
function(data, formula, ...) glm(formula=formula, data=data, ...),
global = TRUE)
# list R scripts
ore.scriptList()$NAME # type= "user" default
ore.scriptList(pattern="Function",
type="all")$NAME
26
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
User-defined R Functions – ore.grant and ore.revoke # load an R script by name to an R function object ore.scriptLoad(name="privateFunction")
ore.scriptLoad(name="globalFunction", newname="privateFunction2")
# grant and revoke R script read privilege to and from public
ore.grant(name = "privateFunction", type = "rqscript")
ore.scriptList(pattern="Funct",type="grant")$NAME
ore.revoke(name = "privateFunction", type = "rqscript")
ore.scriptList(pattern="Funct",type="grant")$NAME
# drop an R script
ore.scriptDrop("privateFunction")
ore.scriptDrop("globalFunction", global=TRUE)
ore.scriptList(pattern="Funct",type="all")$NAME
27
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
ore.groupApply – multi-column INDEX
• INDEX: A ore.vector or ore.frame object containing ore.factor objects or columns, each of which is the same length as argument 'X'. It is used to partition the data in 'X' before sending it to function 'FUN'
28
res <- ore.groupApply(ONTIME_S[c(7,12,18,19,22)],
INDEX = ONTIME_S[,c(7,12)], # day of week, unique carrier
function(df) {
if(nrow(df) == 0)
NULL
else
list(df[1,1],df[1,2],
summary(lm(ARRDELAY ~ DEPDELAY+DISTANCE,data=df)))
},
parallel = 4)
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 29
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise 1.5 – Summary
• Upgraded R version compatibility: R 3.2.0
• Parallel distributed algorithms
– ore.randomForest
– svd
– prcomp
• ore.summary performance enhancement
• ore.grant and ore.revoke on R scripts and datastores
• Datatypes CLOB and BLOB supported for embedded R execution input and output, as well as for ore.create, ore.pull, and ore.push
• ore.groupApply supports partitioning on multiple columns
30
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 31
To Learn More about Oracle’s R Technologies…
http://oracle.com/goto/R
Copyright © 2016 Oracle and/or its affiliates. All rights reserved. | 32