Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
#TDPARTNERS16 GEORGIA WORLD CONGRESS CENTER
TeradataAsterR: Scaling R for In-Database Analytics
Praveenkumar Kondikoppa Senior Software Engineer
Teradata Aster
• Key Motivations
• Overview of TeradataAsaterR Package
• Table Access, Data Exploration
• R functions for Aster Analytical Foundation
• R Map Reduce Runners
Agenda
2
Motivation
3
Database access from R
R import Export
Export Load
• Single threaded
• Network latency
• Memory limitation
Database
TeradataAsterR Package
4
• Avoid data movement by performing analytics where data
resides and gets managed
• Scale R operations via in-database SQL-MR execution
(data parallelism)
• Seamlessly push down R operations into SQL, SQL-MR, SQL-
GR
Overview of TeradataAsterR
6
TeradataAsterR Packages
R interfaces for Analytic Foundation
Push-down R data access and exploration function
Bulk load/export
In-DB R script runners
In-DB R stat function
High Level Description Contd.
7
Table access functions
R function push-down Aster Cluster
R O D B C
R Client
R Map-reduce runners
Bulk data import/export
ODBC
Aster Analytic Foundation Wrappers
Mule Copy
TeradataAsterR overview
8
• Wrap Aster database objects as R objects
• Introduction of new Aster objects
• Virtual Data Frame
• Extension regular R data frame
• Maps to table/view/query in aster database
• Class attribute is ta.data.frame
Virtual Data Frame
9
Aster DB
Employee
vdf<- ta.data.frame(“Employee”)
TeradataAsterR Package
Table
• Vdf represents in-memory pointer to
employee table
Table Access and Data Exploration
10
• Operators provide similar functionality as R data frame
• Create Aster R object
• ta.data.frame
• Conversion to R objects
• as.data.frame
• as.matrix
• as.vector
• factor(as.vector(tadf$column)) – For factor conversion
• Conversion to Aster R objects
• as.ta.data.frame
• as.ta.factor
• as.ta.vector
Functions to Explore Data
11
• print • print.ta.data.frame • print.ta.data.frame.ordered • print.ta.factor • print.ta.list
• Print Rows • ta.show • ta.head
• Explore Structure
• ta.dim • ta.colnames • ta.dimnames • ta.length • ta.nrow • ta.ncol
Data Exploration : Example
12
> conn <- ta.connect("AsterDataDSN", database="retail")
>
> ta.dbObject.ls("public")
[1] "mtcarstable" "product" "product_dim" "sales2014” "sales_fact"
>
> sales <- ta.data.frame("sales2014")
> ta.dim(sales)
[1] 1398104 6
>
> tadf1 <- ta.data.frame("mtcarstable", schemaName=“public")
>ta.names(tadf1)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb“
R functions for Aster Analytics
13
• Aster Analytics Foundation (AF) has highly scalable
machine learning algorithms
• Provide R functions as wrappers to AF
• R users are abstracted from SQLMR syntax
R functions for Aster Analytics
14
• Function out is transparently mapped to R object
• Function signature is similar to native R call
Native R TeradataAsterR
glm ta.glm
kmeans ta.kmeans
• Internally SQLMR call is generated which is executed on
server using RODBC
R for Aster Analytics : Example
15
> knn_train <- ta.data.frame("knn_test1")
> knn_test <- ta.data.frame("knn_test2")
> knn_out <- ta.knn(train=knn_train, test=knn_test, k=3, distance.column=c("sepl","sepw","petl","petw"),
+ response.column=c("class"), test.key="id")
>
>ta.head(knn_out[[1]])
id class
1 2 s
2 4 s
3 6 s
4 8 s
5 10 s
6 12 s
R for Aster Analytics : Example
16
>hist <- ta.data.frame("am_histogram_data")
> ta.head(hist)
memberid name weight time_stamp graduate
1 300 Johann August 19 1989-02-22 09:50:22 0
2 400 Martin Heinrich 20 1989-02-24 09:50:22 0
3 500 Ralph Arthur 25 1989-01-20 09:50:22 1
4 800 Joseph Louis 28 1999-02-18 09:10:22 0
5 900 Marie Curie 12 1989-02-18 08:10:22 1
6 100 Henry Cavendish 12 1989-02-20 09:50:22 0
R for Aster Analytics : Example
17
>test1 <- ta.data.frame("lm_test1")
>glm_out <- ta.glm(formula=(damage ~ temp), family=binomial(), data=test1, threshold=0.01, weight=6, max.iterations=10)
> glm_out$coefficient
attribute predictor category estimate std_err z_score p_value
1 -1 Loglik NA -14.837400 23.0000000 1.00000 0.000000000
2 0 (Intercept) NA 11.663000 3.2962699 3.53823 0.000402812
3 1 temp NA -0.216234 0.0531772 -4.06628 0.000047769
significance
1 <NA>
2 ***
3 ***
R Map Reduce Runners
18
• Extend R by adding Aster specific constructs
• Get translated to SQLMR stream API
• Provides functionality to run R scripts in parallel
• R needs to be installed on each worker node
R Map Reduce Runners
19
Key Features
• Simple way to deploy R scripts on cluster
• ta.install.scripts(); ta.remove.scripts()
• ta.install.files(); ta.remove.files()
• Supports R scripts that operate on row, column, partition or
table
R Map Reduce Runners
20
• Support the various output formats of R functions
• R object: output as native R objects
• Table: insert into an Aster table
• JSON/XML/CSV: output as JSON, XML or CSV format
• Graphic: output as PNG graphics (plot, histogram, pie,
etc.)
R Map Reduce Runners
21
Key Features
• Support user defined R functions
• Support running R functions that do not take a table input
• Control the amount of memory available for in-database R
engine
R Map Reduce Runners
22
Example functions
• ta.eval(FUN, ...)
• Execute a function and generates result in Aster
• No input Aster data frame
• Executed on a single vworker
• ta.apply(tadf, MARGIN, FUN, COMBINER.FUN, ...)
• Apply a given R function on given margins of an Aster
data frame
• Optionally apply another combiner function on the
results of the first function
R Map Reduce Runners
23
Example functions for different margins
# Table MARGIN
> ta.apply(tadf, c(), FUN = sum)
[1] 539240
# Row MARGIN
> ta.apply(tadf, 1, FUN = sum)
[1] 100027 35021 29019 24021 60029 125028
40021 41024 35022 50028
# ta.tapply case: calculate the mean of each partition
in tadf, partitioned by column “val_3”
> ta.apply(tadf, 2, FUN = sum)
id val_1 val_2 val_3
55 539000 160 25
# Example data > ta.data.frame("multi_col")->tadf > ta.pull(tadf) id val_1 val_2 val_3 1 1 125000 19 8 2 3 40000 16 2 3 5 41000 18 1 4 7 35000 14 1 5 9 50000 16 3 6 2 100000 20 5 7 4 35000 16 1 8 6 29000 12 1 9 8 24000 12 1 10 10 60000 17 2
Scale R for In-database Analytics
24
• Scalability
• Use parallelized machine learning algorithms
• Data movement
• Computations are done in database, avoid transfer of
huge data
• Latency
• Minimal data is exchanged between the client and
server, typically the aggregated results.
Scale R for In-database Analytics
25
• Access to database
• R users don’t have to learn about Aster SQL
• Ease of deployment
• TeradataAsterR package is installed just like other R
packages
• R scripts are installed on server with a single function
Thank You
Questions/Comments
Email:
Follow Me
Twitter @
Rate This Session #
with the PARTNERS Mobile App
Remember To Share Your Virtual Passes
MaverickPraveen
26