TeradataAsterR: Scaling R for In-Database Analytics...Bulk load/export In-DB R script In-DB R stat function High Level Description Contd. 7 Table access functions R function push-down

#TDPARTNERS16 GEORGIA WORLD CONGRESS CENTER

TeradataAsterR: Scaling R for In-Database Analytics

Praveenkumar Kondikoppa Senior Software Engineer

Teradata Aster

• Key Motivations

• Overview of TeradataAsaterR Package

• Table Access, Data Exploration

• R functions for Aster Analytical Foundation

• R Map Reduce Runners

Agenda

2

Motivation

3

Database access from R

R import Export

Export Load

• Single threaded

• Network latency

• Memory limitation

Database

TeradataAsterR Package

4

• Avoid data movement by performing analytics where data

resides and gets managed

• Scale R operations via in-database SQL-MR execution

(data parallelism)

• Seamlessly push down R operations into SQL, SQL-MR, SQL-

GR

Overview of TeradataAsterR

6

TeradataAsterR Packages

R interfaces for Analytic Foundation

Push-down R data access and exploration function

Bulk load/export

In-DB R script runners

In-DB R stat function

High Level Description Contd.

7

Table access functions

R function push-down Aster Cluster

R O D B C

R Client

R Map-reduce runners

Bulk data import/export

ODBC

Aster Analytic Foundation Wrappers

Mule Copy

TeradataAsterR overview

8

• Wrap Aster database objects as R objects

• Introduction of new Aster objects

• Virtual Data Frame

• Extension regular R data frame

• Maps to table/view/query in aster database

• Class attribute is ta.data.frame

Virtual Data Frame

9

Aster DB

Employee

vdf<- ta.data.frame(“Employee”)

TeradataAsterR Package

Table

• Vdf represents in-memory pointer to

employee table

Table Access and Data Exploration

10

• Operators provide similar functionality as R data frame

• Create Aster R object

• ta.data.frame

• Conversion to R objects

• as.data.frame

• as.matrix

• as.vector

• factor(as.vector(tadf$column)) – For factor conversion

• Conversion to Aster R objects

• as.ta.data.frame

• as.ta.factor

• as.ta.vector

Functions to Explore Data

11

• Print

• print • print.ta.data.frame • print.ta.data.frame.ordered • print.ta.factor • print.ta.list

• Print Rows • ta.show • ta.head

• Explore Structure

• ta.dim • ta.colnames • ta.dimnames • ta.length • ta.nrow • ta.ncol

Data Exploration : Example

12

> conn <- ta.connect("AsterDataDSN", database="retail")

>

> ta.dbObject.ls("public")

[1] "mtcarstable" "product" "product_dim" "sales2014” "sales_fact"

>

> sales <- ta.data.frame("sales2014")

> ta.dim(sales)

[1] 1398104 6

>

> tadf1 <- ta.data.frame("mtcarstable", schemaName=“public")

>ta.names(tadf1)

[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"

[11] "carb“

R functions for Aster Analytics

13

• Aster Analytics Foundation (AF) has highly scalable

machine learning algorithms

• Provide R functions as wrappers to AF

• R users are abstracted from SQLMR syntax

R functions for Aster Analytics

14

• Function out is transparently mapped to R object

• Function signature is similar to native R call

Native R TeradataAsterR

glm ta.glm

kmeans ta.kmeans

• Internally SQLMR call is generated which is executed on

server using RODBC

R for Aster Analytics : Example

15

> knn_train <- ta.data.frame("knn_test1")

> knn_test <- ta.data.frame("knn_test2")

> knn_out <- ta.knn(train=knn_train, test=knn_test, k=3, distance.column=c("sepl","sepw","petl","petw"),

+ response.column=c("class"), test.key="id")

>

>ta.head(knn_out[[1]])

id class

1 2 s

2 4 s

3 6 s

4 8 s

5 10 s

6 12 s


16

>hist <- ta.data.frame("am_histogram_data")

> ta.head(hist)

memberid name weight time_stamp graduate

1 300 Johann August 19 1989-02-22 09:50:22 0

2 400 Martin Heinrich 20 1989-02-24 09:50:22 0

3 500 Ralph Arthur 25 1989-01-20 09:50:22 1

4 800 Joseph Louis 28 1999-02-18 09:10:22 0

5 900 Marie Curie 12 1989-02-18 08:10:22 1

6 100 Henry Cavendish 12 1989-02-20 09:50:22 0


17

>test1 <- ta.data.frame("lm_test1")

>glm_out <- ta.glm(formula=(damage ~ temp), family=binomial(), data=test1, threshold=0.01, weight=6, max.iterations=10)

> glm_out$coefficient

attribute predictor category estimate std_err z_score p_value

1 -1 Loglik NA -14.837400 23.0000000 1.00000 0.000000000

2 0 (Intercept) NA 11.663000 3.2962699 3.53823 0.000402812

3 1 temp NA -0.216234 0.0531772 -4.06628 0.000047769

significance

1 <NA>

2 ***

3 ***

R Map Reduce Runners

18

• Extend R by adding Aster specific constructs

• Get translated to SQLMR stream API

• Provides functionality to run R scripts in parallel

• R needs to be installed on each worker node


19

Key Features

• Simple way to deploy R scripts on cluster

• ta.install.scripts(); ta.remove.scripts()

• ta.install.files(); ta.remove.files()

• Supports R scripts that operate on row, column, partition or

table


20

• Support the various output formats of R functions

• R object: output as native R objects

• Table: insert into an Aster table

• JSON/XML/CSV: output as JSON, XML or CSV format

• Graphic: output as PNG graphics (plot, histogram, pie,

etc.)


21

Key Features

• Support user defined R functions

• Support running R functions that do not take a table input

• Control the amount of memory available for in-database R

engine


22

Example functions

• ta.eval(FUN, ...)

• Execute a function and generates result in Aster

• No input Aster data frame

• Executed on a single vworker

• ta.apply(tadf, MARGIN, FUN, COMBINER.FUN, ...)

• Apply a given R function on given margins of an Aster

data frame

• Optionally apply another combiner function on the

results of the first function


23

Example functions for different margins

# Table MARGIN

> ta.apply(tadf, c(), FUN = sum)

[1] 539240

# Row MARGIN

> ta.apply(tadf, 1, FUN = sum)

[1] 100027 35021 29019 24021 60029 125028

40021 41024 35022 50028

# ta.tapply case: calculate the mean of each partition

in tadf, partitioned by column “val_3”

> ta.apply(tadf, 2, FUN = sum)

id val_1 val_2 val_3

55 539000 160 25

# Example data > ta.data.frame("multi_col")->tadf > ta.pull(tadf) id val_1 val_2 val_3 1 1 125000 19 8 2 3 40000 16 2 3 5 41000 18 1 4 7 35000 14 1 5 9 50000 16 3 6 2 100000 20 5 7 4 35000 16 1 8 6 29000 12 1 9 8 24000 12 1 10 10 60000 17 2

Scale R for In-database Analytics

24

• Scalability

• Use parallelized machine learning algorithms

• Data movement

• Computations are done in database, avoid transfer of

huge data

• Latency

• Minimal data is exchanged between the client and

server, typically the aggregated results.

Scale R for In-database Analytics

25

• Access to database

• R users don’t have to learn about Aster SQL

• Ease of deployment

• TeradataAsterR package is installed just like other R

packages

• R scripts are installed on server with a single function

Thank You

Questions/Comments

Email:

Follow Me

Twitter @

Rate This Session #

with the PARTNERS Mobile App

Remember To Share Your Virtual Passes

[email protected]

MaverickPraveen

26

mailto:[email protected]

Documents

TeradataAsterR: Scaling R for In-Database Analytics...Bulk load/export In-DB R script In-DB R stat function High Level Description Contd. 7 Table access functions R function push-down