57
Data and AI LATAM 2018

Data and AI LATAM 2018 - microsoftsessions.com Data_Science_wi… · T-SQL. Production Apps. Models. SQL Server 2017. Real time scoring engine. Stored Proc’s and Triggers. Events

Embed Size (px)

Citation preview

Data and AI LATAM 2018

Streamline Productivity and Simplify DeploymentLa parte de imagen con el identificador de relación rId3 no se encontró en el archivo.

Applications

Applications

La parte de imagen con el identificador de relación rId5 no se encontró en el archivo.

MODEL

Scoring

Separate service or embedded logic

La parte de imagen con el identificador de relación rId5 no se encontró en el archivo.

MODELData Transformations

Model Training

Model

Model Training

Data Transformations

Scoring

SQL Server

Analytics server

SQL Server 2017

http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster

Eliminate data movement

Operationalize scripts and models

Enterprise grade performance and scaleExtensibility

Demo – Hello World

EXEC sp_execute_external_script@language =N'R',@script =N'print(paste("Hello World from:", Revo.version$version.string));'

EXEC sp_execute_external_script@language =N'Python',@script =N'import sysprint ("Hellow World from:", sys.version)'

Using SQL Server 2017 Machine Learning Services

Augments R & Python with parallelized, distributed algorithmsProvides in-database execution of scripts and algorithms usedParallel algorithms overcome Python and R memory limitations

Reduces security risks by keeping data in-databaseCreates and consumes portable models

One call, one answerArbitrarily large data setsArbitrarily large worker task setMathematically the same as single-threadedPlatform independentMost are written in C++ for speed

1. Algorithm begins initiator process2. Initiator distributes work to nodes3. Finalizer collects results4. Finalizer iterates or continues5. Finalizer evaluates final model6. Returns single model to calling script

Load a large dataset

Run a RevoScaleRor RevoScalePy

algorithm

RevoScaleR & RevoScalePy Algorithms and Functions

Data Larger than RAM

One callOne or many models returnedArbitrarily large data setsArbitrarily large worker task set

Load a large dataset

Run a RevoScaleRor RevoScalePy

algorithm

RevoScaleR & RevoScalePy Algorithms and Functions

Data Larger than RAM

Augments RevoScaleRFast learnersDeep learning algorithmsEnsemble results using rxEnsemble

Executes RevoScaleR algos on remote data & CPUs“rxSetComputeContext” redirects to remoteAlgorithms in RevoScaleR library redirect as setResults are returned to script as though local

1. Algorithm on local checks compute context2. If set remote, packages and ships request3. Local script blocks (by default) awaiting response4. Remote unpacks and executes in parallel5. Remote returns results to local interface6. Local interface returns results to script

Load a large dataset

Run a RevoScaleRor RevoScalePy

algorithm

SQL Server, Teradata1, Hadoop1, Hadoop MapReduce1, HDInsight1

1R only

Supports custom, multi-layer network topology with filtered, convolutional, and pooling bundles

Binary classificationMulti-class classificationRegression

Bing Ads Click Prediction ($50M per year revenue gain);Image Classification

L1, L2 regularization Binary classificationMulti-class classification

Classifying user feedback

Easy to train learner for anomaly detection

Anomaly Detection Fraud detection

Boosted decision tree. Similar to XGBoost. Supports up to ~100K features

Binary classificationRegression

One of the most popular and best performing learners inside Microsoft

state-of-the-art tree ensembles (Random Forest) Supports up to ~100K features.

Binary classificationRegression

Churn Prediction

Speed, scalability and supports L1,L2 regularization. Supports up to 1B features!

Binary classification, Regression

Outlook used for email spam filtering

Battle tested, large language support, performant (Bing, Office)

Performs natural language processing of free text into numerical representation

Support ticket classification,Sentiment analysis

Ease of use; 1 line of code to set Converts categories into numerical data

Ad Click Prediction

Ease of use Selects a subset of features to speed up training time

Sentiment analysis,Ad Click Prediction

Canonical deployment patterns

SQL Server 2017

SQL Server 2017

Key Points:• Classical pattern of pulling data out

of database to a separate modeling environment

• Data scientists will already be familiar with this approach, so it's something to build on

RemoteExecutionContext

Results

ParallelWorker Tasks

RevoScaleR& RevoScalePy

Parallel Algorithms

Iterate/ Sequence

SQL Server 2016/17

RemoteExecutionContext

Results

ParallelWorker Tasks

RevoScaleR& RevoScalePy

Parallel Algorithms

Iterate/ Sequence

SQL Server 2016/17

Key Points:• Fast and friendly for existing

R/Python users• Limited to what can be done through

a SQL Compute Context• Requires external script execute

permissions

SQL Server 2016/17 Run Python & R

From within the Query

Processor

Run R and Python from SQL environments

T-SQL Apps

T-SQL Script

SQL Server 2016/17 Run Python & R

From within the Query

Processor

Run R and Python from SQL environments

T-SQL Apps

T-SQL Script

Key Points:• Works with all of R/Python functions

for maximum flexibility• Most natural for users with some SQL

familiarity, but doable for all

T-SQL Stored ProcedureSQL

Server 2016/17

Enable smart non-R apps

BI & Reporting; Web apps

T-SQL Script

T-SQL Stored ProcedureSQL

Server 2016/17

Enable smart non-R apps

BI & Reporting; Web apps

T-SQL Script

Key Points:• Helper functions are available so you

don't have to manually recode and R/Python script

• Allows firing via traditional triggers

Events

T-SQLProduction

AppsModels

SQL Server 2017

Real time scoring engine

Stored Proc’sand Triggers

Events

Events

T-SQLProduction

AppsModels

SQL Server 2017

Real time scoring engine

Stored Proc’sand Triggers

Events

Key Points:• R/Python need not be installed• Works with many of the RevoScale*

and MicrosoftML models• Works on SQL 2017 and Azure SQL

DB• single millisecond response times

Other Hints and Deployment Considerations

Integration with SQL query execution• Parallel query pushing data to multiple external processes / threads• Use in-memory technology and Columnstore Indexes alongside your ML

scripts

Streaming mode execution• Stream data in batches to the R/Python process to scale beyond available memory

Train and Predict using parallelism• Leverage RevoScaleR/revoscalepy and scale your R and Python scripts using

multi-threading and parallel processing

Native scoring for faster real-time predictions (New in 2017)

No dependency between rows (ex: scoring) – Trivial Parallelism

sp_execute_external_script@script =

N’Predict…’, @parallel = 1

(MAXDOP = 2)

exec sp_execute_external_script@language = N'R’, @script = N'

# unserialize modellogitObj <- unserialize(modelbin);

# build classification model to predict tipped or notsystem.time(OutputDataSet <- data.frame(predict(logitObj, newdata = InputDataSet,

type = "response")))[3];‘

, @input_data_1 = N’SELECT tipped, passenger_count, trip_time_in_secs, trip_distance, d.direct_distanceFROM dbo.nyctaxi_sample TABLESAMPLE (50 PERCENT) REPEATABLE (98074)CROSS APPLY [CalculateDistance](pickup_latitude, pickup_longitude,

dropoff_latitude, dropoff_longitude) as dOPTION(MAXDOP 2) -- Needed only to control DOP’

, @parallel = 1, @params = N'@modelbin varbinary(max), @r_rowsPerRead int’, @modelbin = @model, @r_rowsPerRead = 5000;

Requirements:No dependency between rows (ex: scoring)

Key Benefits:Execute script over chunks of dataProcess data that doesn’t fit in memoryCan be used from client (rx* function) or server

5000

5000

5000

Dataset = 15000 RowsSp_execute_external_script

@r_rowsPerRead = 5000

exec sp_execute_external_script@language = N'R’, @script = N'

# unserialize modellogitObj <- unserialize(modelbin);

# build classification model to predict tipped or notsystem.time(OutputDataSet <- data.frame(predict(logitObj, newdata = InputDataSet,

type = "response")))[3];‘

, @input_data_1 = N’SELECT tipped, passenger_count, trip_time_in_secs, trip_distance, d.direct_distanceFROM dbo.nyctaxi_sample TABLESAMPLE (50 PERCENT) REPEATABLE (98074)CROSS APPLY [CalculateDistance](pickup_latitude, pickup_longitude,

dropoff_latitude, dropoff_longitude) as d’

, @params = N'@modelbin varbinary(max), @r_rowsPerRead int’, @modelbin = @model, @r_rowsPerRead = 5000;

exec sp_execute_external_script@language = N'R’, @script = N'

# Define the connection stringconnStr <- paste("Driver=SQL Server;Server=", instance_name, ";Database=", database_name, ";Trusted_Connection=true;", sep="");

# Set ComputeContextcc <- RxInSqlServer(connectionString = connStr, numTasks = 4);

# Pull data from queryfeatureDataSource = RxSqlServerData(sqlQuery = input_query, connectionString = connStr, computeContext = cc);

# Table to write data to, using compute contexttipPredictions = RxSqlServerData(table = "nyc_taxi_tip_predictions", connectionString = connStr);

# Unserialize modellogitObj <- unserialize(modelbin);

# Predict tipped or not based on modelPredictions -> rxPredict(logitObj, data = featureDataSource, outData = tipPredictions, overwrite = TRUE);’

, @params = N'@input_query nvarchar(max)’, @input_query = N'SELECT * FROM nyctaxi_training_sample'

rxCall…

rxCall…sp_execute_ext

ernal_script@script =

N’rxLogit…’, @input_data_1

= N’SELECT….’

(MAXDOP = 2)

+BxlServer

+BxlServer

m1+ m2

<Model Object>

@modelSELECT native_modelFROM models

WHERE model_name = 'Fraud Detection Model’

PREDICT MODEL = @model DATA = new_transaction

-- Check/set External Resource Pool configSELECT * FROM sys.resource_governor_resource_pools WHERE name = 'default'SELECT * FROM sys.resource_governor_external_resource_pools WHERE name = 'default'ALTER RESOURCE POOL "default" WITH (max_memory_percent = 60);ALTER EXTERNAL RESOURCE POOL "default" WITH (max_memory_percent = 80);ALTER RESOURCE GOVERNOR RECONFIGURE; -- enforce changes

DMV Description

sys.dm_exec_requests New column: external_script_request_id

sys.dm_external_script_requests Returns running external scripts, DOP & assigned user account

sys.dm_external_script_execution_stats Number of executions for rx* functions in RevoScaleR package

sys.dm_os_performance_counters New “External Scripts” performance counters

https://docs.microsoft.com/en-us/sql/advanced-analytics/r/new-components-in-sql-server-to-support-r

Reduced surface area and isolation

‘external scripts enabled’ required

R/Python script execution outside of SQL Server process

space

Script execution requires explicit

permission

sp_execute_external_script requires EXECUTE ANY

EXTERNAL SCRIPT for non-admins

SQL Server login/user required and db/table access

R/Python processes have limited privileges

R/Python processes run under local user accounts in the

SQLRUserGroup

Each execution is isolated. Different users with different

accounts

Windows firewall rules to block outbound traffic

''

Examples using R:

sqlPackages <- rxInstalledPackages(fields = c("Package","Version", "Built"), computeContext = sqlServer)

pkgs <- c("ggplot2")rxInstallPackages(pkgs = pkgs, verbose = TRUE, scope = "private", computeContext = sqlServer)

pkgs <- c("ggplot2")rxRemovePackages(pkgs = pkgs, verbose = TRUE, scope = "private", computeContext = sqlServer)

Example using T-SQL:

EXEC sp_execute_external_script @language=N'R', @script=N'myPackages <- rxInstalledPackages();OutputDataSet <- as.data.frame(myPackages);'

• Azure SQL Database• R support• Python support

• Machine Learning Services in SQL Server on Linux• Additional algorithms and pre-trained models• Native Scoring for more models

Learning and Scoring Process

ImagesFeaturization (using pre-trained

ResNet18 neural network model)

FeaturesClassification

Algorithm(Boosted Tree)

Classifier Model

LearningLabels

Images Features

Scoring

PredictionsFeaturization (using pre-trained

ResNet18 neural network model)

Classification

Distributed Featurization and Training On HD-InsightSQL Server

Edge

Distributed Featurization

CT Scan Images

Azure Blob StorageClassifier Training

Featurization

ModelsTable

HDInsight-MRS

FeaturizationScoring

with the classifier model

Web App

Diagnosis: 35% certainty

Stored Procedures with R Code

Scoring with Deep Learning Model in SQL

Stored Procedure

call

Model table,Features table,

New Images table

SQL Server

Image Featurization

Parallel Featurization (30x speedup)

Training on Spark and storing in SQL

Scoring in SQL