ł PRETZEL: Opening the Black Box of ML Prediction Serving ......•Windows 10, .Netcore 2.0 25. Evaluation: Latency •Micro-benchmark (No server-client communication) •Score 250

PRETZEL: Opening the Black Box of ML Prediction Serving Systems

Yunseong Lees, Alberto Scolarip, Byung-Gon Chuns,Marco Domenico Santambrogiop, Markus Weimerm, Matteo Interlandim

–„„fi ¿� ¶ �

–„¿�„fi ¿� ¶ �

¿�„fi ¿� ¶ �A

¿�„fi ¿� ¶ �B

BS06‰ –�·ˇ ‡�-� ¿� ¶ �

‰ –�·ˇ ‡· �…›¿�·º —–‡� �‰�”…‚¶ '¿ �• �� ‚�� `¶ ��»�‚» · �� ‚• ,��· ��ß¿º‚¯ …� �»� †¿¡��ß¶��ł�£,�•„�� ¿��»��•` ¿'�

�ß�� »�…– ˆ ¿'�»�¿º �·�.�–�·��œ�¿º��¿¡�‚ ��• � ¿� ¶ �,�»� ˇ ¶ �� ‰ –�·ˇ ‡� � ‚ƒ��‡„� ¿·�� ¿� ¶ ��¶ · �»� ˇ ¶ ��

¿�…–��ß¿º–� ��”�”��• ��˛ �� ‚‚�,�»�¿º‰ �”æ• ‡“��£��,�¯'–�‚ƒ�� • �”fl�� …��ł·�.�‰ –�·ˇ ‡� �»�»��”�Full�Color��¿º»�»��»�

¿�…–�ß�‚• �»�¿º ‚��„�� »�»�� ¿º¿¡�·º ��»� ��”�”»��¡��¶�� »�»� �¿º�»� �� ·�.��»��”�CD-Rom ¿¡�…�• � ��¥�� ‚�»�¿º�»�

¿ł ¢�‚• � �·�

Machine Learning Prediction Serving1. Models are learned from data2. Models are deployed and served together

Prediction serving

UsersServerData

Training

ModelLearn Deploy

Performance goal:1) Low latency2) High throughput3) Minimal resource usage

2

ML.Net

ML Prediction Serving Systems: State-of-the-art

• Assumption: models are black box• Re-use the same code in training phase

• Encapsulate all operationsinto a function call (e.g., predict())

• Apply external optimizations

TF ServingClipper

Prediction ServingSystem

“Pretzel is tasty”

cat

car

Text Analysis

ImageRecognition

Resultcaching

Replication

ensembleRequestBatching

3

How do Models Look inside Boxes?

<Example: Sentiment Analysis>

Pretzel is tasty Model J vs. L(text) (positive vs. negative)

4

Tokenizer



Pretzel is tasty

CharNgram

WordNgram

Concat LogisticRegression

DAG of Operators

J vs. L

FeaturizersPredictor

5

Tokenizer



Pretzel is tasty

CharNgram

WordNgram

Concat LogisticRegression

DAG of Operators

J vs. L

Split text into tokens

ExtractN-grams

Merge two vectors

Compute final score

6

Many Models Have Similar Structures

• Many part of a model can be re-used in other models

• Customer personalization, Templates, Transfer Learning

• Identical set of operators with different parameters

7

Outline

•Prediction Serving Systems• Limitations of Black Box Approaches•PRETZEL: White-box Prediction Serving System• Evaluation•Conclusion

8

Limitation 1: Resource Waste

•Resources are isolated across Black boxes

1. Unable to share memory spaceè Waste memory to maintain duplicate objects

(despite similarities between models)

2. No coordination for CPU resources between boxesè Serving many models can use too many threads

machine9

Tokenizer

CharNgram

WordNgram

Concat LogReg

Limitation 2: Inconsideration for Ops’ Characteristics

1. Operators have different performance characteristics• Concat materializes a vector• LogReg takes only 0.3% (contrary to the training phase)

2. There can be a better plan if such characteristics are considered• Re-use the existing vectors• Apply in-place update in LogReg

0% 20% 40% 60% 80% 100%Latency breakdown

23.1 34.2 32.7

0.3

9.6

CharNgram WordNgram Concat LogReg Others

10

Limitation 3: Lazy Initialization• ML.Net initializes code and memory lazily (efficient in training phase)• Run 250 Sentiment Analysis models 100 timesè cold: first execution / hot: average of the rest 99• Long-tail latency in the cold case• Code analysis, Just–in-time (JIT) compilation, memory allocation, etc• Difficult to provide strong Service-Level-Agreement (SLA)

Tokenizer

CharNgram

WordNgram

Concat LogReg 13x

444x

11

Outline

• (Black-box) Prediction Serving Systems• Limitations of Black Box Approaches•PRETZEL: White-box Prediction Serving System• Evaluation•Conclusion

12

PRETZEL: White-box Prediction Serving•We analyze models to optimize the internal execution

•We let models co-exist on the same runtime, sharing computation and memory resources

•We optimize models in two directions:1. End-to-end optimizations2. Multi-model optimizations

13

End-to-End Optimizations

Optimize the execution of individual models from start to end1. [Ahead-of-time Compilation]

Compile operators’ code in advanceà No JIT overhead

2. [Vector pooling] Pre-allocate data structuresà No memory allocation on the data path

14

Multi-model Optimizations

Share computation and memory across models1. [Object Store]

Share Operators parameters/weightsà Maintain only one copy

2. [Sub-plan Materialization] Reuse intermediate results computed by other modelsà Save computation

15

System Components

1. Flour: Intermediate Representation

2. Oven: Compiler/Optimizer

var fContext = ...;var Tokenizer = ...;return fPrgm.Plan();

Logical StagesS1 S2 S3 1: [x]

2: [y,z]3: …

int[100]float[200]…

Params Stats

Physical StagesS1 S2 S3

ML.NETModel

Stats

ParamsLogical StagesPhysicalStages

Model Plan

3. Runtime: Execute inference queries

Runtime

ObjectStore

Scheduler

…

4. FrontEnd: Handle user requests

FrontEnd

16

Prediction Serving with PRETZEL

1. Offline• Analyze structural information of models• Build ModelPlan for optimal execution• Register ModelPlan to Runtime

2. Online• Handle prediction requests• Coordinate CPU & memory resources

RuntimeFrontEnd

Model

ModelPlan RuntimeRegister

Analyze

17

1. Translate Model into Flour Program

System Design: Offline Phase

Tokenizer

CharNgram

WordNgram

Concat LogReg

<Model>var fContext = new FlourContext(...)var tTokenizer = fContext.CSV

.FromText(fields, fieldsType, sep)

.Tokenize();

var tCNgram = tTokenizer.CharNgram(numCNgrms, ...);var tWNgram = tTokenizer.WordNgram(numWNgrms, ...);var fPrgrm = tCNgram

.Concat(tWNgram)

.ClassifierBinaryLinear(cParams);

return fPrgrm.Plan();

<Flour Program>

18

2. Oven optimizer/compiler build Model Plan


var fContext = new FlourContext(...)var tTokenizer = fContext.CSV


.Tokenize();


.Concat(tWNgram)



<Flour Program>

Push linear predictor & Remove Concat

Stage 1

Stage 2

Group ops into stages

Rule-based optimizer

S1

S2

<Model Plan>

Logical DAG

19

2. Oven optimizer/compiler build Model Plan




.Tokenize();


.Concat(tWNgram)



<Flour Program>

Push linear predictor & Remove Concat

Stage 1

Stage 2

Group ops into stages

Rule-based optimizer

S1

S2

<Model Plan>

Logical DAG



.Tokenize();


.Concat(tWNgram)





.Tokenize();


.Concat(tWNgram)


return fPrgrm.Plan(); Parameters

e.g., Dictionary, N-gram Length

Statisticse.g., dense vs. sparse,maximum vector size 20

3. Model Plan is registered to Runtime


<Model Plan>

Logical DAG

Parameters

Statistics

Physical Stages

2. Find the most efficient physical impl. using params & stats

S1 S2

Logical StagesModel1

S1

S2

Object Store

1. Store parameters & mapping between

logical stages 21

3. Model Plan is registered to Runtime


<Model Plan>

Logical DAG

Parameters

Statistics

Physical Stages

2. Find the most efficient physical impl. using params & stats

S1 S2Catalog

3. Register selected physical stages to

Catalog

Logical StagesModel1

S1

S2

Object Store

1. Store parameters & mapping between

logical stages

N-gram length1 vs. 3 Sparse vs. Dense

22

System Design: Online Phase

Runtime

Logical StagesModel1 Model2

S1

S2

S1’

S2’

4. Send result back to Client

1. When a prediction request arrives

<Model1, “Pretzel is tasty”>

3. Execute stages using thread-pools,

managed by Scheduler

Physical Stages

2. Instantiate physical stages

along with parameters

Object Store

23

Outline

• (Black-box) Prediction Serving Systems• Limitations of Black box approaches•PRETZEL: White-box Prediction Serving System• Evaluation•Conclusion

24

Evaluation

• Q. How PRETZEL improves performance over black-box approaches?• in terms of latency, memory and throughput

• 500 Models from Microsoft Machine Learning Team• 250 Sentiment Analysis (Memory-bound)• 250 Attendee Count (Compute-bound)

• System configuration• 16 Cores CPU, 32GB RAM • Windows 10, .Net core 2.0

25

Evaluation: Latency

• Micro-benchmark (No server-client communication)• Score 250 Sentiment Analysis models 100 times for each• Compare ML.Net vs. PRETZEL

10 2 10 1 100 101Latency (ms, log-scaled)

0

20

40

60

80

100

CDF

(%)

ML.Net(hot)

ML.Net(cold)

0.6 8.1ML.Net PRETZEL

P99 (hot)P99 (cold)Worst (cold)

ML.Net PRETZELP99 (hot) 0.6P99 (cold) 8.1Worst (cold)

ML.Net PRETZELP99 (hot) 0.6 0.2P99 (cold) 8.1 0.8Worst (cold)

ML.Net PRETZELP99 (hot) 0.6 0.2P99 (cold) 8.1 0.8Worst (cold) 280.2 6.2

10 2 10 1 100 101Latency (ms, log-scaled)

0

20

40

60

80

100

CDF

(%)

PRET

ZEL

(hot

)

PRET

ZEL

(col

d) 0.6 8.10.2 0.8

3x

10x

45xbetter

26

• Measure Cumulative Memory Usage after loading 250 models• Attendee Count models (smaller size than Sentiment Analysis)• 4 settings for Comparison

Evaluation: Memory

better0 50 100 150 200 250

Number of pipelines10MB

0.1GB

1GB

10GB

Cum

ulat

ive

Mem

ory

Usag

e (

log-

scal

ed)

32GBSettings Shared

ObjectsShared

RuntimeML.Net + Clipper

ML.Net ✓PRETZEL without ObjectStore ✓PRETZEL ✓ ✓

0 50 100 150 200 250Number of pipelines

10MB

0.1GB

1GB

10GB

Cum

ulat

ive

Mem

ory

Usag

e (

log-

scal

ed)

32GB

0 50 100 150 200 250Number of pipelines

10MB

0.1GB

1GB

10GB

Cum

ulat

ive

Mem

ory

Usag

e (

log-

scal

ed)

32GB

164MB

2.9GB3.7GB9.7GB

25x 62x

PRETZEL

(w.o. ObjStore)ML.Net

ML.Net + Clipper

27

Evaluation: Throughput

1 2 4 8 13Num. CPU Cores

(SA)

0

50

100

150

Thro

ughp

ut (K

QPS

)

(ideal)

(ideal)

PretzelML.Net


(AC)

0

5

10

15(ideal)

(ideal)


(SA)

0

50

100

150

Thro

ughp

ut (K

QPS

)

(ideal)

(ideal)

PretzelML.Net


(AC)

0

5

10

15(ideal)

(ideal)

better

• Micro-benchmark• Score 250 Attendee Count models 1000 times for each• Request 1000 queries in a batch• Compare ML.Net vs. PRETZEL

10x

Close to ideal scalability

More resultsin the paper!

28

Conclusion• PRETZEL is the first white-box prediction serving system for ML pipelines

• By using models’ structural info, we enable two types of optimizations:• End-to-end optimizations generate efficient execution plans for a model

• Multi-model optimizations let models share computation and memory resources

• Our evaluation shows that PRETZEL can improve performance compared to Black-box systems (e.g., ML.Net)• Decrease latency and memory footprint

• Increase resource utilization and throughput

29

PRETZEL: a White-Box ML Prediction Serving System

Thank you! Questions?

–„„fi ¿� ¶ �

–„¿�„fi ¿� ¶ �

¿�„fi ¿� ¶ �A

¿�„fi ¿� ¶ �B

BS06‰ –�·ˇ ‡�-� ¿� ¶ �

‰ –�·ˇ ‡· �…›¿�·º —–‡� �‰�”…‚¶ '¿ �• �� ‚�� `¶ ��»�‚» · �� ‚• ,��· ��ß¿º‚¯ …� �»� †¿¡��ß¶��ł�£,�•„�� ¿��»��•` ¿'�

�ß�� »�…– ˆ ¿'�»�¿º �·�.�–�·��œ�¿º��¿¡�‚ ��• � ¿� ¶ �,�»� ˇ ¶ �� ‰ –�·ˇ ‡� � ‚ƒ��‡„� ¿·�� ¿� ¶ ��¶ · �»� ˇ ¶ ��

¿�…–��ß¿º–� ��”�”��• ��˛ �� ‚‚�,�»�¿º‰ �”æ• ‡“��£��,�¯'–�‚ƒ�� • �”fl�� …��ł·�.�‰ –�·ˇ ‡� �»�»��”�Full�Color��¿º»�»��»�

¿�…–�ß�‚• �»�¿º ‚��„�� »�»�� ¿º¿¡�·º ��»� ��”�”»��¡��¶�� »�»� �¿º�»� �� ·�.��»��”�CD-Rom ¿¡�…�• � ��¥�� ‚�»�¿º�»�

¿ł ¢�‚• � �·�

30

Documents

ł PRETZEL: Opening the Black Box of ML Prediction Serving ......•Windows 10, .Netcore 2.0 25. Evaluation: Latency •Micro-benchmark (No server-client communication) •Score 250