Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Systems for ML, ML for SystemsMarco Serafini

COMPSCI 692S

2

Course Structure• Classes:

• 1 tutorial per class• Each presented by a group of 3 students• Must cover an area: at least 3 papers

• Paper presentations• Give context, present the problem (not solutions!): 15 minutes• Present papers: 40 minutes• Discussion: comparison, strengths, weaknesses, lessons: 10 minutes• Questions: 10 minutes

• Before the class• Everyone reads the papers• Enter reviews in a Google Form (provided on Piazza)

3

Projects• Groups of 3 people• Timeline

1. Look for a Sys4ML or ML4sys problem (Feb 10)• Can ask me if you cannot find one for a project

2. Find experimental evidence that the problem exists (March 1)3. Evaluate existing solutions (March 31)4. Propose and evaluate improvements (April 21)5. Final presentations (April 22-29)

Logistics• All communications are through Piazza

• Signup: https://piazza.com/umass/spring2020/cs692s• Forming groups

• Groups due this Friday• I will match people without group over the weekend• Everyone is in a tutorial group• People in the 3 credits section are also in a project group• Can use Piazza to find group members

• Groups and papers out on Saturday• No electronic devices during classes (not even in airplane mode)

https://piazza.com/umass/spring2020/cs692s

Systems for ML

Machine Learning• Wide array of problems and algorithms

• Classification• Given labeled data points, predict label of new data point

• Regression• Learn a function from some (x, y) pairs

• Clustering• Group data points into “similar” clusters

• Segmentation• Partition image into meaningful segments

• Outlier detection

More Dimensions• Supervision:

• Supervised ML: labeled ground truth is available• Unsupervised ML: no ground truth

• Training vs. Inference• Training: obtain model from training data• Inference: actually run the prediction

Example: Ad Click Predictor• Ad prediction problem

• A user is browsing the web• Choose ad that maximizes the likelihood of a click

• Training data• Trillions of ad-click log entries• Trillions of features per ad and user

• Important to reduce running time of training• Want to retrain frequently• Reduce energy and resource utilization costs

Abstracting ML Algorithms• Can we find commonalities among ML algorithms?• This would allow finding

• Common abstractions• Systems solutions to efficiently implement these abstractions

• Some common aspects• We have a prediction model A• A should optimize some complex objective function L

• E.g.: Likelihood of correctly labeling a new ad as “click” or “no-click”• ML algorithm does this by iteratively refining A

10

High-Level View• Notation

• D: data• A: model parameters• L: function to optimize (e.g., minimize loss)

• Goal: Update A based on D to optimize L• Typical approach: iterative convergence

𝐴" = 𝐹(𝐴 "&' , ∆* 𝐴 "&' , 𝐷 )

iteration t compute updates that minimize L

merge updates to parameters

11

How to Parallelize?• How to execute the algorithm over a set of workers?• Data-parallel approach

• Partition data D • All workers share the model parameters A

• Model-parallel approach• Partition model parameters A• All workers process the same data D

12

Data-Parallel Approach

• Process for each worker• Update parameters based on data• Push updates to parameter servers• Servers aggregate & apply updates • Pull parameters

• Requirements• Updates associative and commutative!• Example: Stochastic Gradient Descent

𝐴" = 𝐴 "&' + Σ/0'1 ∆(𝐴 "&' , 𝐷/)

13

Example• Each worker

• Loads a partition of data• At every iteration, compute gradients

• Server• Aggregate gradients• Update parameters

14

Parameter Server• Stores model parameters• Advantages

• No need for message passing• Distributed shared memory abstraction

• Very first implementation: key-value store• Subsequent improvements

• Server-side UDFs• Worker scheduling• Bandwidth optimizations• …

15

Architecture• Single parameters as <key, value> pairs• Server-side linear algebra operations

• Sum• Multiplication• 2-norm

16

Does This Scale?• We said that a model can have trillion parameters• Q: Does this scale?• A: Yes

• Each data point (worker) only updates few parameters• Example: Sparse Logistic Regression

17

Model-Parallel Scheduler• Some systems (e.g. Petuum) support global scheduler• Scheduler runs application-specific logic• Two main goals

• Partition parameters• Prioritized scheduling: give precedence to parameters that converge slower

18

Model-Parallel Approach

• Process for each worker• Receive ids of parameters 𝑆/

"&' to update (from scheduler)• This is a partition of the entire space of parameters

• Compute update on those parameters• Send updates to parameter server that

• Concatenates updates (which are disjoint)• Applies updates to parameters

• Requirements• There should be no/weak correlation among parameters• Example: matrix factorization

• Q: Advantage?

𝐴" = 𝐴 "&' + 𝐶𝑜𝑛 ({∆/(𝐴 "&' , 𝑆/"&' 𝐴 "&' , 𝐷)}/0'1 )

19

TensorFlow vs. Parameter Server• Parameter server

• Separate worker nodes and parameter nodes• Different interfaces

• TensorFlow: only tasks• Shared parameters (called operators): variables and queues• Tasks managing them are called PS tasks • PS task are regular tasks: they can run arbitrary operators• Uniform programming interface

20

TensorFlow• Dataflow graph of operators• Deferred (lazy) execution• Composable basic operators• Concept of devices

• CPUs, GPUs, mobile devices• Different implementations of the operators

21

Dataflow Graph• Vertex: unit of local computation

• Called operation in TensorFlow

• Edges: inputs and outputs of computation• Values along edges are called tensors• n-dimensional arrays• Elements have primitive types (including byte arrays)

22

Execution Model• Step: client executes a subgraph by indicating:

• Edges to feed the subgraph with input tensors• Edges to fetch the output tensors• Runtime prunes the subgraph to remove unnecessary steps

• Subgraphs are run asynchronously by default• Can execute multiple partial, concurrent subgraphs

• Example: concurrent batches for data-parallel training

23

Distributed Execution• Tasks: named processes

• Run computations, send data

• Operations• Consist of multiple kernels

• Devices: CPU, GPU, TPU, mobile, …• CPU is the host device• Device executes kernel for each operation assigned to it

• Same operation (e.g. matrix multiplication) has different kernels for different devices

24

Distributed Scheduling• TensorFlow runtime places operations on devices

• Implicit constraints: stateful operation on same device as state• Explicit constraints: dictated by the user• Optimal placement still open question

• Obtain per-device subgraphs• All operations assigned to device• Send and Receive operations to replace edges• Specialized per-device implementations

• CPU – GPU: CUDA memory copy• Across tasks: TCP or RDMA

25

Training vs. Inference• Training: data à model

• Computationally expensive• No hard real-time requirements (typically)

•Inference: data + model à prediction• Computationally cheaper• Real-time requirements (sometimes sub-millisecond)

26

Example: Clipper

27

Challenge: Different Frameworks• Different training frameworks, each has its strengths

• E.g.: Caffe for computer vision, HTK for speech recognition

• Each uses different formats à tailored deployment• Best tool may change over time• Solution: model abstraction

28

Challenge: Prediction Latency• Many ML models have high prediction latency

• Some are too slow to use online, e.g., when choosing an ad• Combining model outputs makes it worse

• Trade-off between accuracy and latency• Solutions

• Adaptive batching• Enable mixing models with different complexity• Straggler mitigation when using multiple models

29

Challenge: Model Selection• How to decide which models to deploy?• Selecting the best model offline is expensive• Best model changes over time

• Concept drift: relationships in data change over time• Feature corruption

• Combining multiple models can increase accuracy• Solution: automatically select among multiple models

30

Clipper Overview• Requests flow top to bottom and back• We start reviewing the Model Abstraction Layer

Other Topics• Can we compile ML programs to speed up learning?• How to manage resources for ML workloads?• How to test and debug ML programs?• Can we run ML directly inside a DMBS?• How to build end-to-end ML pipelines?

ML for Systems

Why ML for Systems• Systems use heuristics to decide

• Data structures• Data representation (e.g. row vs. columnar stores)• Indices• Data caching, replication, partitioning• Execution policies (e.g. relational query execution plans)• …

• Heuristics can fall short• Based on assumptions about the workload, which might be wrong• Cannot consider complex interactions

ML for Systems• Replace these heuristics with learned components• Becoming popular in data management

• Learned indices• Learned query optimizers

• Energy efficiency• Resource management, cloud• Parameter tuning• ML to index and query different types of data

• Video, speech, …

Documents

Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a