34
Systems for ML, ML for Systems Marco Serafini COMPSCI 692S

Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Systems for ML, ML for SystemsMarco Serafini

COMPSCI 692S

Page 2: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

2

Course Structure• Classes:

• 1 tutorial per class• Each presented by a group of 3 students• Must cover an area: at least 3 papers

• Paper presentations• Give context, present the problem (not solutions!): 15 minutes• Present papers: 40 minutes• Discussion: comparison, strengths, weaknesses, lessons: 10 minutes• Questions: 10 minutes

• Before the class• Everyone reads the papers• Enter reviews in a Google Form (provided on Piazza)

Page 3: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

3

Projects• Groups of 3 people• Timeline

1. Look for a Sys4ML or ML4sys problem (Feb 10)• Can ask me if you cannot find one for a project

2. Find experimental evidence that the problem exists (March 1)3. Evaluate existing solutions (March 31)4. Propose and evaluate improvements (April 21)5. Final presentations (April 22-29)

Page 4: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Logistics• All communications are through Piazza

• Signup: https://piazza.com/umass/spring2020/cs692s• Forming groups

• Groups due this Friday• I will match people without group over the weekend• Everyone is in a tutorial group• People in the 3 credits section are also in a project group• Can use Piazza to find group members

• Groups and papers out on Saturday• No electronic devices during classes (not even in airplane mode)

Page 5: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Systems for ML

Page 6: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Machine Learning• Wide array of problems and algorithms

• Classification• Given labeled data points, predict label of new data point

• Regression• Learn a function from some (x, y) pairs

• Clustering• Group data points into “similar” clusters

• Segmentation• Partition image into meaningful segments

• Outlier detection

Page 7: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

More Dimensions• Supervision:

• Supervised ML: labeled ground truth is available• Unsupervised ML: no ground truth

• Training vs. Inference• Training: obtain model from training data• Inference: actually run the prediction

Page 8: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Example: Ad Click Predictor• Ad prediction problem

• A user is browsing the web• Choose ad that maximizes the likelihood of a click

• Training data• Trillions of ad-click log entries• Trillions of features per ad and user

• Important to reduce running time of training• Want to retrain frequently• Reduce energy and resource utilization costs

Page 9: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Abstracting ML Algorithms• Can we find commonalities among ML algorithms?• This would allow finding

• Common abstractions• Systems solutions to efficiently implement these abstractions

• Some common aspects• We have a prediction model A• A should optimize some complex objective function L

• E.g.: Likelihood of correctly labeling a new ad as “click” or “no-click”• ML algorithm does this by iteratively refining A

Page 10: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

10

High-Level View• Notation

• D: data• A: model parameters• L: function to optimize (e.g., minimize loss)

• Goal: Update A based on D to optimize L• Typical approach: iterative convergence

𝐴" = 𝐹(𝐴 "&' , ∆* 𝐴 "&' , 𝐷 )

iteration t compute updates that minimize L

merge updates to parameters

Page 11: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

11

How to Parallelize?• How to execute the algorithm over a set of workers?• Data-parallel approach

• Partition data D • All workers share the model parameters A

• Model-parallel approach• Partition model parameters A• All workers process the same data D

Page 12: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

12

Data-Parallel Approach

• Process for each worker• Update parameters based on data• Push updates to parameter servers• Servers aggregate & apply updates • Pull parameters

• Requirements• Updates associative and commutative!• Example: Stochastic Gradient Descent

𝐴" = 𝐴 "&' + Σ/0'1 ∆(𝐴 "&' , 𝐷/)

Page 13: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

13

Example• Each worker

• Loads a partition of data• At every iteration, compute gradients

• Server• Aggregate gradients• Update parameters

Page 14: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

14

Parameter Server• Stores model parameters• Advantages

• No need for message passing• Distributed shared memory abstraction

• Very first implementation: key-value store• Subsequent improvements

• Server-side UDFs• Worker scheduling• Bandwidth optimizations• …

Page 15: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

15

Architecture• Single parameters as <key, value> pairs• Server-side linear algebra operations

• Sum• Multiplication• 2-norm

Page 16: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

16

Does This Scale?• We said that a model can have trillion parameters• Q: Does this scale?• A: Yes

• Each data point (worker) only updates few parameters• Example: Sparse Logistic Regression

Page 17: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

17

Model-Parallel Scheduler• Some systems (e.g. Petuum) support global scheduler• Scheduler runs application-specific logic• Two main goals

• Partition parameters• Prioritized scheduling: give precedence to parameters that converge slower

Page 18: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

18

Model-Parallel Approach

• Process for each worker• Receive ids of parameters 𝑆/

"&' to update (from scheduler)• This is a partition of the entire space of parameters

• Compute update on those parameters• Send updates to parameter server that

• Concatenates updates (which are disjoint)• Applies updates to parameters

• Requirements• There should be no/weak correlation among parameters• Example: matrix factorization

• Q: Advantage?

𝐴" = 𝐴 "&' + 𝐶𝑜𝑛 ({∆/(𝐴 "&' , 𝑆/"&' 𝐴 "&' , 𝐷)}/0'1 )

Page 19: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

19

TensorFlow vs. Parameter Server• Parameter server

• Separate worker nodes and parameter nodes• Different interfaces

• TensorFlow: only tasks• Shared parameters (called operators): variables and queues• Tasks managing them are called PS tasks • PS task are regular tasks: they can run arbitrary operators• Uniform programming interface

Page 20: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

20

TensorFlow• Dataflow graph of operators• Deferred (lazy) execution• Composable basic operators• Concept of devices

• CPUs, GPUs, mobile devices• Different implementations of the operators

Page 21: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

21

Dataflow Graph• Vertex: unit of local computation

• Called operation in TensorFlow

• Edges: inputs and outputs of computation• Values along edges are called tensors• n-dimensional arrays• Elements have primitive types (including byte arrays)

Page 22: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

22

Execution Model• Step: client executes a subgraph by indicating:

• Edges to feed the subgraph with input tensors• Edges to fetch the output tensors• Runtime prunes the subgraph to remove unnecessary steps

• Subgraphs are run asynchronously by default• Can execute multiple partial, concurrent subgraphs

• Example: concurrent batches for data-parallel training

Page 23: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

23

Distributed Execution• Tasks: named processes

• Run computations, send data

• Operations• Consist of multiple kernels

• Devices: CPU, GPU, TPU, mobile, …• CPU is the host device• Device executes kernel for each operation assigned to it

• Same operation (e.g. matrix multiplication) has different kernels for different devices

Page 24: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

24

Distributed Scheduling• TensorFlow runtime places operations on devices

• Implicit constraints: stateful operation on same device as state• Explicit constraints: dictated by the user• Optimal placement still open question

• Obtain per-device subgraphs• All operations assigned to device• Send and Receive operations to replace edges• Specialized per-device implementations

• CPU – GPU: CUDA memory copy• Across tasks: TCP or RDMA

Page 25: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

25

Training vs. Inference• Training: data à model

• Computationally expensive• No hard real-time requirements (typically)

•Inference: data + model à prediction• Computationally cheaper• Real-time requirements (sometimes sub-millisecond)

Page 26: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

26

Example: Clipper

Page 27: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

27

Challenge: Different Frameworks• Different training frameworks, each has its strengths

• E.g.: Caffe for computer vision, HTK for speech recognition

• Each uses different formats à tailored deployment• Best tool may change over time• Solution: model abstraction

Page 28: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

28

Challenge: Prediction Latency• Many ML models have high prediction latency

• Some are too slow to use online, e.g., when choosing an ad• Combining model outputs makes it worse

• Trade-off between accuracy and latency• Solutions

• Adaptive batching• Enable mixing models with different complexity• Straggler mitigation when using multiple models

Page 29: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

29

Challenge: Model Selection• How to decide which models to deploy?• Selecting the best model offline is expensive• Best model changes over time

• Concept drift: relationships in data change over time• Feature corruption

• Combining multiple models can increase accuracy• Solution: automatically select among multiple models

Page 30: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

30

Clipper Overview• Requests flow top to bottom and back• We start reviewing the Model Abstraction Layer

Page 31: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Other Topics• Can we compile ML programs to speed up learning?• How to manage resources for ML workloads?• How to test and debug ML programs?• Can we run ML directly inside a DMBS?• How to build end-to-end ML pipelines?

Page 32: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

ML for Systems

Page 33: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

Why ML for Systems• Systems use heuristics to decide

• Data structures• Data representation (e.g. row vs. columnar stores)• Indices• Data caching, replication, partitioning• Execution policies (e.g. relational query execution plans)• …

• Heuristics can fall short• Based on assumptions about the workload, which might be wrong• Cannot consider complex interactions

Page 34: Systems for ML, ML for Systems · •Wide array of problems and algorithms •Classification •Given labeled data points, predict label of new data point •Regression •Learn a

ML for Systems• Replace these heuristics with learned components• Becoming popular in data management

• Learned indices• Learned query optimizers

• Energy efficiency• Resource management, cloud• Parameter tuning• ML to index and query different types of data

• Video, speech, …