CSE 291D/234 Data Systems for Machine Learning · 2020. 10. 13. · E.g., logistic and linear regression, linear SVM, ... Similar to scaling to disk but instead read pages/chunks

CSE 291D/234 Data Systems for Machine Learning

1

Topic 1: Classical ML Training at Scale

Chapters 2, 5, and 6 of MLSys book

Arun Kumar

2

Academic ML 101

Generalized Linear Models (GLMs); from statistics

Bayesian Networks; inspired by causal reasoning

Decision Tree-based: CART, Random Forest, Gradient-Boosted Trees (GBT), etc.; inspired by symbolic logic

Support Vector Machines (SVMs); inspired by psychology

Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience

“Classical” ML

3

Real-World ML 101

https://www.kaggle.com/c/kaggle-survey-2019

Deep Learning

GLMsTree learners

https://www.kaggle.com/c/kaggle-survey-2019

4

Scalable ML Training in the Lifecycle

Data acquisition Data preparation

Feature Engineering Training & Inference

Model Selection

Serving Monitoring

5

Scalable ML Training in the Big Picture

6

ML Systems

❖ A data processing system (aka data system) for mathematically advanced data analysis operations (inferential or predictive): ❖ Statistical analysis; ML, deep learning (DL); data mining

(domain-specific applied ML + feature eng.) ❖ High-level APIs to express ML computations over (large)

datasets ❖ Execution engine to run ML computations efficiently

Q: What is a Machine Learning (ML) System?

and in a scalable manner

7

But what exactly does it mean for an ML system to be “scalable”?

8

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

9

Background: Memory Hierarchy

Flash Storage

CPU

Main Memory

Magnetic Hard Disk Drive (HDD)

CachePrice

Capacity~MBs ~$2/MB

~10GBs ~$5/GB

~TBs ~$200/TB

~10TBs ~$40/TB

105 –

106

A C

C E

S S

C Y

C L

E S

107 –

108

100s

Acce

ss S

peed

~GB/s

~10GB/s

~100GB/s

~200MB/s

10

Memory Hierarchy in Action

Q: What does this program do when run with ‘python’? (Assume tmp.csv is in current working directory)

import pandas as p m = p.read_csv(‘tmp.csv’,header=None) s = m.sum().sum() print(s)

1,2,3 4,5,6

tmp.py

tmp.csv

11

Memory Hierarchy in Action

Bus

CU ALU

Caches

DRAM

Disk

Store; Retrieve

Store; Retrieve

Retrieve; Process

Registers

Processor

Rough sequence of events when program is executed

tmp.csvtmp.py

Commands interpreted

Computations done by Processor

Monitor

‘21’

I/O for Display I/O for code I/O for data

‘21’

‘21’

Q: What if this does not fit in DRAM?

12

Scalable ML Systems

❖ ML systems that do not require the (training) dataset to fit entirely in main memory (DRAM) of one node ❖ Conversely, if the system thrashes when data file does

not fit in RAM, it is not scalable

Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor

13

Scalable ML Systems

4 main approaches to scale ML to large data: ❖ Single-node disk: Paged access from file on local disk ❖ Remote read: Paged access from disk(s) over network ❖ Distributed memory: Fits on a cluster’s total DRAM ❖ Distributed disk: Fits on a cluster’s full set of disks

14

Evolution of Scalable ML Systems

1980s

S

Mid 1990s

Mid 2010s—

Late 2000s to Early 2010s

Late 1990s to Mid 2000s

In-RDBMS ML Systems

Scalability

Manageability

Developability

ML on Dataflow Systems Cloud ML

Deep Learning Systems

Parameter Server

ML System Abstractions

15

Major Existing ML Systems

General ML libraries:

Disk-based files:In-memory: Layered on RDBMS/Spark:

Cloud-native: “AutoML” platforms:

Decision tree-oriented: Deep learning-oriented:

16

Outline


17

ML Algorithm = Program Over Data


❖ To scale an ML program’s computations, split them up to operate over “chunks” of data at a time

❖ How to split up an ML program this way can be non-trivial! ❖ Depends on data access pattern of the algorithm ❖ A large class of ML algorithms do just sequential scans

for iterative numerical optimization

18

Numerical Optimization in ML

❖ Many regression and classification models in ML are formulated as a (constrained) minimization problem ❖ E.g., logistic and linear regression, linear SVM, etc. ❖ Aka “Empirical Risk Minimization” (ERM)

w⇤ = argminw

nX

i=1

l(yi, f(w, xi))<latexit sha1_base64="wtingO0LqaDZ0Zshlkk1cGJ54IY=">AAACO3icbVDLSgMxFM3UV62vqks3wSJUKWXGB7opFN24rGJroa1DJs20oUlmSDJqGea/3PgT7ty4caGIW/emD6haDwROzrmXe+/xQkaVtu1nKzUzOze/kF7MLC2vrK5l1zdqKogkJlUcsEDWPaQIo4JUNdWM1ENJEPcYufZ6ZwP/+pZIRQNxpfshaXHUEdSnGGkjudnLJke66/nxXXKzB0tw+KU6RrLDqUjceOInsKki7sa05CQ3ArJ836UF6OcnFYV7l+7uutmcXbSHgNPEGZMcGKPiZp+a7QBHnAiNGVKq4dihbpkVNMWMJJlmpEiIcA91SMNQgThRrXh4ewJ3jNKGfiDNExoO1Z8dMeJK9blnKgd7qr/eQPzPa0TaP2nFVISRJgKPBvkRgzqAgyBhm0qCNesbgrCkZleIu0girE3cGROC8/fkaVLbLzoHxaOLw1z5dBxHGmyBbZAHDjgGZXAOKqAKMHgAL+ANvFuP1qv1YX2OSlPWuGcT/IL19Q0ZDK6q</latexit>

❖ GLMs define hyperplanes and use f() that is a scalar function of distances:

wTxi<latexit sha1_base64="uvgXp+ddpvG7NN5TecG7ye7m4Eg=">AAAB+XicbVC7TsMwFL0pr1JeAUYWiwqJqUp4CMYKFsYi9SW1IXJcp7XqOJHtFKqof8LCAEKs/Akbf4PTdoCWI1k6Oude3eMTJJwp7TjfVmFldW19o7hZ2tre2d2z9w+aKk4loQ0S81i2A6woZ4I2NNOcthNJcRRw2gqGt7nfGlGpWCzqepxQL8J9wUJGsDaSb9vdCOtBEGaPk4c6evKZb5edijMFWibunJRhjppvf3V7MUkjKjThWKmO6yTay7DUjHA6KXVTRRNMhrhPO4YKHFHlZdPkE3RilB4KY2me0Giq/t7IcKTUOArMZJ5TLXq5+J/XSXV47WVMJKmmgswOhSlHOkZ5DajHJCWajw3BRDKTFZEBlphoU1bJlOAufnmZNM8q7nnl8v6iXL2Z11GEIziGU3DhCqpwBzVoAIERPMMrvFmZ9WK9Wx+z0YI13zmEP7A+fwCVhJOh</latexit>

Y X1 X2 X30 1b 1c 1d1 2b 2c 2d1 3b 3c 3d0 4b 4c 4d… … … …

D

19

❖ For many ML models, loss function l() is convex; so is L() ❖ But closed-form minimization is typically infeasible ❖ Batch Gradient Descent:

L(w) =nX

i=1

l(yi, f(w, xi))<latexit sha1_base64="Gn3z209XWihwvStV6+mufXVXz8s=">AAACH3icbVDLSgMxFM3UV62vUZdugkVooZQZ35tC0Y0LFxXsA9o6ZNJMG5rJDElGLcP8iRt/xY0LRcRd/8b0saitBwKHc84l9x43ZFQqyxoaqaXlldW19HpmY3Nre8fc3avJIBKYVHHAAtFwkSSMclJVVDHSCAVBvstI3e1fj/z6IxGSBvxeDULS9lGXU49ipLTkmOe3uZaPVM/14qckD0uwJSPfiWnJTh44ZLmBQwvQm8kUnh2azztm1ipaY8BFYk9JFkxRccyfVifAkU+4wgxJ2bStULVjJBTFjCSZViRJiHAfdUlTU458Itvx+L4EHmmlA71A6McVHKuzEzHypRz4rk6O9pTz3kj8z2tGyrtsx5SHkSIcTz7yIgZVAEdlwQ4VBCs20ARhQfWuEPeQQFjpSjO6BHv+5EVSOy7aJ8Wzu9Ns+WpaRxocgEOQAza4AGVwAyqgCjB4AW/gA3war8a78WV8T6IpYzqzD/7AGP4C0FChmg==</latexit>

Batch Gradient Descent for ML

❖ Iterative numerical procedure to find an optimal w ❖ Initialize w to some value w(0) ❖ Compute gradient:

❖ Descend along gradient: (Aka Update Rule)

❖ Repeat until we get close to w*, aka convergence

w(k+1) w(k) � ⌘rL(w(k))<latexit sha1_base64="ayIl16A2siz4c7ll2gSKsLvcCX4=">AAACOXicbVDLahtBEJyVk1hRXop9zGWICEiEiF3HJj4K+5JDDgpED9AqonfUKw2anV1mei3Eot/yxX+RW8AXHxJCrvmBjB6HSE7BQFFVzXRXlClpyfe/e6WDBw8fHZYfV548ffb8RfXlUdemuRHYEalKTT8Ci0pq7JAkhf3MICSRwl40u1z5vSs0Vqb6Cy0yHCYw0TKWAshJo2o7TICmUVzMl1+L+uxt0FjyUGFMYEw657uu897xEAl4qCFSwD/V9wONUbXmN/01+H0SbEmNbdEeVb+F41TkCWoSCqwdBH5GwwIMSaFwWQlzixmIGUxw4KiGBO2wWF++5G+cMuZxatzTxNfqvxMFJNYuksglV4vafW8l/s8b5BSfDwups5xQi81Hca44pXxVIx9Lg4LUwhEQRrpduZiCAUGu7IorIdg/+T7pnjSD982zz6e11sW2jjJ7xV6zOgvYB9ZiH1mbdZhg1+yW/WA/vRvvzvvl/d5ES9525pjtwPvzF5+7rMc=</latexit>

rL(w(k)) =nX

i=1

rl(yi, f(w(k), xi))

<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">AAACOXicbVDLSgMxFM34rPVVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodzziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGGUk7qiipFWKAjyXUaa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>

20

Batch Gradient Descent for ML

w<latexit sha1_base64="q3pU+DFyxWArDM3ABiJpfISIf2Y=">AAAB8XicbVDLSgMxFL3js9ZX1aWbYBFclRkf6LLoxmUF+8C2lEx6pw3NZIYko5Shf+HGhSJu/Rt3/o2ZdhbaeiBwOOdecu7xY8G1cd1vZ2l5ZXVtvbBR3Nza3tkt7e03dJQohnUWiUi1fKpRcIl1w43AVqyQhr7Apj+6yfzmIyrNI3lvxjF2QzqQPOCMGis9dEJqhn6QPk16pbJbcacgi8TLSRly1Hqlr04/YkmI0jBBtW57bmy6KVWGM4GTYifRGFM2ogNsWyppiLqbThNPyLFV+iSIlH3SkKn6eyOlodbj0LeTWUI972Xif147McFVN+UyTgxKNvsoSAQxEcnOJ32ukBkxtoQyxW1WwoZUUWZsSUVbgjd/8iJpnFa8s8rF3Xm5ep3XUYBDOIIT8OASqnALNagDAwnP8ApvjnZenHfnYza65OQ7B/AHzucP/fiRIg==</latexit>

w⇤<latexit sha1_base64="HB3noaSvzO2mOiaEhFaq57JcQLw=">AAAB83icbVDLSsNAFL2pr1pfVZduBosgLkriA10W3bisYB/QxDKZTtqhk0mYmSgl5DfcuFDErT/jzr9x0mahrQcGDufcyz1z/JgzpW372yotLa+srpXXKxubW9s71d29tooSSWiLRDySXR8rypmgLc00p91YUhz6nHb88U3udx6pVCwS93oSUy/EQ8ECRrA2kuuGWI/8IH3KHk761Zpdt6dAi8QpSA0KNPvVL3cQkSSkQhOOleo5dqy9FEvNCKdZxU0UjTEZ4yHtGSpwSJWXTjNn6MgoAxRE0jyh0VT9vZHiUKlJ6JvJPKOa93LxP6+X6ODKS5mIE00FmR0KEo50hPIC0IBJSjSfGIKJZCYrIiMsMdGmpoopwZn/8iJpn9ads/rF3XmtcV3UUYYDOIRjcOASGnALTWgBgRie4RXerMR6sd6tj9loySp29uEPrM8fHvyRvg==</latexit>

L(w)<latexit sha1_base64="TY7U6q5Tv+/ah6YJI3mw93bVcTs=">AAAB9HicbVC7TsMwFL0pr1JeBUYWiwqpLFVCQTBWsDAwFIk+pDaqHNdprTpOsJ2iKup3sDCAECsfw8bf4LQZoOVIlo7OuVf3+HgRZ0rb9reVW1ldW9/Ibxa2tnd294r7B00VxpLQBgl5KNseVpQzQRuaaU7bkaQ48DhteaOb1G+NqVQsFA96ElE3wAPBfEawNpJ7V+4GWA89P3manvaKJbtiz4CWiZOREmSo94pf3X5I4oAKTT hWquPYkXYTLDUjnE4L3VjRCJMRHtCOoQIHVLnJLPQUnRilj/xQmic0mqm/NxIcKDUJPDOZRlSLXir+53Vi7V+5CRNRrKkg80N+zJEOUdoA6jNJieYTQzCRzGRFZIglJtr0VDAlOItfXibNs4pTrVzcn5dq11kdeTiCYyiDA5dQg1uoQwMIPMIzvMKbNbZerHfrYz6as7KdQ/gD6/MHZHeR3Q==</latexit>

w(0)<latexit sha1_base64="OxTTqfbjDPbei9MJL9SITiklUMw=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhmaa03YkKQ48Tlve+DbzWxMqFQvFg55GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFO075ddqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THHOkQZTWgAZOUaD41BBPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz88BpNm</latexit>

Gradient

rL(w(0))<latexit sha1_base64="l790wK2qXm9kyZmjZI9U8DggCRQ=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhooJ9QBPLZDpph04mYWailBBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1zz+HIYwheV5wAqaHnJw/pXVKxqmm1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXXtiLlJkgoihlJS04sSYTwCA1IV1OOAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T//O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpDOW4Q==</latexit>

w(1)<latexit sha1_base64="G17ClWIFMobSQkQQcWLAcftDEbk=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhmaa03YkKQ48Tlve+DbzWxMqFQvFg55GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFP075ddqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THHOkQZTWgAZOUaD41BBPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz89jJNn</latexit>w(2)

<latexit sha1_base64="Yg7pZgdEXbbdSQ9HL4psFPomzX4=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiRV0WXRjcsK9gFtLJPppB06mYSZSaWE/okbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm4Wt4vbO7t6+fXDYVFEiCW2QiEey7WNFORO0oZnmtB1LikOf05Y/us381phKxSLxoCcx9UI8ECxgBGsj9Wy7G2I99IP0afqYlqtn055dcirODGiZuDkpQY56z/7q9iOShFRowrFSHdeJtZdiqRnhdFrsJorGmIzwgHYMFTikyktnyafo1Ch9FETSPKHRTP29keJQqUnom8ksp1r0MvE/r5Po4NpLmYgTTQWZHwoSjnSEshpQn0lKNJ8YgolkJisiQywx0aasoinBXfzyMmlWK+555fL+olS7yesowDGcQBlcuIIa3EEdGkBgDM/wCm9War1Y79bHfHTFyneO4A+szx8/EpNo</latexit>

…

rL(w(1))<latexit sha1_base64="EX6+u+ozzvgd446YH99lTyT79WA=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhooJ9QBPLZDpph04mYWailBBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1zz+HIYwheV5wAqaHnJw/pXVKxq2m1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXXtiLlJkgoihlJS04sSYTwCA1IV1OOAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T//O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpbqW4g==</latexit>

❖ Learning rate is a hyper-parameter selected by user or “AutoML” tuning procedures

❖ Number of iterations/epochs of BGD also hyper-parameter

w(1) w(0) � ⌘rL(w(0))<latexit sha1_base64="MZhyY1bDdwCYYWm3cMV+ti7/2tw=">AAACN3icbVDLSiNBFK12fMT4yoxLN4VBiAtDtw90GcaNCxEFo0I6htuV21pYXd1U3VZCk7+azfyGO924mEHc+gdWYhYaPVBwOOdc6t4TZUpa8v0Hb+LH5NT0TGm2PDe/sLhU+fnrzKa5EdgUqUrNRQQWldTYJEkKLzKDkEQKz6Ob/YF/fovGylSfUi/DdgJXWsZSADmpUzkKE6DrKC7u+pdFLVjv81BhTGBMesc/eb7zNniIBDzUECngh7XxwHqnUvXr/hD8KwlGpMpGOO5U7sNuKvIENQkF1rYCP6N2AYakUNgvh7nFDMQNXGHLUQ0J2nYxvLvP15zS5XFq3NPEh+rHiQISa3tJ5JKDRe24NxC/81o5xXvtQuosJ9Ti/aM4V5xSPiiRd6VBQarnCAgj3a5cXIMBQa7qsishGD/5KznbrAdb9Z2T7Wrj96iOElthq6zGArbLGuyAHbMmE+wPe2T/2H/vr/fkPXsv79EJbzSzzD7Be30DhDerpw==</latexit>

w(2) w(1) � ⌘rL(w(1))<latexit sha1_base64="g9I7Eq4QpMR9QRqIjWoJuHM2mqI=">AAACN3icbVDLSiNBFK32bXxFXbopDEJcGLp9oMugm1kMomBUSMdwu3JbC6urm6rbhtDkr2Yzv+Fu3MxiBnHrH1iJWWj0QMHhnHOpe0+UKWnJ9/94E5NT0zOzc/OlhcWl5ZXy6tqlTXMjsCFSlZrrCCwqqbFBkhReZwYhiRReRfcnA//qAY2Vqb6gXoatBG61jKUAclK7fBomQHdRXHT7N0V1d7vPQ4UxgTFpl3/yAuft8BAJeKghUsB/VscD2+1yxa/5Q/CvJBiRChvhrF1+DDupyBPUJBRY2wz8jFoFGJJCYb8U5hYzEPdwi01HNSRoW8Xw7j7fckqHx6lxTxMfqh8nCkis7SWRSw4WtePeQPzOa+YUH7UKqbOcUIv3j+JccUr5oETekQYFqZ4jIIx0u3JxBwYEuapLroRg/OSv5HK3FuzVDs73K/XjUR1zbINtsioL2CGrsx/sjDWYYL/YE/vH/nu/vb/es/fyHp3wRjPr7BO81zeJKquq</latexit>

21

Data Access Pattern of BGD at Scale

❖ The data-intensive computation in BGD is the gradient ❖ In scalable ML, dataset D may not fit in DRAM ❖ Model w is typically small and DRAM-resident

❖ Gradient is like SQL SUM over vectors (one per example) ❖ At each epoch, 1 filescan over D to get gradient ❖ Update of w happens normally in DRAM ❖ Monitoring across epochs for convergence needed ❖ Loss function L() is also just a SUM in a similar manner

rL(w(k)) =nX

i=1

rl(yi, f(w(k), xi))


Q: What SQL op is this reminiscent of?

22

Scaling BGD to Disk

Disk

OS Cache in DRAM

…

Process wants to read file’s pages one by one and then discard: aka

“filescan” access pattern

P1 P2 P3 P4 P5 P6

Read P1 Read P2 Read P3 Read P4 Read P5 Read P6

Cache is full; replace old pages

Evict P1Evict P2Suppose OS Cache can hold 4 pages of file

…P1 P2 P3 P4 P5 P6


23

❖ Sequential scan to read pages from disk to DRAM ❖ Modern DRAM sizes can be 10s of GBs; so we read a

“chunk” of file at a time (say, 1000s of pages) ❖ Compute partial gradient on each chunk and add all up

rL(w(k)) =nX

i=1

rl(yi, f(w(k), xi))


Y X1 X2 X30 1b 1c 1d1 2b 2c 2d1 3b 3c 3d0 4b 4c 4d… … … …

D0,1b,1c,1d 1,2b,2c,2d

Data pages

1,3b,3c,3d … … …

<latexit sha1_base64="bG8iKwvZLsoFfOTbQ3+UYwXeHc4=">AAAB/XicbVDLSsNAFL3xWesrPnZuBovQbkoiii6Lbly4qGAf0MYwmU7aoZNJmJkoNRR/xY0LRdz6H+78G6ePhbYeuHA4517uvSdIOFPacb6thcWl5ZXV3Fp+fWNza9ve2a2rOJWE1kjMY9kMsKKcCVrTTHPaTCTFUcBpI+hfjvzGPZWKxeJWDxLqRbgrWMgI1kby7f22wAHH6Np3UfHhLiv2S8OSbxecsjMGmifulBRgiqpvf7U7MUkjKjThWKmW6yTay7DUjHA6zLdTRRNM+rhLW4YKHFHlZePrh+jIKB0UxtKU0Gis/p7IcKTUIApMZ4R1T816I/E/r5Xq8NzLmEhSTQWZLApTjnSMRlGgDpOUaD4wBBPJzK2I9LDERJvA8iYEd/bleVI/LrunZefmpFC5mMaRgwM4hCK4cAYVuIIq1IDAIzzDK7xZT9aL9W59TFoXrOnMHvyB9fkDOR6Txw==</latexit>

rL1(w(k))

<latexit sha1_base64="9Xu8gK1dsg0vjYS6WgkRx2IkG/s=">AAAB/XicbVDLSgNBEJyNrxhf6+PmZTAIySXsBkWPQS8ePEQwD0jW0DuZJENmZ5eZWSUuwV/x4kERr/6HN//GSbIHTSxoKKq66e7yI86UdpxvK7O0vLK6ll3PbWxube/Yu3t1FcaS0BoJeSibPijKmaA1zTSnzUhSCHxOG/7wcuI37qlULBS3ehRRL4C+YD1GQBupYx+0Bfgc8HWnjAsPd0lhWBwXO3beKTlT4EXipiSPUlQ79le7G5I4oEITDkq1XCfSXgJSM8LpONeOFY2ADKFPW4YKCKjykun1Y3xslC7uhdKU0Hiq/p5IIFBqFPimMwA9UPPeRPzPa8W6d+4lTESxpoLMFvVijnWIJ1HgLpOUaD4yBIhk5lZMBiCBaBNYzoTgzr+8SOrlkntacm5O8pWLNI4sOkRHqIBcdIYq6ApVUQ0R9Iie0St6s56sF+vd+pi1Zqx0Zh/9gfX5Azqsk8g=</latexit>

rL2(w(k)) …

Scaling BGD to Disk

<latexit sha1_base64="6IPITVWyLEPj25KRcgv51KGPtpA=">AAACLnicbZDLSgMxFIYz9VbrbdSlm2ARWoQyUxTdCEURXLioYC/QjkMmTdvQTGZIMkoZ+kRufBVdCCri1scwbUdqW38I/HznHE7O74WMSmVZb0ZqYXFpeSW9mllb39jcMrd3qjKIBCYVHLBA1D0kCaOcVBRVjNRDQZDvMVLzehfDeu2eCEkDfqv6IXF81OG0TTFSGrnmZZMjjyF4nXu4i3O9/CAPz+Avc+0JPZzQ4hRtBUq6ZtYqWCPBeWMnJgsSlV3zRc/hyCdcYYakbNhWqJwYCUUxI4NMM5IkRLiHOqShLUc+kU48OncADzRpwXYg9OMKjujfiRj5UvZ9T3f6SHXlbG0I/6s1ItU+dWLKw0gRjseL2hGDKoDD7GCLCoIV62uDsKD6rxB3kUBY6YQzOgR79uR5Uy0W7OOCdXOULZ0ncaTBHtgHOWCDE1ACV6AMKgCDR/AM3sGH8WS8Gp/G17g1ZSQzu2BKxvcP5A2kyw==</latexit>

rL(w(k)) = rL1(w(k)) +rL2(w

(k)) + . . .

24

DaskML’s Scalable DataFrame

❖ Dask DF scales to on-disk files by splitting it as a bunch of Pandas DF under the covers

❖ Dask API is a “wrapper” around Pandas API to scale ops to splits and put all results together

https://docs.dask.org/en/latest/dataframe.html


https://docs.dask.org/en/latest/dataframe.html

25

Scaling with Remote Reads

❖ Similar to scaling to disk but instead read pages/chunks over the network from remote disk/disks (e.g., from S3) ❖ Good in practice for a one-shot filescan access pattern ❖ For iterative ML, repeated reads over network ❖ Can combine with caching on local disk / DRAM

❖ Increasingly popular for cloud-native ML workloads


26

Stochastic Gradient Descent for ML

❖ Two key cons of BGD: ❖ Slow to converge to optimal (too many epochs) ❖ Costly full scan of D for each update of w

❖ Stochastic GD (SGD) mitigates both issues ❖ Basic Idea: Use a sample (called mini-batch) of D to

approximate gradient instead of “full batch” gradient ❖ Without replacement sampling ❖ Randomly shuffle D before each epoch ❖ One pass = sequence of mini-batches

❖ SGD works well for non-convex loss functions too, unlike BGD; “workhorse” of scalable ML

27

Data Access Pattern of Scalable SGD

W(t+1) W(t) � ⌘rL̃(W(t))<latexit sha1_base64="UQEnrJTW1ICg3Ivhlxip9zAg4io=">AAACRHicbVDBTttAFFyHUmhKW1OOvawaVQpCjewCao+oXHrgABIhSHaInjfPsGK9tnafiyLLH8elH8CNL+ilh6Kq16rrkAMkjLTa0cw87dtJCiUtBcGt11p6tvx8ZfVF++Xaq9dv/PW3JzYvjcC+yFVuThOwqKTGPklSeFoYhCxROEgu9xt/8B2Nlbk+pkmBwwzOtUylAHLSyI/iDOgiSatBfVZ1iW/xcLPmscKUwJj8ij/2nfeRx0jAYw2JchdJNcbqoObd+eTmyO8EvWAKvkjCGemwGQ5H/k08zkWZoSahwNooDAoaVmBICoV1Oy4tFiAu4RwjRzVkaIfVtISaf3DKmKe5cUcTn6oPJyrIrJ1kiUs2i9p5rxGf8qKS0i/DSuqiJNTi/qG0VJxy3jTKx9KgIDVxBISRblcuLsCAINd725UQzn95kZx86oXbvd2jnc7e11kdq+wde8+6LGSf2R77xg5Znwl2zX6y3+zO++H98v54f++jLW82s8Eewfv3H7dAsLY=</latexit>

Sample mini-batch from dataset without replacement

Original dataset

Random “shuffle”

Epoch 1W(0)

<latexit sha1_base64="S+keXcWMFFf6wrOJ7MBq7Rng6AM=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPppB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv99oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVecs69sVp+bMgJaJW5AKFGj07a/eICJJSIUmHCvVdZ1YeymWmhFOs3IvUTTGZIyHtGuowCFVXjpLnqFTowxQEEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDGy9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qVw+XlfptUUcJjuEEquDCNdThHhrQBAITeIZXeLNS68V6tz7moytWsXMEf2B9/gAKppNG</latexit>

Seq. scan

W(1)<latexit sha1_base64="FI2/tJBTSmjc6HdcV6spCWeSwCc=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPppB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv99oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVfcs69sVp+bMgJaJW5AKFGj07a/eICJJSIUmHCvVdZ1YeymWmhFOs3IvUTTGZIyHtGuowCFVXjpLnqFTowxQEEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDGy9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qVw+XlfptUUcJjuEEquDCNdThHhrQBAITeIZXeLNS68V6tz7moytWsXMEf2B9/gAMLJNH</latexit>

W(2)<latexit sha1_base64="EHwZoqY5j/U4nnFUPxcanVrWIr8=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWpii6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5C7z21MqFQvFo55FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FddqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdKqVd2L6tXDZbl+m9dRhBM4hQq4cA11uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8NspNI</latexit>

Mini-batch 1

Mini-batch 2

Mini-batch 3W(3)

<latexit sha1_base64="DOy/KUlhuZMwbREidWpApiAKXFQ=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxii6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5C7z21MqFQvFo55FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FddqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdK6qLq16tXDZbl+m9dRhBM4hQq4cA11uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8POJNJ</latexit>

Epoch 2

Seq. scan

(Optional) New

Random Shuffle

W(3)<latexit sha1_base64="DOy/KUlhuZMwbREidWpApiAKXFQ=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxii6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5C7z21MqFQvFo55FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FddqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdK6qLq16tXDZbl+m9dRhBM4hQq4cA11uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8POJNJ</latexit>

W(4)<latexit sha1_base64="7HsMu7dOxTzZgQl2dq540qpd99U=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyXRii6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5C7z21MqFQvFo55FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FddqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdK6qLqX1auHWrl+m9dRhBM4hQq4cA11uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8QvpNK</latexit>

…

…

ORDER BY RAND()

Randomized dataset

rL̃(w(k)) =X

(yi,xi)2B⇢D

rl(yi, f(w(k), xi))

<latexit sha1_base64="AfnLQWBOhwvN55z3/kLEnQiCekI=">AAACVnicbVFNb9QwEHVSSsvyFeDIZcQKaVeqVkkBwaVSVThw4FAktq20WSLHO2mttZ3InlBWUf5ke4GfwgXhbHOALiNZenrvzXj8nFdKOorjn0G4dWf77s7uvcH9Bw8fPY6ePD1xZW0FTkWpSnuWc4dKGpySJIVnlUWuc4Wn+fJ9p59+Q+tkab7QqsK55udGFlJw8lQW6dTwXHFISaoFNp/aUao5XeRFc9l+bUbLcTuGA0hdrbNmtMrk3vdMjiGVBo46NndI8KGFfopaW6DYGLIHXd84i4bxJF4XbIKkB0PW13EWXaWLUtQaDQnFnZslcUXzhluSQmE7SGuHFRdLfo4zDw3X6ObNOpYWXnpmAUVp/TEEa/bvjoZr51Y6985uXXdb68j/abOainfzRpqqJjTi5qKiVkAldBnDQloUpFYecGGl3xXEBbdckP+JgQ8huf3kTXCyP0leTd58fj08POrj2GXP2Qs2Ygl7yw7ZR3bMpkywa/YrCIOt4EfwO9wOd26sYdD3PGP/VBj9AZJwsek=</latexit>

28

❖ Mini-batch gradient computations: 1 filescan per epoch ❖ Update of w happens in DRAM ❖ During filescan, count number of examples seen and

update per mini-batch ❖ Typical Mini-batch sizes: 10s to 1000s ❖ Orders of magnitude more updates than BGD!

❖ Random shuffle is not trivial to scale; requires “external merge sort” (roughly 2 scans of file) ❖ ML practitioners often shuffle dataset only once up

front; good enough in most cases in practice

Data Access Pattern of Scalable SGD

29

Handling pages directly is so low-level! Is there a higher-level way to scale ML?

30

Outline


31

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems

❖ In-RDBMS ML ❖ ML on Dataflow Systems

❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

32

RDBMS Scales Queries over Data

❖ Mature software systems to scale to larger-than-RAM data

33

RDBMS User-Defined Aggregates

❖ Most industrial and open source RDBMSs allow extensibility to their SQL dialect with User-Defined Functions (UDFs)

❖ User-Defined Aggregate (UDA) is a UDF API to specify custom aggregates over the whole dataset

Initialize

Transition

Merge

Finalize

Start by setting up “agg. state” in DRAM

Example with SQL AVG:

(S, C): (Partial sum, partial count)

RDBMS gives a tuple from table; update agg. state

(Optional: In parallel RDBMS, combine agg. states of workers)

<latexit sha1_base64="N8pZRaxhzlXn8gQ86guS41Ybwio=">AAACCnicbVDLSgMxFM34rPU16tJNtAgtljIjii6L3bisaB/QDkMmzbShmWRIMpUydO3GX3HjQhG3foE7/8b0sdDWAxcO59zLvfcEMaNKO863tbS8srq2ntnIbm5t7+zae/t1JRKJSQ0LJmQzQIowyklNU81IM5YERQEjjaBfGfuNAZGKCn6vhzHxItTlNKQYaSP59lH+rlgpwDYjoUZSigc4FU5hfuDTInQLvp1zSs4EcJG4M5IDM1R9+6vdETiJCNeYIaVarhNrL0VSU8zIKNtOFIkR7qMuaRnKUUSUl05eGcETo3RgKKQpruFE/T2RokipYRSYzgjpnpr3xuJ/XivR4ZWXUh4nmnA8XRQmDGoBx7nADpUEazY0BGFJza0Q95BEWJv0siYEd/7lRVI/K7kXJef2PFe+nsWRAYfgGOSBCy5BGdyAKqgBDB7BM3gFb9aT9WK9Wx/T1iVrNnMA/sD6/AGhDZcI</latexit>

(S,C) (S,C) + (vi, 1)

<latexit sha1_base64="ckJ8N6ixBXtCU6c1aXXo4eYYT0U=">AAACH3icbVDLSgNBEJz1bXxFPXoZDGIECbvi6xjMxWNEo0I2LLOTXp1kdmeZ6VXCEr/Ei7/ixYMi4s2/cRJz8FXQUFR1090VplIYdN0PZ2x8YnJqema2MDe/sLhUXF45NyrTHBpcSaUvQ2ZAigQaKFDCZaqBxaGEi7BbG/gXN6CNUMkZ9lJoxewqEZHgDK0UFPfLp5vbtc0t6kuIkGmtbqlvsjjI/ZjhtcD8Vuku6P5dp0/Lp0Fnm9aCzlZQLLkVdwj6l3gjUiIj1IPiu99WPIshQS6ZMU3PTbGVM42CS+gX/MxAyniXXUHT0oTFYFr58L8+3bBKm0ZK20qQDtXvEzmLjenFoe0cHG1+ewPxP6+ZYXTYykWSZggJ/1oUZZKiooOwaFto4Ch7ljCuhb2V8mumGUcbacGG4P1++S8536l4exX3ZLdUPRrFMUPWyDopE48ckCo5JnXSIJzck0fyTF6cB+fJeXXevlrHnNHMKvkB5+MTQ5qh6A==</latexit>

(S0, C 0) X

worker j

(Sj , Cj)

Post-process agg. state and return result Return S’/C’

34

❖ SGD epoch implemented as UDA for in-RDBMS execution

Bismarck: SGD as RDBMS UDA

❖ Data-intensive computation scaled automatically by RDBMS ❖ Commands for shuffling, running multiple epochs, checking

convergence, and validation/test error measurements issued from an external controller written in Python

Initialize

Transition

Merge

Finalize

Allocate memory for model W(t) and mini-batch gradient stats

Given tuple with (y,x), compute gradient and update stats If mini-batch limit hit, update model and reset stats

(Optional: applies only to parallel RDBMS) “Combine” model parameters from indep. workers

Return model parameters

35

❖ Many ML algorithms fit within Bismarck’s template

Programming ML as RDBMS UDAs

❖ Transition function is where the bulk of the ML logic is; a few lines of code for SGD updates

36

Distributed SGD via RDBMS UDA

❖ Data is pre-sharded across workers in parallel RDBMSs ❖ Initialize and Transition run independently on shard ❖ Merge “combines” model params from workers; tricky since

SGD epoch is not an algebraic agg. like SUM/AVG ❖ Common heuristic: “Model Averaging”

Worker 1 Worker 2 Worker n…

Master

W(t)1

<latexit sha1_base64="DDvIp/6uHHDZOjNMUvnpHo6Owbk=">AAAB+3icbVDLSsNAFJ3UV62vWJdugkWom5L4QJdFNy4r2Ae0MUymk3boZBJmbsQS8ituXCji1h9x5984abPQ1gMDh3Pu5Z45fsyZAtv+Nkorq2vrG+XNytb2zu6euV/tqCiRhLZJxCPZ87GinAnaBgac9mJJcehz2vUnN7nffaRSsUjcwzSmbohHggWMYNCSZ1YHIYaxH6Td7CGtw0nmOZ5Zsxv2DNYycQpSQwVanvk1GEYkCakAwrFSfceOwU2xBEY4zSqDRNEYkwke0b6mAodUuekse2Yda2VoBZHUT4A1U39vpDhUahr6ejJPqha9XPzP6ycQXLkpE3ECVJD5oSDhFkRWXoQ1ZJIS4FNNMJFMZ7XIGEtMQNdV0SU4i19eJp3ThnPWuLg7rzWvizrK6BAdoTpy0CVqolvUQm1E0BN6Rq/ozciMF+Pd+JiPloxi5wD9gfH5A6OglC4=</latexit>

W(t)2

<latexit sha1_base64="xRaCmja4+Alc2G0VSB2AwXG38es=">AAAB+3icbVDLSsNAFL3xWesr1qWbYBHqpiRV0WXRjcsK9gFtDJPppB06mYSZiVhCfsWNC0Xc+iPu/BsnbRbaemDgcM693DPHjxmVyra/jZXVtfWNzdJWeXtnd2/fPKh0ZJQITNo4YpHo+UgSRjlpK6oY6cWCoNBnpOtPbnK/+0iEpBG/V9OYuCEacRpQjJSWPLMyCJEa+0HazR7SmjrNvIZnVu26PYO1TJyCVKFAyzO/BsMIJyHhCjMkZd+xY+WmSCiKGcnKg0SSGOEJGpG+phyFRLrpLHtmnWhlaAWR0I8ra6b+3khRKOU09PVknlQuern4n9dPVHDlppTHiSIczw8FCbNUZOVFWEMqCFZsqgnCguqsFh4jgbDSdZV1Cc7il5dJp1F3zuoXd+fV5nVRRwmO4Bhq4MAlNOEWWtAGDE/wDK/wZmTGi/FufMxHV4xi5xD+wPj8AaUklC8=</latexit>

W(t)n

<latexit sha1_base64="dQYgF7AMPEB/nejv+XzaZ3+cDiQ=">AAAB+3icbVDLSsNAFJ3UV62vWpduBotQNyXxgS6LblxWsA9oY5lMJ+3QySTM3Igl5FfcuFDErT/izr9x0mahrQcGDufcyz1zvEhwDbb9bRVWVtfWN4qbpa3tnd298n6lrcNYUdaioQhV1yOaCS5ZCzgI1o0UI4EnWMeb3GR+55EpzUN5D9OIuQEZSe5zSsBIg3KlHxAYe37SSR+SGpykAyNW7bo9A14mTk6qKEdzUP7qD0MaB0wCFUTrnmNH4CZEAaeCpaV+rFlE6ISMWM9QSQKm3WSWPcXHRhliP1TmScAz9fdGQgKtp4FnJrOketHLxP+8Xgz+lZtwGcXAJJ0f8mOBIcRZEXjIFaMgpoYQqrjJiumYKELB1FUyJTiLX14m7dO6c1a/uDuvNq7zOoroEB2hGnLQJWqgW9RELUTRE3pGr+jNSq0X6936mI8WrHznAP2B9fkDACOUaw==</latexit>

W(t) =1

n

nX

i=1

W(t)i

<latexit sha1_base64="YvRR+r5RoOYyp/ar/kI19j/9+94=">AAACJ3icbVDLSgMxFM3UV62vqks3wSLUTZnxgW4qRTcuK9gHdNohk2ba0ExmSDJCCfM3bvwVN4KK6NI/MW1noa0HAodzziX3Hj9mVCrb/rJyS8srq2v59cLG5tb2TnF3rymjRGDSwBGLRNtHkjDKSUNRxUg7FgSFPiMtf3Qz8VsPREga8Xs1jkk3RANOA4qRMpJXvHJDpIZ+oFtpT5fVcQqr0A0EwtpJNU+hK5PQ07TqpD0O57Me9Yolu2JPAReJk5ESyFD3iq9uP8JJSLjCDEnZcexYdTUSimJG0oKbSBIjPEID0jGUo5DIrp7emcIjo/RhEAnzuIJT9feERqGU49A3ycmmct6biP95nUQFl11NeZwowvHsoyBhUEVwUhrsU0GwYmNDEBbU7ArxEJmWlKm2YEpw5k9eJM2TinNaOb87K9Wuszry4AAcgjJwwAWogVtQBw2AwSN4Bm/g3XqyXqwP63MWzVnZzD74A+v7B2K9pkU=</latexit>

❖ Affects convergence ❖ Works OK for GLMs

Q: How does the RDBMS parallelize the (SGD) UDA?

37

❖ Model Averaging for distributed SGD has poor convergence for non-convex/ANN models ❖ Too many epochs, typically poor ML accuracy

❖ Model sizes can be too large (even 10s of GBs) for UDA’s aggregation state

❖ UDA’s Merge step is choke point at scale (100s of workers) ❖ Bulk Synchronous Parallelism (BSP) parallel RDBMSs

Bottlenecks of RDBMS SGD UDA

38

The MADlib Library

❖ A decade-old library of scalable statistical and ML procedures on PostgreSQL and Greenplum (parallel RDBMS)

❖ Many procedures are UDAs; some are written in pure SQL

❖ All can be invoked from SQL console ❖ RDBMS can be used for ETL (extract,

transform, load) of features

https://madlib.apache.org/docs/latest/index.html

https://madlib.apache.org/docs/latest/index.html

Some data analysts like ML from SQL console

In-situ data governance and security/auth. of RDBMSs

Faster in some cases; typically slower due to API restrictions

Massively parallel processing of RDBMSs like Greenplum

SQL-based ETL

Most ML users want full flexibility of Python console

Many ML users want more hands-on data access in Jupyter notebooks

Custom ML systems typically faster; Interference with OLTP workloads

BSP is a bottleneck for 100+ nodes (asynchrony needed)

Unnatural APIs to write ML

Usability:

Manageability:

Efficiency:

Scalability:

Developability:

Pros: Cons:

Tradeoffs of In-RDBMS ML

39

40

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems

❖ In-RDBMS ML ❖ ML on Dataflow Systems

❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability

41

❖ Programming model for writing programs on sharded data + distributed system architecture for processing large data

❖ Map and Reduce are terms/ideas from functional PL ❖ Developer only implements the logic of Map and Reduce ❖ System implementation handles orchestration of data

distribution, parallelization, etc. under the covers

The MapReduce Abstraction

42

❖ Standard example: count word occurrences in a doc corpus ❖ Input: A set of text documents (say, webpages) ❖ Output: A dictionary of unique words and their counts

Hmmm, sounds suspiciously familiar … :)function map (String docname, String doctext) : for each word w in doctext : emit (w, 1)function reduce (String word, Iterator partialCounts) : sum = 0 for each pc in partialCounts : sum += pc emit (word, sum)

Part of MapReduce API

The MapReduce Abstraction

43

How MapReduce Works

Parallel flow of control and data during MapReduce execution:

Under the covers, each Mapper and Reducer is a separate process; Reducers face barrier synchronization (BSP)

Fault tolerance achieved using data replication

44

What is Hadoop then?

❖ Open-source system implementation with MapReduce as prog. model and HDFS as distr. filesystem

❖ Map() and Reduce() functions in API; input splitting, data distribution, shuffling, fault tolerance, etc. all handled by the Hadoop library under the covers

❖ Mostly superseded by the Spark ecosystem these days although HDFS is still the base

45

Abstract Semantics of MapReduce

❖ Map(): Operates independently on one “record” at a time ❖ Can batch multiple data examples on to one record ❖ Dependencies across Mappers not allowed ❖ Can emit 1 or more key-value pairs as output ❖ Data types of inputs and outputs can be different!

❖ Reduce(): Gathers all Map output pairs across machines with same key into an Iterator (list) ❖ Aggregation function applied on Iterator and output final

❖ Input Split: ❖ Physical-level split/shard of dataset that batches multiple

examples to one file “block” (~128MB default on HDFS) ❖ Custom Input Splits can be written by appl. user

46

Benefits of MapReduce

❖ Goal: Higher level abstraction for parallel data processing ❖ Key Benefits:

❖ Out-of-the-box scalability and fault tolerance ❖ Map() and Reduce() can be highly general; no restrictions

on data types; easier ETL ❖ Free and OSS (Hadoop)

❖ New burden on users: Cast data-intensive computation to the Map() + Reduce() API ❖ But libraries exist in many PLs to mitigate coding pains:

Java, C++, Python, R, Scala, etc.

47

Emulate MapReduce in SQL?

Q: How would you do the word counting in RDBMS / in SQL? ❖ First step: Transform text docs into relations and load: Part of the Extract-Transform-Load (ETL) stage Suppose we pre-divide each document into words and have the schema: DocWords (DocName, Word) ❖ Second step: a single, simple SQL query!

SELECT Word, COUNT (*) FROM DocWords GROUP BY Word [ORDER BY Word]

Parallelism, scaling, etc. done by RDBMS under the covers

48

RDBMS UDA vs MapReduce

❖ Aggregation state: data structure computed (independently) by workers and unified by master

❖ Initialize: Set up info./initialize RAM for agg. state; runs independently on each worker

❖ Transition: Per-tuple function run by worker to update its agg. state; analogous to Map() in MapReduce

❖ Merge: Function that combines agg. states from workers; run by master after workers done; analogous to Reduce()

❖ Finalize: Run once at end by master to return final result

49

BGD via MapReduce

function map (String datafile) : Read W(t) from known file on HDFS

Initialize G = 0 For each tuple in datafile:

G += per-example gradient on (tuple, W(t)) emit (G)

function reduce (Iterator partialGradients) : FullG = 0 for each G in partialGradients :

FullG += G emit (FullG)

❖ Assume data is sharded; a map partition has many examples ❖ Initial model W(t) read from file on HDFS by each mapper

50

SGD (Averaging) via MapReduce

function map (String datafile) : Read W(t) from known file on HDFS

Initialize G = 0 For each tuple in datafile:

G += per-example gradient on (tuple, W(t)) Wj = Update W(t) with G

emit (Wj)

function reduce (Iterator workerWeights) : AvgW = 0 for each Wj in workerWeights :

AvgW += Wj AvgW /= workerWeights.length() emit (AvgW)

❖ Similar to BGD but Model Averaging across map partitions

51

Apache Spark

❖ Extended dataflow programming model to subsume MapReduce, most relational operators ❖ Inspired by Python Pandas style of function calls for ops

❖ Unified system to handle relations, text, etc.; support more general distributed data processing

❖ Uses distributed memory for caching; faster than Hadoop ❖ New fault tolerance mechanism using lineage, not replication ❖ From UC Berkeley AMPLab; commercialized as Databricks

52

Spark’s Dataflow Programming Model

Transformations are relational ops, MR, etc. as functions Actions are what force computation; aka lazy evaluation

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf

53

Word Count Example in Spark

Spark RDD API available in Python, Scala, Java, and R

54

Programming ML in MapReduce/Spark

❖ All ML procedures that can be cast as RDBMS UDAs can be cast to MapReduce API of Hadoop or Spark RDD APIs ❖ MapReduce is just easier to use for most developers ❖ Spark integrates better with PyData ecosystem

https://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf

❖ Apache Mahout is a library of classical ML algorithms written as MapReduce programs for Hadoop; expanded later to DSL

❖ SparkML is a library of classical ML algorithms written using Spark RDD API; common in enterprises for scalable ML

https://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf

55

Tradeoffs of ML on Dataflow Systems

SparkML integrates well with Python stacks

In-situ data governance and security/auth. of “data lakes”

Comparable to in-RDBMS ML; no OLTP interference

Massively parallel processing

SQL- & MR-based ETL Less code to write ML

Not all ML algorithms scaled

Many ML users may not be familiar with Spark data ETL

Custom ML systems typically faster still

BSP is a bottleneck for 100+ nodes (asynchrony needed)

Still somewhat unnatural APIs to write ML

Usability:

Manageability:

Efficiency:

Scalability:

Developability:

Pros: Cons:

56

Outline


57

Outline

❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems

❖ Parameter Server ❖ GraphLab ❖ XGBoost

❖ Advanced Issues in ML Scalability

58

Parameter Server for Distributed SGD

❖ Recall bottlenecks of Model Averaging-based SGD in RDBMS UDA or with MapReduce: ❖ BSP becomes a choke point (Merge / Reduce stage) ❖ Often poor convergence, especially for non-convex ❖ Hard to handle large models

❖ Parameter Server (PS) mitigates all these issues: ❖ Breaks the synchronization barrier for merging: allows

asynchronous updates from workers to master ❖ Flexible communication frequency: can send updates at

every mini-batch or a set of few mini-batches

59

ParameterServer for Distributed SGD

Worker 1 Worker 2 Worker n…

PS 1 PS 2 PS k…

Multi-server “master”; each server manages a part of W(t)<latexit sha1_base64="jHixyQj+DLWH5ucHau8W5BCjQHQ=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPppB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv99oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVX2W9e2KU3NmQMvELUgFCjT69ldvEJEkpEITjpXquk6svRRLzQinWbmXKBpjMsZD2jVU4JAqL50lz9CpUQYoiKR5QqOZ+nsjxaFS09A3k3lOtejl4n9eN9HBjZcyESeaCjI/FCQc6QjlNaABk5RoPjUEE8lMVkRGWGKiTVllU4K7+OVl0jqvuRe1q4fLSv22qKMEx3ACVXDhGupwDw1oAoEJPMMrvFmp9WK9Wx/z0RWr2DmCP7A+fwByPpOK</latexit>

Push / Pull when ready/needed

No sync. for workers or servers

Workers send gradients to master for updates

at each mini-batch (or lower frequency)

rL̃(W(t+1)n )

<latexit sha1_base64="GGLCamUO1sNnU5MPUcSRaDlHVdg=">AAACEHicbVC7SgNBFJ2NrxhfUUubwSAmCGHXB1oGbSwsIpgHZGOYncwmQ2Znl5m7Qlj2E2z8FRsLRWwt7fwbJ49CEw9cOJxzL/fe40WCa7DtbyuzsLi0vJJdza2tb2xu5bd36jqMFWU1GopQNT2imeCS1YCDYM1IMRJ4gjW8wdXIbzwwpXko72AYsXZAepL7nBIwUid/6EriCYJd4KLLkpsUF92AQN/zk0Z6nxThyCmlHVnq5At22R4DzxNnSgpoimon/+V2QxoHTAIVROuWY0fQTogCTgVLc26sWUTogPRYy1BJAqbbyfihFB8YpYv9UJmSgMfq74mEBFoPA890jo7Vs95I/M9rxeBftBMuoxiYpJNFfiwwhHiUDu5yxSiIoSGEKm5uxbRPFKFgMsyZEJzZl+dJ/bjsnJTPbk8LlctpHFm0h/ZRETnoHFXQNaqiGqLoET2jV/RmPVkv1rv1MWnNWNOZXfQH1ucPdTycNg==</latexit>

rL̃(W(t�1)2 )

<latexit sha1_base64="z+XBRFQZlyoDHOZkNVVbaz0Q+wU=">AAACEHicbVC7SgNBFJ31GeMramkzGMSkMOxGRcugjYVFBPOAbAyzk9lkyOzsMnNXCMt+go2/YmOhiK2lnX/j5FFo4oELh3Pu5d57vEhwDbb9bS0sLi2vrGbWsusbm1vbuZ3dug5jRVmNhiJUTY9oJrhkNeAgWDNSjASeYA1vcDXyGw9MaR7KOxhGrB2QnuQ+pwSM1MkduZJ4gmAXuOiy5CbFBTcg0Pf8pJHeJwU4doppp1zs5PJ2yR4DzxNnSvJoimon9+V2QxoHTAIVROuWY0fQTogCTgVLs26sWUTogPRYy1BJAqbbyfihFB8apYv9UJmSgMfq74mEBFoPA890jo7Vs95I/M9rxeBftBMuoxiYpJNFfiwwhHiUDu5yxSiIoSGEKm5uxbRPFKFgMsyaEJzZl+dJvVxyTkpnt6f5yuU0jgzaRweogBx0jiroGlVRDVH0iJ7RK3qznqwX6936mLQuWNOZPfQH1ucPHSSb/A==</latexit>

rL̃(W(t)1 )

<latexit sha1_base64="OcO+0LVRY+mliWjDoWz5mG5oEXI=">AAACDnicbVC7SgNBFJ31GeNr1dJmMASSJuz6QMugjYVFBPOAbAyzk9lkyOzsMnNXCMt+gY2/YmOhiK21nX/j5FFo4oELh3Pu5d57/FhwDY7zbS0tr6yurec28ptb2zu79t5+Q0eJoqxOIxGplk80E1yyOnAQrBUrRkJfsKY/vBr7zQemNI/kHYxi1glJX/KAUwJG6tpFTxJfEOwBFz2W3mS45IUEBn6QNrP7tATlrOuWu3bBqTgT4EXizkgBzVDr2l9eL6JJyCRQQbRuu04MnZQo4FSwLO8lmsWEDkmftQ2VJGS6k07eyXDRKD0cRMqUBDxRf0+kJNR6FPqmc3yqnvfG4n9eO4HgopNyGSfAJJ0uChKBIcLjbHCPK0ZBjAwhVHFzK6YDoggFk2DehODOv7xIGscV96RydntaqF7O4sihQ3SESshF56iKrlEN1RFFj+gZvaI368l6sd6tj2nrkjWbOUB/YH3+AChxm4k=</latexit>

❖ Model params may get out-of-sync or stale; but SGD turns out to be remarkably robust—multiple updates/epoch really helps

❖ Communication cost per epoch is higher (per mini-batch)

60

Programming ML using PS

❖ Designed mainly for sparse feature vectors/updates ❖ Easy to parallelize updates to model params

❖ ML developer recasts ML procedure into two pars: worker-side updates and server-side aggregation ❖ Loosely analogous to Map and Reduce, respectively ❖ But more complex due to flexible update schedules

❖ Supports 3 consistency models for staleness of updates, with different tradeoffs on efficiency vs accuracy

61

Systems-level Advances in PS

❖ Workers and Servers can both be multi-nodes/multi-process; fine-grained task scheduling on nodes

❖ Range-based partitioning of params for servers❖ Timestamped and compressed

updates exchanged ❖ Replication and fault tolerance

62

Tradeoffs of Parameter Server

Supports billions of features and params

In-built fault tolerance

Faster than in-RDBMS ML and ML-on-Dataflow

Highest; can work with 1000s of nodes

Abstracts away many systems scaling issues

Not reproducible; not well integrated with ETL stacks

TensorFlow offers it natively; o/w, hard to operate/govern

Not suitable for dense updates; high comm. cost

Not suitable for smaller scales due to overheads

Reasoning about ML (in)consistency is hard

Usability:

Manageability:

Efficiency:

Scalability:

Developability:

Pros: Cons:

63

Sample of Your Strong Points on PS

❖ “Abstracts away most system complexity such as asynchronous communication between nodes, fault tolerance, and data replication”

❖ “Allows addition and deletion of nodes to the framework without any restarts by means of a consistent hash ring with replication of data”

❖ “Relaxing consistency constraints gives the algorithm designer flexibility in defining different consistency models (going from sequential to eventual) by balancing system efficiency and algorithm convergence”

❖ “Outperforms other open source systems in Sparse Logistic Regression”

❖ “Range construct, while simple, seems fundamental to efficient communication and thus fault tolerance”

❖ “Generality. Both supervised learning and unsupervised learning”

❖ “Fault tolerance is crucial … in an unreliable environment like the cloud where the machines can fail or jobs can be preempted”

64

Sample of Your Weak Points on PS

❖ “Does not show where or when the bottlenecks of synchronization appear”

❖ “Could lead to a longer amount of time for an ML algorithm model to train if the model is sensitive to inconsistent data”

❖ “Does not specify how to set a reasonable threshold to achieve best trade-off between system efficiency and algorithm convergence rate”

❖ “Flexibility in choosing consistency models also adds in complexity”

❖ “Non-convex models, such as complex neural networks, may not be suitable for this pipeline”

❖ “Didn’t show how the system performed in training a deep neural net”

❖ “Crucial and widely used non-parametric models like Decision Trees and K nearest neighbors do not fit the architecture”

❖ “Google is hardly representative of the broader industry”

❖ “There is no eval at all for … fault tolerance”

65

Innovativeness and Depth Ratings

66

Outline




67

Graph-Parallel Algorithms

❖ Some data analytics algorithms (not just ML) operate on graph data and have complex update dependencies

Example: PageRank

http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf

❖ Not a simple sequential access pattern like SGD ❖ If viewed as a table: reads and writes to tuples, with each

write depending on all neighboring tuples ❖ Does not scale well with RDBMS/UDA or MapReduce/Spark


68

The GraphLab Abstraction

❖ “Think like a vertex” paradigm over data graph (V, E, D) ❖ Arbitrary data state associated with vertices and edges ❖ 3-function API: Gather-Apply-Scatter

❖ Gather: Collect latest states from neighbors ❖ Apply: Vertex-local state update function ❖ Scatter: Send local state to neighbors


❖ Original single-node GraphLab assumed shared-memory

❖ Scaled to single-node disk with careful sharding & caching of graph


69

Distributed GraphLab Execution


❖ Enabled asynchronous updates of vertex and edge states ❖ Consistency-parallelism tradeoff ❖ Some algorithms seem robust to such inconsistency ❖ Sophisticated distributed locking protocols


70

Outline




71

Decision Tree Data Access Pattern

❖ CART has complex non-sequential data access patterns ❖ Compare candidate splits on each feature at each node ❖ Class-conditional aggregates needed per candidate ❖ Repartition data for sub-tree growth

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html

❖ Does not scale well with RDBMS/UDA or MapReduce/Spark

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html

72

Decision Tree Ensembles

❖ RandomForest is very popular ❖ Just a bunch of independent trees on column subsets

❖ Tree Boosting is a popular adaptive ensemble ❖ Construct trees sequentially and weigh them ❖ “Weak” learners aggregated to a strong learner

❖ Gradient-Boosted Decision Tree (GBDT) is very popular ❖ Also adds trees to ensemble sequentially ❖ Real-valued prediction; convex differentiable loss

73

GBDT Data Access Pattern

❖ More complex non-sequential access pattern than single tree! ❖ An “iteration” adds a tree by exploring candidate splits ❖ Still needs recursive data re-partitioning ❖ Key difference: scoring function has more statistics (1st

and 2nd deriv.); this needs read-write access per example

❖ Access pattern over per-example stats is random, depends on split location

❖ Ideal if whole stats and a whole column (at a time) can fit in RAM

Y X1 X2 X3

0 1b 1c 1d

1 2b 2c 2d

1 3b 3c 3d

0 4b 4c 4d

… … … …

DatasetGi Hi

… …

… …

… …

… …

… …

Stats

74

XGBoost

❖ Custom ML system to scale GBDT to larger-than-RAM data, both single-node disk and on a cluster

❖ Very popular on tabular data; won many Kaggle contests ❖ Key philosophy: Algorithm-system “co-design”:

❖ Make system implementation memory hierarchy-aware based on algorithm’s data access patterns

❖ Modify ML algorithmics to better suit system scale

75

XGBoost: Algorithm-level Ideas

❖ 2 key changes to make GBDT more scalable ❖ Bottleneck: Computing candidate split stats at scale

❖ Idea: Approximate stats with weighted quantile sketch ❖ Bottleneck: Sparse feature vectors and missing data

❖ Idea: Bake in “default direction” for child during learning

76

XGBoost: Systems-level Ideas

❖ 4 key choices to ensure efficiency at scale ❖ Goal: Reduce overhead of evaluating candidates

❖ Idea: Pre-sort all columns independently ❖ Goal: Exploit parallelism to raise throughput

❖ Idea: Shard data into a column “block” per worker; workers compute local columns’ stats independently

77

XGBoost: Systems-level Ideas

❖ 4 key choices to ensure efficiency at scale ❖ Goal: Mitigate read-write randomness to stats in RAM

❖ Idea: CPU cache-aware staging of stats to reduce stalls

❖ Goal: Scale to on-disk and cluster data ❖ Idea: Shard blocks further on disk; stage reads to RAM;

block-level compression to reduce I/O latency

78

XGBoost: Scalability Gains

❖ Gains from both algorithmic changes and systems ideas

79

Tradeoffs of Custom Distr. ML Sys.

Suitable for hyper-specialized ML use case

More feasible in cloud-native / managed env.

Often much lower runtimes and costs

Often more scalable to larger datasets

More amenable if ML algorithm is familiar; new open source/startup communities

Need to (re)learn new system APIs again

Extra overhead to add and maintain in tools ecosystem

Debugging runtime issues needs specialized knowhow

Arcane scalability issues may arise, e.g., inconsistency

Need to (re)learn new implementation APIs and consistency tradeoffs; risk of lower technical support

Usability:

Manageability:

Efficiency:

Scalability:

Developability:

Pros: Cons:

80

Your Strong Points on XGBoost

❖ “The scalability of XGBoost algorithm enables much faster learning speed in model exploration by using parallel and distributed computing”

❖ “Versatile: it could be applied to a wide range of problems, including classification, ranking, rate prediction, and categorization”

❖ “Flexibility. The system supports the exact greedy algorithm and the approximate algorithm”

❖ “Handles the issue of sparsity in data”

❖ “Theoretically justified weighted quantile sketch”

❖ “Language Support and portability”

❖ “By making XGBoost open-source, they put their promises into action”

❖ “Make a point of cleaving to real-world issues faced by industry users”

❖ “This end-to-end system is widely used by data scientists”

81

Your Weak Points on XGBoost

❖ “Specificity: The system only works for gradient boosted tree algorithms”

❖ “In recent years deep neural networks can beat boosting tree algorithms”

❖ “Accuracy of XGBoost algorithms is generally lower than LightGBM”

❖ “Kaggle competitions and industry ML roles have quite a gap”

❖ “System introduces a lot of hyperparameters”

❖ “XGBoost tuning is difficult”

❖ “Sparsity-aware approach to split-finding introduces another layer of learned uncertainty … no data exploring effect on accuracy”

❖ “No evaluation on their weighted quantile sketch algorithm”

❖ “Finding correct block size for approximate algorithm … need to tune”

❖ “Not much detail on how the system parallelizes its workload”

❖ “Assumes readers have deep understanding about Boosting algorithm”

82

Innovativeness and Depth Ratings

83

Outline


84

Advanced Issues in ML Scalability

❖ Streaming/Incremental ML at scale ❖ Scaling Massive Task Parallelism in ML ❖ Pushing ML Through Joins ❖ Larger-than-Memory Models ❖ Models with More Complex Data Access Patterns ❖ Delay-Tolerant / Geo-Distributed / Federated ML ❖ Scaling end-to-end ML Pipelines

85

Scalable Incremental ML

❖ Datasets keep growing in many real-world ML applications ❖ Incremental ML: update a learned model using only new data

❖ Streaming ML is one variant (near real-time) ❖ Non-trivial to make all ML algorithms incremental

❖ SGD-based procedures are “online” by default; just resume gradient descent on new data

❖ ML/data mining folks have studied how to make other ML algorithms incremental; accuracy-runtime tradeoffs

86

Scalable Incremental ML: SageMaker

❖ Industrial cloud-native ML requirements: ❖ Incremental training and model freshness ❖ Predictable training runtime ❖ Elasticity and pause-resume ❖ Trainable on ephemeral (non-archived) data ❖ Automatic model/hyper-parameter tuning

❖ Design: Streaming ML algorithms that fit into a 3-function API of Initialize-Update-Finalize (akin to UDA) ❖ All SGD-based procedures (GLMs, fact. Machines) ❖ Variants of K-Means, PCA, topic models, forecasting

87

Scalable Incremental ML: SageMaker

❖ Parameter Server-based architecture with streaming workers

88

Massive Task Parallelism: Ray

❖ Advanced ML applications that use reinforcement learning produce large numbers of short-running tasks ❖ Training robots, self-driving cars, etc. ❖ ML-based cluster resource management

https://www.usenix.org/system/files/osdi18-moritz.pdf

❖ Ray is an ML system that automatically scales such tasks from single-node to large clusters

❖ Tasks can have shared state for control

❖ Data is replicated/broadcast

https://www.usenix.org/system/files/osdi18-moritz.pdf

89

Pushing ML Through Joins

❖ Most real-world tabular/relational datasets are multi-table ❖ A central fact table with target; many dimension tables

❖ Key-foreign key joins are ubiquitous to denormalize such data into single table before ML training

❖ Single table has a lot of data redundancy ❖ In turn, many ML algorithms end up

having a lot of computational redundancy

❖ Pushing ML computations through joins enables computations directly over the normalized database

90

Pushing ML Through Joins

❖ Orion first showed how to rewrite ML computations to operate on join input rather than output, aka “factorized ML” ❖ GLMs with BGD, L-BFGS, Newton methods ❖ Rewritten impl. fits within UDA / MapReduce abstractions

https://adalabucsd.github.io/papers/2015_Orion_SIGMOD.pdf https://adalabucsd.github.io/papers/2017_Morpheus_VLDB.pdf

❖ Morpheus offered a unified abstraction based on linear algebra to factorize many ML algorithms in one formalism ❖ K-Means clustering, matrix

factorization, etc.

https://adalabucsd.github.io/papers/2015_Orion_SIGMOD.pdf

https://adalabucsd.github.io/papers/2017_Morpheus_VLDB.pdf

91

Larger-than-Memory Models

❖ Some ML algorithms may have state that does not fit in RAM ❖ Need to shard model too (not just data) to stage updates ❖ Specific to the ML algorithm’s data access patterns

http://www.cs.cmu.edu/~kijungs/etc/10-405.pdf

Example: Matrix factorization trained with SGD

http://www.cs.cmu.edu/~kijungs/etc/10-405.pdf

92

Outline


93

Evaluating “Scalability”

https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

❖ Message: Not just scalability but efficiency matters too; not just speedup curve but time vs strong (single-node) baselines

https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf

Documents

CSE 291D/234 Data Systems for Machine Learning · 2020. 10. 13. · E.g., logistic and linear regression, linear SVM, ... Similar to scaling to disk but instead read pages/chunks