Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
CSE 291D/234 Data Systems for Machine Learning
1
Topic 1: Classical ML Training at Scale
Chapters 2, 5, and 6 of MLSys book
Arun Kumar
2
Academic ML 101
Generalized Linear Models (GLMs); from statistics
Bayesian Networks; inspired by causal reasoning
Decision Tree-based: CART, Random Forest, Gradient-Boosted Trees (GBT), etc.; inspired by symbolic logic
Support Vector Machines (SVMs); inspired by psychology
Artificial Neural Networks (ANNs): Multi-Layer Perceptrons (MLPs), Convolutional NNs (CNNs), Recurrent NNs (RNNs), Transformers, etc.; inspired by brain neuroscience
“Classical” ML
3
Real-World ML 101
https://www.kaggle.com/c/kaggle-survey-2019
Deep Learning
GLMsTree learners
4
Scalable ML Training in the Lifecycle
Data acquisition Data preparation
Feature Engineering Training & Inference
Model Selection
Serving Monitoring
5
Scalable ML Training in the Big Picture
6
ML Systems
❖ A data processing system (aka data system) for mathematically advanced data analysis operations (inferential or predictive): ❖ Statistical analysis; ML, deep learning (DL); data mining
(domain-specific applied ML + feature eng.) ❖ High-level APIs to express ML computations over (large)
datasets ❖ Execution engine to run ML computations efficiently
Q: What is a Machine Learning (ML) System?
and in a scalable manner
7
But what exactly does it mean for an ML system to be “scalable”?
8
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability
9
Background: Memory Hierarchy
Flash Storage
CPU
Main Memory
Magnetic Hard Disk Drive (HDD)
CachePrice
Capacity~MBs ~$2/MB
~10GBs ~$5/GB
~TBs ~$200/TB
~10TBs ~$40/TB
105 –
106
A C
C E
S S
C Y
C L
E S
107 –
108
100s
Acce
ss S
peed
~GB/s
~10GB/s
~100GB/s
~200MB/s
10
Memory Hierarchy in Action
Q: What does this program do when run with ‘python’? (Assume tmp.csv is in current working directory)
import pandas as p m = p.read_csv(‘tmp.csv’,header=None) s = m.sum().sum() print(s)
1,2,3 4,5,6
tmp.py
tmp.csv
11
Memory Hierarchy in Action
Bus
CU ALU
Caches
DRAM
Disk
Store; Retrieve
Store; Retrieve
Retrieve; Process
Registers
Processor
Rough sequence of events when program is executed
tmp.csvtmp.py
Commands interpreted
Computations done by Processor
Monitor
‘21’
I/O for Display I/O for code I/O for data
‘21’
‘21’
Q: What if this does not fit in DRAM?
12
Scalable ML Systems
❖ ML systems that do not require the (training) dataset to fit entirely in main memory (DRAM) of one node ❖ Conversely, if the system thrashes when data file does
not fit in RAM, it is not scalable
Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor
13
Scalable ML Systems
4 main approaches to scale ML to large data: ❖ Single-node disk: Paged access from file on local disk ❖ Remote read: Paged access from disk(s) over network ❖ Distributed memory: Fits on a cluster’s total DRAM ❖ Distributed disk: Fits on a cluster’s full set of disks
14
Evolution of Scalable ML Systems
1980s
S
Mid 1990s
Mid 2010s—
Late 2000s to Early 2010s
Late 1990s to Mid 2000s
In-RDBMS ML Systems
Scalability
Manageability
Developability
ML on Dataflow Systems Cloud ML
Deep Learning Systems
Parameter Server
ML System Abstractions
15
Major Existing ML Systems
General ML libraries:
Disk-based files:In-memory: Layered on RDBMS/Spark:
Cloud-native: “AutoML” platforms:
Decision tree-oriented: Deep learning-oriented:
16
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability
17
ML Algorithm = Program Over Data
Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor
❖ To scale an ML program’s computations, split them up to operate over “chunks” of data at a time
❖ How to split up an ML program this way can be non-trivial! ❖ Depends on data access pattern of the algorithm ❖ A large class of ML algorithms do just sequential scans
for iterative numerical optimization
18
Numerical Optimization in ML
❖ Many regression and classification models in ML are formulated as a (constrained) minimization problem ❖ E.g., logistic and linear regression, linear SVM, etc. ❖ Aka “Empirical Risk Minimization” (ERM)
w⇤ = argminw
nX
i=1
l(yi, f(w, xi))<latexit sha1_base64="wtingO0LqaDZ0Zshlkk1cGJ54IY=">AAACO3icbVDLSgMxFM3UV62vqks3wSJUKWXGB7opFN24rGJroa1DJs20oUlmSDJqGea/3PgT7ty4caGIW/emD6haDwROzrmXe+/xQkaVtu1nKzUzOze/kF7MLC2vrK5l1zdqKogkJlUcsEDWPaQIo4JUNdWM1ENJEPcYufZ6ZwP/+pZIRQNxpfshaXHUEdSnGGkjudnLJke66/nxXXKzB0tw+KU6RrLDqUjceOInsKki7sa05CQ3ArJ836UF6OcnFYV7l+7uutmcXbSHgNPEGZMcGKPiZp+a7QBHnAiNGVKq4dihbpkVNMWMJJlmpEiIcA91SMNQgThRrXh4ewJ3jNKGfiDNExoO1Z8dMeJK9blnKgd7qr/eQPzPa0TaP2nFVISRJgKPBvkRgzqAgyBhm0qCNesbgrCkZleIu0girE3cGROC8/fkaVLbLzoHxaOLw1z5dBxHGmyBbZAHDjgGZXAOKqAKMHgAL+ANvFuP1qv1YX2OSlPWuGcT/IL19Q0ZDK6q</latexit>
❖ GLMs define hyperplanes and use f() that is a scalar function of distances:
wTxi<latexit sha1_base64="uvgXp+ddpvG7NN5TecG7ye7m4Eg=">AAAB+XicbVC7TsMwFL0pr1JeAUYWiwqJqUp4CMYKFsYi9SW1IXJcp7XqOJHtFKqof8LCAEKs/Akbf4PTdoCWI1k6Oude3eMTJJwp7TjfVmFldW19o7hZ2tre2d2z9w+aKk4loQ0S81i2A6woZ4I2NNOcthNJcRRw2gqGt7nfGlGpWCzqepxQL8J9wUJGsDaSb9vdCOtBEGaPk4c6evKZb5edijMFWibunJRhjppvf3V7MUkjKjThWKmO6yTay7DUjHA6KXVTRRNMhrhPO4YKHFHlZdPkE3RilB4KY2me0Giq/t7IcKTUOArMZJ5TLXq5+J/XSXV47WVMJKmmgswOhSlHOkZ5DajHJCWajw3BRDKTFZEBlphoU1bJlOAufnmZNM8q7nnl8v6iXL2Z11GEIziGU3DhCqpwBzVoAIERPMMrvFmZ9WK9Wx+z0YI13zmEP7A+fwCVhJOh</latexit>
Y X1 X2 X30 1b 1c 1d1 2b 2c 2d1 3b 3c 3d0 4b 4c 4d… … … …
D
19
❖ For many ML models, loss function l() is convex; so is L() ❖ But closed-form minimization is typically infeasible ❖ Batch Gradient Descent:
L(w) =nX
i=1
l(yi, f(w, xi))<latexit sha1_base64="Gn3z209XWihwvStV6+mufXVXz8s=">AAACH3icbVDLSgMxFM3UV62vUZdugkVooZQZ35tC0Y0LFxXsA9o6ZNJMG5rJDElGLcP8iRt/xY0LRcRd/8b0saitBwKHc84l9x43ZFQqyxoaqaXlldW19HpmY3Nre8fc3avJIBKYVHHAAtFwkSSMclJVVDHSCAVBvstI3e1fj/z6IxGSBvxeDULS9lGXU49ipLTkmOe3uZaPVM/14qckD0uwJSPfiWnJTh44ZLmBQwvQm8kUnh2azztm1ipaY8BFYk9JFkxRccyfVifAkU+4wgxJ2bStULVjJBTFjCSZViRJiHAfdUlTU458Itvx+L4EHmmlA71A6McVHKuzEzHypRz4rk6O9pTz3kj8z2tGyrtsx5SHkSIcTz7yIgZVAEdlwQ4VBCs20ARhQfWuEPeQQFjpSjO6BHv+5EVSOy7aJ8Wzu9Ns+WpaRxocgEOQAza4AGVwAyqgCjB4AW/gA3war8a78WV8T6IpYzqzD/7AGP4C0FChmg==</latexit>
Batch Gradient Descent for ML
❖ Iterative numerical procedure to find an optimal w ❖ Initialize w to some value w(0) ❖ Compute gradient:
❖ Descend along gradient: (Aka Update Rule)
❖ Repeat until we get close to w*, aka convergence
w(k+1) w(k) � ⌘rL(w(k))<latexit sha1_base64="ayIl16A2siz4c7ll2gSKsLvcCX4=">AAACOXicbVDLahtBEJyVk1hRXop9zGWICEiEiF3HJj4K+5JDDgpED9AqonfUKw2anV1mei3Eot/yxX+RW8AXHxJCrvmBjB6HSE7BQFFVzXRXlClpyfe/e6WDBw8fHZYfV548ffb8RfXlUdemuRHYEalKTT8Ci0pq7JAkhf3MICSRwl40u1z5vSs0Vqb6Cy0yHCYw0TKWAshJo2o7TICmUVzMl1+L+uxt0FjyUGFMYEw657uu897xEAl4qCFSwD/V9wONUbXmN/01+H0SbEmNbdEeVb+F41TkCWoSCqwdBH5GwwIMSaFwWQlzixmIGUxw4KiGBO2wWF++5G+cMuZxatzTxNfqvxMFJNYuksglV4vafW8l/s8b5BSfDwups5xQi81Hca44pXxVIx9Lg4LUwhEQRrpduZiCAUGu7IorIdg/+T7pnjSD982zz6e11sW2jjJ7xV6zOgvYB9ZiH1mbdZhg1+yW/WA/vRvvzvvl/d5ES9525pjtwPvzF5+7rMc=</latexit>
rL(w(k)) =nX
i=1
rl(yi, f(w(k), xi))
<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">AAACOXicbVDLSgMxFM34rPVVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodzziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGGUk7qiipFWKAjyXUaa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>
20
Batch Gradient Descent for ML
w<latexit sha1_base64="q3pU+DFyxWArDM3ABiJpfISIf2Y=">AAAB8XicbVDLSgMxFL3js9ZX1aWbYBFclRkf6LLoxmUF+8C2lEx6pw3NZIYko5Shf+HGhSJu/Rt3/o2ZdhbaeiBwOOdecu7xY8G1cd1vZ2l5ZXVtvbBR3Nza3tkt7e03dJQohnUWiUi1fKpRcIl1w43AVqyQhr7Apj+6yfzmIyrNI3lvxjF2QzqQPOCMGis9dEJqhn6QPk16pbJbcacgi8TLSRly1Hqlr04/YkmI0jBBtW57bmy6KVWGM4GTYifRGFM2ogNsWyppiLqbThNPyLFV+iSIlH3SkKn6eyOlodbj0LeTWUI972Xif147McFVN+UyTgxKNvsoSAQxEcnOJ32ukBkxtoQyxW1WwoZUUWZsSUVbgjd/8iJpnFa8s8rF3Xm5ep3XUYBDOIIT8OASqnALNagDAwnP8ApvjnZenHfnYza65OQ7B/AHzucP/fiRIg==</latexit>
w⇤<latexit sha1_base64="HB3noaSvzO2mOiaEhFaq57JcQLw=">AAAB83icbVDLSsNAFL2pr1pfVZduBosgLkriA10W3bisYB/QxDKZTtqhk0mYmSgl5DfcuFDErT/jzr9x0mahrQcGDufcyz1z/JgzpW372yotLa+srpXXKxubW9s71d29tooSSWiLRDySXR8rypmgLc00p91YUhz6nHb88U3udx6pVCwS93oSUy/EQ8ECRrA2kuuGWI/8IH3KHk761Zpdt6dAi8QpSA0KNPvVL3cQkSSkQhOOleo5dqy9FEvNCKdZxU0UjTEZ4yHtGSpwSJWXTjNn6MgoAxRE0jyh0VT9vZHiUKlJ6JvJPKOa93LxP6+X6ODKS5mIE00FmR0KEo50hPIC0IBJSjSfGIKJZCYrIiMsMdGmpoopwZn/8iJpn9ads/rF3XmtcV3UUYYDOIRjcOASGnALTWgBgRie4RXerMR6sd6tj9loySp29uEPrM8fHvyRvg==</latexit>
L(w)<latexit sha1_base64="TY7U6q5Tv+/ah6YJI3mw93bVcTs=">AAAB9HicbVC7TsMwFL0pr1JeBUYWiwqpLFVCQTBWsDAwFIk+pDaqHNdprTpOsJ2iKup3sDCAECsfw8bf4LQZoOVIlo7OuVf3+HgRZ0rb9reVW1ldW9/Ibxa2tnd294r7B00VxpLQBgl5KNseVpQzQRuaaU7bkaQ48DhteaOb1G+NqVQsFA96ElE3wAPBfEawNpJ7V+4GWA89P3manvaKJbtiz4CWiZOREmSo94pf3X5I4oAKTT hWquPYkXYTLDUjnE4L3VjRCJMRHtCOoQIHVLnJLPQUnRilj/xQmic0mqm/NxIcKDUJPDOZRlSLXir+53Vi7V+5CRNRrKkg80N+zJEOUdoA6jNJieYTQzCRzGRFZIglJtr0VDAlOItfXibNs4pTrVzcn5dq11kdeTiCYyiDA5dQg1uoQwMIPMIzvMKbNbZerHfrYz6as7KdQ/gD6/MHZHeR3Q==</latexit>
w(0)<latexit sha1_base64="OxTTqfbjDPbei9MJL9SITiklUMw=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhmaa03YkKQ48Tlve+DbzWxMqFQvFg55GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFO075ddqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THHOkQZTWgAZOUaD41BBPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz88BpNm</latexit>
Gradient
rL(w(0))<latexit sha1_base64="l790wK2qXm9kyZmjZI9U8DggCRQ=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhooJ9QBPLZDpph04mYWailBBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1zz+HIYwheV5wAqaHnJw/pXVKxqmm1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXXtiLlJkgoihlJS04sSYTwCA1IV1OOAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T//O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpDOW4Q==</latexit>
w(1)<latexit sha1_base64="G17ClWIFMobSQkQQcWLAcftDEbk=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyXxgS6LblxWsA9oa5lMJ+3QySTMTCol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrK6tbxQ3S1vbO7t79v5BU4WxJLRBQh7KtocV5UzQhmaa03YkKQ48Tlve+DbzWxMqFQvFg55GtBfgoWA+I1gbqW/b3QDrkecnT+ljUnFP075ddqrODGiZuDkpQ4563/7qDkISB1RowrFSHdeJdC/BUjPCaVrqxopGmIzxkHYMFTigqpfMkqfoxCgD5IfSPKHRTP29keBAqWngmcksp1r0MvE/rxNr/7qXMBHFmgoyP+THHOkQZTWgAZOUaD41BBPJTFZERlhiok1ZJVOCu/jlZdI8q7rn1cv7i3LtJq+jCEdwDBVw4QpqcAd1aACBCTzDK7xZifVivVsf89GCle8cwh9Ynz89jJNn</latexit>w(2)
<latexit sha1_base64="Yg7pZgdEXbbdSQ9HL4psFPomzX4=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiRV0WXRjcsK9gFtLJPppB06mYSZSaWE/okbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm4Wt4vbO7t6+fXDYVFEiCW2QiEey7WNFORO0oZnmtB1LikOf05Y/us381phKxSLxoCcx9UI8ECxgBGsj9Wy7G2I99IP0afqYlqtn055dcirODGiZuDkpQY56z/7q9iOShFRowrFSHdeJtZdiqRnhdFrsJorGmIzwgHYMFTikyktnyafo1Ch9FETSPKHRTP29keJQqUnom8ksp1r0MvE/r5Po4NpLmYgTTQWZHwoSjnSEshpQn0lKNJ8YgolkJisiQywx0aasoinBXfzyMmlWK+555fL+olS7yesowDGcQBlcuIIa3EEdGkBgDM/wCm9War1Y79bHfHTFyneO4A+szx8/EpNo</latexit>
…
rL(w(1))<latexit sha1_base64="EX6+u+ozzvgd446YH99lTyT79WA=">AAACA3icbVDLSsNAFJ3UV62vqDvdDBah3ZTEB7osunHhooJ9QBPLZDpph04mYWailBBw46+4caGIW3/CnX/jpM1CWw9cOJxzL/fe40WMSmVZ30ZhYXFpeaW4Wlpb39jcMrd3WjKMBSZNHLJQdDwkCaOcNBVVjHQiQVDgMdL2RpeZ374nQtKQ36pxRNwADTj1KUZKSz1zz+HIYwheV5wAqaHnJw/pXVKxq2m1Z5atmjUBnCd2TsogR6Nnfjn9EMcB4QozJGXXtiLlJkgoihlJS04sSYTwCA1IV1OOAiLdZPJDCg+10od+KHRxBSfq74kEBVKOA093ZofKWS8T//O6sfLP3YTyKFaE4+kiP2ZQhTALBPapIFixsSYIC6pvhXiIBMJKx1bSIdizL8+T1lHNPq6d3pyU6xd5HEWwDw5ABdjgDNTBFWiAJsDgETyDV/BmPBkvxrvxMW0tGPnMLvgD4/MHpbqW4g==</latexit>
❖ Learning rate is a hyper-parameter selected by user or “AutoML” tuning procedures
❖ Number of iterations/epochs of BGD also hyper-parameter
w(1) w(0) � ⌘rL(w(0))<latexit sha1_base64="MZhyY1bDdwCYYWm3cMV+ti7/2tw=">AAACN3icbVDLSiNBFK12fMT4yoxLN4VBiAtDtw90GcaNCxEFo0I6htuV21pYXd1U3VZCk7+azfyGO924mEHc+gdWYhYaPVBwOOdc6t4TZUpa8v0Hb+LH5NT0TGm2PDe/sLhU+fnrzKa5EdgUqUrNRQQWldTYJEkKLzKDkEQKz6Ob/YF/fovGylSfUi/DdgJXWsZSADmpUzkKE6DrKC7u+pdFLVjv81BhTGBMesc/eb7zNniIBDzUECngh7XxwHqnUvXr/hD8KwlGpMpGOO5U7sNuKvIENQkF1rYCP6N2AYakUNgvh7nFDMQNXGHLUQ0J2nYxvLvP15zS5XFq3NPEh+rHiQISa3tJ5JKDRe24NxC/81o5xXvtQuosJ9Ti/aM4V5xSPiiRd6VBQarnCAgj3a5cXIMBQa7qsishGD/5KznbrAdb9Z2T7Wrj96iOElthq6zGArbLGuyAHbMmE+wPe2T/2H/vr/fkPXsv79EJbzSzzD7Be30DhDerpw==</latexit>
w(2) w(1) � ⌘rL(w(1))<latexit sha1_base64="g9I7Eq4QpMR9QRqIjWoJuHM2mqI=">AAACN3icbVDLSiNBFK32bXxFXbopDEJcGLp9oMugm1kMomBUSMdwu3JbC6urm6rbhtDkr2Yzv+Fu3MxiBnHrH1iJWWj0QMHhnHOpe0+UKWnJ9/94E5NT0zOzc/OlhcWl5ZXy6tqlTXMjsCFSlZrrCCwqqbFBkhReZwYhiRReRfcnA//qAY2Vqb6gXoatBG61jKUAclK7fBomQHdRXHT7N0V1d7vPQ4UxgTFpl3/yAuft8BAJeKghUsB/VscD2+1yxa/5Q/CvJBiRChvhrF1+DDupyBPUJBRY2wz8jFoFGJJCYb8U5hYzEPdwi01HNSRoW8Xw7j7fckqHx6lxTxMfqh8nCkis7SWRSw4WtePeQPzOa+YUH7UKqbOcUIv3j+JccUr5oETekQYFqZ4jIIx0u3JxBwYEuapLroRg/OSv5HK3FuzVDs73K/XjUR1zbINtsioL2CGrsx/sjDWYYL/YE/vH/nu/vb/es/fyHp3wRjPr7BO81zeJKquq</latexit>
21
Data Access Pattern of BGD at Scale
❖ The data-intensive computation in BGD is the gradient ❖ In scalable ML, dataset D may not fit in DRAM ❖ Model w is typically small and DRAM-resident
❖ Gradient is like SQL SUM over vectors (one per example) ❖ At each epoch, 1 filescan over D to get gradient ❖ Update of w happens normally in DRAM ❖ Monitoring across epochs for convergence needed ❖ Loss function L() is also just a SUM in a similar manner
rL(w(k)) =nX
i=1
rl(yi, f(w(k), xi))
<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">AAACOXicbVDLSgMxFM34rPVVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodzziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGGUk7qiipFWKAjyXUaa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>
Q: What SQL op is this reminiscent of?
22
Scaling BGD to Disk
Disk
OS Cache in DRAM
…
Process wants to read file’s pages one by one and then discard: aka
“filescan” access pattern
P1 P2 P3 P4 P5 P6
Read P1 Read P2 Read P3 Read P4 Read P5 Read P6
Cache is full; replace old pages
Evict P1Evict P2Suppose OS Cache can hold 4 pages of file
…P1 P2 P3 P4 P5 P6
Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor
23
❖ Sequential scan to read pages from disk to DRAM ❖ Modern DRAM sizes can be 10s of GBs; so we read a
“chunk” of file at a time (say, 1000s of pages) ❖ Compute partial gradient on each chunk and add all up
rL(w(k)) =nX
i=1
rl(yi, f(w(k), xi))
<latexit sha1_base64="JXVS3ntMe9/vLgGdAsMHDEbfGl8=">AAACOXicbVDLSgMxFM34rPVVdekmWIQWSpnxgW4KRTcuXFSwD+i0QybNtKGZzJBk1DLMb7nxL9wJblwo4tYfMG1noa0HAodzziX3HjdkVCrTfDEWFpeWV1Yza9n1jc2t7dzObkMGkcCkjgMWiJaLJGGUk7qiipFWKAjyXUaa7vBy7DfviJA04LdqFJKOj/qcehQjpSUnV7M5chmC1wXbR2rgevF90o0Lw2JShBVoy8h3Ylqxki6HaZIVRg4tQW9uoPTg0GLRyeXNsjkBnCdWSvIgRc3JPdu9AEc+4QozJGXbMkPViZFQFDOSZO1IkhDhIeqTtqYc+UR24snlCTzUSg96gdCPKzhRf0/EyJdy5Ls6Od5Wznpj8T+vHSnvvBNTHkaKcDz9yIsYVAEc1wh7VBCs2EgThAXVu0I8QAJhpcvO6hKs2ZPnSeOobB2XT29O8tWLtI4M2AcHoAAscAaq4ArUQB1g8AhewTv4MJ6MN+PT+JpGF4x0Zg/8gfH9A82Oq7Y=</latexit>
Y X1 X2 X30 1b 1c 1d1 2b 2c 2d1 3b 3c 3d0 4b 4c 4d… … … …
D0,1b,1c,1d 1,2b,2c,2d
Data pages
1,3b,3c,3d … … …
<latexit sha1_base64="bG8iKwvZLsoFfOTbQ3+UYwXeHc4=">AAAB/XicbVDLSsNAFL3xWesrPnZuBovQbkoiii6Lbly4qGAf0MYwmU7aoZNJmJkoNRR/xY0LRdz6H+78G6ePhbYeuHA4517uvSdIOFPacb6thcWl5ZXV3Fp+fWNza9ve2a2rOJWE1kjMY9kMsKKcCVrTTHPaTCTFUcBpI+hfjvzGPZWKxeJWDxLqRbgrWMgI1kby7f22wAHH6Np3UfHhLiv2S8OSbxecsjMGmifulBRgiqpvf7U7MUkjKjThWKmW6yTay7DUjHA6zLdTRRNM+rhLW4YKHFHlZePrh+jIKB0UxtKU0Gis/p7IcKTUIApMZ4R1T816I/E/r5Xq8NzLmEhSTQWZLApTjnSMRlGgDpOUaD4wBBPJzK2I9LDERJvA8iYEd/bleVI/LrunZefmpFC5mMaRgwM4hCK4cAYVuIIq1IDAIzzDK7xZT9aL9W59TFoXrOnMHvyB9fkDOR6Txw==</latexit>
rL1(w(k))
<latexit sha1_base64="9Xu8gK1dsg0vjYS6WgkRx2IkG/s=">AAAB/XicbVDLSgNBEJyNrxhf6+PmZTAIySXsBkWPQS8ePEQwD0jW0DuZJENmZ5eZWSUuwV/x4kERr/6HN//GSbIHTSxoKKq66e7yI86UdpxvK7O0vLK6ll3PbWxube/Yu3t1FcaS0BoJeSibPijKmaA1zTSnzUhSCHxOG/7wcuI37qlULBS3ehRRL4C+YD1GQBupYx+0Bfgc8HWnjAsPd0lhWBwXO3beKTlT4EXipiSPUlQ79le7G5I4oEITDkq1XCfSXgJSM8LpONeOFY2ADKFPW4YKCKjykun1Y3xslC7uhdKU0Hiq/p5IIFBqFPimMwA9UPPeRPzPa8W6d+4lTESxpoLMFvVijnWIJ1HgLpOUaD4yBIhk5lZMBiCBaBNYzoTgzr+8SOrlkntacm5O8pWLNI4sOkRHqIBcdIYq6ApVUQ0R9Iie0St6s56sF+vd+pi1Zqx0Zh/9gfX5Azqsk8g=</latexit>
rL2(w(k)) …
Scaling BGD to Disk
<latexit sha1_base64="6IPITVWyLEPj25KRcgv51KGPtpA=">AAACLnicbZDLSgMxFIYz9VbrbdSlm2ARWoQyUxTdCEURXLioYC/QjkMmTdvQTGZIMkoZ+kRufBVdCCri1scwbUdqW38I/HznHE7O74WMSmVZb0ZqYXFpeSW9mllb39jcMrd3qjKIBCYVHLBA1D0kCaOcVBRVjNRDQZDvMVLzehfDeu2eCEkDfqv6IXF81OG0TTFSGrnmZZMjjyF4nXu4i3O9/CAPz+Avc+0JPZzQ4hRtBUq6ZtYqWCPBeWMnJgsSlV3zRc/hyCdcYYakbNhWqJwYCUUxI4NMM5IkRLiHOqShLUc+kU48OncADzRpwXYg9OMKjujfiRj5UvZ9T3f6SHXlbG0I/6s1ItU+dWLKw0gRjseL2hGDKoDD7GCLCoIV62uDsKD6rxB3kUBY6YQzOgR79uR5Uy0W7OOCdXOULZ0ncaTBHtgHOWCDE1ACV6AMKgCDR/AM3sGH8WS8Gp/G17g1ZSQzu2BKxvcP5A2kyw==</latexit>
rL(w(k)) = rL1(w(k)) +rL2(w
(k)) + . . .
24
DaskML’s Scalable DataFrame
❖ Dask DF scales to on-disk files by splitting it as a bunch of Pandas DF under the covers
❖ Dask API is a “wrapper” around Pandas API to scale ops to splits and put all results together
https://docs.dask.org/en/latest/dataframe.html
Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor
25
Scaling with Remote Reads
❖ Similar to scaling to disk but instead read pages/chunks over the network from remote disk/disks (e.g., from S3) ❖ Good in practice for a one-shot filescan access pattern ❖ For iterative ML, repeated reads over network ❖ Can combine with caching on local disk / DRAM
❖ Increasingly popular for cloud-native ML workloads
Basic Idea: Split data file (virtually or physically) and stage reads (and writes) of pages to DRAM and processor
26
Stochastic Gradient Descent for ML
❖ Two key cons of BGD: ❖ Slow to converge to optimal (too many epochs) ❖ Costly full scan of D for each update of w
❖ Stochastic GD (SGD) mitigates both issues ❖ Basic Idea: Use a sample (called mini-batch) of D to
approximate gradient instead of “full batch” gradient ❖ Without replacement sampling ❖ Randomly shuffle D before each epoch ❖ One pass = sequence of mini-batches
❖ SGD works well for non-convex loss functions too, unlike BGD; “workhorse” of scalable ML
27
Data Access Pattern of Scalable SGD
W(t+1) W(t) � ⌘rL̃(W(t))<latexit sha1_base64="UQEnrJTW1ICg3Ivhlxip9zAg4io=">AAACRHicbVDBTttAFFyHUmhKW1OOvawaVQpCjewCao+oXHrgABIhSHaInjfPsGK9tnafiyLLH8elH8CNL+ilh6Kq16rrkAMkjLTa0cw87dtJCiUtBcGt11p6tvx8ZfVF++Xaq9dv/PW3JzYvjcC+yFVuThOwqKTGPklSeFoYhCxROEgu9xt/8B2Nlbk+pkmBwwzOtUylAHLSyI/iDOgiSatBfVZ1iW/xcLPmscKUwJj8ij/2nfeRx0jAYw2JchdJNcbqoObd+eTmyO8EvWAKvkjCGemwGQ5H/k08zkWZoSahwNooDAoaVmBICoV1Oy4tFiAu4RwjRzVkaIfVtISaf3DKmKe5cUcTn6oPJyrIrJ1kiUs2i9p5rxGf8qKS0i/DSuqiJNTi/qG0VJxy3jTKx9KgIDVxBISRblcuLsCAINd725UQzn95kZx86oXbvd2jnc7e11kdq+wde8+6LGSf2R77xg5Znwl2zX6y3+zO++H98v54f++jLW82s8Eewfv3H7dAsLY=</latexit>
Sample mini-batch from dataset without replacement
Original dataset
Random “shuffle”
Epoch 1W(0)
<latexit sha1_base64="S+keXcWMFFf6wrOJ7MBq7Rng6AM=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPppB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv99oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVecs69sVp+bMgJaJW5AKFGj07a/eICJJSIUmHCvVdZ1YeymWmhFOs3IvUTTGZIyHtGuowCFVXjpLnqFTowxQEEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDGy9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qVw+XlfptUUcJjuEEquDCNdThHhrQBAITeIZXeLNS68V6tz7moytWsXMEf2B9/gAKppNG</latexit>
Seq. scan
W(1)<latexit sha1_base64="FI2/tJBTSmjc6HdcV6spCWeSwCc=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPppB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv99oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVfcs69sVp+bMgJaJW5AKFGj07a/eICJJSIUmHCvVdZ1YeymWmhFOs3IvUTTGZIyHtGuowCFVXjpLnqFTowxQEEnzhEYz9fdGikOlpqFvJvOcatHLxf+8bqKDGy9lIk40FWR+KEg40hHKa0ADJinRfGoIJpKZrIiMsMREm7LKpgR38cvLpHVecy9qVw+XlfptUUcJjuEEquDCNdThHhrQBAITeIZXeLNS68V6tz7moytWsXMEf2B9/gAMLJNH</latexit>
W(2)<latexit sha1_base64="EHwZoqY5j/U4nnFUPxcanVrWIr8=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWpii6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5C7z21MqFQvFo55FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FddqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdKqVd2L6tXDZbl+m9dRhBM4hQq4cA11uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8NspNI</latexit>
Mini-batch 1
Mini-batch 2
Mini-batch 3W(3)
<latexit sha1_base64="DOy/KUlhuZMwbREidWpApiAKXFQ=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxii6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5C7z21MqFQvFo55FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FddqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdK6qLq16tXDZbl+m9dRhBM4hQq4cA11uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8POJNJ</latexit>
Epoch 2
Seq. scan
(Optional) New
Random Shuffle
W(3)<latexit sha1_base64="DOy/KUlhuZMwbREidWpApiAKXFQ=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyWxii6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5C7z21MqFQvFo55FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FddqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdK6qLq16tXDZbl+m9dRhBM4hQq4cA11uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8POJNJ</latexit>
W(4)<latexit sha1_base64="7HsMu7dOxTzZgQl2dq540qpd99U=">AAAB+XicbVDLSsNAFL2pr1pfUZduBotQNyXRii6LblxWsA9oa5lMJ+3QySTMTAol5E/cuFDErX/izr9x0mahrQcGDufcyz1zvIgzpR3n2yqsrW9sbhW3Szu7e/sH9uFRS4WxJLRJQh7KjocV5UzQpmaa004kKQ48Ttve5C7z21MqFQvFo55FtB/gkWA+I1gbaWDbvQDrsecn7fQpqdTO04FddqrOHGiVuDkpQ47GwP7qDUMSB1RowrFSXdeJdD/BUjPCaVrqxYpGmEzwiHYNFTigqp/Mk6fozChD5IfSPKHRXP29keBAqVngmcksp1r2MvE/rxtr/6afMBHFmgqyOOTHHOkQZTWgIZOUaD4zBBPJTFZExlhiok1ZJVOCu/zlVdK6qLqX1auHWrl+m9dRhBM4hQq4cA11uIcGNIHAFJ7hFd6sxHqx3q2PxWjByneO4Q+szx8QvpNK</latexit>
…
…
ORDER BY RAND()
Randomized dataset
rL̃(w(k)) =X
(yi,xi)2B⇢D
rl(yi, f(w(k), xi))
<latexit sha1_base64="AfnLQWBOhwvN55z3/kLEnQiCekI=">AAACVnicbVFNb9QwEHVSSsvyFeDIZcQKaVeqVkkBwaVSVThw4FAktq20WSLHO2mttZ3InlBWUf5ke4GfwgXhbHOALiNZenrvzXj8nFdKOorjn0G4dWf77s7uvcH9Bw8fPY6ePD1xZW0FTkWpSnuWc4dKGpySJIVnlUWuc4Wn+fJ9p59+Q+tkab7QqsK55udGFlJw8lQW6dTwXHFISaoFNp/aUao5XeRFc9l+bUbLcTuGA0hdrbNmtMrk3vdMjiGVBo46NndI8KGFfopaW6DYGLIHXd84i4bxJF4XbIKkB0PW13EWXaWLUtQaDQnFnZslcUXzhluSQmE7SGuHFRdLfo4zDw3X6ObNOpYWXnpmAUVp/TEEa/bvjoZr51Y6985uXXdb68j/abOainfzRpqqJjTi5qKiVkAldBnDQloUpFYecGGl3xXEBbdckP+JgQ8huf3kTXCyP0leTd58fj08POrj2GXP2Qs2Ygl7yw7ZR3bMpkywa/YrCIOt4EfwO9wOd26sYdD3PGP/VBj9AZJwsek=</latexit>
28
❖ Mini-batch gradient computations: 1 filescan per epoch ❖ Update of w happens in DRAM ❖ During filescan, count number of examples seen and
update per mini-batch ❖ Typical Mini-batch sizes: 10s to 1000s ❖ Orders of magnitude more updates than BGD!
❖ Random shuffle is not trivial to scale; requires “external merge sort” (roughly 2 scans of file) ❖ ML practitioners often shuffle dataset only once up
front; good enough in most cases in practice
Data Access Pattern of Scalable SGD
29
Handling pages directly is so low-level! Is there a higher-level way to scale ML?
30
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability
31
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems
❖ In-RDBMS ML ❖ ML on Dataflow Systems
❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability
32
RDBMS Scales Queries over Data
❖ Mature software systems to scale to larger-than-RAM data
33
RDBMS User-Defined Aggregates
❖ Most industrial and open source RDBMSs allow extensibility to their SQL dialect with User-Defined Functions (UDFs)
❖ User-Defined Aggregate (UDA) is a UDF API to specify custom aggregates over the whole dataset
Initialize
Transition
Merge
Finalize
Start by setting up “agg. state” in DRAM
Example with SQL AVG:
(S, C): (Partial sum, partial count)
RDBMS gives a tuple from table; update agg. state
(Optional: In parallel RDBMS, combine agg. states of workers)
<latexit sha1_base64="N8pZRaxhzlXn8gQ86guS41Ybwio=">AAACCnicbVDLSgMxFM34rPU16tJNtAgtljIjii6L3bisaB/QDkMmzbShmWRIMpUydO3GX3HjQhG3foE7/8b0sdDWAxcO59zLvfcEMaNKO863tbS8srq2ntnIbm5t7+zae/t1JRKJSQ0LJmQzQIowyklNU81IM5YERQEjjaBfGfuNAZGKCn6vhzHxItTlNKQYaSP59lH+rlgpwDYjoUZSigc4FU5hfuDTInQLvp1zSs4EcJG4M5IDM1R9+6vdETiJCNeYIaVarhNrL0VSU8zIKNtOFIkR7qMuaRnKUUSUl05eGcETo3RgKKQpruFE/T2RokipYRSYzgjpnpr3xuJ/XivR4ZWXUh4nmnA8XRQmDGoBx7nADpUEazY0BGFJza0Q95BEWJv0siYEd/7lRVI/K7kXJef2PFe+nsWRAYfgGOSBCy5BGdyAKqgBDB7BM3gFb9aT9WK9Wx/T1iVrNnMA/sD6/AGhDZcI</latexit>
(S,C) (S,C) + (vi, 1)
<latexit sha1_base64="ckJ8N6ixBXtCU6c1aXXo4eYYT0U=">AAACH3icbVDLSgNBEJz1bXxFPXoZDGIECbvi6xjMxWNEo0I2LLOTXp1kdmeZ6VXCEr/Ei7/ixYMi4s2/cRJz8FXQUFR1090VplIYdN0PZ2x8YnJqema2MDe/sLhUXF45NyrTHBpcSaUvQ2ZAigQaKFDCZaqBxaGEi7BbG/gXN6CNUMkZ9lJoxewqEZHgDK0UFPfLp5vbtc0t6kuIkGmtbqlvsjjI/ZjhtcD8Vuku6P5dp0/Lp0Fnm9aCzlZQLLkVdwj6l3gjUiIj1IPiu99WPIshQS6ZMU3PTbGVM42CS+gX/MxAyniXXUHT0oTFYFr58L8+3bBKm0ZK20qQDtXvEzmLjenFoe0cHG1+ewPxP6+ZYXTYykWSZggJ/1oUZZKiooOwaFto4Ch7ljCuhb2V8mumGUcbacGG4P1++S8536l4exX3ZLdUPRrFMUPWyDopE48ckCo5JnXSIJzck0fyTF6cB+fJeXXevlrHnNHMKvkB5+MTQ5qh6A==</latexit>
(S0, C 0) X
worker j
(Sj , Cj)
Post-process agg. state and return result Return S’/C’
34
❖ SGD epoch implemented as UDA for in-RDBMS execution
Bismarck: SGD as RDBMS UDA
❖ Data-intensive computation scaled automatically by RDBMS ❖ Commands for shuffling, running multiple epochs, checking
convergence, and validation/test error measurements issued from an external controller written in Python
Initialize
Transition
Merge
Finalize
Allocate memory for model W(t) and mini-batch gradient stats
Given tuple with (y,x), compute gradient and update stats If mini-batch limit hit, update model and reset stats
(Optional: applies only to parallel RDBMS) “Combine” model parameters from indep. workers
Return model parameters
35
❖ Many ML algorithms fit within Bismarck’s template
Programming ML as RDBMS UDAs
❖ Transition function is where the bulk of the ML logic is; a few lines of code for SGD updates
36
Distributed SGD via RDBMS UDA
❖ Data is pre-sharded across workers in parallel RDBMSs ❖ Initialize and Transition run independently on shard ❖ Merge “combines” model params from workers; tricky since
SGD epoch is not an algebraic agg. like SUM/AVG ❖ Common heuristic: “Model Averaging”
Worker 1 Worker 2 Worker n…
Master
W(t)1
<latexit sha1_base64="DDvIp/6uHHDZOjNMUvnpHo6Owbk=">AAAB+3icbVDLSsNAFJ3UV62vWJdugkWom5L4QJdFNy4r2Ae0MUymk3boZBJmbsQS8ituXCji1h9x5984abPQ1gMDh3Pu5Z45fsyZAtv+Nkorq2vrG+XNytb2zu6euV/tqCiRhLZJxCPZ87GinAnaBgac9mJJcehz2vUnN7nffaRSsUjcwzSmbohHggWMYNCSZ1YHIYaxH6Td7CGtw0nmOZ5Zsxv2DNYycQpSQwVanvk1GEYkCakAwrFSfceOwU2xBEY4zSqDRNEYkwke0b6mAodUuekse2Yda2VoBZHUT4A1U39vpDhUahr6ejJPqha9XPzP6ycQXLkpE3ECVJD5oSDhFkRWXoQ1ZJIS4FNNMJFMZ7XIGEtMQNdV0SU4i19eJp3ThnPWuLg7rzWvizrK6BAdoTpy0CVqolvUQm1E0BN6Rq/ozciMF+Pd+JiPloxi5wD9gfH5A6OglC4=</latexit>
W(t)2
<latexit sha1_base64="xRaCmja4+Alc2G0VSB2AwXG38es=">AAAB+3icbVDLSsNAFL3xWesr1qWbYBHqpiRV0WXRjcsK9gFtDJPppB06mYSZiVhCfsWNC0Xc+iPu/BsnbRbaemDgcM693DPHjxmVyra/jZXVtfWNzdJWeXtnd2/fPKh0ZJQITNo4YpHo+UgSRjlpK6oY6cWCoNBnpOtPbnK/+0iEpBG/V9OYuCEacRpQjJSWPLMyCJEa+0HazR7SmjrNvIZnVu26PYO1TJyCVKFAyzO/BsMIJyHhCjMkZd+xY+WmSCiKGcnKg0SSGOEJGpG+phyFRLrpLHtmnWhlaAWR0I8ra6b+3khRKOU09PVknlQuern4n9dPVHDlppTHiSIczw8FCbNUZOVFWEMqCFZsqgnCguqsFh4jgbDSdZV1Cc7il5dJp1F3zuoXd+fV5nVRRwmO4Bhq4MAlNOEWWtAGDE/wDK/wZmTGi/FufMxHV4xi5xD+wPj8AaUklC8=</latexit>
W(t)n
<latexit sha1_base64="dQYgF7AMPEB/nejv+XzaZ3+cDiQ=">AAAB+3icbVDLSsNAFJ3UV62vWpduBotQNyXxgS6LblxWsA9oY5lMJ+3QySTM3Igl5FfcuFDErT/izr9x0mahrQcGDufcyz1zvEhwDbb9bRVWVtfWN4qbpa3tnd298n6lrcNYUdaioQhV1yOaCS5ZCzgI1o0UI4EnWMeb3GR+55EpzUN5D9OIuQEZSe5zSsBIg3KlHxAYe37SSR+SGpykAyNW7bo9A14mTk6qKEdzUP7qD0MaB0wCFUTrnmNH4CZEAaeCpaV+rFlE6ISMWM9QSQKm3WSWPcXHRhliP1TmScAz9fdGQgKtp4FnJrOketHLxP+8Xgz+lZtwGcXAJJ0f8mOBIcRZEXjIFaMgpoYQqrjJiumYKELB1FUyJTiLX14m7dO6c1a/uDuvNq7zOoroEB2hGnLQJWqgW9RELUTRE3pGr+jNSq0X6936mI8WrHznAP2B9fkDACOUaw==</latexit>
W(t) =1
n
nX
i=1
W(t)i
<latexit sha1_base64="YvRR+r5RoOYyp/ar/kI19j/9+94=">AAACJ3icbVDLSgMxFM3UV62vqks3wSLUTZnxgW4qRTcuK9gHdNohk2ba0ExmSDJCCfM3bvwVN4KK6NI/MW1noa0HAodzziX3Hj9mVCrb/rJyS8srq2v59cLG5tb2TnF3rymjRGDSwBGLRNtHkjDKSUNRxUg7FgSFPiMtf3Qz8VsPREga8Xs1jkk3RANOA4qRMpJXvHJDpIZ+oFtpT5fVcQqr0A0EwtpJNU+hK5PQ07TqpD0O57Me9Yolu2JPAReJk5ESyFD3iq9uP8JJSLjCDEnZcexYdTUSimJG0oKbSBIjPEID0jGUo5DIrp7emcIjo/RhEAnzuIJT9feERqGU49A3ycmmct6biP95nUQFl11NeZwowvHsoyBhUEVwUhrsU0GwYmNDEBbU7ArxEJmWlKm2YEpw5k9eJM2TinNaOb87K9Wuszry4AAcgjJwwAWogVtQBw2AwSN4Bm/g3XqyXqwP63MWzVnZzD74A+v7B2K9pkU=</latexit>
❖ Affects convergence ❖ Works OK for GLMs
Q: How does the RDBMS parallelize the (SGD) UDA?
37
❖ Model Averaging for distributed SGD has poor convergence for non-convex/ANN models ❖ Too many epochs, typically poor ML accuracy
❖ Model sizes can be too large (even 10s of GBs) for UDA’s aggregation state
❖ UDA’s Merge step is choke point at scale (100s of workers) ❖ Bulk Synchronous Parallelism (BSP) parallel RDBMSs
Bottlenecks of RDBMS SGD UDA
38
The MADlib Library
❖ A decade-old library of scalable statistical and ML procedures on PostgreSQL and Greenplum (parallel RDBMS)
❖ Many procedures are UDAs; some are written in pure SQL
❖ All can be invoked from SQL console ❖ RDBMS can be used for ETL (extract,
transform, load) of features
https://madlib.apache.org/docs/latest/index.html
Some data analysts like ML from SQL console
In-situ data governance and security/auth. of RDBMSs
Faster in some cases; typically slower due to API restrictions
Massively parallel processing of RDBMSs like Greenplum
SQL-based ETL
Most ML users want full flexibility of Python console
Many ML users want more hands-on data access in Jupyter notebooks
Custom ML systems typically faster; Interference with OLTP workloads
BSP is a bottleneck for 100+ nodes (asynchrony needed)
Unnatural APIs to write ML
Usability:
Manageability:
Efficiency:
Scalability:
Developability:
Pros: Cons:
Tradeoffs of In-RDBMS ML
39
40
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems
❖ In-RDBMS ML ❖ ML on Dataflow Systems
❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability
41
❖ Programming model for writing programs on sharded data + distributed system architecture for processing large data
❖ Map and Reduce are terms/ideas from functional PL ❖ Developer only implements the logic of Map and Reduce ❖ System implementation handles orchestration of data
distribution, parallelization, etc. under the covers
The MapReduce Abstraction
42
❖ Standard example: count word occurrences in a doc corpus ❖ Input: A set of text documents (say, webpages) ❖ Output: A dictionary of unique words and their counts
Hmmm, sounds suspiciously familiar … :)function map (String docname, String doctext) : for each word w in doctext : emit (w, 1)function reduce (String word, Iterator partialCounts) : sum = 0 for each pc in partialCounts : sum += pc emit (word, sum)
Part of MapReduce API
The MapReduce Abstraction
43
How MapReduce Works
Parallel flow of control and data during MapReduce execution:
Under the covers, each Mapper and Reducer is a separate process; Reducers face barrier synchronization (BSP)
Fault tolerance achieved using data replication
44
What is Hadoop then?
❖ Open-source system implementation with MapReduce as prog. model and HDFS as distr. filesystem
❖ Map() and Reduce() functions in API; input splitting, data distribution, shuffling, fault tolerance, etc. all handled by the Hadoop library under the covers
❖ Mostly superseded by the Spark ecosystem these days although HDFS is still the base
45
Abstract Semantics of MapReduce
❖ Map(): Operates independently on one “record” at a time ❖ Can batch multiple data examples on to one record ❖ Dependencies across Mappers not allowed ❖ Can emit 1 or more key-value pairs as output ❖ Data types of inputs and outputs can be different!
❖ Reduce(): Gathers all Map output pairs across machines with same key into an Iterator (list) ❖ Aggregation function applied on Iterator and output final
❖ Input Split: ❖ Physical-level split/shard of dataset that batches multiple
examples to one file “block” (~128MB default on HDFS) ❖ Custom Input Splits can be written by appl. user
46
Benefits of MapReduce
❖ Goal: Higher level abstraction for parallel data processing ❖ Key Benefits:
❖ Out-of-the-box scalability and fault tolerance ❖ Map() and Reduce() can be highly general; no restrictions
on data types; easier ETL ❖ Free and OSS (Hadoop)
❖ New burden on users: Cast data-intensive computation to the Map() + Reduce() API ❖ But libraries exist in many PLs to mitigate coding pains:
Java, C++, Python, R, Scala, etc.
47
Emulate MapReduce in SQL?
Q: How would you do the word counting in RDBMS / in SQL? ❖ First step: Transform text docs into relations and load: Part of the Extract-Transform-Load (ETL) stage Suppose we pre-divide each document into words and have the schema: DocWords (DocName, Word) ❖ Second step: a single, simple SQL query!
SELECT Word, COUNT (*) FROM DocWords GROUP BY Word [ORDER BY Word]
Parallelism, scaling, etc. done by RDBMS under the covers
48
RDBMS UDA vs MapReduce
❖ Aggregation state: data structure computed (independently) by workers and unified by master
❖ Initialize: Set up info./initialize RAM for agg. state; runs independently on each worker
❖ Transition: Per-tuple function run by worker to update its agg. state; analogous to Map() in MapReduce
❖ Merge: Function that combines agg. states from workers; run by master after workers done; analogous to Reduce()
❖ Finalize: Run once at end by master to return final result
49
BGD via MapReduce
function map (String datafile) : Read W(t) from known file on HDFS
Initialize G = 0 For each tuple in datafile:
G += per-example gradient on (tuple, W(t)) emit (G)
function reduce (Iterator partialGradients) : FullG = 0 for each G in partialGradients :
FullG += G emit (FullG)
❖ Assume data is sharded; a map partition has many examples ❖ Initial model W(t) read from file on HDFS by each mapper
50
SGD (Averaging) via MapReduce
function map (String datafile) : Read W(t) from known file on HDFS
Initialize G = 0 For each tuple in datafile:
G += per-example gradient on (tuple, W(t)) Wj = Update W(t) with G
emit (Wj)
function reduce (Iterator workerWeights) : AvgW = 0 for each Wj in workerWeights :
AvgW += Wj AvgW /= workerWeights.length() emit (AvgW)
❖ Similar to BGD but Model Averaging across map partitions
51
Apache Spark
❖ Extended dataflow programming model to subsume MapReduce, most relational operators ❖ Inspired by Python Pandas style of function calls for ops
❖ Unified system to handle relations, text, etc.; support more general distributed data processing
❖ Uses distributed memory for caching; faster than Hadoop ❖ New fault tolerance mechanism using lineage, not replication ❖ From UC Berkeley AMPLab; commercialized as Databricks
52
Spark’s Dataflow Programming Model
Transformations are relational ops, MR, etc. as functions Actions are what force computation; aka lazy evaluation
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
53
Word Count Example in Spark
Spark RDD API available in Python, Scala, Java, and R
54
Programming ML in MapReduce/Spark
❖ All ML procedures that can be cast as RDBMS UDAs can be cast to MapReduce API of Hadoop or Spark RDD APIs ❖ MapReduce is just easier to use for most developers ❖ Spark integrates better with PyData ecosystem
https://papers.nips.cc/paper/3150-map-reduce-for-machine-learning-on-multicore.pdf
❖ Apache Mahout is a library of classical ML algorithms written as MapReduce programs for Hadoop; expanded later to DSL
❖ SparkML is a library of classical ML algorithms written using Spark RDD API; common in enterprises for scalable ML
55
Tradeoffs of ML on Dataflow Systems
SparkML integrates well with Python stacks
In-situ data governance and security/auth. of “data lakes”
Comparable to in-RDBMS ML; no OLTP interference
Massively parallel processing
SQL- & MR-based ETL Less code to write ML
Not all ML algorithms scaled
Many ML users may not be familiar with Spark data ETL
Custom ML systems typically faster still
BSP is a bottleneck for 100+ nodes (asynchrony needed)
Still somewhat unnatural APIs to write ML
Usability:
Manageability:
Efficiency:
Scalability:
Developability:
Pros: Cons:
56
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability
57
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems
❖ Parameter Server ❖ GraphLab ❖ XGBoost
❖ Advanced Issues in ML Scalability
58
Parameter Server for Distributed SGD
❖ Recall bottlenecks of Model Averaging-based SGD in RDBMS UDA or with MapReduce: ❖ BSP becomes a choke point (Merge / Reduce stage) ❖ Often poor convergence, especially for non-convex ❖ Hard to handle large models
❖ Parameter Server (PS) mitigates all these issues: ❖ Breaks the synchronization barrier for merging: allows
asynchronous updates from workers to master ❖ Flexible communication frequency: can send updates at
every mini-batch or a set of few mini-batches
59
ParameterServer for Distributed SGD
Worker 1 Worker 2 Worker n…
PS 1 PS 2 PS k…
Multi-server “master”; each server manages a part of W(t)<latexit sha1_base64="jHixyQj+DLWH5ucHau8W5BCjQHQ=">AAAB+XicbVDLSsNAFL3xWesr6tLNYBHqpiQ+0GXRjcsK9gFtLJPppB06mYSZSaGE/IkbF4q49U/c+TdO2iy09cDA4Zx7uWeOH3OmtON8Wyura+sbm6Wt8vbO7t6+fXDYUlEiCW2SiEey42NFORO0qZnmtBNLikOf07Y/vsv99oRKxSLxqKcx9UI8FCxgBGsj9W27F2I98oO0nT2lVX2W9e2KU3NmQMvELUgFCjT69ldvEJEkpEITjpXquk6svRRLzQinWbmXKBpjMsZD2jVU4JAqL50lz9CpUQYoiKR5QqOZ+nsjxaFS09A3k3lOtejl4n9eN9HBjZcyESeaCjI/FCQc6QjlNaABk5RoPjUEE8lMVkRGWGKiTVllU4K7+OVl0jqvuRe1q4fLSv22qKMEx3ACVXDhGupwDw1oAoEJPMMrvFmp9WK9Wx/z0RWr2DmCP7A+fwByPpOK</latexit>
Push / Pull when ready/needed
No sync. for workers or servers
Workers send gradients to master for updates
at each mini-batch (or lower frequency)
rL̃(W(t+1)n )
<latexit sha1_base64="GGLCamUO1sNnU5MPUcSRaDlHVdg=">AAACEHicbVC7SgNBFJ2NrxhfUUubwSAmCGHXB1oGbSwsIpgHZGOYncwmQ2Znl5m7Qlj2E2z8FRsLRWwt7fwbJ49CEw9cOJxzL/fe40WCa7DtbyuzsLi0vJJdza2tb2xu5bd36jqMFWU1GopQNT2imeCS1YCDYM1IMRJ4gjW8wdXIbzwwpXko72AYsXZAepL7nBIwUid/6EriCYJd4KLLkpsUF92AQN/zk0Z6nxThyCmlHVnq5At22R4DzxNnSgpoimon/+V2QxoHTAIVROuWY0fQTogCTgVLc26sWUTogPRYy1BJAqbbyfihFB8YpYv9UJmSgMfq74mEBFoPA890jo7Vs95I/M9rxeBftBMuoxiYpJNFfiwwhHiUDu5yxSiIoSGEKm5uxbRPFKFgMsyZEJzZl+dJ/bjsnJTPbk8LlctpHFm0h/ZRETnoHFXQNaqiGqLoET2jV/RmPVkv1rv1MWnNWNOZXfQH1ucPdTycNg==</latexit>
rL̃(W(t�1)2 )
<latexit sha1_base64="z+XBRFQZlyoDHOZkNVVbaz0Q+wU=">AAACEHicbVC7SgNBFJ31GeMramkzGMSkMOxGRcugjYVFBPOAbAyzk9lkyOzsMnNXCMt+go2/YmOhiK2lnX/j5FFo4oELh3Pu5d57vEhwDbb9bS0sLi2vrGbWsusbm1vbuZ3dug5jRVmNhiJUTY9oJrhkNeAgWDNSjASeYA1vcDXyGw9MaR7KOxhGrB2QnuQ+pwSM1MkduZJ4gmAXuOiy5CbFBTcg0Pf8pJHeJwU4doppp1zs5PJ2yR4DzxNnSvJoimon9+V2QxoHTAIVROuWY0fQTogCTgVLs26sWUTogPRYy1BJAqbbyfihFB8apYv9UJmSgMfq74mEBFoPA890jo7Vs95I/M9rxeBftBMuoxiYpJNFfiwwhHiUDu5yxSiIoSGEKm5uxbRPFKFgMsyaEJzZl+dJvVxyTkpnt6f5yuU0jgzaRweogBx0jiroGlVRDVH0iJ7RK3qznqwX6936mLQuWNOZPfQH1ucPHSSb/A==</latexit>
rL̃(W(t)1 )
<latexit sha1_base64="OcO+0LVRY+mliWjDoWz5mG5oEXI=">AAACDnicbVC7SgNBFJ31GeNr1dJmMASSJuz6QMugjYVFBPOAbAyzk9lkyOzsMnNXCMt+gY2/YmOhiK21nX/j5FFo4oELh3Pu5d57/FhwDY7zbS0tr6yurec28ptb2zu79t5+Q0eJoqxOIxGplk80E1yyOnAQrBUrRkJfsKY/vBr7zQemNI/kHYxi1glJX/KAUwJG6tpFTxJfEOwBFz2W3mS45IUEBn6QNrP7tATlrOuWu3bBqTgT4EXizkgBzVDr2l9eL6JJyCRQQbRuu04MnZQo4FSwLO8lmsWEDkmftQ2VJGS6k07eyXDRKD0cRMqUBDxRf0+kJNR6FPqmc3yqnvfG4n9eO4HgopNyGSfAJJ0uChKBIcLjbHCPK0ZBjAwhVHFzK6YDoggFk2DehODOv7xIGscV96RydntaqF7O4sihQ3SESshF56iKrlEN1RFFj+gZvaI368l6sd6tj2nrkjWbOUB/YH3+AChxm4k=</latexit>
❖ Model params may get out-of-sync or stale; but SGD turns out to be remarkably robust—multiple updates/epoch really helps
❖ Communication cost per epoch is higher (per mini-batch)
60
Programming ML using PS
❖ Designed mainly for sparse feature vectors/updates ❖ Easy to parallelize updates to model params
❖ ML developer recasts ML procedure into two pars: worker-side updates and server-side aggregation ❖ Loosely analogous to Map and Reduce, respectively ❖ But more complex due to flexible update schedules
❖ Supports 3 consistency models for staleness of updates, with different tradeoffs on efficiency vs accuracy
61
Systems-level Advances in PS
❖ Workers and Servers can both be multi-nodes/multi-process; fine-grained task scheduling on nodes
❖ Range-based partitioning of params for servers❖ Timestamped and compressed
updates exchanged ❖ Replication and fault tolerance
62
Tradeoffs of Parameter Server
Supports billions of features and params
In-built fault tolerance
Faster than in-RDBMS ML and ML-on-Dataflow
Highest; can work with 1000s of nodes
Abstracts away many systems scaling issues
Not reproducible; not well integrated with ETL stacks
TensorFlow offers it natively; o/w, hard to operate/govern
Not suitable for dense updates; high comm. cost
Not suitable for smaller scales due to overheads
Reasoning about ML (in)consistency is hard
Usability:
Manageability:
Efficiency:
Scalability:
Developability:
Pros: Cons:
63
Sample of Your Strong Points on PS
❖ “Abstracts away most system complexity such as asynchronous communication between nodes, fault tolerance, and data replication”
❖ “Allows addition and deletion of nodes to the framework without any restarts by means of a consistent hash ring with replication of data”
❖ “Relaxing consistency constraints gives the algorithm designer flexibility in defining different consistency models (going from sequential to eventual) by balancing system efficiency and algorithm convergence”
❖ “Outperforms other open source systems in Sparse Logistic Regression”
❖ “Range construct, while simple, seems fundamental to efficient communication and thus fault tolerance”
❖ “Generality. Both supervised learning and unsupervised learning”
❖ “Fault tolerance is crucial … in an unreliable environment like the cloud where the machines can fail or jobs can be preempted”
64
Sample of Your Weak Points on PS
❖ “Does not show where or when the bottlenecks of synchronization appear”
❖ “Could lead to a longer amount of time for an ML algorithm model to train if the model is sensitive to inconsistent data”
❖ “Does not specify how to set a reasonable threshold to achieve best trade-off between system efficiency and algorithm convergence rate”
❖ “Flexibility in choosing consistency models also adds in complexity”
❖ “Non-convex models, such as complex neural networks, may not be suitable for this pipeline”
❖ “Didn’t show how the system performed in training a deep neural net”
❖ “Crucial and widely used non-parametric models like Decision Trees and K nearest neighbors do not fit the architecture”
❖ “Google is hardly representative of the broader industry”
❖ “There is no eval at all for … fault tolerance”
65
Innovativeness and Depth Ratings
66
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems
❖ Parameter Server ❖ GraphLab ❖ XGBoost
❖ Advanced Issues in ML Scalability
67
Graph-Parallel Algorithms
❖ Some data analytics algorithms (not just ML) operate on graph data and have complex update dependencies
Example: PageRank
http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf
❖ Not a simple sequential access pattern like SGD ❖ If viewed as a table: reads and writes to tuples, with each
write depending on all neighboring tuples ❖ Does not scale well with RDBMS/UDA or MapReduce/Spark
68
The GraphLab Abstraction
❖ “Think like a vertex” paradigm over data graph (V, E, D) ❖ Arbitrary data state associated with vertices and edges ❖ 3-function API: Gather-Apply-Scatter
❖ Gather: Collect latest states from neighbors ❖ Apply: Vertex-local state update function ❖ Scatter: Send local state to neighbors
http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf
❖ Original single-node GraphLab assumed shared-memory
❖ Scaled to single-node disk with careful sharding & caching of graph
69
Distributed GraphLab Execution
http://vldb.org/pvldb/vol5/p716_yuchenglow_vldb2012.pdf
❖ Enabled asynchronous updates of vertex and edge states ❖ Consistency-parallelism tradeoff ❖ Some algorithms seem robust to such inconsistency ❖ Sophisticated distributed locking protocols
70
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems
❖ Parameter Server ❖ GraphLab ❖ XGBoost
❖ Advanced Issues in ML Scalability
71
Decision Tree Data Access Pattern
❖ CART has complex non-sequential data access patterns ❖ Compare candidate splits on each feature at each node ❖ Class-conditional aggregates needed per candidate ❖ Repartition data for sub-tree growth
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/variable-importance.html
❖ Does not scale well with RDBMS/UDA or MapReduce/Spark
72
Decision Tree Ensembles
❖ RandomForest is very popular ❖ Just a bunch of independent trees on column subsets
❖ Tree Boosting is a popular adaptive ensemble ❖ Construct trees sequentially and weigh them ❖ “Weak” learners aggregated to a strong learner
❖ Gradient-Boosted Decision Tree (GBDT) is very popular ❖ Also adds trees to ensemble sequentially ❖ Real-valued prediction; convex differentiable loss
73
GBDT Data Access Pattern
❖ More complex non-sequential access pattern than single tree! ❖ An “iteration” adds a tree by exploring candidate splits ❖ Still needs recursive data re-partitioning ❖ Key difference: scoring function has more statistics (1st
and 2nd deriv.); this needs read-write access per example
❖ Access pattern over per-example stats is random, depends on split location
❖ Ideal if whole stats and a whole column (at a time) can fit in RAM
Y X1 X2 X3
0 1b 1c 1d
1 2b 2c 2d
1 3b 3c 3d
0 4b 4c 4d
… … … …
DatasetGi Hi
… …
… …
… …
… …
… …
Stats
74
XGBoost
❖ Custom ML system to scale GBDT to larger-than-RAM data, both single-node disk and on a cluster
❖ Very popular on tabular data; won many Kaggle contests ❖ Key philosophy: Algorithm-system “co-design”:
❖ Make system implementation memory hierarchy-aware based on algorithm’s data access patterns
❖ Modify ML algorithmics to better suit system scale
75
XGBoost: Algorithm-level Ideas
❖ 2 key changes to make GBDT more scalable ❖ Bottleneck: Computing candidate split stats at scale
❖ Idea: Approximate stats with weighted quantile sketch ❖ Bottleneck: Sparse feature vectors and missing data
❖ Idea: Bake in “default direction” for child during learning
76
XGBoost: Systems-level Ideas
❖ 4 key choices to ensure efficiency at scale ❖ Goal: Reduce overhead of evaluating candidates
❖ Idea: Pre-sort all columns independently ❖ Goal: Exploit parallelism to raise throughput
❖ Idea: Shard data into a column “block” per worker; workers compute local columns’ stats independently
77
XGBoost: Systems-level Ideas
❖ 4 key choices to ensure efficiency at scale ❖ Goal: Mitigate read-write randomness to stats in RAM
❖ Idea: CPU cache-aware staging of stats to reduce stalls
❖ Goal: Scale to on-disk and cluster data ❖ Idea: Shard blocks further on disk; stage reads to RAM;
block-level compression to reduce I/O latency
78
XGBoost: Scalability Gains
❖ Gains from both algorithmic changes and systems ideas
79
Tradeoffs of Custom Distr. ML Sys.
Suitable for hyper-specialized ML use case
More feasible in cloud-native / managed env.
Often much lower runtimes and costs
Often more scalable to larger datasets
More amenable if ML algorithm is familiar; new open source/startup communities
Need to (re)learn new system APIs again
Extra overhead to add and maintain in tools ecosystem
Debugging runtime issues needs specialized knowhow
Arcane scalability issues may arise, e.g., inconsistency
Need to (re)learn new implementation APIs and consistency tradeoffs; risk of lower technical support
Usability:
Manageability:
Efficiency:
Scalability:
Developability:
Pros: Cons:
80
Your Strong Points on XGBoost
❖ “The scalability of XGBoost algorithm enables much faster learning speed in model exploration by using parallel and distributed computing”
❖ “Versatile: it could be applied to a wide range of problems, including classification, ranking, rate prediction, and categorization”
❖ “Flexibility. The system supports the exact greedy algorithm and the approximate algorithm”
❖ “Handles the issue of sparsity in data”
❖ “Theoretically justified weighted quantile sketch”
❖ “Language Support and portability”
❖ “By making XGBoost open-source, they put their promises into action”
❖ “Make a point of cleaving to real-world issues faced by industry users”
❖ “This end-to-end system is widely used by data scientists”
81
Your Weak Points on XGBoost
❖ “Specificity: The system only works for gradient boosted tree algorithms”
❖ “In recent years deep neural networks can beat boosting tree algorithms”
❖ “Accuracy of XGBoost algorithms is generally lower than LightGBM”
❖ “Kaggle competitions and industry ML roles have quite a gap”
❖ “System introduces a lot of hyperparameters”
❖ “XGBoost tuning is difficult”
❖ “Sparsity-aware approach to split-finding introduces another layer of learned uncertainty … no data exploring effect on accuracy”
❖ “No evaluation on their weighted quantile sketch algorithm”
❖ “Finding correct block size for approximate algorithm … need to tune”
❖ “Not much detail on how the system parallelizes its workload”
❖ “Assumes readers have deep understanding about Boosting algorithm”
82
Innovativeness and Depth Ratings
83
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability
84
Advanced Issues in ML Scalability
❖ Streaming/Incremental ML at scale ❖ Scaling Massive Task Parallelism in ML ❖ Pushing ML Through Joins ❖ Larger-than-Memory Models ❖ Models with More Complex Data Access Patterns ❖ Delay-Tolerant / Geo-Distributed / Federated ML ❖ Scaling end-to-end ML Pipelines
85
Scalable Incremental ML
❖ Datasets keep growing in many real-world ML applications ❖ Incremental ML: update a learned model using only new data
❖ Streaming ML is one variant (near real-time) ❖ Non-trivial to make all ML algorithms incremental
❖ SGD-based procedures are “online” by default; just resume gradient descent on new data
❖ ML/data mining folks have studied how to make other ML algorithms incremental; accuracy-runtime tradeoffs
86
Scalable Incremental ML: SageMaker
❖ Industrial cloud-native ML requirements: ❖ Incremental training and model freshness ❖ Predictable training runtime ❖ Elasticity and pause-resume ❖ Trainable on ephemeral (non-archived) data ❖ Automatic model/hyper-parameter tuning
❖ Design: Streaming ML algorithms that fit into a 3-function API of Initialize-Update-Finalize (akin to UDA) ❖ All SGD-based procedures (GLMs, fact. Machines) ❖ Variants of K-Means, PCA, topic models, forecasting
87
Scalable Incremental ML: SageMaker
❖ Parameter Server-based architecture with streaming workers
88
Massive Task Parallelism: Ray
❖ Advanced ML applications that use reinforcement learning produce large numbers of short-running tasks ❖ Training robots, self-driving cars, etc. ❖ ML-based cluster resource management
https://www.usenix.org/system/files/osdi18-moritz.pdf
❖ Ray is an ML system that automatically scales such tasks from single-node to large clusters
❖ Tasks can have shared state for control
❖ Data is replicated/broadcast
89
Pushing ML Through Joins
❖ Most real-world tabular/relational datasets are multi-table ❖ A central fact table with target; many dimension tables
❖ Key-foreign key joins are ubiquitous to denormalize such data into single table before ML training
❖ Single table has a lot of data redundancy ❖ In turn, many ML algorithms end up
having a lot of computational redundancy
❖ Pushing ML computations through joins enables computations directly over the normalized database
90
Pushing ML Through Joins
❖ Orion first showed how to rewrite ML computations to operate on join input rather than output, aka “factorized ML” ❖ GLMs with BGD, L-BFGS, Newton methods ❖ Rewritten impl. fits within UDA / MapReduce abstractions
https://adalabucsd.github.io/papers/2015_Orion_SIGMOD.pdf https://adalabucsd.github.io/papers/2017_Morpheus_VLDB.pdf
❖ Morpheus offered a unified abstraction based on linear algebra to factorize many ML algorithms in one formalism ❖ K-Means clustering, matrix
factorization, etc.
91
Larger-than-Memory Models
❖ Some ML algorithms may have state that does not fit in RAM ❖ Need to shard model too (not just data) to stage updates ❖ Specific to the ML algorithm’s data access patterns
http://www.cs.cmu.edu/~kijungs/etc/10-405.pdf
Example: Matrix factorization trained with SGD
92
Outline
❖ Basics of Scaling ML Computations ❖ Scaling ML to On-Disk Files ❖ Layering ML on Scalable Data Systems ❖ Custom Scalable ML Systems ❖ Advanced Issues in ML Scalability
93
Evaluating “Scalability”
https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf
❖ Message: Not just scalability but efficiency matters too; not just speedup curve but time vs strong (single-node) baselines