Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
ML Benchmark Design Challenges
Peter [email protected]
(work by many people in MLPerf community)
ISPASS FastPath 2019 -- March 24th
MLPerf is the work of many
OverviewBackground
Major design choices
Benchmarks
Metric
Problems and solutions
Presentation choices
Conclusion
Why benchmark?
Measurement drives innovation.
Example: SPEC → Modern computing
Example: Gauge blocks → Modern industry
Why benchmark ML system performance now?According to the New York Times, 2018 saw:
● 45 startups working on ML chips● 1.5bn invested in ML chip startups
What is MLPerf?The first machine learning
performance benchmark suite
with broad industry and academic
support.
MLPerf: looking back at 2018
Nov.
Submissions
Google, Intel, and NVIDIA engineers allowed to sleep.
Dec.
Results
We post results.
Feb.
Concept
We need an ML benchmark suite!
May
Launch
Hey everyone, we have an ML benchmark suite! Want to submit?
MLPerf: looking ahead to 2019
Oct.
And results again!
Inference v0.6, Training v0.7
Dec.
Host entity
Host organization to maintain the benchmark
Apr.
Announcements
Inference v0.5, Training v0.6
June
Results
Inference v0.5, Training v0.6
OverviewBackground
Major design choices
Benchmarks
Metric
Problems and solutions
Presentation choices
Conclusion
What is an ML benchmark?
Target Quality
E.g. 75.9%Train a model
Dataset
E.g. ImageNet
Do we specify the model?
Dataset Target Quality
E.g. 75.9%Which model?E.g. ImageNet
Choice: two divisionsClosed division: model is specifiedOpen division: model is not specified
Vision Language Commerce Other
Image recognition
Object detection
Segmentation
Video
Medical imaging
Text to speech
Speech to text
Translation
Natural language processing
Recommendation
Time series
Reinforcement learning -- games
Reinforcement learning -- robotics
GANs
What benchmarks should we use?
$$$ *
But which models for the closed division?Should we select lowest common denominator, current, or cutting edge?
Image recognition: AlexNet, ResNet, or RetinaNet?
If we specify the model we might want:
Different “complexity” models
Object detection: SSD vs. Mask R-CNN
Different “method” models
Translation: GNMT vs. Transformer
Our choices for v0.5Mix of importance, availability of data, and readiness of code. Cutting but not bleeding edge models.
Area Problem Dataset Model
Vision Image recognition ImageNet ResNet
Object detection COCO SSD
Object detection COCO Mask R CNN
Language Translation WMT Eng.-German NMT
Translation WMT Eng.-German Transformer
Commerce Recommendation Movielens-20M NCF
Other Go Pro games Mini go
OverviewBackground
Major design choices
Benchmarks
Metric
Problems and solutions
Presentation choices
Conclusion
Metric: throughput vs. time-to-trainThroughput (samples / sec)
Easy / cheap to measure
Throughput Fewer epochs
Lower precisionHigher batch size
Higher precisionLower batch size
Can increase throughput at cost of total time to train!
Time-to-train (end-to-end)Time to solution!ExpensiveHigh varianceLeast bad choice
How do you define time to train?
Time component Included Rationale
System initialization time Partial or no May be disproportionate on larger systems running smaller datasets
Preprocessing No Need to allow reformatting for fairness, plus same a system init
Non-deterministic preprocessing
Yes Changes across epochs
Training Yes
Evaluation Yes, but limit Evaluation every epoch is more common for research than production
OverviewBackground
Major design choices
Benchmarks
Metric
Problems and solutions
Presentation choices
Conclusion
Problem: allow reimplementation of modelsThere are multiple competing ML frameworks
Not all architectures support all frameworks
Implementations still require some degree of tuning
Temporary solution: allow submitters to reimplement the benchmarks
Require models are mathematically equivalent
Exceptions: floating point, whitelist of minor differences
Problem: measure systems not hyperparametersDifferent system sizes require...
Different batch sizes which require ...
Different optimizer hyperparameters
But, some working hyperparameter are better than others
Finding good hyperparameters is expensive and not the point of the benchmark
Solution 1: hyperparameter stealing during review process
Solution 2: batchsize to hyperparameter table
Problem: reduce varianceML convergence has relatively high variance
Solution (kind of): run each benchmark multiple times
To reduce variance by x, need to run x^2 times = $$$
Settled for high margins of error
For vision: 5 runs, 90% of runs on same system within 5%
For everything else: 10 runs, 90% of runs on same system within 10%
OverviewBackground
Major design choices
Benchmarks
Metric
Problems and solutions
Presentation choices
Conclusion
Present results as raw time-to-train or speedups over reference system?
Raw time-to-train
Make physical sense, e.g. 579 minutes.
But benchmarks have widely varied running times
Hard to see good results
Instead, could present speedups over a reference system, e.g. 12.3x faster
Also, makes higher better
Present results raw, with scale, or scale-normalized?Do you present only the results?
Results lack scale information.
If so, an inefficient larger system can look better than an efficient smaller system.
Could add a supplemental scale
Number of chips, cost, power
Could normalize results by the scaling value
Performance / watt or performance / dollar
Present results or summarizeDo you have single MLPerf score than summarize all results?
Pro:
Easy to communicate
Do it correctly and consistently
Con:
Oversimplifies -- systems are optimized for different use cases
Users do not care about all use cases equally
Res
ults
(mlp
erf.o
rg/re
sults
)
OverviewBackground
Major design choices
Benchmarks
Metric
Problems and solutions
Presentation choices
Conclusion
Is ML benchmarking important? Yes.
● “We are glad to see MLPerf grow from just a concept to a major consortium supported by a wide variety of companies and academic institutions. The results released today will set a new precedent for the industry to improve upon to drive advances in AI,” reports Haifeng Wang, Senior Vice President of Baidu who oversees the AI Group.
● “Open standards such as MLPerf and Open Neural Network Exchange (ONNX) are key to driving innovation and collaboration in machine learning across the industry,” said Bill Jia, VP, AI Infrastructure at Facebook. “We look forward to participating in MLPerf with its charter to standardize benchmarks.”
● “MLPerf can help people choose the right ML infrastructure for their applications. As machine learning continues to become more and more central to their business, enterprises are turning to the cloud for the high performance and low cost of training of ML models,” – Urs Hölzle, Senior Vice President of Technical Infrastructure, Google.
● “We believe that an open ecosystem enables AI developers to deliver innovation faster. In addition to existing efforts through ONNX, Microsoft is excited to participate in MLPerf to support an open and standard set of performance benchmarks to drive transparency and innovation in the industry.” – Eric Boyd, CVP of AI Platform, Microsoft
● “MLPerf demonstrates the importance of innovating in scale-up computing as well as at all levels of the computing stack — from hardware architecture to software and optimizations across multiple frameworks.” --Ian Buck, vice president and general manager of Accelerated Computing at NVIDIA
Lots of work remains.Areas that need improvement:
More, better benchmarks
Reduced variance
Open division utility for academia
Better public datasets
Better reference implementations
We need your help to make MLPerf better. Join us at mlperf.org!