ML Benchmark - IBM...Translation: GNMT vs. Transformer. Our choices for v0.5 Mix of importance, availability of data, and readiness of code. ... “MLPerf demonstrates the importance

ML Benchmark Design Challenges

Peter [email protected]

(work by many people in MLPerf community)

ISPASS FastPath 2019 -- March 24th

MLPerf is the work of many

OverviewBackground

Major design choices

Benchmarks

Metric

Problems and solutions

Presentation choices

Conclusion

Why benchmark?

Measurement drives innovation.

Example: SPEC → Modern computing

Example: Gauge blocks → Modern industry

Why benchmark ML system performance now?According to the New York Times, 2018 saw:

● 45 startups working on ML chips● 1.5bn invested in ML chip startups

What is MLPerf?The first machine learning

performance benchmark suite

with broad industry and academic

support.

MLPerf: looking back at 2018

Nov.

Submissions

Google, Intel, and NVIDIA engineers allowed to sleep.

Dec.

Results

We post results.

Feb.

Concept

We need an ML benchmark suite!

May

Launch

Hey everyone, we have an ML benchmark suite! Want to submit?

MLPerf: looking ahead to 2019

Oct.

And results again!

Inference v0.6, Training v0.7

Dec.

Host entity

Host organization to maintain the benchmark

Apr.

Announcements


June

Results


OverviewBackground


Benchmarks

Metric



Conclusion

What is an ML benchmark?

Target Quality

E.g. 75.9%Train a model

Dataset

E.g. ImageNet

Do we specify the model?

Dataset Target Quality

E.g. 75.9%Which model?E.g. ImageNet

Choice: two divisionsClosed division: model is specifiedOpen division: model is not specified

Vision Language Commerce Other

Image recognition

Object detection

Segmentation

Video

Medical imaging

Text to speech

Speech to text

Translation

Natural language processing

Recommendation

Time series

Reinforcement learning -- games

Reinforcement learning -- robotics

GANs

What benchmarks should we use?

$$$ *

But which models for the closed division?Should we select lowest common denominator, current, or cutting edge?

Image recognition: AlexNet, ResNet, or RetinaNet?

If we specify the model we might want:

Different “complexity” models

Object detection: SSD vs. Mask R-CNN

Different “method” models

Translation: GNMT vs. Transformer

Our choices for v0.5Mix of importance, availability of data, and readiness of code. Cutting but not bleeding edge models.

Area Problem Dataset Model

Vision Image recognition ImageNet ResNet

Object detection COCO SSD

Object detection COCO Mask R CNN

Language Translation WMT Eng.-German NMT

Translation WMT Eng.-German Transformer

Commerce Recommendation Movielens-20M NCF

Other Go Pro games Mini go

OverviewBackground


Benchmarks

Metric



Conclusion

Metric: throughput vs. time-to-trainThroughput (samples / sec)

Easy / cheap to measure

Throughput Fewer epochs

Lower precisionHigher batch size

Higher precisionLower batch size

Can increase throughput at cost of total time to train!

Time-to-train (end-to-end)Time to solution!ExpensiveHigh varianceLeast bad choice

How do you define time to train?

Time component Included Rationale

System initialization time Partial or no May be disproportionate on larger systems running smaller datasets

Preprocessing No Need to allow reformatting for fairness, plus same a system init

Non-deterministic preprocessing

Yes Changes across epochs

Training Yes

Evaluation Yes, but limit Evaluation every epoch is more common for research than production

OverviewBackground


Benchmarks

Metric



Conclusion

Problem: allow reimplementation of modelsThere are multiple competing ML frameworks

Not all architectures support all frameworks

Implementations still require some degree of tuning

Temporary solution: allow submitters to reimplement the benchmarks

Require models are mathematically equivalent

Exceptions: floating point, whitelist of minor differences

Problem: measure systems not hyperparametersDifferent system sizes require...

Different batch sizes which require ...

Different optimizer hyperparameters

But, some working hyperparameter are better than others

Finding good hyperparameters is expensive and not the point of the benchmark

Solution 1: hyperparameter stealing during review process

Solution 2: batchsize to hyperparameter table

Problem: reduce varianceML convergence has relatively high variance

Solution (kind of): run each benchmark multiple times

To reduce variance by x, need to run x^2 times = $$$

Settled for high margins of error

For vision: 5 runs, 90% of runs on same system within 5%

For everything else: 10 runs, 90% of runs on same system within 10%

OverviewBackground


Benchmarks

Metric



Conclusion

Present results as raw time-to-train or speedups over reference system?

Raw time-to-train

Make physical sense, e.g. 579 minutes.

But benchmarks have widely varied running times

Hard to see good results

Instead, could present speedups over a reference system, e.g. 12.3x faster

Also, makes higher better

Present results raw, with scale, or scale-normalized?Do you present only the results?

Results lack scale information.

If so, an inefficient larger system can look better than an efficient smaller system.

Could add a supplemental scale

Number of chips, cost, power

Could normalize results by the scaling value

Performance / watt or performance / dollar

Present results or summarizeDo you have single MLPerf score than summarize all results?

Pro:

Easy to communicate

Do it correctly and consistently

Con:

Oversimplifies -- systems are optimized for different use cases

Users do not care about all use cases equally

Res

ults

(mlp

erf.o

rg/re

sults

)

OverviewBackground


Benchmarks

Metric



Conclusion

Is ML benchmarking important? Yes.

● “We are glad to see MLPerf grow from just a concept to a major consortium supported by a wide variety of companies and academic institutions. The results released today will set a new precedent for the industry to improve upon to drive advances in AI,” reports Haifeng Wang, Senior Vice President of Baidu who oversees the AI Group.

● “Open standards such as MLPerf and Open Neural Network Exchange (ONNX) are key to driving innovation and collaboration in machine learning across the industry,” said Bill Jia, VP, AI Infrastructure at Facebook. “We look forward to participating in MLPerf with its charter to standardize benchmarks.”

● “MLPerf can help people choose the right ML infrastructure for their applications. As machine learning continues to become more and more central to their business, enterprises are turning to the cloud for the high performance and low cost of training of ML models,” – Urs Hölzle, Senior Vice President of Technical Infrastructure, Google.

● “We believe that an open ecosystem enables AI developers to deliver innovation faster. In addition to existing efforts through ONNX, Microsoft is excited to participate in MLPerf to support an open and standard set of performance benchmarks to drive transparency and innovation in the industry.” – Eric Boyd, CVP of AI Platform, Microsoft

● “MLPerf demonstrates the importance of innovating in scale-up computing as well as at all levels of the computing stack — from hardware architecture to software and optimizations across multiple frameworks.” --Ian Buck, vice president and general manager of Accelerated Computing at NVIDIA

Lots of work remains.Areas that need improvement:

More, better benchmarks

Reduced variance

Open division utility for academia

Better public datasets

Better reference implementations

We need your help to make MLPerf better. Join us at mlperf.org!

http://mlperf.org

Documents

ML Benchmark - IBM...Translation: GNMT vs. Transformer. Our choices for v0.5 Mix of importance, availability of data, and readiness of code. ... “MLPerf demonstrates the importance