Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017

PADDLE PADDLEFAULT-TOLERABLE DEEP LEARNING

A REAL REQUEST FOR AI

▸ How to control TV sets via voice

▸ AI Hub

▸ No. An Alexa in each room?

▸ AI API

▸ No. Business owners don’t want user behavior data go to AI tech providers.

▸ AI on Cloud

▸ No. GPU instances are too expensive.

▸ AI on on-premise clusters

▸ Yes.

Unisound, a PaddlePaddle collaborator, planted their speech recognition technology into air conditioners, TV

sets, and Android-based mirrors in cars.

CLOUD AND ON-PREMISE CLUSTERS

Internet traditional

big companies

on-premises

on-premises

small companies cloud on-

premises

THE SOLUTION - GENERAL PURPOSE CLUSTERS

GPU servers Multi-GPU servers CPU servers…

Kubernetes: a distributed operating system

PaddleSpark

speech model trainer

speech API

serverfluentd

nginx

log Kafkaonline data

process

offline data

process

Hadoop HDFS

labeled data

model

Internet clients:

- Web browser- mobile apps- IoT devices

CHALLENGES - GENERAL PURPOSE CLUSTERS

▸ group replica of processes into jobs

▸ Web services, data processing pipelines, machine learning jobs.

▸ service isolation and multi-user

▸ online experiments requires real log data stream, so

▸ we run production jobs and experimental jobs on the same cluster.

▸ priority-based scheduling

▸ a high-priority (production) job can preempt low-priority (experiment) jobs.

▸ make full use of hardware

▸ e.g., schedule processes of a Hadoop job that requires network and disk bandwidth and processes of a deep learning job that requires GPU on the same node.

CHALLENGES - FAULT-TOLERABLE JOBS

▸ auto-scaling

▸ there are often many active users at day time, so the cluster kills processes of deep learning jobs and creates more Web service processes.

▸ in nights, it kills some Web service processes to run more deep learning processes.

▸ fault-recovery

▸ a job must be tolerable with a varying number of processes.

▸ speedup v.s. fault-recovery

▸ speedup optimizes a job.

▸ speedup with fault-tolerance optimizes the business.

A PADDLE PADDLE JOB

parameter server 1

parameter server 2

trainer 1

global model shard 1/2

global model shard 2/2

local model shard 1/2


trainer 2



trainer 3



master

gradients/model gradients/model gradients/model

tasks tasks tasks

AUTO FAULT-RECOVERY

etcd

job B

master of job A

job A

task 4

task 2

task 1

todo

pending

done

task 3

task 2 task 1

todo

pending

done

task 3

master of job B

todo

created

pending

done

dispatched

completed

timeout

KEEP OPEN

▸ Thanks to the Kubernetes community for their expertise on distributed computing and their effort of code review.

▸ We hope to see more traditional industries have their on-premise clusters support running their whole business.

▸ PaddlePaddle will keep open.

▸ We are working on open source more AI technologies basing on PaddlePaddle.

Technology

Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017