Upload
mlconf
View
73
Download
2
Embed Size (px)
Citation preview
A REAL REQUEST FOR AI
▸ How to control TV sets via voice
▸ AI Hub
▸ No. An Alexa in each room?
▸ AI API
▸ No. Business owners don’t want user behavior data go to AI tech providers.
▸ AI on Cloud
▸ No. GPU instances are too expensive.
▸ AI on on-premise clusters
▸ Yes.
Unisound, a PaddlePaddle collaborator, planted their speech recognition technology into air conditioners, TV
sets, and Android-based mirrors in cars.
CLOUD AND ON-PREMISE CLUSTERS
Internet traditional
big companies
on-premises
on-premises
small companies cloud on-
premises
THE SOLUTION - GENERAL PURPOSE CLUSTERS
GPU servers Multi-GPU servers CPU servers…
Kubernetes: a distributed operating system
PaddleSpark
speech model trainer
speech API
serverfluentd
nginx
log Kafkaonline data
process
offline data
process
Hadoop HDFS
labeled data
model
Internet clients:
- Web browser- mobile apps- IoT devices
CHALLENGES - GENERAL PURPOSE CLUSTERS
▸ group replica of processes into jobs
▸ Web services, data processing pipelines, machine learning jobs.
▸ service isolation and multi-user
▸ online experiments requires real log data stream, so
▸ we run production jobs and experimental jobs on the same cluster.
▸ priority-based scheduling
▸ a high-priority (production) job can preempt low-priority (experiment) jobs.
▸ make full use of hardware
▸ e.g., schedule processes of a Hadoop job that requires network and disk bandwidth and processes of a deep learning job that requires GPU on the same node.
CHALLENGES - FAULT-TOLERABLE JOBS
▸ auto-scaling
▸ there are often many active users at day time, so the cluster kills processes of deep learning jobs and creates more Web service processes.
▸ in nights, it kills some Web service processes to run more deep learning processes.
▸ fault-recovery
▸ a job must be tolerable with a varying number of processes.
▸ speedup v.s. fault-recovery
▸ speedup optimizes a job.
▸ speedup with fault-tolerance optimizes the business.
A PADDLE PADDLE JOB
parameter server 1
parameter server 2
trainer 1
global model shard 1/2
global model shard 2/2
local model shard 1/2
local model shard 2/2
trainer 2
local model shard 1/2
local model shard 2/2
trainer 3
local model shard 1/2
local model shard 2/2
master
gradients/model gradients/model gradients/model
tasks tasks tasks
AUTO FAULT-RECOVERY
etcd
job B
master of job A
job A
task 4
task 2
task 1
todo
pending
done
task 3
task 2 task 1
todo
pending
done
task 3
master of job B
todo
created
pending
done
dispatched
completed
timeout
KEEP OPEN
▸ Thanks to the Kubernetes community for their expertise on distributed computing and their effort of code review.
▸ We hope to see more traditional industries have their on-premise clusters support running their whole business.
▸ PaddlePaddle will keep open.
▸ We are working on open source more AI technologies basing on PaddlePaddle.