Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
云深度学习平台架构与实践陈迪豪 / 崔建伟
About Us
崔建伟 ⼩小⽶米深度学习平台架构师
陈迪豪 第四范式先知平台架构师
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
Define Cloud Machine Learning
! What is Machine Learning
MLPCNN RNN/LSTM RL
Define Cloud Machine Learning
! What is Cloud Machine Learning
Google Cloud Machine Learning Engine Amazon Machine Learning Azure Machine Learning Studio
Google Cloud Amazon Web Service Microsoft Azure Cloud
Training
TensorFlow TensorFlow
EC2 SaaS
MXNet
Studio SaaS
CNTK
Prediction
Define Cloud Machine Learning
! Why Cloud Machine Learning
! Train in local machine ! No resource isolation ! No resource sharing ! No cluster orchestration ! No auto-scaling ! No automatical failover Example: pip install tensorflow
Define Cloud Machine Learning
! Architecture of Cloud Machine Learning
Cloud Platform Layer
Machine Learning Layer
Application Layer
Kubernetes / OpenStack / …
Training / Prediction / …
TensorFlow / MXNet / …
Define Cloud Machine Learning
! Architecture of Cloud Machine Learning
模型开发 训练任务
线上服务
Define Cloud Machine Learning
! Architecture of Google-like Cloud Machine Learning
API Service
TensorFlow
K8S ClusterClient TF Serving
Online Req
Submit train job
Create model service
Submit prediction job Create predict container
Create model container
Create train container MXNet
RESTful
Offline Req
Define Cloud Machine Learning
! Architecture of Google-like Cloud Machine Learning
Step 1: Build docker image Step 2: Implement API service Step 3: Submit to Kubernetes
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
! TensorFlow vs Hadoop ! TensorFlow vs Spark ! TensorFlow vs Hive ! TensorFlow vs PowerGraph ! TensorFlow vs Azure ML Studio
! TensorFlow vs H2O / Dataiku / 数加
Re-define Cloud Machine Learning
! We need all of these!
Re-define Cloud Machine Learning
! HDFS: for large data storage ! Hive: for data preprocessing ! Spark: for feature extraction ! Hadoop: for task scheduling ! TensorFlow: for model training ! Kubernetes: for CPU/GPU management
“Super-machine-learning-man”
! We want all of these!
! Closed-loop from data preprocessing to online services ! Feature extraction without writing code ! Easy to define machine learning process ! Flexible and heterogeneous infrastructure ! Automatically failover and scaling ! Easy to use for the domain experts
Re-define Cloud Machine Learning
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
!先知平台
Cloud-ML at 4Paradigm
!先知平台
Cloud-ML at 4Paradigm
!简化数据引⼊入,⽀支持RDBMS和HDFS数据源
!简化数据拆分,⽀支持按⽐比例例拆分和按规则拆分
!简化特征抽取,⽀支持连续特征和离散特征的组合
!简化模型训练,⽀支持⾃自研超⾼高维度LR和开源框架算法
!简化模型评估,⽀支持ROC、Logloss、K-S等评估指标
!先知平台
Cloud-ML at 4Paradigm
某国Top1的新闻App推荐,优化点击率提升34%
某知识分享领域Top3 App⾳音频推荐,优化听完率提升43%
某秀场类直播Top3 App主播推荐,优化收看时⻓长提升21%
某国内最⼤大的UGC社区内容推荐,优化点击率提升93%
⽤用户喜欢
⽤用户⽆无感
机器器学习个性化推荐
⽤用户喜欢
⽤用户⽆无感
运营⼩小编专家经验规则
机器器学习模型推荐
Cloud-ML at 4Paradigm
prophet.4paradigm.com
http://prophet.4paradigm.com
Agenda
❖ Define Cloud Machine Learning
❖ Re-define Cloud Machine Learning
❖ Cloud-ML at 4Paradigm
❖ Cloud-ML at Xiaomi
Cloud-ML 架构
⼩小⽶米⽣生态云
⼩小⽶米融合云
Cloud-ML
FDS(⼩小⽶米⽂文件存储服务)
Docker + Kubernets
PaaS SaaS Dev Training Serving Vision NLP ASR
Cloud-ML 主要功能
Cloud-ML 使⽤用情况
Cloud-ML
4个集群部署
150+开发者使⽤用
20+⼩小⽶米内部业务接⼊入
5家⼩小⽶米⽣生态链公司接⼊入
Cloud-ML 实践: PaaS改进! Dev环境
!提供模型开发功能
!实现:
!提供主流计算框架镜像,⽀支持ssh
!以kubernetes service运⾏行行计算框架
!问题:
! Pod可能被重新调度
!数据持久化
!端⼝口对外开放
Cloud-ML 实践: PaaS改进! Dev环境数据持久化
! Fuse
!⽀支持主要的Posix接⼝口
! FDS⽀支持Fuse
!⽀支持将⽤用户Bucket挂载到本地
! Kubernetes⽀支持Fuse
!启动Dev Pod时挂载/dev/fuse
! Cloud-ML⽀支持Fuse
!创建Dev时Mount FDS Bucket
Cloud-ML 实践: PaaS改进! Dev环境端⼝口开放
!需求: 在Dev环境中开放可以被外部访问的端⼝口
!⽅方案:
! HAProxy实现转发
!转发节点以service启动
!防⽕火墙规则配置Eip: hostport
Docker Proxy Kube-proxy
Public Access
HAProxyDev
转发节点 计算节点
Cloud-ML 实践: PaaS改进! Serving 服务发现
!现状:
!配置Nodeport
!控制节点转发到service
!问题:
!控制节点单点
! Port标识对业务不不友好
!⽅方案:name service
name service
collector
service: pods
service
request
pods
Kubernetesservice1:pods service2:pods …
client
Cloud-ML 实践: SaaS服务
图像识别
⾃自然语⾔言处理理
语⾳音识别
Cloud-ML 实践: SaaS服务!使⽤用场景
图像/语⾳音/⽂文本
智能设备 App Server Cloud-ML SaaS
FDS(⼩小⽶米⽂文件存储服务)
Cloud-ML 实践: SaaS服务
⼈人脸检测: ⼈人脸位置、性别、年年龄物体识别: 1500+ 物体分类(包括客厅、卧室等场景)
FaceInfo: topX: 208 topY: 73 width: 403 height: 403 child female age: 5.6
图像识别
Cloud-ML 实践: SaaS服务
⼈人脸检测: ⼈人脸位置、性别、年年龄物体识别: 1500+ 物体分类(包括客厅、卧室等场景)
图像识别
物体 置信度
客厅(living room) 0.52
餐厅(dining room) 0.14
⼤大厅(hall) 0.08
休闲室(waiting room) 0.06
Cloud-ML 实践: SaaS服务!图像识别
Cloud-ML 实践!将来的⼯工作 ! PaaS
!⽀支持更更多的训练框架
! Kaldi, CNTK
! Dev环境状态可保存
!资源超卖
!与数据处理理流程⽆无缝集成 ! SaaS
!上线更更多模型服务