Upload
alluxio-inc
View
79
Download
0
Embed Size (px)
Citation preview
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017
March 2017
HISTORY
• Started at UC Berkeley AMPLab In Summer 2012 • Originally named as Tachyon • Rebranded to Alluxio in early 2016
• Open Sourced in 2013 • Apache License 2.0 • Latest Stable Release: Alluxio 1.4.0 • Alluxio 1.5.0 Planned For Q2, 2017
2
© 2017 Alluxio Confidential
BIG DATA ECOSYSTEM YESTERDAY
3
© 2017 Alluxio Confidential
BIG DATA ECOSYSTEM TODAY
…
…
3
© 2017 Alluxio Confidential
…
…
BIG DATA ECOSYSTEM ISSUES
3
© 2017 Alluxio Confidential
BIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File System
Hadoop Compatible File System
Native Key-Value Interface
Native File System
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
© 2017 Alluxio Confidential
BIG DATA ECOSYSTEM WITH ALLUXIO
…
…
FUSE Compatible File System
Hadoop Compatible File System
Native Key-Value Interface
Native File System
Enabling Application to Access Data from any Storage System at Memory-speed
GlusterFS InterfaceAmazon S3 Interface Swift InterfaceHDFS Interface
3
© 2017 Alluxio Confidential 4
© 2017 Alluxio Confidential 5
© 2017 Alluxio Confidential
FASTEST-GROWING BIG DATA PROJECT
6
© 2017 Alluxio Confidential
FASTEST-GROWING BIG DATA PROJECT
• Formerly named Tachyon, born in the AMPLab
• 500+ contributors from 100+ organizations
• Running world’s largest production clusters
6
© 2017 Alluxio Confidential
WHY ALLUXIO
7
Co-located compute and data with memory-speed access to data
Virtualized across different storage systems under a unified namespace
Scale-out architecture
File system API, software only
© 2017 Alluxio Confidential
ALLUXIO BENEFITS
Unification
New workflows across any data in any storage system
Orders of magnitude improvement in run time
Choice in compute and storage – grow each independently, buy only what is needed
Performance Flexibility
8
© 2017 Alluxio Confidential
ALLUXIO DEPLOYMENTS
9
© 2017 Alluxio Confidential
ALLUXIO USE CASES
On-Demand Analytics & Accelerating I/O to and from remote storage
Managing data across disparate storage systems
Sharing data across workloads at memory speed
10
© 2017 Alluxio Confidential
MANAGE DATA ACROSS STORAGE SYSTEMS
“We’ve been running in production for over 9 months, Alluxio’s enabled different applications & frameworks to easily interact with data from different storage systems
RESULTS
• Data sharing among Spark Streaming, Spark batch and Flink jobs provide efficient data sharing
• Improved the performance of their system with 15x – 300x speedups
• Tiered storage feature manages storage resources including memory, SSD and disk
Qunar uses real-time machine learning
for their website ads
• 200+ nodes deployment
• 6 billion logs (4.5 TB) daily
• Mix of Memory + HDD
ALLUXIO
11
© 2017 Alluxio Confidential
ON-DEMAND ANALYTICS &ACCELERATE I/O TO/FROM REMOTE STORAGE
“The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds.
RESULTS
• Data queries are now 30x faster with Alluxio
• Alluxio cluster runs stably, providing over 50TB of RAM space
• By using Alluxio, batch queries usually lasting over 15 minutes were transformed into an interactive query taking less than 30 seconds
PMs run interactive queries to gain
insights into their products & business
• 200+ nodes deployment
• 2+ petabytes of storage
• Mix of memory + HDD
ALLUXIO
Baidu File
System
12
© 2017 Alluxio Confidential
SHARE DATA ACROSS JOBS @ MEMORY SPEED
“Thanks to Alluxio, we now have the raw data immediately available at every iteration & can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity.
RESULTS
• Barclays workflow iteration time decreased from hours to seconds
• Alluxio enabled workflows that were impossible before
• By keeping data only in memory, the I/O cost of loading and storing in Alluxio is now on the order of seconds
Barclays uses query & machine learning
to train models for risk management
• 6 node deployment
• 1TB of storage
• Memory only
ALLUXIO
13
ALLUXIO
Relational Database: Teradata