48
Memory-Speed Virtual Distributed Storage: Latest Features and Use Cases August 2016 @ Strata+Hadoop World Beijing Bin Fan, Yupeng Fu 1

Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Embed Size (px)

Citation preview

Page 1: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

M e m o r y - S p e e d V i r t u a l D i s t r i b u t e d S t o r a g e : L a t e s t F e a t u r e s a n d U s e C a s e s

August 2016 @ Strata+Hadoop World Beijing

Bin Fan, Yupeng Fu

1

Page 2: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

About Us

•  Bin Fan

•  Software Engineer @ Alluxio, Inc.

•  Alluxio PMC

•  Worked at Google, MSR

•  CMU, CUHK, USTC

•  Wechat @ apc999_fb

•  Yupeng Fu

•  Software Engineer @ Alluxio, Inc.

•  Alluxio PMC

•  Worked at Palantir, Google

•  UCSD, Tsinghua

•  Wechat @richbird9

2

Page 3: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

•  Team consists of Alluxio creators and top committers

•  Invested by

•  Committed to Alluxio Open Source

•  http://www.alluxio.com

Alluxio Inc.

3

Page 4: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Outline

•  What is Alluxio •  Example use cases

1.  Off-heap memory to alleviate resource pressure 2.  Fast data sharing between jobs 3.  Accelerate access to remote storage 4.  Unified namespace

•  Summary

4

Page 5: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Non-persistent data-storage

software.

ATTRIBUTES

Memory-Speed Virtual Distributed Storage

Scale out architecture.

Virtualized across different storage types under a unified namespace.

Memory-speed access to data.

5

Page 6: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

OPEN SOURCE ALLUXIO: History

• Started in the AMPLab at UC Berkeley –  Summer of 2012

• Open Sourced as Tachyon (renamed to Alluxio) –  Apache 2.0 License

–  Latest Release: version 1.2.0 (July, 2016)

6

Page 7: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

OPEN SOURCE ALLUXIO

v0.2 v0.3 v0.4 v0.5

v0.6 v0.7

v0.8

v1.0

v1.1

7

Page 8: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

OPEN SOURCE ALLUXIO

The fastest growing open-source project in the big data ecosystem. Currently over 300 contributors from over 100 organizations.

8

Page 9: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Outline

•  What is Alluxio •  Example use cases

1.  Off-heap memory to alleviate resource pressure 2.  Fast data sharing between jobs 3.  Accelerate access to remote storage 4.  Unified namespace

•  Summary

9

Page 10: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Application 1

Alluxio

CASE 1: Off-heap Memory Storage to Alleviate Resource Pressure

10

Page 11: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

HDFS / Amazon S3

Consolidating Memory

Spark Job1

Spark Memory

block 1

block 3

Spark Job2

Spark Memory

block 3

block 1

block 1

block 3

block 2

block 4

storage engine & execution engine same process

Data duplicated at memory-level

11

Page 12: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Spark Job1

Spark mem

Spark Job2

Spark mem

HDFS / Amazon S3 block 1

block 3

block 2

block 4

HDFS disk

block 1

block 3

block 2

block 4 Alluxio

In-Memory

block 1

block 3 block 4

Data not duplicated at memory-level

Consolidating Memory

storage engine & execution engine same process

12

Page 13: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Data Resilience during Crashes

Spark Task

Spark Memory block manager

block 1

block 3

HDFS / Amazon S3 block 1

block 3

block 2

block 4

Process crash requires network and/or disk I/O to re-read the data

storage engine & execution engine same process

13

Page 14: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Data Resilience during Crashes

Crash

Spark Memory block manager

block 1

block 3

HDFS / Amazon S3 block 1

block 3

block 2

block 4

Process crash requires network and/or disk I/O to re-read the data

storage engine & execution engine same process

14

Page 15: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

HDFS / Amazon S3

Data Resilience during Crashes

block 1

block 3

block 2

block 4

Crash

storage engine & execution engine same process

Process crash requires network and/or disk I/O to re-read the data

15

Page 16: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Spark Task

Spark Memory block manager

HDFS disk

block 1

block 3

block 2

block 4 Alluxio

In-Memory

block 1

block 3 block 4

Process crash only needs memory I/O to re-read the data

Data Resilience during Crashes

storage engine & execution engine same process

16

Page 17: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Crash

Process crash only needs memory I/O to re-read the data

HDFS disk

block 1

block 3

block 2

block 4 Alluxio

In-Memory

block 1

block 3 block 4

Data Resilience during Crashes

storage engine & execution engine same process

17

Page 18: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

CASE STUDY

Thanks to Alluxio, we now have the raw data immediately available at every iteration and we can skip the costs of loading in terms of time waiting, network traffic, and RDBMS activity. - Henry Powell, Barclays

RESULTS

•  Barclays workflow iteration time decreased from hours to seconds

•  Alluxio enabled workflows that were

impossible before •  By keeping data only in memory, the I/O cost of

loading and storing in Alluxio is now on the order of seconds

Relational Database

Share Data Across Jobs at Memory Speed

•  Reducing time from hours to

seconds

•  Solving issues of Teradata as

under storage

18

Page 19: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Barclays Previous Architecture

Spark

Teradata

Slow ETL, 30+ minutes

Each context restart requires ETL process

again

h"ps://dzone.com/ar1cles/Accelerate-­‐In-­‐Memory-­‐Processing-­‐with-­‐Spark-­‐from-­‐Hours-­‐to-­‐Seconds-­‐With-­‐Tachyon  

19

Page 20: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Barclays Architecture with Alluxio

Spark

Teradata

Alluxio

ETL only once Data still available in Alluxio memory after Job crash

20

Page 21: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

CASE 2: Fast Data sharing between apps

Application 1

Alluxio

Application2

HDFS

21

Page 22: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Spark Job

Spark Memory

block 1

block 3

Hadoop MR Job

YARN

HDFS / Amazon S3 block 1

block 3

block 2

block 4

storage engine & execution engine same process

Data Sharing Between Frameworks

Inter-process sharing slowed down by network and/or disk I/O

22

Page 23: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Data Sharing Between Frameworks

Spark Job

Spark Memory

Hadoop MR Job

YARN

HDFS / Amazon S3 block 1

block 3

block 2

block 4

HDFS disk

block 1

block 3

block 2

block 4 Alluxio

In-Memory

block 1

block 3 block 4

storage engine & execution engine same process

Inter-process sharing can happen at memory speed

23

Page 24: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

CASE STUDY

We’ve been running Alluxio in production for over 9 months, resulting in 15x speedup on average, and 300x speedup at peak service times. - Xueyan Li, Qunar

RESULTS

•  Multiple computing frameworks can share data

at memory speed with Alluxio •  Improved the performance of their system with

15x – 300x speedups •  Alluxio provides various user-friendly APIs,

which simplifies the transition to Alluxio.

Sharing Data Across Different Compute Frameworks

•  6 billion logs (4.5 TB) daily

•  Mix of Memory + HDD

24

Page 25: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Qunar Previous Architecture

Flink Streaming Spark Spark

Streaming Flink

HDFS Cluster

Some jobs cannot finish

Slow access to the storage

25

Page 26: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Qunar Architecture with Alluxio

Flink Streaming

HDFS Cluster

Spark Spark Streaming Flink

Alluxio

Share in-memory data across frameworks

26

Page 27: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

CASE 3: Accelerate access to remote storage

Application

Alluxio

Remote Storage

27

Page 28: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

•  Different compute and storage hardware requirements

•  Scale compute and storage resources independently

•  Cost effective object stores (Amazon S3, Google GCS, Microsoft Azure

Blob Store) are inherently decoupled

Decoupling Compute from Storage

Accessing data requires remote I/O!

28

Page 29: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Remote I/O problems

Application

Remote Storage

every data operation requires data transfer, sometimes over

the WAN

high latency, network throughput

29

Page 30: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Visualizing the Stack with Alluxio

FAST 104 - 105 MB/s

Application

Remote Storage

MODERATE 103 - 104 MB/s

SLOW 102 - 103 MB/s

Alluxio MEM Often

Only When Necessary

Alluxio SSD/HDD

Limited

30

Page 31: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

What if memory capacity is still not enough?

31

Page 32: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Alluxio Manages All Local Storage

MEM

SSD

HDD

Faster

Higher Capacity

32

Page 33: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Configurable Storage Tiers

MEM only

MEM + HDD

SSD only

33

Page 34: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Pluggable Tier Management Policies

Evict stale data to slower tier

Promote hot data to faster tier

34

Page 35: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

CASE STUDY

Baidu File System

The performance was amazing. With Spark SQL alone, it took 100-150 seconds to finish a query; using Alluxio, where data may hit local or remote Alluxio nodes, it took 10-15 seconds. - Shaoshan Liu, Baidu

RESULTS

•  Data queries are now 30x faster with Alluxio •  1000-worker Alluxio cluster run stably,

providing over 50TB of RAM space •  By using Alluxio, batch queries usually lasting

over 15 minutes were transformed into an interactive query taking less than 30 seconds

Accelerate Access to Remote Storage

•  1000+ nodes deployment

•  2+ petabytes of storage

•  Mix of memory + HDD

•  1000-worker cluster

35

Page 36: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

CASE 4: Unified Name Space

36  

Application

Alluxio

AWS S3 HDFS

Page 37: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Common Scenarios

•  At large organizations, data spans many storage systems

•  Legacy storage systems

•  Application logic needs to integrate with different storage systems

37

Page 38: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Unified Name Space An abstraction for applications to access different storage systems through the same interface •  Benefits –  Application written once with Allluxio API –  Transparently add new storage systems –  Seamlessly integrate with existing systems

38

Page 39: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Transparent Naming

Operations over persisted Alluxio objects mapped

transparently to underlying storage

alluxio://Data/Sales <-> hdfs://Data/Sales

Alluxio Storage System (HDFS, S3, …)

alluxio://host:port/

Data Users

Reports Sales Alice Bob

hdfs://host:port/

Data Users

Reports Sales Alice Bob

39

mount

Page 40: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Multiple Storage Systems

•  Unified namespace for multiple data sources

alluxio://Data/Sales -> s3://bucket/Sales

alluxio://Users/Alice -> hdfs://Users/Alice

•  API for on-the-fly mounting / unmounting

Alluxio Storage System A

alluxio://host:port/

Data Users

Alice Bob

hdfs://host:port/

Users

Alice Bob

Storage System B

s3://host/bucket

Reports Sales Reports Sales

40

mount

mount

Page 41: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

A Proliferation of Under Storage Systems

MORE…

41

Page 42: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

CASE STUDY

RESULTS

•  Alluxio’s unified namespace enables different applications and frameworks to easily interact with their data from different storage systems

•  Global data centers across North America and

Asia •  Tiered storage feature manages various

storage resources including memory, SSD and disk

Transparently Manage Data Across Different Storage Systems

•  Storage medium: MEM

•  Clusters in different regions

An Internet Company

Machine Learning Pipelines

42 42

Page 43: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Outline

•  What is Alluxio •  Example use cases

1.  Off-heap memory to alleviate resource pressure 2.  Fast data sharing between jobs 3.  Accelerate access to remote storage 4.  Unified namespace

•  Summary

43

Page 44: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Takeaway: When to use Alluxio

•  Two or more jobs access the same dataset •  Job(s) may not always succeed •  Spark with memory/GC pressure •  Jobs/Apps are pipelined •  Resulting data does not need to be immediately

persisted •  Interact with remote data •  Data stored across different storage systems

44

Page 45: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Try out Alluxio 1.2.0 http://www.alluxio.org/releases

45

Page 46: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

First China Meetups!

•  Nanjing (Saturday, 7/30)

•  Shanghai (Sunday, 7/31)

•  Beijing (Sunday, 8/7)

46

Page 47: Alluxio Use Cases at Strata+Hadoop World Beijing 2016

Resources •  Alluxio Project: http://www.alluxio.org

•  Development: https://github.com/Alluxio/alluxio

•  Meet Friends: http://www.meetup.com/Alluxio

•  Alluxio, Inc.: http://www.alluxio.com

•  Training: http://www.alluxio.com/alluxio-training/

•  Contact us: [email protected]

•  Alluxio Wechat: Alluxio 47

Page 48: Alluxio Use Cases at Strata+Hadoop World Beijing 2016