18
1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu , Emalayan Vairavanathan, (and many others from UBC, ANL, ORNL) Networked Systems Laboratory (NetSysLab) University of British Columbia

1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from

Embed Size (px)

Citation preview

1

MosaStore -A Versatile Storage System

Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany,

Matei Ripeanu, Emalayan Vairavanathan,(and many others from UBC, ANL, ORNL)

Networked Systems Laboratory (NetSysLab)University of British Columbia

http://netsyslab.ece.ubc.ca

2

A golf course …

… a (nudist) beach

(… and 199 days of rain each year)

Networked Systems Laboratory (NetSysLab)University of British Columbia

The Landscape

Storage System Middleware

Supercomputers Desktop GridsCloud Computing

Workflows CheckpointingData Analysis

Diverse platform capabilities

Diverse workload characteristics

Challenge: Design an efficient storage system middleware

CCCC

4

Motivation: Underprovisioned storage systems on manyHPC platforms (e.g., BlueGene/P at ANL)

10 Gb/s Switch

Complex

10 Gb/s Switch

Complex

GPFS

24 servers

IO rate : 8GBps = 51KBps / core

2.5K IO Nodes

850 MBps per 64 nodes

160K coresH

i-Sp

eed

Netw

ork

2.5 GBpsper node

The shared storage is a bottleneckThere are underutilized resources close to application

5

Solution: a temporary shared datastore

10 Gb/s Switch

Complex

10 Gb/s Switch

Complex

GPFS

24 servers

IO rate : 8GBps = 51KBps / core

2.5K IO Nodes

850 MBps per 64 nodes

160K coresS

hare

d d

ata

-sto

re

2.5 GBpsper node

Nodes dedicated to an applicationStorage system coupled with the application’s execution

6

Benefits

10 Gb/s Switch

Complex

10 Gb/s Switch

Complex

GPFS

24 servers

IO rate : 8GBps = 51KBps / core

2.5K IO Nodes

850 MBps per 64 nodes

160K coresS

hare

d d

ata

-sto

re

2.5 GBpsper node

Storage closer to the application. Ability to specialize

Evaluation: Harnessing ‘Close to Application’ Underutilized Resources

Overall: 1.52x

Workflow Stages(DOCK6)

Read input, compute, and write temporary results

Summarize, sort, and select

Archive

Storage Optimizations

Cache the input data

Cache temporary files

Asynch. flush results to GPFS

Results (8K cores)

1.06x

11.76x

1.51x

Exploiting the underutilized resources can critically improve the storage system performance

Zhang et. al., “Design and Evaluation of a Collective I/O Model for Loosely-coupled Petascale Programming”, MTAGS ’08.

Evaluation: Specialization

0

100

200

300

400

500

600

16 32 80 160 240 320 400 480 560Number of clients

Aver

age

Band

wid

th (M

B/s) Lustre Average

stdchk Average

MosaStore throughput at larger scale (pool of 35 nodes)Experiment by: Henry Monti (VirginiaTech) on Cray XT4 cluster at ORNL

Deduplication benefits a checpointing workload• 3x higher throughput• 25-70% less storage

space and network effort

• Scales to hundreds of clients

Specialization can critically improve the storage system performance

[S. Al-Kiswany, M. Ripeanu, S. Vazhkudai, A. Gharaibeh, “stdchk: A Checkpoint Storage System for Desktop Grid Computing”, ICDCS ‘08]

Summary so far• MosaStore: versatile storage architecture, that :

Exploits underutilized resources ‘close`to the application. Supports specialization and configurability

• System is Configured at deployment time Deployment lifetime coupled with that of the target application.

[S. Al-Kiswany, A. Gharaibeh, M. Ripeanu, “The Case for a Versatile Storage System”, HotStorage’09]

MosaStore - Storage System PrototypeGoals: (1) exploration platform, and (2) support for large-scale computational science research projects.

MosaStore - Storage System PrototypeGoals: (1) exploration platform, and (2) support for large-scale computational science research projects.

Versatile Storage

Configurable and extensible storage system that can be specialized for a broad set of apps.

[ICDCS ’08, HotStorage ’09]

Configurable and extensible storage system that can be specialized for a broad set of apps.

[ICDCS ’08, HotStorage ’09]

How to harness massively multicore processors to support storage system operations?

[HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10]

How to harness massively multicore processors to support storage system operations?

[HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10]

StoreGPU Cross-layer Optimizations

Can one enable cross-layer optimizations?

[HPDC HotTopics ’08, CCGrid`12, WSLF`11]

Can one enable cross-layer optimizations?

[HPDC HotTopics ’08, CCGrid`12, WSLF`11]

CMFS API

Automatingconfig. choice

How I choose a good configuration for my application?

[ERSS`11¸ GRID`10]

How I choose a good configuration for my application?

[ERSS`11¸ GRID`10]

• Application Storage SystemApplications can present hints on the desired

use of the data: e.g., desired replication levels, caching, data importance, etc

• Storage System Application Storage can expose storage-level attributes

e.g., file location characteristics, file health status,

Today: applications and storage systems treat data items uniformly

Opportunity: additional information can enable differentiated treatment of data items

POSIX API CustomMetadata

Our use-case: A workflow aware file system

12

Workflow Applications

Montage workflow

File based communication

Irregular and application-dependant data access

100000s of process, runs for weeks

Generate large I/O volumes (100TB cumulative).

Execution29%

Data managt30%

Scheduling, Idle41%

Source [Zhao et. al, 2012]512 BG/P cores, GPFS intermediate file system

13

I/O patterns in Workflow Applications

• Pipeline

• Broadcast

• Reduce

• Scatter

• Gather

Case studies in storage access by loosely coupled petascale applications , Wozniak et al, PDWS, 2009

Application: Montage

14• <

Stages 6, 7,8Pipeline pattern

Stage - 10Reduce pattern

Stage - 9Pipelinepattern

Stage - 5Reduce pattern

15

I/O Patterns and Storage Optimizations

Pipeline Locality aware scheduling

Broadcast Replication

Reduce Data placementLocality-aware scheduling

Scatter Block-level placement Locality-aware scheduling

Gather Block level co-placementLocality-aware scheduling

Pattern Optimizations

Data-item specific patterns and optimizations! Need for information flows in both directionsIdea: Cross-layer communication to support this

16

A workflow-aware file system

Thesis: cross-layer communication supported by file-level metadata

the key mechanism to enable a workflow-aware file system

Progress so far: promising evaluation of potential gains (CCGrid`12)

Next step: build the system and evaluate it with applications (?SC`12)

MosaStore - Storage System PrototypeGoals: (1) exploration platform, and (2) support for large-scale computational science research projects.

MosaStore - Storage System PrototypeGoals: (1) exploration platform, and (2) support for large-scale computational science research projects.

Versatile Storage

Configurable and extensible storage system that can be specialized for a broad set of apps.

[ICDCS ’08, HotStorage ’09]

Configurable and extensible storage system that can be specialized for a broad set of apps.

[ICDCS ’08, HotStorage ’09]

Harnessing massively multicore processors to support storage system operations.

[HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10]

Harnessing massively multicore processors to support storage system operations.

[HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10]

StoreGPU Cross-layer Optimizations

Enabl bidirectional cross-layer optimizations.

[HPDC HotTopics ’08, CCGrid`12, WSLF`11]

Enabl bidirectional cross-layer optimizations.

[HPDC HotTopics ’08, CCGrid`12, WSLF`11]

CMFS API

Automatingconfig. choice

How I choose a good configuration for my application?

[ERSS`11¸ GRID`10]

How I choose a good configuration for my application?

[ERSS`11¸ GRID`10]

Thank you