Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial

Worldwide File Replicationon Grid Datafarm

Osamu Tatebe and Satoshi SekiguchiGrid Technology Research Center,National Institute of Advanced Industrial Science and Technology (AIST)

APAN 2003 ConferenceFukuoka, January 2003

ATLAS/Grid Datafarm project:CERN LHC Experiment

Detector for ALICE experiment

Detector forLHCb experiment

Truck

ATLAS Detector40mx20m7000 Tons

LHCPerimeter 26.7km

~2000 physicists from 35 countries

Collaboration between KEK, AIST, Titech, and ICEPP, U Tokyo

Petascale Data-intensive Computing Requirements Peta/Exabyte scale files Scalable parallel I/O throughput

> 100GB/s, hopefully > 1TB/s within a system and between systems

Scalable computational power > 1TFLOPS, hopefully > 10TFLOPS

Efficiently global sharing with group-oriented authentication and access control

Resource Management and Scheduling System monitoring and administration Fault Tolerance / Dynamic re-configuration Global Computing Environment

Grid Datafarm: Cluster-of-cluster Filesystem with Data Parallel Support Cluster-of-cluster filesystem on the Grid

File replicas among clusters for fault tolerance and load balancing

Extension of striping cluster filesystem Arbitrary file block length Filesystem node = compute node + I/O node. Each node has lar

ge fast local disks. Parallel I/O, parallel file transfer, and more

Extreme I/O bandwidth, >TB/s Exploit data access locality File affinity scheduling and local file view

Fault tolerance – file recovery Write-once files can be re-generated using a command history

and re-computation

[1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data IntensiveComputing, Proc. of CCGrid 2002, Berlin, May 2002Available at http://datafarm.apgrid.org/

Distributed disks across the clusters form a single Distributed disks across the clusters form a single GfaGfarm file systemrm file system

Each cluster generates the corresponding part of dataEach cluster generates the corresponding part of data The data are The data are replicatedreplicated for fault tolerance and load bal for fault tolerance and load bal

ancing (bandwidth challenge!)ancing (bandwidth challenge!) Analysis process is executed on the node that has the Analysis process is executed on the node that has the

datadata

Baltimore Tsukuba Indiana Sandiego Tokyo

Extreme I/O bandwidth support example: gfgrep - parallel grep

%% gfrun –G gfrun –G gfarm:inputgfarm:input gfgrepgfgrep – –o o gfarm:outputgfarm:output regexpregexp gfarm:inputgfarm:input

CERN.CH KEK.JP

input.1

input.2

input.3

input.4

open(“gfarm:input”, &f1)create(“gfarm:output”, &f2)set_view_local(f1)set_view_local(f2)

close(f1); close(f2)

grep regexp

Host2.ch

Host1.ch

Host3.ch

Host4.jp

gfarm:inputHost1.ch Host2.ch Host3.chHost4.jp Host5.jp

gfmd

input.5

Host5.jp

output.4output.2

output.5output.3

output.1

gfgrep gfgrep

gfgrep

gfgrepgfgrep

File affinity scheduling

Design of AIST Gfarm Cluster I Cluster node (High density and High performance)

1U, Dual 2.4GHz Xeon, GbE 480GB RAID with 4 3.5” 120GB HDDs + 3ware RAID

136MB/s on writes, 125MB/s on reads 12-node experimental cluster (operational from Oct 2002)

12U + GbE switch (2U) + KVM switch (2U) + keyboard + LCD Totally 6TB RAID with 48 disks

1063MB/s on writes, 1437MB/s on reads 410 MB/s for file replication with 6 streams

WAN emulation with NistNET

480GB 120MB/s

10GFlops

Gb

E sw

itch

Grid Datafarm US-OZ-Japan Testbad

Indiana Univ.

SDSC

IndianapolisGigaPoP

NOC

TokyoNOC

OC-12 POS

Japan

APAN/TransPAC

KEK

Titech

AIST

ICEPP

PNWG OC-12

StarLightOC-12 ATM

Tsukuba WAN

20 Mbps

GbE

SuperSINET

1 Gbps

OC-12

US

ESnetNII-ESnet HEP PVC

GbE

Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s

KEK Titech AISTICEPP

SDSC Indiana U

MelbourneAustralia

Grid Datafarm for a HEP application

Osamu Tatebe(Grid Technology Research Center, AIST)

Satoshi Sekiguchi(AIST), Youhei Morita (KEK), Satoshi Matsuoka (Titech & NII), Kento Aida (Titech),

Donald F. (Rick) McMullen (Indiana),Philip Papadopoulos (SDSC)

SC2002 High-Performance Bandwidth Challenge

Target Application at SC2002: FADS/GoofyTarget Application at SC2002: FADS/Goofy

Monte Carlo Simulation Framework with Geant4 (C++)Monte Carlo Simulation Framework with Geant4 (C++) FADS/Goofy: FADS/Goofy: FFramework for ramework for AATLAS/TLAS/AAutonomous utonomous DDetector etector SSimulation / imulation / GGeant4-based eant4-based OObject-bject-ooriented riented

FFollollyy

http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/domains/simulation/http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/domains/simulation/

Modular I/O package selection:Modular I/O package selection:Objectivity/DB and/or ROOT I/OObjectivity/DB and/or ROOT I/Oon top of Gfarm filesystemon top of Gfarm filesystemwith good scalabilitywith good scalability

CPU intensive event simulationCPU intensive event simulationwith high speed file replicationwith high speed file replicationand/or distributionand/or distribution

Network and cluster configuration for SC2002 Bandwidth Challenge

SCinet 10 GE

SC2002, Baltimore

Indiana Univ.

SDSC

IndianapolisGigaPoP

NOC

TokyoNOC

OC-12 POS

Japan

APAN/TransPAC

KEK

Titech

AIST

ICEPP

PNWGGrid Cluster

Federation Booth

OC-12

StarLightOC-12 ATM(271Mbps)

E1200

Tsukuba WAN

20 Mbps

GbE

SuperSINET

1 Gbps

OC-12

10 GE

US

ESnetNII-ESnet HEP PVC

GbE

KEK Titech AIST ICEPP SDSC Indiana U SC2002

Total bandwidth from/to SC2002 booth: 2.137 Gbps

Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/sPeak CPU performance: 962 GFlops

USBackbone

San DiegoSDSC

Indiana Univ

SC2002Booth

AIST

TitechU Tokyo

Chicago

Seattle

MAFFIN

Tokyo

1Gbps

622Mbps

622Mbps

KEK

622Mbps

271Mbps

1Gbps1Gbps

10Gbps

10Gbps

Tsukuba WAN

20Mbps

TranspacificMultiple routes

in a singleapplication!741 Mbps

A record speed

SC2002Bandwidth Challenge

2.286 Gbps!using 12 nodes

Parallel file replication

Baltimore

Network and cluster configuration

SC2002 boothSC2002 booth12-node AIST Gfarm cluster connected with GbE connects to the SCinet with 10 GE using Force10 E1200Performance in LAN

Network bandwidth: 930MbpsFile transfer bandwidth: 75MB/s(=629Mbps)

GTRC, AISTGTRC, AISTThe same 7-node AIST Gfarm cluster connects to Tokyo XP with GbE via Tsukuba WAN and Maffin

Indiana UnivIndiana Univ15-node PC cluster connected with Fast Ethernet connects to Indianapolis GigaPoP with OC12

SDSCSDSC8-node PC cluster connected with GbE connects to outside with OC12

TransPAC north and south routesTransPAC north and south routesDefault north routeSouth route is used by static routing for 3 nodes in each SC booth and AISTRTT between AIST and SC booth: north 199ms, south 222msSouth route is shaped to 271Mbps

PC

E1200

PC

GbE

SCinet10GE OC192

12nodes

Network configuration at SC2002 booth

Challenging points of TCP-based file transfer

Large latency, high Bandwidth (aka LFN)Large latency, high Bandwidth (aka LFN)Big socket size for large congestion windowFast window-size recovery after packet loss

High Speed TCP (internet draft by Sally Floyd)Network striping

Packet loss due to real congestionPacket loss due to real congestionTransfer control

Poor disk I/O performancePoor disk I/O performance3ware RAID with 4 3.5” HDDs on each nodeOver 115 MB/s (~ 1 Gbps network bandwidth)Network striping vs. disk striping access

# streams, stripe sizeLimited number of nodesLimited number of nodes

Need to achieve maximum file transfer performance

Bandwidth measurement result of TransPAC bBandwidth measurement result of TransPAC by Iperfy Iperf

Northern route: 2 node pairsSouthern route: 3 node pairs

(100Mbps streams)Totally 753Mbps (10-sec avg)(peak bandwidth: 622+271=893Mbps ）

10-sec average bandwidth

5-min average bandwidth

Northern route

Southern route

Bandwidth measurement result between SC-boBandwidth measurement result between SC-booth and other sites by Iperf (1-min averagoth and other sites by Iperf (1-min average)e)

Due to a packet loss problem of Abilenebetween Denver and Kansas City

TransPAC Northern route hasvery high deviation

Indiana Univ SDSCTransPACNorth TransPAC

South

10GE notavailable

Due to evaluationof different configurationsof southern route

File replication between US and Japan

Using 4 nodes in each US and Japan, we achieved 741 Mbps for file transfer!

（ out of 893 Mbps, 10-sec average bandwidth）


Parameters of US-Japan file transfer

parameterparameter Northern Northern routeroute

Southern Southern routeroute

socket buffer sizesocket buffer size 610 KB610 KB 250 KB250 KB

Traffic control per streamTraffic control per stream 50 Mbps50 Mbps 28.5 Mbps28.5 Mbps

# streams per node pair# streams per node pair 16 streams16 streams 8 streams8 streams

# nodes# nodes 3 hosts3 hosts 1 host1 host

stripe unit sizestripe unit size 128 KB128 KB 128 KB128 KB

# node # node pairspairs

# streams# streams 10-sec 10-sec average average BWBW

Transfer Transfer time time (sec)(sec)

Average BWAverage BW

1 (N1)1 (N1) 16 (N16x1)16 (N16x1) 113.0113.0 152Mbps152Mbps

2 (N2)2 (N2) 32 (N16x2)32 (N16x2) 419 Mbps419 Mbps 115.9115.9 297Mbps297Mbps

3 (N3)3 (N3) 48 (N16x3)48 (N16x3) 593 Mbps593 Mbps 139.6139.6 369Mbps369Mbps

4 (N3 S1)4 (N3 S1) 56 (N16x3 + 56 (N16x3 + S8x1)S8x1)

741 Mbps741 Mbps 150.0150.0 458Mbps458Mbps

File replication performance between SC-booth and other US sites

Replication bandwidth [SC 1 node-US sites]

0

10

20

30

40

50

1 2 3 4 5

# remote nodes

Ban

dwid

th [

MB

/s]

sc->indianaindiana->scsc->sdscsdsc->sc

SC02 Bandwidth Challenge Result

We achieved 2.286 Gbps using 12 nodes!(outgoing 1.691 Gbps, incoming 0.595 Gbps)



0.1-sec average bandwidth

Summary

Petascale Data Intensive Computing Wave Key technology: Grid and cluster Grid datafarm is an architecture for

Online >10PB storage, >TB/s I/O bandwidth Efficient sharing on the Grid Fault tolerance

Initial performance evaluation shows scalable performance 1742 MB/s, 1974 MB/s on writes and reads on 64 cluster nodes of Presto III 443 MB/s using 23 parallel streams on Presto III 1063 MB/s, 1436 MB/s on writes and reads on 12 cluster nodes of AIST Gfarm I 410 MB/s using 6 parallel streams on AIST Gfarm I

Metaserver overhead is negligible Gfarm file replication achieves 2.286 Gbps at SC2002 bandwidth challenge, and

741 Mbps out of 893 Mbps between US and Japan! Smart resource broker is needed!!

[email protected]://datafarm.apgrid.org/

Special thanks to

Rick McMullen, John Hicks (Indiana Univ, PRAGMA)

Phillip Papadopoulos (SDSC, PRAGMA) Hisashi Eguchi (Maffin) Kazunori Konishi, Yoshinori Kitatsuji, Ayumu

Kubota (APAN) Chris Robb (Indiana Univ, Abilene) Force 10 Networks, Inc

Documents

Worldwide File Replication on Grid Datafarm Osamu Tatebe and Satoshi Sekiguchi Grid Technology Research Center, National Institute of Advanced Industrial