Upload
cory-black
View
226
Download
2
Embed Size (px)
Citation preview
Worldwide File Replicationon Grid Datafarm
Osamu Tatebe and Satoshi SekiguchiGrid Technology Research Center,National Institute of Advanced Industrial Science and Technology (AIST)
APAN 2003 ConferenceFukuoka, January 2003
ATLAS/Grid Datafarm project:CERN LHC Experiment
Detector for ALICE experiment
Detector forLHCb experiment
Truck
ATLAS Detector40mx20m7000 Tons
LHCPerimeter 26.7km
~2000 physicists from 35 countries
Collaboration between KEK, AIST, Titech, and ICEPP, U Tokyo
Petascale Data-intensive Computing Requirements Peta/Exabyte scale files Scalable parallel I/O throughput
> 100GB/s, hopefully > 1TB/s within a system and between systems
Scalable computational power > 1TFLOPS, hopefully > 10TFLOPS
Efficiently global sharing with group-oriented authentication and access control
Resource Management and Scheduling System monitoring and administration Fault Tolerance / Dynamic re-configuration Global Computing Environment
Grid Datafarm: Cluster-of-cluster Filesystem with Data Parallel Support Cluster-of-cluster filesystem on the Grid
File replicas among clusters for fault tolerance and load balancing
Extension of striping cluster filesystem Arbitrary file block length Filesystem node = compute node + I/O node. Each node has lar
ge fast local disks. Parallel I/O, parallel file transfer, and more
Extreme I/O bandwidth, >TB/s Exploit data access locality File affinity scheduling and local file view
Fault tolerance – file recovery Write-once files can be re-generated using a command history
and re-computation
[1] O.tatebe, et al, Grid Datafarm Architecture for Petascale Data IntensiveComputing, Proc. of CCGrid 2002, Berlin, May 2002Available at http://datafarm.apgrid.org/
Distributed disks across the clusters form a single Distributed disks across the clusters form a single GfaGfarm file systemrm file system
Each cluster generates the corresponding part of dataEach cluster generates the corresponding part of data The data are The data are replicatedreplicated for fault tolerance and load bal for fault tolerance and load bal
ancing (bandwidth challenge!)ancing (bandwidth challenge!) Analysis process is executed on the node that has the Analysis process is executed on the node that has the
datadata
Baltimore Tsukuba Indiana Sandiego Tokyo
Extreme I/O bandwidth support example: gfgrep - parallel grep
%% gfrun –G gfrun –G gfarm:inputgfarm:input gfgrepgfgrep – –o o gfarm:outputgfarm:output regexpregexp gfarm:inputgfarm:input
CERN.CH KEK.JP
input.1
input.2
input.3
input.4
open(“gfarm:input”, &f1)create(“gfarm:output”, &f2)set_view_local(f1)set_view_local(f2)
close(f1); close(f2)
grep regexp
Host2.ch
Host1.ch
Host3.ch
Host4.jp
gfarm:inputHost1.ch Host2.ch Host3.chHost4.jp Host5.jp
gfmd
input.5
Host5.jp
output.4output.2
output.5output.3
output.1
gfgrep gfgrep
gfgrep
gfgrepgfgrep
File affinity scheduling
Design of AIST Gfarm Cluster I Cluster node (High density and High performance)
1U, Dual 2.4GHz Xeon, GbE 480GB RAID with 4 3.5” 120GB HDDs + 3ware RAID
136MB/s on writes, 125MB/s on reads 12-node experimental cluster (operational from Oct 2002)
12U + GbE switch (2U) + KVM switch (2U) + keyboard + LCD Totally 6TB RAID with 48 disks
1063MB/s on writes, 1437MB/s on reads 410 MB/s for file replication with 6 streams
WAN emulation with NistNET
480GB 120MB/s
10GFlops
Gb
E sw
itch
Grid Datafarm US-OZ-Japan Testbad
Indiana Univ.
SDSC
IndianapolisGigaPoP
NOC
TokyoNOC
OC-12 POS
Japan
APAN/TransPAC
KEK
Titech
AIST
ICEPP
PNWG OC-12
StarLightOC-12 ATM
Tsukuba WAN
20 Mbps
GbE
SuperSINET
1 Gbps
OC-12
US
ESnetNII-ESnet HEP PVC
GbE
Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/s
KEK Titech AISTICEPP
SDSC Indiana U
MelbourneAustralia
Grid Datafarm for a HEP application
Osamu Tatebe(Grid Technology Research Center, AIST)
Satoshi Sekiguchi(AIST), Youhei Morita (KEK), Satoshi Matsuoka (Titech & NII), Kento Aida (Titech),
Donald F. (Rick) McMullen (Indiana),Philip Papadopoulos (SDSC)
SC2002 High-Performance Bandwidth Challenge
Target Application at SC2002: FADS/GoofyTarget Application at SC2002: FADS/Goofy
Monte Carlo Simulation Framework with Geant4 (C++)Monte Carlo Simulation Framework with Geant4 (C++) FADS/Goofy: FADS/Goofy: FFramework for ramework for AATLAS/TLAS/AAutonomous utonomous DDetector etector SSimulation / imulation / GGeant4-based eant4-based OObject-bject-ooriented riented
FFollollyy
http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/domains/simulation/http://atlas.web.cern.ch/Atlas/GROUPS/SOFTWARE/OO/domains/simulation/
Modular I/O package selection:Modular I/O package selection:Objectivity/DB and/or ROOT I/OObjectivity/DB and/or ROOT I/Oon top of Gfarm filesystemon top of Gfarm filesystemwith good scalabilitywith good scalability
CPU intensive event simulationCPU intensive event simulationwith high speed file replicationwith high speed file replicationand/or distributionand/or distribution
Network and cluster configuration for SC2002 Bandwidth Challenge
SCinet 10 GE
SC2002, Baltimore
Indiana Univ.
SDSC
IndianapolisGigaPoP
NOC
TokyoNOC
OC-12 POS
Japan
APAN/TransPAC
KEK
Titech
AIST
ICEPP
PNWGGrid Cluster
Federation Booth
OC-12
StarLightOC-12 ATM(271Mbps)
E1200
Tsukuba WAN
20 Mbps
GbE
SuperSINET
1 Gbps
OC-12
10 GE
US
ESnetNII-ESnet HEP PVC
GbE
KEK Titech AIST ICEPP SDSC Indiana U SC2002
Total bandwidth from/to SC2002 booth: 2.137 Gbps
Total disk capacity: 18 TB, disk I/O bandwidth: 6 GB/sPeak CPU performance: 962 GFlops
USBackbone
San DiegoSDSC
Indiana Univ
SC2002Booth
AIST
TitechU Tokyo
Chicago
Seattle
MAFFIN
Tokyo
1Gbps
622Mbps
622Mbps
KEK
622Mbps
271Mbps
1Gbps1Gbps
10Gbps
10Gbps
Tsukuba WAN
20Mbps
TranspacificMultiple routes
in a singleapplication!741 Mbps
A record speed
SC2002Bandwidth Challenge
2.286 Gbps!using 12 nodes
Parallel file replication
Baltimore
Network and cluster configuration
SC2002 boothSC2002 booth12-node AIST Gfarm cluster connected with GbE connects to the SCinet with 10 GE using Force10 E1200Performance in LAN
Network bandwidth: 930MbpsFile transfer bandwidth: 75MB/s(=629Mbps)
GTRC, AISTGTRC, AISTThe same 7-node AIST Gfarm cluster connects to Tokyo XP with GbE via Tsukuba WAN and Maffin
Indiana UnivIndiana Univ15-node PC cluster connected with Fast Ethernet connects to Indianapolis GigaPoP with OC12
SDSCSDSC8-node PC cluster connected with GbE connects to outside with OC12
TransPAC north and south routesTransPAC north and south routesDefault north routeSouth route is used by static routing for 3 nodes in each SC booth and AISTRTT between AIST and SC booth: north 199ms, south 222msSouth route is shaped to 271Mbps
PC
E1200
PC
GbE
SCinet10GE OC192
12nodes
Network configuration at SC2002 booth
Challenging points of TCP-based file transfer
Large latency, high Bandwidth (aka LFN)Large latency, high Bandwidth (aka LFN)Big socket size for large congestion windowFast window-size recovery after packet loss
High Speed TCP (internet draft by Sally Floyd)Network striping
Packet loss due to real congestionPacket loss due to real congestionTransfer control
Poor disk I/O performancePoor disk I/O performance3ware RAID with 4 3.5” HDDs on each nodeOver 115 MB/s (~ 1 Gbps network bandwidth)Network striping vs. disk striping access
# streams, stripe sizeLimited number of nodesLimited number of nodes
Need to achieve maximum file transfer performance
Bandwidth measurement result of TransPAC bBandwidth measurement result of TransPAC by Iperfy Iperf
Northern route: 2 node pairsSouthern route: 3 node pairs
(100Mbps streams)Totally 753Mbps (10-sec avg)(peak bandwidth: 622+271=893Mbps )
10-sec average bandwidth
5-min average bandwidth
Northern route
Southern route
Bandwidth measurement result between SC-boBandwidth measurement result between SC-booth and other sites by Iperf (1-min averagoth and other sites by Iperf (1-min average)e)
Due to a packet loss problem of Abilenebetween Denver and Kansas City
TransPAC Northern route hasvery high deviation
Indiana Univ SDSCTransPACNorth TransPAC
South
10GE notavailable
Due to evaluationof different configurationsof southern route
File replication between US and Japan
Using 4 nodes in each US and Japan, we achieved 741 Mbps for file transfer!
( out of 893 Mbps, 10-sec average bandwidth)
10-sec average bandwidth
Parameters of US-Japan file transfer
parameterparameter Northern Northern routeroute
Southern Southern routeroute
socket buffer sizesocket buffer size 610 KB610 KB 250 KB250 KB
Traffic control per streamTraffic control per stream 50 Mbps50 Mbps 28.5 Mbps28.5 Mbps
# streams per node pair# streams per node pair 16 streams16 streams 8 streams8 streams
# nodes# nodes 3 hosts3 hosts 1 host1 host
stripe unit sizestripe unit size 128 KB128 KB 128 KB128 KB
# node # node pairspairs
# streams# streams 10-sec 10-sec average average BWBW
Transfer Transfer time time (sec)(sec)
Average BWAverage BW
1 (N1)1 (N1) 16 (N16x1)16 (N16x1) 113.0113.0 152Mbps152Mbps
2 (N2)2 (N2) 32 (N16x2)32 (N16x2) 419 Mbps419 Mbps 115.9115.9 297Mbps297Mbps
3 (N3)3 (N3) 48 (N16x3)48 (N16x3) 593 Mbps593 Mbps 139.6139.6 369Mbps369Mbps
4 (N3 S1)4 (N3 S1) 56 (N16x3 + 56 (N16x3 + S8x1)S8x1)
741 Mbps741 Mbps 150.0150.0 458Mbps458Mbps
File replication performance between SC-booth and other US sites
Replication bandwidth [SC 1 node-US sites]
0
10
20
30
40
50
1 2 3 4 5
# remote nodes
Ban
dwid
th [
MB
/s]
sc->indianaindiana->scsc->sdscsdsc->sc
SC02 Bandwidth Challenge Result
We achieved 2.286 Gbps using 12 nodes!(outgoing 1.691 Gbps, incoming 0.595 Gbps)
10-sec average bandwidth
1-sec average bandwidth
0.1-sec average bandwidth
Summary
Petascale Data Intensive Computing Wave Key technology: Grid and cluster Grid datafarm is an architecture for
Online >10PB storage, >TB/s I/O bandwidth Efficient sharing on the Grid Fault tolerance
Initial performance evaluation shows scalable performance 1742 MB/s, 1974 MB/s on writes and reads on 64 cluster nodes of Presto III 443 MB/s using 23 parallel streams on Presto III 1063 MB/s, 1436 MB/s on writes and reads on 12 cluster nodes of AIST Gfarm I 410 MB/s using 6 parallel streams on AIST Gfarm I
Metaserver overhead is negligible Gfarm file replication achieves 2.286 Gbps at SC2002 bandwidth challenge, and
741 Mbps out of 893 Mbps between US and Japan! Smart resource broker is needed!!
[email protected]://datafarm.apgrid.org/
Special thanks to
Rick McMullen, John Hicks (Indiana Univ, PRAGMA)
Phillip Papadopoulos (SDSC, PRAGMA) Hisashi Eguchi (Maffin) Kazunori Konishi, Yoshinori Kitatsuji, Ayumu
Kubota (APAN) Chris Robb (Indiana Univ, Abilene) Force 10 Networks, Inc