Upload
xlight
View
108
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Joint Techs Workshop, TIP 2004Jan 28, 2004Honolulu, Hawaii
National Institute of Advanced Industrial Science and Technology
Trans-Pacific Grid Datafarm
Osamu TatebeOsamu TatebeGrid Technology Research Center, AISTGrid Technology Research Center, AIST
On behalf of the Grid On behalf of the Grid DatafarmDatafarm ProjectProject
National Institute of Advanced Industrial Science and Technology
Key points of this talk
TransTrans--pacific Grid file system and pacific Grid file system and testbedtestbed
70 TBytes disk capacity, 13 GB/sec disk I/O performance
TransTrans--pacific file replication [SC2003 Bandwidth Challenge]pacific file replication [SC2003 Bandwidth Challenge]
1.5TB data transferred in an hour
Multiple high-speed Trans-Pacific networks; APAN/TransPAC (2.4 Gbps OC48 POS, 500 Mbps OC-12 ATM), SuperSINET (2.4 Gbps x 2, 1 Gbps available)
6,000 miles
stable 3.79 Gbps out of theoretical peak 3.9 Gbps (97%) using 11 node pairs (MTU 6000B)
We won the "Distributed Infrastructure" award!
National Institute of Advanced Industrial Science and Technology
[Background] Petascale Data Intensive Computing
Detector forALICE experiment
Detector forLHCb experiment
High Energy PhysicsCERN LHC, KEK Belle
~MB/collision,100 collisions/sec~PB/year2000 physicists, 35 countries
Astronomical Data Analysisdata analysis of whole the dataTB~PB/year/telescopeSUBARU telescope
10 GB/night, 3 TB/year
National Institute of Advanced Industrial Science and Technology
[Background 2] Large-scale File Sharing
P2P P2P –– exclusive and specialexclusive and special--purpose approachpurpose approach
Napster, Gnutella, Freenet, . . .
Grid technology Grid technology –– file transfer, metadata managementfile transfer, metadata management
GridFTP, Replica Location Service
Storage Resource Broker (SRB)
LargeLarge--scale file system scale file system –– general approachgeneral approach
Legion, Avaki [Grid, no replica management]
Grid Datafarm [Grid]
Farsite, OceanStore [P2P]
AFS, DFS, . . .
National Institute of Advanced Industrial Science and Technology
Goal and feature of Grid Datafarm
GoalGoalDependable data sharing among multiple organizationsHigh-speed data access, High-speed data processing
Grid Grid DatafarmDatafarmGrid File System – Global dependable virtual file system
Integrates CPU + storage
Parallel & distributed data processing
FeaturesFeaturesSecured based on Grid Security InfrastructureScalable depending on data size and usage scenariosData location transparent data accessAutomatic and transparent replica access for fault toleranceHigh-performance data access and processing by accessing multiple dispersed storages in parallel (file affinity scheduling)
National Institute of Advanced Industrial Science and Technology
Grid Datafarm (1): Gfarm file system -World-wide virtual file system [CCGrid 2002]
Transparent access to dispersed file data in a GridTransparent access to dispersed file data in a GridPOSIX I/O APIs, and native Gfarm APIs for extended file view semantics and replicationsMap from virtual directory tree to physical fileAutomatic and transparent replica access for fault tolerance and access-concentration avoidance
Gfarm File System
/grid
ggf jp
aist gtrc
file1 file3file2 file4
file1 file2
File replica creation
Virtual DirectoryTree
mapping
File system metadata
National Institute of Advanced Industrial Science and Technology
Grid Datafarm (2): High-performance data access and processing support [CCGrid 2002]
WorldWorld--wide parallel and distributed processingwide parallel and distributed processingAggregate of files = superfileData processing of superfiles = parallel and distributed data processing of member files
Local file view (SPMD parallel file access)File-affinity scheduling (“Owner-computes”)
Grid File System
Virtual CPU
Astronomic archival datain a year (superfile)365 parallel analysis
World-wideParallel &distributedprocessing
National Institute of Advanced Industrial Science and Technology
Transfer technology in long fat networks
Bandwidth and latency between US and JapanBandwidth and latency between US and Japan
1~10 Gbps, 150~300 msec in RTT
TCP accelerationTCP acceleration
Adjustment of congestion window
Multiple TCP connections
HighSpeed TCP、Scalable TCP、FAST TCP
XCP (not TCP)
UDP based accelerationUDP based acceleration
Tsunami、UDT、RBUDP、atou、. . .
Bandwidth prediction without packet loss
National Institute of Advanced Industrial Science and Technology
Multiple TCP streams sometimes considered harmful . . .
Multiple TCP streams achieve good bandwidth, but Multiple TCP streams achieve good bandwidth, but excessively congest the network. In fact would excessively congest the network. In fact would ““shoot oneself in the footshoot oneself in the foot””..
High oscillationNot stable!
Too muchcongestion
APAN/TransPAC LA-Tokyo (2.4Gbps)
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
375.5 376 376.5 377 377.5 378
Time (seconds)
Ban
dwidth (Mbp
TxTotal
TxBW0
TxBW1
TxBW2
[10 msec average]Too much network flowNeed to limit bandwidth appropriately
Compensateeach other
National Institute of Advanced Industrial Science and Technology
A programmable network testbed device GNET-1
Programmable hardwarenetwork testbedWAN emulation- latency, bandwidth,packet loss, jitter, . . .
Precise measurement- bandwidth in 100 usec- latency, jitter between 2 GNET-1General purpose, very flexible!
Large high-speedmemory blocks
National Institute of Advanced Industrial Science and Technology
IFG-based pace control by GNET-1
Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))
0
100
200
300
400
500
600
700
800
900
1000
245.5 246 246.5 247
Time (Second)
Ban
dwidth (Mbp
TxBW0
TxBW1
TxBW2
Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))
0
100
200
300
400
500
600
700
800
900
1000
245.5 246 246.5 247
Time (Second)
Ban
dwidth (Mbp
TxBW0
TxBW1
TxBW2
Shaping by GNET-1 (700Mbps x 3 @ APAN LA-Tokyo(2.4Gbps))
0
100
200
300
400
500
600
700
800
900
1000
245.5 246 246.5 247
Time (Second)
Ban
dwidth (Mb
RxBW0
RxBW1
BottleneckGNET-11 Gbps
(enable flow control) 700 Mbps 700 MbpsNO PACKET LOSS!
GNET-1 provides
Precise traffic pacing at any data rate by changing IFG (Inter-Frame Gap)
Packet loss free network using large input buffer (16MB)
National Institute of Advanced Industrial Science and Technology
Summary of technologies for performance improvement
[[Disk I/O performanceDisk I/O performance] Grid ] Grid DatafarmDatafarm –– A Grid file system with highA Grid file system with high--performance dataperformance data--intensive computing supportintensive computing support
A world-wide virtual file system that federates local file systems of multiple clusters
It provides scalable disk I/O performance for file replication via high-speed network links and large-scale data-intensive applications
Trans-Pacific Grid Datafarm testbed5 clusters in Japan, 3 clusters in US, and 1 cluster in Thailand, provides 70 TBytes disk capacity, 13 GB/sec disk I/O performance
It supports file replication for fault tolerance and access-concentration avoidance
[[WorldWorld--wide highwide high--speed network efficient utilizationspeed network efficient utilization] GNET] GNET--1 1 –– a gigabit a gigabit network network testbedtestbed devicedevice
Provides IFG-based precise rate-controlled flow at any rate
Enables stable and efficient Trans-Pacific network use of HighSpeedTCP
National Institute of Advanced Industrial Science and Technology
Trans-Pacific Grid Datafarm testbed:Network and cluster configuration
2.4G
2.4G
10G
10G
1G
2.4G(1G)
1G1G
SuperSINET
APAN/TransPAC
Los Angeles
622M
AIST
Titech
Maffin
10G
10G
APANTokyo XP
SuperSINET
TsukubaWAN
10G
2.4GNewYork
OC-12 ATM
SC2003Phoenix
32 nodes23.3 TBytes2 GB/sec
5G
16 nodes11.7 TBytes1 GB/sec
16 nodes11.7 TBytes1 GB/sec
7 nodes3.7 TBytes200 MB/sec
10 nodes1 TBytes300 MB/sec
147 nodes16 TBytes4 GB/sec
IndianaUniv
KasetsartUniv,Thailand
SDSC
Trans-Pacific thoretical peak 3.9 GbpsGfarm disk capacity 70 TBytes
disk read/write 13 GB/sec
Chicago
Abilene
Abilene
KEK
UnivTsukuba NII
1G
[2.34 Gbps]
[950 Mbps]
[500 Mbps]
National Institute of Advanced Industrial Science and Technology
Scientific Data for Bandwidth Challenge
TransTrans--Pacific File Replication of scientific dataPacific File Replication of scientific data
For transparent, high-performance, and fault-tolerant access
Astronomical Object Survey on Grid Astronomical Object Survey on Grid DatafarmDatafarm [HPC Challenge participant][HPC Challenge participant]
World-wide data analysis on whole the archive
652 GBytes data observed by SUBARU telescope
N. Yamamoto (AIST)
Large configuration data from Lattice QCDLarge configuration data from Lattice QCD
Three sets of hundreds of gluon field configurations on a 24^3*48 4-D space-time lattice (3 sets x 364.5 MB x 800 = 854.3 GB)
Generated by the CP-PACS parallel computer at Center for Computational Physics, Univ. of Tsukuba (300Gflops x years of CPU time) [Univ Tsukuba Booth]
National Institute of Advanced Industrial Science and Technology
Network bandwidth in APAN/TransPACLA route
[Gbp
s]
1
2
No pacing Pacing in 2.3 Gbps(900 + 900 + 500)
PC
PC
PC
PC
PC
switch
switch
switch
FC10E600 router Juniper
M20
PC
PC
PC
PC
PC
switch
switch
switchLA Tokyo
10G 2.4G 3G3G
GNET-1
RTT: 141 ms
Stable transfer rate of 2.3 Gbps
National Institute of Advanced Industrial Science and Technology
APAN/TransPAC LA route (1)
National Institute of Advanced Industrial Science and Technology
APAN/TransPAC LA route (2)
National Institute of Advanced Industrial Science and Technology
APAN/TransPAC LA route (3)
National Institute of Advanced Industrial Science and Technology
File replication between Japan and US(network configuration)
PC
PC
PC
PC
PC
switch
switch
router
PC
PC
PC
PC
PC
switch
switchLA Tokyo
10G
2.4G
3G
PC
PC
PC
switch router
PC
PC
PC
switch
(1G)2.4G 1G
PCswitch
PCswitch
PC
PCswitch router
router
PC
PCswitch500M 1G
Chicago
FC10E600
NYCA
bileneA
bilene
JuniperM20
3G
RTT: 141 ms
RTT: 285 ms
RTT: 250 ms
GNET-1
1G
1G
Phoenix Tokyo,Tsukuba
National Institute of Advanced Industrial Science and Technology
File replication performance between Japan and US (total)
National Institute of Advanced Industrial Science and Technology
APAN/TransPAC Chicago
Pacing at 500 Mbps, quite stable
National Institute of Advanced Industrial Science and Technology
APAN/TransPAC LA (1)
After re-pacing from 800 to 780 Mbps, quite stable
National Institute of Advanced Industrial Science and Technology
APAN/TransPAC LA (2)
After re-pacing of LA (1), quite stable
National Institute of Advanced Industrial Science and Technology
APAN/TransPAC LA (3)
After re-pacing of LA (1), quite stable
National Institute of Advanced Industrial Science and Technology
SuperSINET NYCRe-pacing from 930 to 950 Mbps
National Institute of Advanced Industrial Science and Technology
Summary
Efficient use around the peak rate in long fat networksEfficient use around the peak rate in long fat networksIFG-based precise pacing within packet loss free bandwidth with GNET-1
-> packet loss free network
Stable network flow even with HighSpeed TCPDisk I/O performance improvementDisk I/O performance improvement
Parallel disk access using GfarmTrans-pacific file replication performance: 3.79Gbps out of theoretical peak 3.9 Gbps (97%) using 11 node pairs (MTU 6000B)1.5TB data was transferred in an hour
Linux 2.4 kernel problem during file replication (transfer)Linux 2.4 kernel problem during file replication (transfer)Network transfer stopped in a few minutes when flushing buffer cache to diskLinux kernel bug?Defensive solution: set very short interval for buffer cache flush
This limits file transfer rate to 400 Mbps for one node pair
Successful TransSuccessful Trans--pacific scale data analysispacific scale data analysis.. . . Scalability problem of LDAP server for a metadata server. . Scalability problem of LDAP server for a metadata server
Further improvement needed
National Institute of Advanced Industrial Science and Technology
Future work
Standardization effort with GGF Grid File System WGStandardization effort with GGF Grid File System WG
Foster (world-wide) storage sharing and integration
dependable data sharing, high-performance data accessamong several organizations
Application areaApplication area
High energy physics experiment
Astronomic data analysis
Bioinformatics, . . .
Dependable data processing in eGovernment and eCommerce
Other applications that needs dependable file sharingamong several organizations
National Institute of Advanced Industrial Science and Technology
Special thanks to
HirotakaHirotaka Ogawa, Ogawa, YuetsuYuetsu Kodama, Tomohiro Kodama, Tomohiro KudohKudoh, Satoshi , Satoshi SekiguchiSekiguchi (AIST), Satoshi Matsuoka, (AIST), Satoshi Matsuoka, KentoKento Aida (Aida (TitechTitech), ), TaisukeTaisuke BokuBoku, , MitsuhisaMitsuhisa Sato (Sato (UnivUniv Tsukuba),Tsukuba),YouheiYouhei Morita (KEK), Yoshinori Morita (KEK), Yoshinori KitatsujiKitatsuji (APAN Tokyo XP), (APAN Tokyo XP), Jim Williams, John Hicks (Jim Williams, John Hicks (TransPACTransPAC/Indiana /Indiana UnivUniv))
EguchiEguchi HisashiHisashi ((MaffinMaffin), Kazunori ), Kazunori KonishiKonishi, Jin Tanaka, , Jin Tanaka, Yoshitaka Hattori (APAN), Jun Yoshitaka Hattori (APAN), Jun MatsukataMatsukata (NII), Chris Robb (NII), Chris Robb (Abilene)(Abilene)
Tsukuba WAN NOC team, APAN NOC team, NII Tsukuba WAN NOC team, APAN NOC team, NII SuperSINETSuperSINETNOC teamNOC team
Force10 NetworksForce10 Networks
PRAGMA, PRAGMA, ApGridApGrid, SDSC, Indiana University, , SDSC, Indiana University, KasetsartKasetsartUniversityUniversity