View
8
Download
0
Category
Preview:
Citation preview
Storage on the Lunatic FringeStorage on the Lunatic Fringe
NNSANNSAAdvanced Simulation and Advanced Simulation and
Computing ProgramComputing Program(ASC)(ASC)
Panel at SC2003 November 19Panel at SC2003 November 19Bill BoasBill Boas
Lawrence Livermore National Lawrence Livermore National LaboratoryLaboratoryUCRL-PRES-201057
This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48.
2November 19, 2003Panel at SC03
AgendaAgenda
• ASC Program Role• ASC Storage Roadmap• File System Requirements• Programming Model• Cluster Vision• Q1 CY’04 OCF Cluster Deployment• Issues on the Fringe
3November 19, 2003Panel at SC03
Role of the Advanced Simulation and Computing Program (ASC)Role of the Advanced Simulation and Role of the Advanced Simulation and Computing Program (ASC)Computing Program (ASC)
• ASC Mission: Provide computational means to assess and certify the safety, performance and reliability of nuclear stockpile and its components
• ASC Goals: Deliver predictive codes based on multi-scale modeling, code verification and validation, small-scale experimental data, test data, judgment engineering analysis, expert judgment
• Started in 1996: approximately 1/8 of the SSP budget • PathForward and Alliances: support h/w and s/w,
developments with research, academia and industry• Scalable Global Secure File System (SGSFS): awarded to
HP, Intel and CFS in 2002 to develop a high performance file system to meet ASC goals
4November 19, 2003Panel at SC03
ASC Program is more than platforms and physics codesASC Program is more than platforms and physics codes
AdvancedApplications
Materials and PhysicsModeling
tegration
Simulation Support
Physical InfrastructureAnd Platforms
ComputationalSystems
UniversityPartnerships In
AdvancedArchitectures
Verification and
Validation
Problem SolvingEnvironmentVIEWS PathForward
DISCOM
5November 19, 2003Panel at SC03
ASC Data Storage and I/O RoadmapASC Data Storage and I/O Roadmap
Area CY 02 CY 03 CY 04 CY 05 CY 06 CY 07ASC Perf. Targets
30 TF1 PB Archive7- 20 GB/s parallel FS1 GB/s to Arch. tape
70-100 TF7 PB Archive100 GB/s parallel FS10 GB/s to Arch. tape
200 TF25 PB Archive200 GB/s parallel FS20 GB/s to Arch.tape
SGSFS Lustre Lite on Linux
Lustre Lite limitedproduction
Lustre w. OST striping
Lustre early prod.
Lustre stable prod.
SIO Libs Limited App use
Use by key Apps
Broad App use
Perf. tuned for Lustre
Archive HPSS 4.1 production
HPSS 4.5 production
HPSS 5.1 Metadata fixes
HPSS 6.1 replace DCE
TBD
DFS DFS in production
Pilot NFSv4 on Linux
Deploy NFSv4
Integrate NFSv4 w. Lustre
COTS 180 GB/disk30 MB/s single disk300 GB tape capacity70 MB/s max tape rate
600 GB/disk80 MB/s single disk600 GB tape capacity120 MB/s tape rate
1200 GB/disk200 MB/s single disk2 TB tape capacity200 MB/s tape rate
6November 19, 2003Panel at SC03
ASCI Scale File System RequirementsASCI Scale File System Requirements
• Global Access• Multi-Gigabyte per Second Performance• Scalable Infrastructure for Clusters and
Archive throughout Site or Facility• Integrated Infrastructure for WAN Access• Scalable Management and Operational
Facilities• Security
7November 19, 2003Panel at SC03
An Identical Programming ModelAn Identical Programming ModelAcross Scalable Platforms (at LLNL)Across Scalable Platforms (at LLNL)
LocalGlobal SharedGlobal Shared
Scalable I/OScalable I/O
OpenMPOpenMP
OpenMP
Shared Serial(NFS) I/O
MPI
MPI Communication
OpenMP
Local
IdeaIdea: Provide a consistent : Provide a consistent programming model for multiple programming model for multiple platform generations and across platform generations and across multiple vendors!multiple vendors!
IdeaIdea: Incrementally increase : Incrementally increase functionality over time!functionality over time!
8November 19, 2003Panel at SC03
Architectural Vision for a Linux clusterArchitectural Vision for a Linux cluster
QsNet Elan3, 100BaseT Control
Example 1,498 8-Way 8 Compute Nodes
4 Login nodeswith 8x10Gb-Enet
2 Service
84 RAID subsystems2 GB/s raw delivered each
Lustre Total 106 GB/s
2 MetaData (fail-over) Servers128 Gateway nodes @ 1 GB/s
delivered Lustre I/O
100BaseT Management
OST OST OST OST OST OST OST OSTMDS MDS
Dual 1,632 Port (51x32D32U+32x64D0U) IBA 12x
1 & 10GbEnet Federated Switch SATA Switches
System and Storage Parameters• 128 OSTs with SATA attached RAID Arrays
•20 B:F = 2.0 PB of global disk• Lustre file system with 100 GB/s delivered
parallel I/O performance• Could scale up to 2,048 nodes or over 130
teraFLOP/s
System and Storage Parameters• >100 TF/s and 50 TB of memory• Dual IBA 12x interconnect
• B:F =12:64 = 0.1875• ~5 µs MPI latency and 10 GB/s MPI Bandwidth• 1,466*8 = 11,728 MPI tasks (limited by number of
compute nodes)
9November 19, 2003Panel at SC03
MCR - 1152 Port QsNet Elan3
GW GW MDS MDS
ALC - 960 Port QsNet Elan3
GW GW MDS MDS
OCF SGS File System Cluster (OFC)
B439LLNLExternal
Backbone
SW
SWSW
SW
SW
SW
SW
SW
2 Login nodes 32 Gateway nodes @ 190 MB/s
PVC - 128 Port Elan3B451
GW GW
HPSSArchive
24
PFTP
OST OST OST OST OST OST
OST OST OST OST OST OST
OST
Thunder - 1024 PortQsNet Elan4
128OST
Heads
128OST
Heads
NASSystems
MDS MDS
USERS
924 Dual P4 Compute Nodes
B439
OCFMetaDataCluster
FederatedEthernet
2 Login nodeswith 4 Gb-Enet
32 Gateway nodes @ 190 MB/sdelivered Lustre I/O over 2x1GbE
B 113116etc.
with 4 Gb-Enet delivered Lustre I/O over 2x1GbE
B439
52 Dual P4Render Nodes
6 Dual P4Display
1,116 Dual P4 Compute Nodes
MM Fiber 1 GigE
SM Fiber 1 GigE
Copper 1 GigE
2 Login nodes
2Gig FC
2Gig FC
Dual P4Head
FC RAID
146, 73, 36 GB
1008 4 - Way Itanium2 Compute Nodes
16 GW NodesItanium2
400- 600 Terabytes
B451
USERS
MultiMulti--Cluster Global Scalable StorageCluster Global Scalable Storage
10November 19, 2003Panel at SC03
Issues on the Fringe in Issues on the Fringe in Parallel/Cluster File SystemsParallel/Cluster File Systems
• Multi-Cluster Interconnection• Metadata Services Scaling• Performance at Scale• Scaling Multiple Clusters• Recovery at Scale• Availability at Scale• Geographic dispersion• Security in Multi-Clusters/Geographic• $$ at scale of Storage Hardware• “Non-exotic” direct interconnect use
11November 19, 2003Panel at SC03
Storage on the Lunatic FringeStorage on the Lunatic Fringe
DISCLAIMER
This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes.
This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48.
Recommended