Upload
barrie-gray
View
216
Download
0
Embed Size (px)
Citation preview
Computer Science SectionNational Center for Atmospheric Research
Department of Computer ScienceUniversity of Colorado at Boulder
Blue Gene Experience at the National Center for Atmospheric Research
October 4, 2006
Theron Voran
University of Colorado at Boulder / National Center for Atmospheric Research2
Why Blue Gene?
Extreme scalability, balanced architecture, simple design Efficient energy usage A change from IBM Power systems at NCAR But familiar
Programming model Chip (similar to Power4) Linux on front-end and IO nodes
Interesting research platform
University of Colorado at Boulder / National Center for Atmospheric Research3
Outline
System Overview Applications In the Classroom Scheduler Development TeraGrid Integration Other Current Research Activities
University of Colorado at Boulder / National Center for Atmospheric Research4
Frost Fun Facts
Collaborative effort Univ of Colorado at Boulder (CU) NCAR Univ of Colorado at Denver
Debuted in June 2005, tied for 58th place on Top500 5.73 Tflops peak – 4.71 sustained
25KW loaded power usage 4 front-ends, 1 service node 6TB usable storage
Why is it leaning?
Henry Tufo and Rich Loft, with Frost
University of Colorado at Boulder / National Center for Atmospheric Research5
System Internals
PLB (4:1)
“Double FPU”
Ethernet Gbit
JTAGAccess
144 bit wide DDR256MB
JTAG
Gbit Ethernet
440 CPU
440 CPUI/O proc
L2
L2
MultiportedSharedSRAM Buffer
Torus
DDR Control with ECC
SharedL3 directoryfor EDRAM
Includes ECC
4MB EDRAM
L3 CacheorMemory
l
6 out and6 in, each at 1.4 Gbit/s link
256
256
1024+144 ECC256
128
128
32k/32k L1
32k/32k L1
2.7GB/s
22GB/s
11GB/s
“Double FPU”
5.5GB/s
5.5 GB/s
256
snoop
Tree
3 out and3 in, each at 2.8 Gbit/s link
GlobalInterrupt
4 global barriers orinterrupts
128
Blue Gene/L system on-a-chip
University of Colorado at Boulder / National Center for Atmospheric Research6
More Details
Chips PPC440 @700MHZ, 2 cores per
node 512 MB memory per node Coprocessor vs Virtual Node 1:32 IO to Compute ratio
Interconnects 3D Torus (154 MB/s one
direction) Tree (354 MB/s) Global Interrupt GigE JTAG/IDO
Storage 4 Power5 systems as GPFS
cluster NFS export to BGL IO nodes
University of Colorado at Boulder / National Center for Atmospheric Research7
Frost Utilization
BlueGene/L (frost) Usage
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1/1
/06
1/1
5/0
6
1/2
9/0
6
2/1
2/0
6
2/2
6/0
6
3/1
2/0
6
3/2
6/0
6
4/9
/06
4/2
3/0
6
5/7
/06
5/2
1/0
6
6/4
/06
6/1
8/0
6
7/2
/06
7/1
6/0
6
7/3
0/0
6
8/1
3/0
6
8/2
7/0
6
9/1
0/0
6
Utilization
University of Colorado at Boulder / National Center for Atmospheric Research8
HOMME
High Order Method Modeling Environment
Spectral element dynamical core Proved scalable on other platforms Cubed-sphere topology Space-filling curves
University of Colorado at Boulder / National Center for Atmospheric Research9
HOMME Performance
Ported in 2004 on BG/L prototype at TJ Watson, with eventual goal of Gordon Bell submission in 2005
Serial and parallel obstacles: SIMD instructions Eager vs Adaptive routing Mapping strategies
Result: Good scalability out to 32,768
processors (3 elements per processor)
0
1
2
3
4
5
6
7 0 1
2 3
4 5
6 7
0
1
2
3
4
5
6
7
Snake mapping on 8x8x8 3D torus
University of Colorado at Boulder / National Center for Atmospheric Research10
HOMME Scalability on 32 Racks
University of Colorado at Boulder / National Center for Atmospheric Research11
Other Applications
Popular codes on Frost WRF CAM, POP, CICE MPIKAIA EULAG BOB PETSc
Used as a scalability test bed, in preparation for runs on 20-rack BG/W system
University of Colorado at Boulder / National Center for Atmospheric Research12
Classroom Access
Henry Tufo’s ‘High Performance Scientific Computing’ course at University of Colorado
Let students loose on 2048 processors Thinking BIG Throughput and latency studies Scalability tests - Conway’s
Game of Life Final projects
Feedback from ‘novice’ HPC users
University of Colorado at Boulder / National Center for Atmospheric Research13
Cobalt
Component-Based Lightweight Toolkit Open source resource manager and scheduler Developed by ANL along with NCAR/CU
Component Architecture Communication via XML-RPC Process manager, queue manager, scheduler
~3000 lines of python code Manages traditional clusters also
http://www.mcs.anl.gov/cobalt
University of Colorado at Boulder / National Center for Atmospheric Research14
Cobalt Architecture
University of Colorado at Boulder / National Center for Atmospheric Research15
Cobalt Development Areas
Scheduler improvements Efficient packing Multi-rack challenges Simulation ability Tunable scheduling parameters
Visualization Aid in scheduler development Give users (and admins) better understanding of machine allocation
Accounting / project management and logging Blue Gene/P TeraGrid integration
University of Colorado at Boulder / National Center for Atmospheric Research16
NCAR joins the TeraGrid, June 2006
University of Colorado at Boulder / National Center for Atmospheric Research17
TeraGrid Testbed
E N T E R P R I S E6 0 0 0
CSS Switch
ComputationalCluster
E N T E R P R I S E6 0 0 0
E N T E R P R I S E6 0 0 0
Teragrid
Teragrid
NLR NLR
CU experimental
CU experimental
StorageCluster
Experimental Environment Production Environment
NETS Switch
Datagrid
NCAR1GBNet
NCAR1GBNet
FRGP
FRGP
University of Colorado at Boulder / National Center for Atmospheric Research18
TeraGrid Activities
Grid-enabling Frost Common TeraGrid Software Stack (CTSS) Grid Resource Allocation Manager (GRAM) and Cobalt
interoperability Security infrastructure
Storage Cluster 16 OSTs, 50-100 TB usable storage 10G connectivity GPFS-WAN Lustre-WAN
University of Colorado at Boulder / National Center for Atmospheric Research19
Other Current Research Activities
Scalability of CCSM components POP CICE
Scalable solver experiments Efficient communication mapping
Coupled climate models Petascale parallelism
Meta-scheduling Across sites Cobalt vs other schedulers
Storage PVFS2 + ZeptoOS Lustre
University of Colorado at Boulder / National Center for Atmospheric Research20
Frost has been a success as a …
Research experiment Utilization rates
Educational tool Classroom Fertile ground for grad students
Development platform Petascale problems Systems work
Questions?
[email protected]://wiki.cs.colorado.edu/BlueGeneWiki