View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Supercomputing • Communications • Data
NCAR Scientific Computing Division
UCAR CONFIDENTIAL
NCAR’s Response to upcoming OCI NCAR’s Response to upcoming OCI SolicitationsSolicitations
Richard LoftRichard Loft
SCD Deputy Director for R&DSCD Deputy Director for R&D
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
OutlineOutline
NSF Cyberinfrastructure Strategy (Track-1 & Track-2)NSF Cyberinfrastructure Strategy (Track-1 & Track-2) NCAR generic strategy for NSFXX-625’s (Track-2)NCAR generic strategy for NSFXX-625’s (Track-2) NCAR response to NSF05-625NCAR response to NSF05-625 NSF Petascale Initiative StrategyNSF Petascale Initiative Strategy NCAR response to NSF Petascale Initiative NCAR response to NSF Petascale Initiative
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NSF’s Cyberinfrastructure StrategyNSF’s Cyberinfrastructure Strategy
The NSF’s HPC acquisition strategy (through FY10) for HPC is The NSF’s HPC acquisition strategy (through FY10) for HPC is for three Tracks:for three Tracks:– Track 1: High End O(1 PFLOPS sustained)Track 1: High End O(1 PFLOPS sustained)
– Track 2: Mid level system O(100 TFLOPS) Track 2: Mid level system O(100 TFLOPS) NSFXX-625NSFXX-625 First instance (NSF05-625) submitted First instance (NSF05-625) submitted Feb 10, 2006Feb 10, 2006 Next instances due:Next instances due:
– November 30, 2006
– November 30, 2007
– November 30, 2008
– Track 3: Typical University HPC O(1-10 TFLOPS)Track 3: Typical University HPC O(1-10 TFLOPS) The purpose of the Track-1 system will be to achieve The purpose of the Track-1 system will be to achieve
revolutionary advancement and breakthroughs in science and revolutionary advancement and breakthroughs in science and engineering.engineering.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Solicitation NSF05-625:Solicitation NSF05-625:Towards a Petascale Computing Towards a Petascale Computing Environment for Science and EngineeringEnvironment for Science and Engineering
Award: September 2006Award: September 2006 System in production by System in production by
May 31, 2007May 31, 2007 $30,000,000$30,000,000 or or
$15,000,000.$15,000,000. Operating costs funded Operating costs funded
under separate action.under separate action. RP serves the broad science RP serves the broad science
community - open access.community - open access. Allocations by LRAC/MRAC Allocations by LRAC/MRAC
or “their successors”or “their successors” Two 10 Gb/s TeraGrid linksTwo 10 Gb/s TeraGrid links
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NCAR’s Overall NSFXX-625 StrategyNCAR’s Overall NSFXX-625 Strategy
Leverage NCAR/SCD expertise in production HPC.Leverage NCAR/SCD expertise in production HPC. Get a production system - Get a production system -
– No white box Linux solutions.No white box Linux solutions.– Stay on path to usable petascale systemsStay on path to usable petascale systems
NCAR is a Teragrid outsider - must address two areas:NCAR is a Teragrid outsider - must address two areas:– Leverage experience with general scientific usersLeverage experience with general scientific users– Lack of Grid consulting experience Lack of Grid consulting experience – Emphasize, but don’t over emphasize, geosciences.Emphasize, but don’t over emphasize, geosciences.
In proposing, NCAR has a facility problemIn proposing, NCAR has a facility problem– Minimize costs - power, administrative staff, level of Minimize costs - power, administrative staff, level of
support.support. Creative plan for remote user support and education.Creative plan for remote user support and education.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NSF05-625 PartnersNSF05-625 Partners
Facility Partner Facility Partner End-to-End System Supplier End-to-End System Supplier User Support Network -User Support Network -
– NCAR Consulting Service GroupNCAR Consulting Service Group– University partnersUniversity partners
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NSF05-625 Facility PartnerNSF05-625 Facility Partner
NCAR ML Facility after ICESS is FULL.NCAR ML Facility after ICESS is FULL. Key Points:Key Points:
– A new datacenter is needed whether NCAR wins the NSF05-625 A new datacenter is needed whether NCAR wins the NSF05-625 solicitation or not.solicitation or not.
– Because of the short timeline, new datacenter never factors into Because of the short timeline, new datacenter never factors into the strategy for NSFXX-625.the strategy for NSFXX-625.
Identified a colocation facilityIdentified a colocation facility facility featuresfacility features
– local (Denver-Boulder area)local (Denver-Boulder area)– State of the Art, High Availability CenterState of the Art, High Availability Center– Currently 4 x 2MW generators of power availableCurrently 4 x 2MW generators of power available– Familiar with large scale deploymentsFamiliar with large scale deployments– Dark Fibre readily available (good connectivity)Dark Fibre readily available (good connectivity)
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NSF05-625 NSF05-625 Supercomputer System DetailsSupercomputer System Details
Two systems: capability + capacityTwo systems: capability + capacity ~80 Tflops combined~80 Tflops combined Robotic tape storage system ~12PBRobotic tape storage system ~12PB
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NCAR NSF05-625 User Support PlanNCAR NSF05-625 User Support Plan
Largest potential differentiator in proposal - let’s do Largest potential differentiator in proposal - let’s do something unique!something unique!
System will be used by the generic scientist -support System will be used by the generic scientist -support plan must plan must – Be extensible to other domains than geoscienceBe extensible to other domains than geoscience– Address grid user supportAddress grid user support
Strategy leverages OSCER-lead IGERT proposal-Strategy leverages OSCER-lead IGERT proposal-– Combine teaching of computational science with user Combine teaching of computational science with user
supportsupport– Embed application support expertise in key institutionsEmbed application support expertise in key institutions– Build education and training materials through university Build education and training materials through university
partnerships.partnerships.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Track-1 System BackgroundTrack-1 System Background
Source of funds:Source of funds: Presidential Innovation Initiative Presidential Innovation Initiative announced in SOTU.announced in SOTU.
Performance goal:Performance goal: 1 PFLOPS1 PFLOPS sustainedsustained on on “interesting problems”.“interesting problems”.
Science goal:Science goal: breakthroughs breakthroughs Use model:Use model: 12 research teams per year using whole 12 research teams per year using whole
system for days or weeks at a time.system for days or weeks at a time. Capability systemCapability system - large everything & fault tolerant. - large everything & fault tolerant. Single systemSingle system in in oneone locationlocation.. NotNot a requirement that machine be a requirement that machine be upgradableupgradable..
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Track-1 Project ParametersTrack-1 Project Parameters
Funds:Funds: $200M$200M over 4 years, starting FY07 over 4 years, starting FY07 – Single awardSingle award– Money is for end-to-end system (as in 625)Money is for end-to-end system (as in 625)– Not intended to fund facility.Not intended to fund facility.– Release of funds tied to meeting hw and sw milestones.Release of funds tied to meeting hw and sw milestones.
Deployment Stages:Deployment Stages:– SimulatorSimulator– PrototypePrototype– Petascale system operates: FY10-FY15Petascale system operates: FY10-FY15
Operations funds FY10-15 funded separately.Operations funds FY10-15 funded separately.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Two Stage Award Process TimelineTwo Stage Award Process Timeline
Solicitation out: May, 2006 Solicitation out: May, 2006 (???)(???) [ HPCS down-select: June, 2006 ][ HPCS down-select: June, 2006 ] Preliminary Proposal due: August, 2006Preliminary Proposal due: August, 2006
– Down selection (invitation to 3-4 to write Full Proposal)Down selection (invitation to 3-4 to write Full Proposal) Full Proposal due: January, 2007Full Proposal due: January, 2007 Site visits: Spring, 2007Site visits: Spring, 2007 Award: Sep, 2007Award: Sep, 2007
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NSF’s view of the problemNSF’s view of the problem
NSF recognizes the facility (power, cooling, space) challenge NSF recognizes the facility (power, cooling, space) challenge of this system.of this system.
Therefore NSF welcomes collaborative approaches:Therefore NSF welcomes collaborative approaches:– University & Federal LabUniversity & Federal Lab– University & commercial data centerUniversity & commercial data center– University & State GovernmentUniversity & State Government– University consortiumUniversity consortium
NSF recognizes that applications will need significant NSF recognizes that applications will need significant modification to run on this system.modification to run on this system.– User support planUser support plan– Expects proposer to discuss needs in this area with experts in key Expects proposer to discuss needs in this area with experts in key
applications areas.applications areas.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
The Cards in NCAR’s HandThe Cards in NCAR’s Hand
NCAR …NCAR …– Is a leader in making the case that geoscience Is a leader in making the case that geoscience
grand challenge problems need petascale grand challenge problems need petascale computing.computing.
– Has many grand challenge problems to offer itself.Has many grand challenge problems to offer itself.– Has experience at large processor counts.Has experience at large processor counts.– Has recently connected to the TeraGrid, and is Has recently connected to the TeraGrid, and is
moving towards becoming a full-fledged Resource moving towards becoming a full-fledged Resource Provider.Provider.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NCAR Response OptionsNCAR Response Options
Do NothingDo Nothing Focus on Petascale Geoscience ApplicationsFocus on Petascale Geoscience Applications Partner with a lead institution or consortiumPartner with a lead institution or consortium Lead a Tier-1 proposalLead a Tier-1 proposal
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NCAR Response OptionsNCAR Response Options
Do NothingDo Nothing Focus on Petascale Geoscience ApplicationsFocus on Petascale Geoscience Applications Partner with a lead institution or consortiumPartner with a lead institution or consortium Lead a Tier-1 proposalLead a Tier-1 proposal
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Questions, Comments?Questions, Comments?
Supercomputing • Communications • Data
NCAR Scientific Computing Division
UCAR CONFIDENTIAL
The Relationship Between OCI’s The Relationship Between OCI’s Roadmap and NCAR’s Datacenter Roadmap and NCAR’s Datacenter
projectproject
Richard LoftRichard Loft
SCD Deputy Director for R&DSCD Deputy Director for R&D
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Projected CCSM Computing Projected CCSM Computing Requirements Exceed Moore’s Law Requirements Exceed Moore’s Law
Thanks to Jeff Kiehl/Bill Collins
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NSF’s Cyberinfrastructure StrategyNSF’s Cyberinfrastructure Strategy
The NSF’s HPC acquisition strategy (through FY10) for HPC is The NSF’s HPC acquisition strategy (through FY10) for HPC is for three Tracks:for three Tracks:– Track 1: High End O(1 PFLOPS sustained)Track 1: High End O(1 PFLOPS sustained)
– Track 2: Mid level system O(100 TFLOPS) Track 2: Mid level system O(100 TFLOPS) NSFXX-625NSFXX-625 First instance (NSF05-625) submitted First instance (NSF05-625) submitted Feb 10, 2006Feb 10, 2006 Next instances due:Next instances due:
– November 30, 2006
– November 30, 2007
– November 30, 2008
– Track 3: Typical University HPC O(1-10 TFLOPS)Track 3: Typical University HPC O(1-10 TFLOPS) The purpose of the Track-1 system will be to achieve The purpose of the Track-1 system will be to achieve
revolutionary advancement and breakthroughs in science and revolutionary advancement and breakthroughs in science and engineering.engineering.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NCAR strategic goals:NCAR strategic goals:
NCAR will stay in the top echelon of geoscience computing NCAR will stay in the top echelon of geoscience computing centers.centers.
NCAR’s immediate strategic goal is to be a Track-2 center.NCAR’s immediate strategic goal is to be a Track-2 center. To do this, NCAR must be integrated with NSF’s To do this, NCAR must be integrated with NSF’s
cyberinfrastructure plans.cyberinfrastructure plans. This means both connecting and ultimately operating within This means both connecting and ultimately operating within
the Teragrid framework.the Teragrid framework. The Teragrid is evolving, so this is a moving target.The Teragrid is evolving, so this is a moving target.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
NCAR new-facilityNCAR new-facility
NCAR ML Facility after ICESS is FULL.NCAR ML Facility after ICESS is FULL. Key Points:Key Points:
– A new datacenter is needed whether NCAR wins the A new datacenter is needed whether NCAR wins the NSF05-625 solicitation or not.NSF05-625 solicitation or not.
– Because of the short timeline, a new datacenter Because of the short timeline, a new datacenter never factors into the strategy for NSFXX-625.never factors into the strategy for NSFXX-625.
– Right now, we can’t handle a modest budget Right now, we can’t handle a modest budget augmentation for computing with the current augmentation for computing with the current facility.facility.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Mesa Lab is full after Mesa Lab is full after the ICESS procurementthe ICESS procurement
ICESS = Integrated Computing Environment for ICESS = Integrated Computing Environment for Scientific SimulationScientific Simulation
We’re sitting at 980 kW right now.We’re sitting at 980 kW right now. Deinstall of bluesky will give us back 450 kW.Deinstall of bluesky will give us back 450 kW. This leaves about 600 kW of head-room.This leaves about 600 kW of head-room. The ICESS procurement is expected to deliver a The ICESS procurement is expected to deliver a
system with a maximum power requirement of 500-system with a maximum power requirement of 500-600 kW of power.600 kW of power.
This is not enough to house $15M-$30M of equipment This is not enough to house $15M-$30M of equipment from NSF05-625, for example.from NSF05-625, for example.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
SCD Computer Facility Equipment Power Consumption (kW)
0
100
200
300
400
500
600
700
800
900
1000
Jan-97 Jan-98 Jan-99 Jan-00 Jan-01 Jan-02 Jan-03 Jan-04 Jan-05 Jan-06
Cray C90
IBM POWER3 blackforest & babyblue
SGI Origin3800
IBM POWER4 bluesky,thunder &bluedawn
IBM Linux
IBM BlueGene/L
IBM POWER5bluevista
Cray T3D
SGI O2K ute
HP SPP2000
SGI O2K ute
Compaq ES40
Network & Enterprise Systems
SGI O2K dataproc
Cray J90's
Max power at the Mesa Lab is 1.2 MW!
We’re fast running out of power…
Supercomputing • Communications • Data
NCAR Scientific Computing Division
UCAR CONFIDENTIAL
Preparing for the PetascalePreparing for the Petascale
Richard LoftRichard Loft
SCD Deputy Director for R&DSCD Deputy Director for R&D
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
What to expect in HEC?What to expect in HEC?
Much more parallelism.Much more parallelism. A good deal of uncertainty regarding node architectures.A good deal of uncertainty regarding node architectures.
– Many threads per node.Many threads per node. Continued ubiquity of Linux/Intel systems.Continued ubiquity of Linux/Intel systems. There will be vector systemsThere will be vector systems Emergence of exotic architectures.Emergence of exotic architectures. Largest (petascale) system likely to have special featuresLargest (petascale) system likely to have special features
– Power aware design (small memory?)Power aware design (small memory?)– Fault tolerant design featuresFault tolerant design features– Light-weight compute node kernelsLight-weight compute node kernels– Custom networks Custom networks
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Top 500:Top 500:Speed of Supercomputers vs TimeSpeed of Supercomputers vs Time
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Top 500:Top 500:Number of Processors vs TimeNumber of Processors vs Time
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
HEC in 2010HEC in 2010
Based on history, should expect 4K-8K CPU systems to Based on history, should expect 4K-8K CPU systems to be commonplace by the end of the decade.be commonplace by the end of the decade.
The largest systems on the Top500 list should be 1-10 The largest systems on the Top500 list should be 1-10 PFLOPS.PFLOPS.
Parallelism in largest system - estimate (2010).Parallelism in largest system - estimate (2010).– Assume a clock speed of 5 GHz a double FMA CPU Assume a clock speed of 5 GHz a double FMA CPU
delivers 20 GFLOPS peakdelivers 20 GFLOPS peak– 1 PFLOPS peak = 50K CPU’s.1 PFLOPS peak = 50K CPU’s.– 10 PFLOPS peak = 500K CPU’s10 PFLOPS peak = 500K CPU’s– Large vector systems (if they exist) will still be highly Large vector systems (if they exist) will still be highly
parallel. parallel. – To justifying using the largest systems, must use a To justifying using the largest systems, must use a
sizable fraction of the resource.sizable fraction of the resource.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Range of Plausible Architectures: 2010Range of Plausible Architectures: 2010
Power issues will slow rate of increase in clock frequency. This will drive trend towards massive parallelism. All scalar system with have multiple CPU’s per socket (chip). Currently 2 CPU’s per core, by 2008, 4 CPU’s per socket will be
common place. 2010 scalar architectures will likely continue this trend. 8 CPU’s are
possible - Cell Chip already has 8 synergistic processors. Key unknown is which architecture for a cluster on a chip will be most
effective. Vector systems will be around, but at what price? Wildcards
– Impact of DARPA HPCS program– Exotics: FPGA’s, PIM’s, GPU’s.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
How to make science staff aware ofHow to make science staff aware ofcoming changes?coming changes?
NCAR must develop a science driven plan for exploiting petascale NCAR must develop a science driven plan for exploiting petascale systems at the end of the decade.systems at the end of the decade.
Briefed NCAR Director, DD, CISL and ESSL DirectorsBriefed NCAR Director, DD, CISL and ESSL Directors Meetings (SEWG at CCSM Breckenridge)Meetings (SEWG at CCSM Breckenridge) Organizing NSF workshops on petascale geoscience Organizing NSF workshops on petascale geoscience
benchmarking scheduled at DC (June 1-2) and NCAR (TBD)benchmarking scheduled at DC (June 1-2) and NCAR (TBD) Have initiated internal petascale discussionsHave initiated internal petascale discussions
– CGD-SCD joint meetings CGD-SCD joint meetings – Peta_ccsm mail list.Peta_ccsm mail list.– Peta_ccsm Swiki site.Peta_ccsm Swiki site.
Through activities like this. NSA should take leadership role.Through activities like this. NSA should take leadership role.
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
What must be done to secure What must be done to secure resources to improve scalability?resources to improve scalability?
Must help ourselves. Must help ourselves. – Invest judiciously in computational science where Invest judiciously in computational science where
possible.possible.– Leverage application development partnerships (SciDAC, Leverage application development partnerships (SciDAC,
etc.)etc.) Write proposals.Write proposals.
– Support for applications development for the Track-1 Support for applications development for the Track-1 system can be built into a NCAR partnership deal.system can be built into a NCAR partnership deal.
– NSF has indicated an independent funding track for NSF has indicated an independent funding track for applications. NCAR should aggressively pursue those applications. NCAR should aggressively pursue those funding sources.funding sources.
New ideas can help - e.g. POPNew ideas can help - e.g. POP
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
POP Space Filling Curves: POP Space Filling Curves: partition for 8 processorspartition for 8 processors
Credit: John Dennis, SCD
Supercomputing • Communications • Data
NCAR Scientific Computing Division
UCAR CONFIDENTIAL
POP 1/10 Degree BG/L Improvements
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
POP 1/10 Degree performance
BG/L SFCimprovement
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Questions, Comments?Questions, Comments?
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Top 500 Processor Types: Top 500 Processor Types: Intel taking overIntel taking over
0
100
200
300
400
500
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005
SIMD
Vector
Scalar
Sparc
MIPS
Intel
HP
Power
Alpha
Today Intel is inside 2/3 of the Top500 machines
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
Supercomputing • Communications • Data
NCAR Scientific Computing DivisionUCAR CONFIDENTIAL
The commodity onslaught …The commodity onslaught …
The Linux/Intel cluster is taking over Top500.The Linux/Intel cluster is taking over Top500. Linux has not penetrated at major Weather, Ocean, Linux has not penetrated at major Weather, Ocean,
Climate centers- Climate centers- yet yet - reasons- reasons– System maturity (SCD experience)System maturity (SCD experience)– Scalability of dominant commodity interconnects Scalability of dominant commodity interconnects – Combinatorics (Linux flavor, processor, interconnect, Combinatorics (Linux flavor, processor, interconnect,
compiler)compiler) But it affects NCAR indirectly because…But it affects NCAR indirectly because…
– Ubiquity = OpportunityUbiquity = Opportunity– Universities are deploying them.Universities are deploying them.– NCAR must rethink services provided to the Universities.NCAR must rethink services provided to the Universities.– Puts strain on all community software development Puts strain on all community software development
activities.activities.