View
66
Download
0
Category
Tags:
Preview:
DESCRIPTION
Arthur Maccabe Director, Computer Science and Mathematics Division May 2010. ORNL Computing Story. Managed by UT-Battele for the Department of Energy. Our vision for sustained leadership and scientific impact. Provide the world’s most powerful open resource for capability computing - PowerPoint PPT Presentation
Citation preview
1
ORNL Computing Story
Arthur MaccabeDirector, Computer Science and Mathematics Division
May 2010
Managed by UT-Battelefor the Department of Energy
2
Our vision for sustained leadershipand scientific impact• Provide the world’s most powerful open resource
for capability computing
• Follow a well-defined path for maintaining world leadershipin this critical area
• Attract the brightest talent and partnerships from all over the world
• Deliver leading-edge science relevant to the missionsof DOE and key federal and state agencies
• Invest in education and training
Unique opportunity for multi-agency collaboration based
on requirements and technology
3 Managed by UT-Battellefor the U.S. Department of Energy ORNL_Computing_0912
Ubiquitous access to data and workstation level computing
(>10,000 users)
Capacity computing
(>1000 users)
Capability computing
(>100 users)
Mid-range computing (clouds or clusters) and datacenters
Large hardware systems and mid-range computers (clusters)
Leadership computing at the exascale• Most urgent, challenging, and important problems• Scientific computing support• Scalable applications developed and maintained• Computational endstations for community codes• High-speed WAN
2009: Jobs with ~105
CPU cores
2009: Jobs with ~103
CPU cores
2009: Jobs with ~1 to 102 CPU
cores
ORNL has a role in providing a healthy HPC ecosystem for several agencies
Breadth of access
• Software either commercially available or developed internally
• Knowledge discovery tools and problem solving environment
• High-speed LAN and WAN
• Applications having some scalability developed and maintained, portals, user support
• High-speed WAN
4 Managed by UT-Battellefor the U.S. Department of Energy BOG_CompSciStrat_0901
ORNL’s computing resources
Science Data NetTeraGridInternet2ESnet
Scientific Visualization Lab
• EVEREST: 30 ft 8 ft
• 35 million pixels• EVEREST cluster
Data analysis• LENS: 32 nodes• EWOK: 81 nodes
Experimental systems• Keeneland, NSF:
GPU- based compute cluster
• IBM BG/P• Power 7 • Cray XMT
Center-wide file system• Spider
– 192 data servers– 10 PB disks– 240 GB/sec
Archival storage• HPSS
– Stored data: 8 PB– Capacity: 30 PB
December 2009:Summary• Supercomputers: 6
– Cores: 376,812– Memory: 516 TB– Petaflops: 3.85– Disks: 14,350 TB
Classified
Netw
ork
rout
ers
DOEJaguar 2.3 PF
NSFKraken 1 PF
Leadership Capability
DOEJaguar 240 TF
ORNLFrost
Capacity Computing
NSF Athena 164 TF
Climate
NOAA TBD 1 PF+
Oak Ridge Institutional Clusters(LLNL model)
Multiprogrammatic clusters
National Security
Cores: TBDMemory: TBD
Cores: 224,256Memory: 300 TB
Cores: 99,072Memory: 132 TB
Cores: 31,328Memory: 62 TB
Cores: 2,048Memory: 3 TB
Cores: 18,060Memory: 18 TB
5
· World’s most capable complexfor computational science: Infrastructure, staff, multiagency programs
· Outcome: Sustained world leadership in transformational researchand scientific discovery using advanced computing
We are DOE’s lead laboratory for open scientific computing
SCplan_0804
Why ORNL?
Strategy
Provide the nation’s most capable site for advancing through the petascale to the exascale and beyond
Execute multiagency programs for advanced architecture, extreme-scale software R&D, and transformational science
Attract top talent, deliver outstandinguser program, educate and train next-generation researchers
Leadership areas· Delivering leading-edge computational science for DOE missions· Deploying and operating computational resources required
to tackle national challenges in science, energy, and security· Scaling applications through the petascale to the exascale
Infrastructure· Multiprogram Computational Data Center:
Infrastructure for 100–250 petaflops and 1 exaflops systems
Jaguar performanceToday: 2300 TF+
6
Advancing Scientific Discovery*5 of the top 10 ASCR science accomplishments in Breakthroughs report used OLCF resources and staff
Electron pairing in HTC cuprates*The 2D Hubbard model emits a superconducting state for cuprates & exhibits an electron pairing mechanism most likely caused by spin-fluctuation exchange. PRL (2007, 2008)
Taming turbulent heat loss in fusion reactors* Advanced understanding of energy loss in tokamak fusion reactor plasmas,. PRL(vol 99) and Physics of Plasmas (vol 14)
How does a pulsar get its spin?*Discovered the first plausible mechanism for a pulsar’s spin that fit observations, namely the shock wave created when the star’s massive iron core collapses. Jan 4, 2007 issue of Nature
Carbon SequestrationSimulations of carbon dioxide sequestration show where the greenhouse gas flows when pumped into underground aquifers.
Stabilizing a lifted flame*Elucidated the mechanisms that allow a flame to burn stably above burners, namely increasing the fuel or surrounding air co-flow velocity Combustion and Flame(2008)
Shining the light on dark matter*A glimpse into the invisible world of dark matter, finding that dark matter evolving in a galaxy such as our Milky Way remains identifiable and clumpy. Aug 7, 2008 issue of Nature
7
Science applications are scaling on JaguarScience
Area Code Contact Cores Total Performance Notes
Materials DCA++ Schulthess 150,144 1,300 TF* Gordon Bell Winner
Materials LSMS Eisenbach 149,580 1,050 TF
Seismology SPECFEM3D Carrington 149,784 165 TF Gordon Bell Finalist
Weather WRF Michalakes 150,000 50 TF
Climate POP Jones 18,000 20 sim yrs/CPU day
Combustion S3D Chen 144,000 83 TF
Fusion GTC PPPL 102,000 20 billion particles / sec
Materials LS3DF Lin-Wang Wang 147,456 442 TF Gordon Bell
Winner
Chemistry NWChem Apra 96,000 480 TF
Chemistry MADNESS Harrison 140,000 550+ TF
Finalist
8
Utility infrastructure tosupport multiple data centers
Build a 280 MW substation on ORNL campus
Upgrade building power to 25 MW
Deploy a 10,000+ ton chiller plant
Upgrade UPS and generator capability
8 Managed by UT-Battellefor the Department of Energy
9
Flywheel based UPS for highest efficiency
Variable Speed Chillers save energy
Liquid Cooling is 1,000 times more efficient than air cooling
13,800 volt power into the building saves on transmission losses
ORNL’s data center:Designed for efficiency
Vapor barriers and positive air pressure keep humidity out of computer center
Result: With a PUE of 1.25, ORNL has one of the world’s most efficient data centers
480 volt power to computers saves $1M in installation costs and reduce losses
10
Equipment Cooling Chiller Plant Facility Electrical Systems
IT Equipment Cross-cutting issues
Air Management Cooling plant optimizations
UPS systems Power supply efficiency
Motor efficiency
Air economizers Free cooling Self generation Sleep/standby loads
Right sizing
Humidification controls alternatives
Variable speed pumping
AC-DC Distribution IT equipment fans
Variable speed drives
Centralized air handlers
Variable speed chillers
Standby generation
Lighting
Direct liquid cooling Maintenance
Low pressure drop air distribution
Commissioning / continuous benchmarking
Fan efficiency Heat recovery
Redundancies
Charging for space and powerBuilding envelope
“High Performance Buildings for High Tech Industries in the 21st Century”Dale Sartor, Lawrence Berkeley National Laboratory
ORNL Steps
Innovative best practices peeded to increase computer center efficiency
11
Today, ORNL’s facility is among the world’s most efficient data centers
Power utilization efficiency (PUE) =Data center power / IT equipment
ORNL’s PUE=1.25
1212 Managed by UT-Battelle
for the Department of Energy
State-of-the-art owned network is directly connected to every majorR&E network at multiple lambdas
13
Center-wide file system• “Spider” provides a shared, parallel
file system for all systems– Based on Lustre file system
• Demonstrated bandwidth >240 GB/s • Over 10 PB of RAID-6 capacity
– 13,440 1-TB SATA Drives
• 192 storage servers • Available from all systems via our
high-performance scalable I/O network (Infiniband)
• Currently mounted on over 26,000 client nodes
14
EverestPowerwall Remote
VisualizationCluster
End-to-EndCluster
ApplicationDevelopment
Cluster Data Archive25 PB
Completing the simulation environment to meet the science requirements
XT5
XT4Spider
Login
Scalable I/O Network (SION)4x DDR Infiniband Backplane Network
15
Hardware scaled from single-core through dual-core to quad-core and dual-socket , 12-core SMP nodes
Scaling applications and system software is biggest challenge
· NNSA and DoD have funded much of the basic system architecture research· Cray XT based on Sandia Red Storm· IBM BG designed with Livermore· Cray X1 designed in collaboration
with DoD
· DOE SciDAC and NSF PetaApps programs are funding scalable application work, advancing many apps
· DOE-SC and NSF have funded much of the library and applied math as well as tools
· Computational Liaisons key to using deployed systems
Cray XT4119 TF
2006 2007 2008
Cray XT3 Dual-Core54 TF
Cray XT4 Quad-Core263 TF
We have increased system performance by 1,000 times since 2004
2005
Cray X13 TF
Cray XT3Single-core26 TF
2009
Cray XT5 Systems12-core, dual-socket SMP2000+ TF and 1000 TF
16
Science requires advanced computational capability 1000x over the next decadeMission: Deploy and operatethe computational resourcesrequired to tackle global challenges
Vision: Maximize scientific productivityand progress on the largest scalecomputational problems
· Deliver transforming discoveries in climate, materials, biology, energy technologies, etc.
· Ability to investigate otherwise inaccessible systems, from regional climate impacts to energy grid dynamics
· Providing world-class computational resources and specialized services for the most computationally intensive problems
· Providing stable hardware/software path of increasing scale to maximize productive applications development
Cray XT5 2+ PFLeadership system for science
OLCF-3: 10-20 PFLeadership system with some HPCS technology
2009 2011 2015 2018
OLCF-5: 1 EF
OLCF-4: 100-250 PF based on DARPA HPCS technology
17
Jaguar: World’s most powerful computerdesigned for science from the ground up
Peak performance 2+ petaflopsSystem memory 300 terabytesDisk space 10 petabytesDisk bandwidth 240 gigabytes/secondSystem power 7 MW
1818 Managed by UT-Battellefor the Department of Energy
National Institute for Computational Sciences University of Tennessee and ORNL partnership
16,704 six-core AMD Operton™ Processors
1,042 Teraflops130 TB memory
3.3 PB Disk Space48 Service and I/O nodes
World’s most powerful academic supercomputer
19
Collaborators
An International, Dedicated High-End Computing Project to Revolutionize Climate Modeling
COLA Center for Ocean-Land-Atmosphere Studies, USA
ECMWF European Center for Medium-Range Weather Forecasts
JAMSTEC Japan Agency for Marine-Earth Science and Technology
UT University of Tokyo
NICS National Institute for Computational Sciences, University of Tennessee Expected Outcomes
• Better understand global mesoscale phenomena in the atmosphere and ocean
• Understand the impact of greenhouse gases on the regional aspects of climate
• Improve the fidelity of models simulating mean climate and extreme events
ProjectUse dedicated HPC resources – Cray XT4 (Athena) at NICS – to simulate global climate change at the highest resolution ever. Six months of dedicated access.
NICAM Nonhydrostatic, Icosahedral, Atmospheric Model
IFS ECMWF Integrated Forecast System
Codes
20
ORNL / DOD HPC collaborationsPeta/Exa-scale HPC Technology Collaborations in support of national security
• System design, performance and benchmark studies
• Wide-area network investigations
• Extreme Scale Software Center– Focused on widening the usage and improving productivity the next generation of
“extreme-scale” supercomputers – Systems software, tools, environments and applications development– Large scale system reliability, availability and serviceability (RAS) improvements
25 MW power8,000 tons cooling32,000 ft2 raised floor
20 Managed by UT-Battellefor the Department of Energy
21
Impact:· Performance (time-to-solution): Speed up critical
national security applications by a factor of 10 to 40· Programmability (idea-to-first-solution): Reduce cost
and time of developing application solutions · Portability (transparency): Insulate research and
operational application software from system· Robustness (reliability): Apply all known techniques
to protect against outside attacks, hardware faults,and programming errors
Fill the Critical Technology and Capability GapToday (late 80s HPC technology) Future (Quantum/Bio Computing)
Applications:· Intelligence/surveillance, reconnaissance, cryptanalysis, weapons analysis, airborne
contaminant modeling, and biotechnology
HPCS program focus areas
Slide courtesy of DARPA
We are partners in the $250MDARPA HPCS programPrototype Cray system to be deployed at ORNL
22
The next big climate challengeNature, Vol. 453, Issue No. 7193, March 15, 2008• Develop a strategy to revolutionize prediction of the climate through the
21st century to help address the threat of global climate change
• Current inadequacy in provision of robust estimates of risk to society is strongly influenced by limitations in computer power
• A World Climate Research Facility (WCRF) for climate prediction should be established that will enable the national centers to accelerate progress in improving operational climate prediction at decadal to multi-decadal lead times
• Central component of the WCRF will be one or more dedicated high-end computing facilities that will enable the revolution in climate prediction… with systems at least 10,000 times more powerful than the currently available computers, is vital for regional climate predictions to underpin mitigation policies and adaptation needs
23
Oak Ridge Climate Change Science InstituteA new multi-disciplinary, multi-agency organization bringing together ORNL’s climate change programs to promote a cohesive vision of climate change science and technology at ORNL
>100 Staff members matrixed from across ORNL World’s leading computing systems (>3 PF in 2009) Specialized facilities and laboratories Programs
James HackDirector
David BaderDeputy Director
Computation Geologic Sequestration
Observation & Experiment Synthesis Science
Data Developing Areas
24
High-performance computing:
OLCF, NICS, NOAA
Data systems, knowledge discovery,
networkingObservation
networksExperimental
manipulation facilities
• Atmosphere• Ocean• Ice• Terrestrial and marine
biogeochemistry• Land use• Hydrologic cycle
• Aerosols, water vapor, clouds, atmosphere dynamics
• Ocean dynamics and biogeochemistry
• Ice dynamics• Terrestrial ecosystem
feedbacks and response• Land-use trends
and projections• Extreme events, hydrology,
aquatic ecology
• Adaptation • Mitigation• Infrastructure• Energy and economics
Oak Ridge Climate Change Science InstituteIntegration of models, measurements, and analysis
Earth system modelsfrom local to global scales
Process understanding:Observation, experiment, theory
Integratedassessment
Partnerships will be essential
Facilities and infrastructure
http://climatechangescience.ornl.gov
25
What we would like to be able to say about climate-related impacts within next 5 years• What specific changes will be experienced, and where?• When will the changes begin and how will they evolve over the
next two-to-three decades?• How severe will the changes be? How do they compare with
historical trends and events?• What will be the impacts over space and time?
– People (e.g., food, water, health, employment, social structure)– Nature (e.g., biodiversity, water, fisheries)– Infrastructure (e.g., energy, water, buildings)
• What specific – and effective – adaptation tactics are possible?• How might adverse impacts be avoided or mitigated?
26
High-resolution Earth system modeling A necessary core capabilityObjectives and Impact Strategy: Develop predictive global simulation capabilities for
addressing climate change consequences Driver: Higher fidelity simulations with improved predictive skill on
decadal time scales on regional space scales Objective: Configurable high-resolution scalable atmospheric, ocean,
terrestrial, cryospheric, and carbon component models to answer policy and planning relevant questions about climate change
Impact: Exploration of renewable energy resource deployment, carbon mitigation strategies, climate adaptation scenarios (agriculture, energy and water resource management, protection of vulnerable infrastructure, national security)
Mesoscale-resolved column integrated water vaporJaguar XT5 simulation
Eddy-resolved sea surface temperatureJaguar XT5 simulation
Net ecosystem exchange of CO2
27
NOAA collaboration as exampleInteragency Agreement with Department of Energy Oak Ridge Operations Office• Signed August 6, 2009• Five-year agreement• $215M Work for Others (initial investment of $73M)• Facilities and science activity• Provides dedicated specialized high performance
computing collaborative services for climate modeling • Builds on existing three-year MOU w/ DOE• Common research goal to develop, test, and apply
state-of-the-science computer-based global climate simulation models based upon strong scientific foundation
Collaboration with Oak Ridge National Laboratory
Synergy among research efforts across multiple disciplines and agenciesLeverages substantial existing infrastructure on campus
Access to world-class network connectivityLeverages ORNL Extreme Scale Software Center
Utilizes proven project management expertise
28
Delivering Science
• “Effective speed” increasescome from faster hardware and improved algorithms
• Science Prospects and Benefits with HPC in the Next Decade
• Speculative requirements for scientific application on HPC platforms (2010–2020)
• Architectures and applications can be co-designed in order to create synergy in their respective evolutions.
Having HPC facilities embedded in an R&D organization (CCSD) comprised of staff with expertise in CS, math, scalable application development, knowledge discovery, and computational science enables integrated approaches to delivering new science
Applied Math
Theoretical &Computational Science
New Science
Domain Science:A partnership of experiment, theory and simulation workingtowards shared goals.
Computer Science
Nanoscience
29
We have a unique opportunity for advancing math and computer science critical to mission success through multi-agency partnership
Institute for AdvancedArchitectures and Algorithms
• Two national centers of excellence in HPC architecture and software established in 2008– Funded by DOE and DOD– Major breakthrough in recognition of our capabilities
Extreme Scale Software Development Center
· Jointly funded by NNSAand SC in 2008 ~$7.4M
· ORNL-Sandiapartnership
· $7M in 2008· Aligned with
DOE-SC interests
IAA is the medium through which architectures and applications can be co-designed in order to create synergy in their respective evolutions.
30
Preparing for the ExascaleBy Analyzing Long-Term Science Drivers and Requirements
• We have recently surveyed, analyzed, and documented the science drivers and application requirements envisioned for exascale leadership systems in the 2020 timeframe
• These studies help to– Provide a roadmap for the ORNL Leadership
Computing Facility– Uncover application needs and requirements– Focus our efforts on those disruptive
technologies and research areas in need of our and the HPC community’s attention
31
• All projections are daunting– Based on projections of existing technology both
with and without “disruptive technologies”– Assumed to arrive in 2016-2020 timeframe
• Example 1– 400 cabinets, 115K nodes @ 10 TF per node, 50-100 PB, optical interconnect,
150-200 GB/s injection B/W per node, 50 MW
• Examples 2-4 (DOE “Townhall” report*)
What Will an EF System Look Like?
*www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf
32
• The U.S. Department of Energy requires exaflops computing by 2018 to meet the needs of the science communities that depend on leadership computing
• Our vision: Provide a series of increasingly powerful computer systems and work with user community to scale applications to each of the new computer systems
– Today: Upgrade of Jaguar to 6-core processors in progress
– OLCF-3 Project: New 10-20 petaflops computer based on early DARPA HPCS technology
Moving to the Exascale
OLCF Roadmap from 10-year plan
250 PF HPCS System
1020 PF
2008 2009 2010 2011 2012 2013 2014 2015 2016
ORNL Multiprogram Computing and Data Center (140,000 ft2)
2017
ORNL Computational Sciences Building
2018 2019
ORNL Multipurpose Research Facility
1 EF
OLCF-3Future systems
Today 2 PF, 6-core 1 PF
100 PF
33
1970 1980 1990 2000 2010 2020 2030
Multi-core Era: A new paradigmin computing
Vector Era• USA, Japan
Massively Parallel Era• USA, Japan, Europe
We have always had inflection points where technology changed
34
What do the Science Codes Need?What system features do the applications need to deliver the science?• 20 PF in 2011–2012 time frame
with 1 EF by end of the decade
• Applications want powerful nodes, not lots of weak nodes– Lots of FLOPS and OPS– Fast, low-latency memory– Memory capacity ≥ 2GB/core
• Strong interconnect
Node peak FLOPS
Memory bandwidth
Interconnect latency
Memory latency
Interconnect bandwidth
Node memory capacity
Disk bandwidth
Large storage capacity
Disk latency
WAN bandwidth
MTTI
Archival capacity
35
How will we deliver these features, and address the power problem?
DARPA ExaScale Computing Study (Kogge et al.): We can’t get to the exascale without radical changes
• Clock rates have reached a plateau and even gone down
• Power and thermal constraints restrict socket performance
• Multi-core sockets are driving up required parallelism and scalability
Future systems will get performance by integrating accelerators on the socket
(already happening with GPUs)
• AMD Fusion™
• Intel Larrabee
• IBM Cell (power + synergistic processing units)
• This has happened before (3090+array processor, 8086+8087, …)
36
• Same number of cabinets, cabinet design, and cooling as Jaguar
• Operating system upgrade of today’s
Cray Linux Environment• New Gemini interconnect
• 3-D Torus • Globally addressable memory• Advanced synchronization
features• New accelerated node design• 10-20 PF peak performance • Much larger memory• 3x larger and 4x faster file system• ≈ 10 MW of power
OLCF-3 system description
37
OLCF-3 node description
• Accelerated Node Design– Next generation interconnect – Next generation AMD processor– Future NVIDIA accelerator
• Fat nodes– 70 GB memory– Very high performance processors– Very high memory bandwidth
Interconnect
Node 1Node 0
38
NVIDIA’s commitment to HPCFeatures for computing on GPUs• Added high-performance 64-bit arithmetic
• Adding ECC and parity that other GPU vendors have not added – Critical for a large system
• Larger memories
• Dual copy engines for simultaneous execution and copy
• S1070 has 4 GPUs exclusively for computing– No video out cables
• Development of CUDA and recently announced work with PGI on Fortran CUDA
May 04, 2009 NVIDIA Shifts GPU Clusters Into Second Gear by Michael Feldman, HPCwire Editor
GPU-accelerated clusters are moving quickly from the "kick the tires" stage into production systems, and NVIDIA has positioned itself as the principal driver for this emerging high performance computing segment.The company's Tesla S1070 hardware, along with the CUDA computing environment, are starting to deliver real results for commercial HPC workloads. For example Hess Corporation has a 128-GPU cluster that is performing seismic processing for the company. The 32 S1070s (4 GPUs per board)
are paired with dual-socket quad-core CPU servers and are performing at the level of about 2,000 dual-socket CPU servers for some of their workloads. For Hess, that means it can get the same computing horsepower for 1/20 the price and for 1/27 the power consumption.
39
· Exceptional expertise and experience· In-depth applications expertise in house· Strong partners· Proven management team
· Driving architectural innovation needed for exascale· Superior global and injection bandwidths· Purpose built for scientific computing· Leverages DARPA HPCS technologies
· Broad system software development partnerships· Experienced performance/optimization tool development teams · Partnerships with vendors and agencies to lead the way· Leverage DOD and NSF investments
Petaflops
· Power (reliability, availability, cost)· Space (current and growth path)· Global network access capable of 96 X 100 Gb/s
Exaflops · Multidisciplinary application development teams· Partnerships to drive application performance· Science base and thought leadership
Our strengths in key areaswill ensure success SCIENCE
40 Managed by UT-Battellefor the U.S. Department of Energy Overview_0909
www.ornl.gov
Oak Ridge National Laboratory:Meeting the challenges of the 21st century
Recommended