Overview: Efficient Parallel Computing - Virginia Techcourses.cs.vt.edu/~cs5944/lectures/slides/FengCSSeminarJan09.pdf · 1/30/09 2 synergy.cs.vt.edu Overview: Efficient Parallel

1/30/09

1

Prof. Wu FENG, [email protected], 540-231-1192 Depts. of Computer Science and Electrical & Computer Engineering

© W. Feng, January 2009

synergy.cs.vt.edu

Overview: Efficient Parallel Computing •  Interdisciplinary

–  Approaches: Experimental, Simulative, and Theoretical –  Abstractions: Systems Software, Middleware, Applications

•  Synergistic Research Sampling –  Emerging Chip Multiprocessors: Multicore, GPUs, Cell / PS3

•  Performance Analysis & Optimization: Systems & Applications –  Automated Process-to-Core Mapping: SyMMer –  Mixin Layers & Concurrent Mixin Layers (Tilevich) –  Green Supercomputing: EcoDaemon & Green500 –  Pairwise Sequence Search: BLAST and Smith-Waterman –  Episodic Data Mining (Ramakrishnan, Cao) –  Electrostatic Potential (Onufriev)

IP2M Award

1/30/09

2

synergy.cs.vt.edu



•  Synergistic Research Sampling –  Hybrid Circuit- and Packet-Switched Networking

•  Application-Awareness in Network Infrastructure: FAATE •  Composable Protocols for

High-Performance Networking

synergy.cs.vt.edu



•  Synergistic Research Sampling –  Large-Scale Bioinformatics

•  mpiBLAST: Parallel Sequence Search –  “Finding Missing Genes” in Genomes (Setubal)

•  ParaMEDIC: Parallel Metadata Environment for Distributed I/O & Computing

www.mpiblast.org

Storage Challenge Award

1/30/09

3

synergy.cs.vt.edu



•  Synergistic Research Sampling –  Virtual Computing for K-12 Pedagogy & Research (Gardner)

•  Characterization & Optimization of Virtual Machines •  Cybersecurity & Scalability: Private vs. Public Addresses

Best Undergraduate Student Poster Award

synergy.cs.vt.edu

The Synergy Lab (http://synergy.cs.vt.edu/)

•  Ph.D. Students –  Jeremy Archuleta –  Marwa Elteir –  Song Huang –  Thomas Scogland –  Sushant Sharma –  Kumaresh Singh –  Shucai Xiao

•  M.S. Students –  Ashwin Aji –  Rajesh Gangam –  Ajeet Singh

•  B.S. Students –  Jacqueline Addesa –  William Gomez –  Gabriel Martinez

•  Faculty Collaborators –  Kirk Cameron –  Yong Cao –  Mark Gardner –  Alexey Onufriev –  Dimitris Nikolopoulos –  Naren Ramakrishnan –  Adrian Sandu –  João Setubal –  Eli Tilevich

1/30/09

4

Prof. Wu FENG, [email protected], 540-231-1192 Depts. of Computer Science and Electrical & Computer Engineering

© W. Feng, January 2009

synergy.cs.vt.edu

Why Green Supercomputing? •  Premise

–  We need horsepower to compute but horsepower also translates to electrical power consumption.

•  Consequences –  Electrical power costs $$$$. –  “Too much” power affects efficiency, reliability, availability.

•  Idea –  Autonomic energy and power savings while maintaining

performance in the supercomputer or datacenter. •  Low-Power Approach: Power Awareness at Integration Time •  Power-Aware Approach: Power Awareness at Run Time Wall Street Journal, January 29, 2007

Green Destiny: Low-Power Supercomputer

Green Destiny “Replica”: Traditional Supercomputer

Only Difference? The Processors

© W. Feng, Mar. 2007

1/30/09

5

synergy.cs.vt.edu

Outline

•  Motivation & Background –  Where is High-End Computing (HEC)? –  The Need for Efficiency, Reliability, and Availability

•  Supercomputing in Small Spaces (http://sss.cs.vt.edu/) –  Distant Past: Green Destiny (2001-2002) –  Recent Past to Present: Evolution of Green Destiny (2003-2007)

•  Architectural –  MegaScale, Orion Multisystems, IBM Blue Gene/L

•  Software-Based –  EnergyFit™: Power-Aware Run-Time System (β adaptation)

•  Conclusion and Future Work

synergy.cs.vt.edu

Where is High-End Computing?

We have spent decades focusing on performance, performance, performance

(and price/performance).

1/30/09

6

synergy.cs.vt.edu

Where is High-End Computing? TOP500 Supercomputer List

•  Benchmark –  LINPACK: Solves a (random) dense system of linear

equations in double-precision (64 bits) arithmetic. •  Introduced by Prof. Jack Dongarra, U. Tennessee & ORNL

•  Evaluation Metric –  Performance (i.e., Speed)

•  Floating-Operations Per Second (FLOPS)

•  Web Site –  http://www.top500.org/

synergy.cs.vt.edu

Where is High-End Computing? Gordon Bell Awards at SC

•  Metrics for Evaluating Supercomputers (or HEC) –  Performance (i.e., Speed)

•  Metric: Floating-Operations Per Second (FLOPS) –  Price/Performance Cost Efficiency

•  Metric: Acquisition Cost / FLOPS •  Example: VT System X cluster

•  Performance & price/performance are important metrics, but …

1/30/09

7

synergy.cs.vt.edu


… more “performance” more horsepower more power consumption

less efficiency, reliability, and availability

synergy.cs.vt.edu




1/30/09

8

synergy.cs.vt.edu

Moore’s Law for Power Consumption?

1

10

100

1000

1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ

I386 I486

Pentium

Pentium Pro Pentium II

Pentium III

Chip Maximum Power in watts/cm2

Surpassed Heating Plate

Surpassed Nuclear Reactor

2004

Pentium 4 (Willamette)

1985 1995 2001

Pentium 4 (Prescott)

Source: Intel

Rocket Nozzle

Year

synergy.cs.vt.edu




1/30/09

9

synergy.cs.vt.edu

Efficiency of HEC Systems •  “Performance” and “Price/Performance” Metrics …

–  Lower efficiency, reliability, and availability. –  Higher operational costs, e.g., admin, maintenance, etc.

•  Examples –  Computational Efficiency

•  Relative to Peak: Actual Performance/Peak Performance •  Relative to Space: Performance/Sq. Ft. •  Relative to Power: Performance/Watt

–  Performance: 10,000-fold increase (since the Cray C90). •  Performance/Sq. Ft.: Only 65-fold increase. •  Performance/Watt: Only 300-fold increase.

–  Massive construction and operational costs associated with powering and cooling.

synergy.cs.vt.edu

Reliability & Availability of HEC Systems

Systems CPUs Reliability & Availability

ASCI Q 8,192 MTBF: 6.5 hrs. 114 unplanned outages/month. –  HW outage sources: storage, CPU, memory.

ASCI White

8,192 MTBF: 5 hrs. (2001) and 40 hrs. (2003). –  HW outage sources: storage, CPU, 3rd-party

HW. PSC Lemieux

3,016 MTBF: 9.7 hrs. Availability: 98.33%.

Google (projected from 2003)

~450,000 ~550 reboots/day; 2-3% machines replaced/yr. –  HW outage sources: storage, memory.

Availabiliity: ~100%.

1/30/09

10

synergy.cs.vt.edu

Ubiquitous Need for Efficiency, Reliability, and Availability •  Requirement: Near-100% availability with efficient

and reliable resource usage. –  E-commerce, enterprise apps, online services, ISPs, data

and HPC centers supporting R&D. •  Problems

–  Frequency of Service Outages •  65% of IT managers report that their web sites were

unavailable to customers over a 6-month period. –  Cost of Service Outages

•  NYC stockbroker: $ 6,500,000 / hour •  Amazon.com: $ 1,860,000 / hour •  Ebay (22 hours): $ 225,000 / hour •  Social Effects: negative press, loss of customers who “click

over” to competitor (e.g., Google vs. Ask Jeeves)

Adapted from David Patterson, IEEE IPDPS 2001

synergy.cs.vt.edu

Outline






1/30/09

11

synergy.cs.vt.edu

Supercomputing in Small Spaces Established 2001 at LANL. Now at Virginia Tech.

•  Goals –  Improve efficiency, reliability, and availability in large-scale

supercomputing systems (HEC, server farms, distributed systems) •  Autonomic energy and power savings while maintaining performance in

a large-scale computing system. –  Reduce the total cost of ownership (TCO).

•  TCO = Acquisition Cost (“one time”) + Operation Cost (recurring) •  Crude Analogy

–  Today’s Supercomputer or Datacenter •  Formula One Race Car: Wins raw performance but reliability is so poor

that it requires frequent maintenance. Throughput low. –  “Supercomputing in Small Spaces” Supercomputer or Datacenter

•  Toyota Camry: Loses raw performance but high reliability and efficiency results in high throughput (i.e., miles driven/month answers/month).

http://sss.cs.vt.edu/

synergy.cs.vt.edu

•  A 240-Node Cluster in Five Sq. Ft. •  Each Node

–  1-GHz Transmeta TM5800 CPU w/ High-Performance Code-Morphing Software running Linux 2.4.x

–  640-MB RAM, 20-GB hard disk, 100-Mb/s Ethernet

•  Total –  240 Gflops peak (Linpack: 101 Gflops in March 2002.) –  150 GB of RAM (expandable to 276 GB) –  4.8 TB of storage (expandable to 38.4 TB) –  Power Consumption: Only 3.2 kW (diskless)

•  Reliability & Availability –  No unscheduled downtime in 24-month lifetime.

•  Environment: A dusty 85°-90° F warehouse!

Green Destiny Supercomputer (circa December 2001 – February 2002)

Featured in The New York Times, BBC News, and CNN. Now in the Computer History Museum.

2003 WINNER

Equivalent Linpack to a 256-CPU SGI Origin 2000

(On the TOP500 List at the time)

1/30/09

12

synergy.cs.vt.edu Michael S. Warren, Los Alamos National Laboratory © W. Feng, May 2002

Green Destiny: Low-Power Supercomputer

Green Destiny “Replica”: Traditional Supercomputer

Only Difference? The Processors

© W. Feng, Mar. 2007

1/30/09

13

synergy.cs.vt.edu

Machine Avalon Beowulf

ASCI Red

ASCI White

Green Destiny

Year 1996 1996 2000 2002 Performance (Gflops) 18 600 2500 58 Area (ft2) 120 1600 9920 5 Power (kW) 18 1200 2000 5 DRAM (GB) 36 585 6200 150 Disk (TB) 0.4 2.0 160.0 4.8 DRAM density (MB/ft2) 300 366 625 3000 Disk density (GB/ft2) 3.3 1.3 16.1 960.0 Perf/Space (Mflops/ft2) 150 375 252 11600 Perf/Power (Mflops/watt) 1.0 0.5 1.3 11.6

Parallel Computing Platforms Running an N-body Gravitational Code

synergy.cs.vt.edu

Yet in 2002 … •  “Green Destiny is so low power that it runs just as fast

when it is unplugged.” •  “The slew of expletives and exclamations that followed

Feng’s description of the system …” •  “In HPC, no one cares about power & cooling, and no

one ever will …” •  “Moore’s Law for Power will stimulate the economy by

creating a new market in cooling technologies.”

1/30/09

14

synergy.cs.vt.edu

Outline




•  Software-Based –  EnergyFit™: Power-Aware Run-Time System (β adaptation) –  EcoDaemon™: Power-Aware Operating System Daemon


Cluster Technology

Low-Power Systems Design

Linux

But in the form factor of a workstation …

a cluster workstation

© W. Feng, Aug. 2007

1/30/09

15

synergy.cs.vt.edu

•  LINPACK Performance –  14 Gflops

•  Footprint –  3 sq. ft. (24” x 18”) –  1 cu. ft. (24" x 4" x 18“)

•  Power Consumption –  170 watts at load

•  How does this compare with a traditional desktop?

synergy.cs.vt.edu Univ. of Tsukuba Booth @ SC2004, Nov. 2004.

Inter-University Project: MegaScale http://www.para.tutics.tut.ac.jp/megascale/

1/30/09

16

synergy.cs.vt.edu

Chip (2 processors)

Compute Card (2 chips, 2x1x1)

Node Card (32 chips, 4x4x2)

16 Compute Cards

System (64 cabinets, 64x32x32)

Cabinet (32 Node boards, 8x8x16)

2.8/5.6 GF/s 4 MB

5.6/11.2 GF/s 0.5 GB DDR

90/180 GF/s 8 GB DDR

2.9/5.7 TF/s 256 GB DDR

180/360 TF/s 16 TB DDR

October 2003 BG/L half rack prototype 500 Mhz 512 nodes/1024 proc. 2 TFlop/s peak 1.4 Tflop/s sustained © 2004 IBM Corporation

IBM Blue Gene/L Now the fastest supercomputer in the world.

synergy.cs.vt.edu

Outline




•  Software-Based –  EnergyFit™: Power-Aware Run-Time System (β adaptation) –  EcoDaemon™: Power-Aware Operating System Daemon


1/30/09

17

synergy.cs.vt.edu

Software-Based Approach:

Power-Aware Computing

•  Self-Adapting Software for Energy Efficiency –  Conserve power & energy WHILE maintaining performance.

•  Observations –  Many commodity technologies support dynamic voltage &

frequency scaling (DVFS), which allows changes to the processor voltage and frequency at run time.

–  A computing system can trade off processor performance for power reduction.

•  Power α V2f, where V is the supply voltage of the processor and f is its frequency.

•  Processor performance α f.

57%

43%

synergy.cs.vt.edu

•  Execution time of many programs is insensitive to CPU speed change.

Power-Aware HPC via DVFS: Key Observation

100

110

0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2

CPU speed (GHz)

Execution Tim

e (%)

Clock speed slowed down by 50%

Performance degraded by 4%

NAS IS benchmark

1/30/09

18

synergy.cs.vt.edu

•  Applying DVFS to these programs will result in significant power and energy savings at a minimal performance impact.

Power-Aware HPC via DVFS: Key Idea

90

110

0 20 40 60 80 100 120

Energy Usage(%)

Execution Tim

e (%)

Performance constraint

Energy-optimal DVS schedule

d 2 GHz

0.8 GHz NAS IS benchmark

synergy.cs.vt.edu

Why is Power Awareness via DVFS Hard?

•  What is cycle time of a processor? –  Frequency ≈ 2 GHz Cycle Time ≈ 1 / (2 x 109 ) = 0.5 ns

•  How long does the system take to scale voltage and frequency?

O (10,000,000 cycles)

1/30/09

19

synergy.cs.vt.edu

Problem Formulation: LP-Based Energy-Optimal DVFS Schedule

•  Definitions –  A DVFS system exports n { (fi, Pi ) } settings. –  Ti : total execution time of a program running at setting i

•  Given a program with deadline D, find a DVS schedule (t1*, …, tn*) such that –  If the program is executed for ti seconds at setting i, the total energy

usage E is minimized, the deadline D is met, and the required work is completed.

synergy.cs.vt.edu

Performance Modeling

•  Traditional Performance Model –  T(f) = (1 / f) * W where T(f) (in seconds) is the execution time of a task running

at f and W (in cycles) is the amount of CPU work to be done.

•  Problems? –  W needs to be known a priori. Difficult to predict. –  W is not always constant across frequencies. –  It predicts that the execution time will double if the CPU speed

is cut in half. (Not so for memory & I/O-bound.)

1/30/09

20

synergy.cs.vt.edu

Single-Coefficient β Performance Model

•  Our Formulation –  Define the relative performance slowdown δ as T(f) / T(fMAX) – 1 –  Re-formulate two-coefficient model

as a single-coefficient model:

–  The coefficient β is computed at run-time using a regression method on the past MIPS rates reported from the built-in PMU.

C. Hsu and W. Feng. “A Power-Aware Run-Time System for High-Performance Computing,” SC|05, Nov. 2005.

synergy.cs.vt.edu

Current DVFS Scheduling Algorithms

•  2step (i.e., CPUSPEED via SpeedStep): –  Using a dual-speed CPU, monitor CPU utilization periodically. –  If utilization > pre-defined upper threshold, set CPU to fastest; if utilization <

pre-defined lower threshold, set CPU to slowest. •  nqPID: A refinement of the 2step algorithm.

–  Recognize the similarity of DVFS scheduling and a classical control-systems problem Modify a PID controller (Proportional-Integral-Derivative) to suit DVFS scheduling problem.

•  freq: Reclaims the slack time between the actual processing time and the worst-case execution time.

–  Track the amount of remaining CPU work Wleft and the amount of remaining time before the deadline Tleft.

–  Set desired CPU frequency at each interval to fnew = Wleft / Tleft. –  The algorithm assumes that the total amount of work in CPU cycles is known a

priori, which, in practice, is often unpredictable and not always a constant across frequencies.

1/30/09

21

synergy.cs.vt.edu

β - Adaptation on Sequential Codes (SPEC)

relative time / relative energy with respect to total execution time and system energy usage

SMALLER numbers are BETTER.

synergy.cs.vt.edu

NAS Parallel on an Athlon-64 Cluster

1/30/09

22

synergy.cs.vt.edu

NAS Parallel on an Opteron Cluster

synergy.cs.vt.edu

Outline






1/30/09

23

synergy.cs.vt.edu

Opportunities Galore!

•  Component Level –  CPU, e.g., β

(Library vs. OS) –  Memory –  Disk –  System

•  Resource Co-scheduling –  Across Systems, e.g., Job

Scheduling

•  Abstraction Level –  Application Transparent

•  Compiler: βcompile-time •  Library: βrun-time •  OS: EnergyFit, EcoDaemon

–  Application Modified (Too intrusive)

–  Parallel Library Modified (e.g., P. Raghavan & B. Norris, IPDPS’05)

•  Directed Monitoring –  Hardware performance

counters 57%

43%

synergy.cs.vt.edu

Selected References: Available at http://sss.cs.vt.edu/ Green Supercomputing •  “Green Supercomputing Comes of Age,” IT Professional, Jan.-Feb. 2008. •  “The Green500 List: Encouraging Sustainable Supercomputing,” IEEE

Computer, Dec. 2007. •  “Green Supercomputing in a Desktop Box,” IEEE Workshop on High-

Performance, Power-Aware Computing at IEEE IPDPS, Mar. 2007. •  “Making a Case for a Green500 List,” IEEE Workshop on High-Performance,

Power-Aware Computing at IEEE IPDPS, Apr. 2006. •  “A Power-Aware Run-Time System for High-Performance Computing,” SC|05,

Nov. 2005. •  “The Importance of Being Low Power in High-Performance Computing,”

Cyberinfrastructure Technology Watch (NSF), Aug. 2005. •  “Green Destiny & Its Evolving Parts,” International Supercomputer Conf

Innovative Supercomputer Architecture Award,., Apr. 2004. •  “Making a Case for Efficient Supercomputing,” ACM Queue, Oct. 2003. •  “Green Destiny + mpiBLAST = Bioinfomagic,” 10th Int’l Conf. on Parallel

Computing (ParCo’03), Sept. 2003. •  “Honey, I Shrunk the Beowulf!,” Int’l Conf. on Parallel Processing, Aug. 2002.

1/30/09

24

synergy.cs.vt.edu

Conclusion •  Recall that …

–  We need horsepower to compute but horsepower also translates to electrical power consumption.

•  Consequences –  Electrical power costs $$$$. –  “Too much” power affects efficiency, reliability, availability.

•  Solution –  Autonomic energy and power savings while maintaining

performance in the supercomputer and datacenter. •  Low-Power Architectural Approach •  Power-Aware Commodity Approach

synergy.cs.vt.edu

Acknowledgments

•  Contributions –  Jeremy S. Archuleta (LANL & VT) Former LANL Intern, now PhD –  Chung-hsing Hsu (LANL) Former LANL Postdoc –  Michael S. Warren (LANL) Former LANL Colleague –  Eric H. Weigle (UCSD) Former LANL Colleague

•  Sponsorship & Funding (Past) –  AMD –  DOE Los Alamos Computer Science Institute –  LANL Information Architecture Project –  Virginia Tech

1/30/09

25

Wu FENG [email protected]

http://sss.cs.vt.edu/

http://synergy.cs.vt.edu/ Laboratory

http://www.green500.org/

http://www.mpiblast.org/

http://www.chrec.org/

Documents

Overview: Efficient Parallel Computing - Virginia Techcourses.cs.vt.edu/~cs5944/lectures/slides/FengCSSeminarJan09.pdf · 1/30/09 2 synergy.cs.vt.edu Overview: Efficient Parallel