7
Operational Operational Machines: ASCI Machines: ASCI White White Presented to SOS7 Presented to SOS7 Mark Seager Mark Seager [email protected] [email protected] 925-423-3141 925-423-3141 ICCD ADH for Advanced ICCD ADH for Advanced Technology Technology Lawrence Livermore National Lawrence Livermore National Laboratory Laboratory This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48.

[PPT]Operational Machines: ASCI White - Sandia National …cs.sandia.gov/SOS7/presentations/seager_white.ppt · Web viewOperational Machines: ASCI White Presented to SOS7 Mark Seager

Embed Size (px)

Citation preview

Page 1: [PPT]Operational Machines: ASCI White - Sandia National …cs.sandia.gov/SOS7/presentations/seager_white.ppt · Web viewOperational Machines: ASCI White Presented to SOS7 Mark Seager

Operational Machines: Operational Machines: ASCI WhiteASCI WhitePresented to SOS7Presented to SOS7

Mark SeagerMark [email protected]@llnl.gov925-423-3141925-423-3141

ICCD ADH for Advanced TechnologyICCD ADH for Advanced TechnologyLawrence Livermore National LaboratoryLawrence Livermore National Laboratory

This work was performed under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48.

Page 2: [PPT]Operational Machines: ASCI White - Sandia National …cs.sandia.gov/SOS7/presentations/seager_white.ppt · Web viewOperational Machines: ASCI White Presented to SOS7 Mark Seager

Q1: Is your machine living up to the performance Q1: Is your machine living up to the performance expectations? If yes, how? If not, what is the root cause?expectations? If yes, how? If not, what is the root cause?

• ASCI White is providing robust cycles to the tri-laboratory community

• Application performance relative to peak is less than expected– FMA 5-16% of floating point arithmetic instructions

issued– Some users sacrifice (turn off) compiler optimization

to have strict reproducibility• Modern coding techniques lead to poor memory

bandwidth utilization– Low cache-line payload utilization– OOP and non-uniform grids several memory

references per floating point operation

Page 3: [PPT]Operational Machines: ASCI White - Sandia National …cs.sandia.gov/SOS7/presentations/seager_white.ppt · Web viewOperational Machines: ASCI White Presented to SOS7 Mark Seager

Q2: What is the MTBI? Q2: What is the MTBI? MTBF

y = 30.629x - 1E+06R2 = 0.2034

0

20

40

60

80

100

120

140

160

180

1/5/

2001

2/5/

2001

3/5/

2001

4/5/

2001

5/5/

2001

6/5/

2001

7/5/

2001

8/5/

2001

9/5/

2001

10/5

/200

1

11/5

/200

1

12/5

/200

1

1/5/

2002

2/5/

2002

3/5/

2002

4/5/

2002

5/5/

2002

6/5/

2002

7/5/

2002

8/5/

2002

9/5/

2002

10/5

/200

2

11/5

/200

2

12/5

/200

2

1/5/

2003

2/5/

2003

Hou

rs (W

hite

, Fro

st, I

ce)

0

20000

40000

60000

80000

100000

120000

Hr/N

ode MTBF (hr)

MTBF (hr/node)

Linear (MTBF (hr/node))

NH-2 MTBF is about 26,000 hr/node or 51 hours for white. Typical applications (of 1/3 machine size or smaller run for weeks at a time)

Page 4: [PPT]Operational Machines: ASCI White - Sandia National …cs.sandia.gov/SOS7/presentations/seager_white.ppt · Web viewOperational Machines: ASCI White Presented to SOS7 Mark Seager

What are the topmost reasons for HW What are the topmost reasons for HW interrupts? interrupts?

0

50

100

150

200

250

300

HW

-CP

U

HW

-IO

HW

-LO

CA

L_D

ISK

HW

-ME

MO

RY

HW

-M

OTH

ER

BO

AR

D

HW

-NO

DE

_SW

AP

HW

-OTH

ER

HW

-P

OW

ER

_SU

PP

LY

HW

-S

SA

_AD

AP

TER

HW

-SS

A_D

ISK

HW

-SS

A_O

THE

R

HW

-SW

ITC

H

HW

-C

OLO

NY

_AD

AP

T

HW

-RIO

HW

-PO

WE

R

HW

-SS

A_D

ISK

-Hot

HW

-D

ATA

RA

M_M

EM

O

HW

-3R

D_P

AR

TY_D

I

HW

2/21/20032/14/20032/7/20031/31/20031/24/20031/17/20031/10/20031/3/200312/27/200212/20/200212/13/200212/6/200211/29/200211/22/200211/15/200211/8/200211/1/200210/25/200210/18/200210/11/200210/4/20029/27/20029/20/20029/13/20029/6/20028/30/20028/23/20028/16/20028/9/20028/2/20027/26/2002

Sect whit Imp yes

Count of Id

Type Subtype

Wk-endng

Page 5: [PPT]Operational Machines: ASCI White - Sandia National …cs.sandia.gov/SOS7/presentations/seager_white.ppt · Web viewOperational Machines: ASCI White Presented to SOS7 Mark Seager

What are the topmost reasons for What are the topmost reasons for SW interrupts?SW interrupts?

0

20

40

60

80

100

120

SW

-CO

MM

_SS

SW

-GP

FS

SW

-LL_

PO

E

SW

-OS

SW

-OTH

ER

SW

-CO

MP

ILE

R

LOC

AL-

DP

CS

LOC

AL-

NE

TWO

RK

LOC

AL-

OTH

ER

LOC

AL-

PO

WE

R

SW zLocal

2/21/20032/14/20032/7/20031/31/20031/24/20031/17/20031/10/20031/3/200312/27/200212/20/200212/13/200212/6/200211/29/200211/22/200211/15/200211/8/200211/1/200210/25/200210/18/200210/11/200210/4/20029/27/20029/20/20029/13/20029/6/20028/30/20028/23/20028/16/20028/9/20028/2/20027/26/2002

Sect whit Imp yes

Count of Id

Type Subtype

Wk-endng

Page 6: [PPT]Operational Machines: ASCI White - Sandia National …cs.sandia.gov/SOS7/presentations/seager_white.ppt · Web viewOperational Machines: ASCI White Presented to SOS7 Mark Seager

What is the average utilization rate?What is the average utilization rate?

0

10

20

30

40

50

60

70

80

90

100

Dec-9

6Jan-97

Feb-97

Mar-97

Apr-97

May-9

7Jun-97

Jul-97

Aug-97

Sep-97

Oct-97

Nov-9

7De

c-97

Jan-98

Feb-98

Mar-98

Apr-98

May-9

8Jun-98

Jul-98

Aug-98

Sep-98

Oct-98

Nov-9

8De

c-98

Jan-99

Feb-99

Utili

zatio

n (p

erce

nt) Blue

SkyWhiteFrostIce

Page 7: [PPT]Operational Machines: ASCI White - Sandia National …cs.sandia.gov/SOS7/presentations/seager_white.ppt · Web viewOperational Machines: ASCI White Presented to SOS7 Mark Seager

Q3: What is the primary complaint, Q3: What is the primary complaint, if any, from the users?if any, from the users?

• Not enough time on the machine– Users want more access

• Scalability of MPI– MPI_ALLREDUCE– MPI_BARRIER

• Extremely long job startup