Download pdf - Warped-DMR Light-weight Error detection for GPGPU · 2012-12-08 · For Inter-Warp DMR 96.43 89.60 • Replay Checker is used to check the instruction types and command replay(DMR)

For Inter-Warp DMR

• Replay Checker is used to check the instruction types

and command replay(DMR) in the following cycle if

different

• When the same type instructions are issued

consecutively, ReplayQ is used to maintain the

instructions so that the instructions can be verified

anytime later whenever the corresponding execution

unit becomes available.

– Opcode, Operands, and Original execution result for 32 threads

(around 500B for each entry)

Inter-Warp DMR: Exploiting underutilized resources among heterogeneous units

• In any fully utilized warps, the unused execution units conduct DMR of unverified previous warp’s

execution

• If the result of the stored original execution and the new result mismatches ERROR detected!!

Warped-DMR

Light-weight Error detection for GPGPU

Hyeran Jeon and Murali Annavaram

University of Southern California

MOTIVATION

ARCHITECTURAL SUPPORT

WARPED-DMR ABSTRACT

CONTACT

Hyeran Jeon

Email: [email protected]

Murali Annaram

Email: [email protected]

For many scientific

applications that commonly

run on supercomputers,

program correctness

is as important as

performance. Few soft or

hard errors could lead to

corrupt results and can

potentially waste days or

even months of computing

effort. In this research we

exploit unique architectural

characteristics of GPGPUs

to propose a light weight

error detection method,

called Warped Dual

Modular Redundancy

(Warped-DMR).

Warped-DMR detects

errors in computation by

relying on opportunistic

spatial and temporal dual-

modular execution of code.

Warped-DMR is light weight

because it exploits the

underutilized parallelism in

GPGPU computing for error

detection. Error detection

spans both within a warp as

well as between warps,

called intra-warp and inter-

warp DMR, respectively.

Warped-DMR achieves

96% error coverage while

incurring a worst-case 16%

performance overhead

without extra execution

units or programmer’s

effort.

Intra-Warp DMR: Exploiting underutilized resources among homogeneous units • For any underutilized warps, the inactive threads within the warp duplicate the active threads’

execution • Active mask gives a hint for duplication selection

• If the result of the inactive and active thread mismatches ERROR detected!!

For Intra-Warp DMR

• Register Forwarding Unit is used to have the pair of active and inactive threads use the same

operands, RFU forwards active thread’s register value to inactive thread according to active

mask

– Overhead : 0.08ns and 390um2 @ Synopsis Design Compiler

• Thread-Core mapping is used to increase error coverage by modifying thread-core affinity in

scheduler

• Scientific computing is different to

multimedia

– Correctness matters

– Some vendors began to add memory protection

schemes to GPU

• But what about execution units?

– Larger portion of die area is assigned to execution

units in GPU

– Vast number of cores Higher probability of

computation errors

• Underutilization among Homogeneous Units

– Since threads within a warp share a PC value, in a

diverged control flow, some threads should execute

one flow but the others not

• Underutilization among Homogeneous Units

– Dispatcher issues an instruction to one of three

execution units at a time

– In Worst case, two execution units among three

become idle

0%10%20%30%40%50%60%70%80%90%

100%32 3130 2928 2726 2524 2322 2120 1918 1716 1514 1312 1110 98 76 54 3

RESULTS

Warped-DMR(Intra-Warp DMR + Inter-Warp DMR) covers 96% of computations with 16%

performance overhead without extra execution units

BACKGROUND

• Instructions are executed in a batch of threads(warp or wavefront) unit

– Threads within a warp are running in lock-stepp manner by sharing a PC

• Instructions are categorized into 3 types and executed on the corresponding

execution units

– Arithmetic operations on SP, Memory operations on LD/ST, Transcendental

instructions(i.e. sin, cosine) on SFU)

SFU SP LD/ST

Local Memory

Scheduler/Dispatcher

Register File SM

Global Memory

GPU

SM

... Thread Block A Thread

Kernel

Warp

SPs LD/STs SFUs

time

sin ld

add

ld

add

ld

warp4: sin.f32 %f3, %f1 warp1: ld.shared.f32 %f20,[%r99+824] warp2: add.f32 %f16, %f14, %f15 warp1: ld.shared.f32 %f21, [%r99+956] warp2: add.f32 %f18, %f12, %f17 warp3: ld.shared.f32 %f2, [%r70+4]

If(cond)

{

b++;

} else {

b--;

}

a = b;

time

SP 2 SP 1

b++ b--

Cond?

a = b

same OK

different ERROR!!

SP 2 SP 1 time

Cond?

a = b

b++ b++ DMR

b-- b-- DMR

COMP

COMP

Flush & Error Handling

time

SPs LD/STs SFUs

sin ld add

ld add

ld

ld add

ld add

ld sin

DMR

DMR DMR

DMR

DMR

DMR

Code Typical GPU execution With Intra-Warp DMR

Code Typical GPU execution With Inter-Warp DMR

< Execution time breakdown with respect to the number of active threads >

Over 30% of execution time of

BitonicSort is run by 16 threads

40% of execution time of

BFS is run by 1 thread

2 types of Underutilization in GPGPU computing

WARPED-DMR : EXPLOITING THE UNDERUTILIZATIONS FOR ERROR DETECTION

Can we use these idle resources?

th3.r0 th3.r1

.

.

th2.r0 th2.r1

.

.

th1.r0 th1.r1

.

.

th0.r0 th0.r1

.

.

SP SP SP SP

RF

EXE

WB

Comparator

active mask

ERROR!!

Register Forwarding Unit

1100

th3.r1 th2.r1 th1.r1 th0.r1

th3.r1 th2.r1 th3.r1 th2.r1

CORE CORE CORE SP MEM MEM

MEM

SFU

ReplayQ

EX

E

RF

D

EC

SP

CHECKER same

SP

SP enqueue

& search

SFU DMR

< Error coverage w.r.t. SIMT cluster organization and Thread to Core mapping >

89.60 91.91

96.43

0

20

40

60

80

100

120

Erro

r C

ove

rage

(%

)

with 4core cluster with 8core cluster

< Normalized Kernel Simulation Cycles w.r.t. ReplayQ size >

1.41 1.32

1.24 1.16

00.20.40.60.8

11.21.41.61.8

2

No

rmal

ize

d S

imu

lati

on

Cyc

les

0 1 5 10