Dataflow: A Complement to Superscalar

Mihai Budiu – Microsoft Research

Pedro V. Artigas – Carnegie Mellon University

Seth Copen Goldstein – Carnegie Mellon University

Computer Architecture-- A Simplified History --

1967 1990

superscalar

dataflow

This Work

• Re-evaluate dataflow– Same workloads as superscalar

(C programs: Mediabench, Spec)

– Modern performance analysis tool(whole-program critical path)

• Use of superscalar mechanisms in dataflow

Why Study Dataflow

• Naturally exploit ILP• Potentially very high ILP• Simple, regular

microarchitecture• Very low power

[1/1000 superscalar]• Suitable for stream processing

Outline

• Motivation• ASH: A Static Dataflow Model

• Explaining bottlenecks• Conclusions

Application-Specific Hardware

C program

Compiler

Dataflow IR

Computation Dataflow

x = a & 7;...

y = x >> 2;

Program

Circuits

Operations Nodes Pipeline stages

Variables Def-use edges Channels (wires)

Pure dataflow: no program counter

Basic Computation=Pipeline Stage

latch+

Control Flow => Data Flow

datapredicate

Merge (label)

Gateway

Split (branch)p

+1< 100

int sum=0, i;

for (i=0; i < 100; i++)

sum += i*i;

return sum;return sum; !

Comparison: Idealized Simulation

• Compared to 4-wide OOO SimpleScalar• Same operation latencies• Same memory hierarchy (LSQ, L1, L2)• not free

Obvious!

ASH runs at full dataflow speed,and has no resource limitations, so CPU cannot do any better(if compilers equally good)

SpecInt95, ASH vs 4-way OOO

Outline• Motivation• ASH: A Static Dataflow Model• Dissection: explaining bottlenecks

• Conclusions

The Scalpel

C CASH ASH SimulatorASH

tracedrawings

Dynamic Critical Path

Automaticanalysis

The (Loop) Body

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

SpecINT95: 124.m88ksim, init_processor()

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

load predicate

loop predicate

sizeof(X[j])

definition

MIPS gcc CodeLOOP:

L1: beq $v0,$a1,EXIT ; X[j].r == i

L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++

L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF

L1=>L2=>L3=>L5=>L14-instructions loop-carried dependence

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

If Branch Prediction Correct

L1=>L2=>L3=>L5=>L1for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++

SpecInt95, perfect prediction

Speed-up

prediction

no data

Critical Path with Prediction

Loads are notspeculative

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

Prediction + Load Speculation

~4 cycles!Load not pipelined(self-anti-dependence)

ack edge

for (j = 0; X[j].r != 0xF; j++)

if (X[j].r == i)

break;

OOO Pipe Snapshot

IF DA EX WB CT

L3 L3 L3

registerrenaming

L2: addiu $v1,$v1,20 ; &X[j+1].r

L3: lw $v0,0($v1) ; X[j+1].r

L4: addiu $a0,$a0,1 ; j++

Conclusions: Limitations of Static Dataflow

1. dataflow state is “more” distributed

2. “control” dependences still limit ILP

3. nontrivial to squash distributed speculation

4. good prediction may need global information

5. self-antidependences can be critical

(removed by register renaming)

6. distributed computation => more remote accesses

7. more synchronization in dataflow (“join” is not free)

Unrolling Does Not Help

for(i = 0; i < 64; i++) {

for (j = 0; X[j].r != 0xF; j+=2) {

if (X[j].r == i)

break;

if (X[j+1].r == 0xF)

break;

if (X[j+1].r == i)

break;

Y[i] = X[j].q;

when 1 iteration

How Performance Is Evaluated

Unlimited ILPstatic dataflow

LSQL18K

L21/4M

SimpleScalar

Last-Arrival Events

• Event enabling the generation of a result• May be an ack• Critical path=collection of last-arrival edges

3. Some edges may repeat 2. Trace back along

last-arrival edges

1. Start from last node

back back to talk

History

Out-of-orderBranch predSpeculation

TomasulloIBM 360

ThorntonCDC 1964

KarpGraph model

SmithBr pred1981

FisherVLIW

CockeSuperscalar

SmithPrecise spec

DennisDataflow lang

BurgerTRIPS2001

OskinWaveScalar

ArvindTagged-token

PapadopoulosMonsoon

Dataflow: A Complement to Superscalar

Documents

Processor Superscalar 06

SuperScalar Design Prime

64-Bit Superscalar Microprocessor Advanced Information · 64-Bit Superscalar Microprocessor Advanced Information ... M Pipe Instruction Register ... 64-Bit Superscalar Microprocessor

Instr Paralel n Superscalar

Chapter 5 Superscalar Techniques

Superscalar Processor

Superscalar Processors

Lect. 3: Superscalar Processors

BM-311 Bilgisayar Mimarisi - WordPress.com · 3 Superscalar işlemciler Superscalar ilk defa 1987 yılında öne sürülmütür. Superscalar yaklaúımın temeli, birbirinden bağımsız

A Hybrid Systolic-Dataflow Architecture for Inductive ...jianw/hpca2020.pdfHybrid Systolic-Dataflow Hybrid Systolic-Dataflow Hybrid Systolic-Dataflow Fig. 3: Proposed Architecture

Superscalar Processors Superscalar Processors: Branch ...bhagiweb/cs211/lectures/superscalar.pdfSuperscalar Processors: Branch Prediction Dynamic Scheduling Superscalar Processors

Superscalar Architecture_AIUB

Dataflow Models

Superscalar - summary

dataflow - Lunds tekniska högskola · 3 motivation the need for a parallel programming model dataflow programming actors, dataflow, and the CAL actor language dataflow perspectives

Fundamentals of Superscalar Processors

GPU Superscalar (GPUSs) BSC

Dataflow analysis

Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon

Dataflow I: Dataflow Analysis