80
Shonan, September 2018 Magnus Själander Hardware-Software Codesign for Efficient Computing Magnus Själander Associate Professor Norwegian University of Science and Technology

Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Hardware-Software Codesign

for Efficient Computing

Magnus Själander

Associate Professor

Norwegian University of Science and Technology

Page 2: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

The Free Lunch is Over

• Technology and frequency scaling are at a standstill

• Most (all) designs are power limited

• Performance improvements require solutions that are

more energy efficient, two examples:

– Co-optimization of hardware and software

– Specialization

• Both cases require strong compiler support to extract

and represent program properties

Page 3: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

1. Rethinking the HW-SW Interface

2. Software Programmable Accelerator

3. A New Intermediate Representation

Page 4: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Collaborators

Kim-Anh Tran Alexandra Jimborean Trevor Carlson Stefanos Kaxiras Konstantinos Koukos

Vasileios SpiliopoulosYaman Umuroglu Lahiru Rasnayake Nico Reissmann Jan Christian Meyer

Page 5: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

1. Rethinking the HW-SW InterfaceSWOOP: Software-Hardware Co-design for Non-speculative, Execute-Ahead, In-Order Cores K. A. Tran, A. Jimborean, T. E. Carlsson, K. Koukus M. Själander, and S. Kaxiras

PLDI 2018

Page 6: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

for (int i=0; i<N; i++) {

tmp = x[i];

if (tmp > 0)

sum = tmp + sum;

}

Page 7: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 8: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 9: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 10: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 11: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 12: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 13: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 14: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 15: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 16: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 17: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 18: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 19: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

Page 20: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

• SWOOP targets loops

Page 21: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

for( ;;) {

}

A

E

Load registers

Execut e

Access

• SWOOP targets loops

• Amenable loops are split

into an Access and an

Execute phase (DAE)

Page 22: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

for( ;;) {

}

A

E

Load registers

Execut e

Access

Check m iss

• SWOOP targets loops

• Amenable loops are split

into an Access and an

Execute phase (DAE)

• Check miss instruction

Page 23: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

for( ;;) {

}

A

E

Load registers

A+ 1

A+ 2

Alt ernat ive Pat h

Execut e

Access

Check m iss

• SWOOP targets loops

• Amenable loops are split

into an Access and an

Execute phase (DAE)

• Check miss instruction

• Branch to alternative path

on load miss

Page 24: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

for( ;;) {

}

A

E

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Check m iss

E

• SWOOP targets loops

• Amenable loops are split

into an Access and an

Execute phase (DAE)

• Check miss instruction

• Branch to alternative path

on load miss

Page 25: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

for( ;;) {

}

A

E

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Check m iss

E

Page 26: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

for( ;;) {

}

A

E

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Check m iss

E

Page 27: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

for( ;;) {

}

A

E

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Check m iss

E

Page 28: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

A0for( ;;) {

}

A

E

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Alt erna t ive Pat h

Check m iss

E

Page 29: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

A0for( ;;) {

}

A

E

Long latency load

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Alt erna t ive Pat h

Check m iss

E

Page 30: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

A0

A1

for( ;;) {

}

A

E

Long latency load

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Alt erna t ive Pat h

Check m iss

E

Page 31: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

A0

A1

A2

for( ;;) {

}

A

E

Long latency load

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Alt erna t ive Pat h

Check m iss

E

Page 32: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

A0

A1

A2

E0

for( ;;) {

}

A

E

Long latency load

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Alt erna t ive Pat h

Check m iss

E

Page 33: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

A0

A1

A2

E0

E1

for( ;;) {

}

A

E

Long latency load

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Alt erna t ive Pat h

Check m iss

E

Page 34: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

A0

A1

A2

E0

E1

E2

for( ;;) {

}

A

E

Long latency load

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Alt erna t ive Pat h

Check m iss

E

Page 35: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

SWOOP Software/Execution Model

loop

body

for( ;;) {

}

DAE

Use registers

A0

E0

A1

E1

A2

E2

A0

A1

A2

E0

E1

A3

E2

for( ;;) {

}

A

E

Long latency load

Norm al Execut ion

Load registers

A+ 1

E+ 1

A+ 2

E+ 2

Alt ernat ive Pat h

Execut e

Access

Alt erna t ive Pat h

Check m iss

E

Page 36: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Challenges

Creating efficient decoupled access execute phases

• memory dependencies

• complex control flow

• register pressure

while being non-speculative and minimizing code

duplication

Page 37: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Increased Register Pressure

Page 38: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Increased Register Pressure

Increased number of registers

Increased live ranges

Page 39: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Increased Register PressureA register spill stores the value

to the stack, which is a use of

the value and causes a stall

Page 40: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Increased Register Pressure

However, communication is

local to the Access and Execute

Page 41: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Increased Register Pressure

ContextHowever, communication is

local to the Access and Execute

Page 42: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Increased Register Pressure

However, communication is

local to the Access and Execute

Simple context register remapping in hardware

Context

Page 43: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Context Register Remapping

• A hardware-software solution to reduce register pressure

• Only registers written in an Access phase are remapped

• A register is remapped at most once per Access phase

• Contexts are software controlled

• Avoids increasing the register pressure

• Significant reduction of remappings compared to OoO

Page 44: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Evaluation Setup

• SWOOP compiler based on LLVM

• Sniper Multi-Core Simulator + McPAT

• SWOOP– Based on an in-order core similar to an ARM Cortex A7

– One extra pipeline stage to model context remapping

• Compared against– In-Order (similar to a Cortex A7)

– In-Order with perfect L3 cache (no main memory accesses)

– Out-of-Order (similar to a Cortex A15)

– Clairvoyance (software-only solution — CGO 2017)

• SPEC CPU 2006, CIGAR, and NAS benchmark suites (high MPKI)

Page 45: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance

Page 46: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance

Higher is better

Page 47: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance

Page 48: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance

+34%

vs.

In-Order

Page 49: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance

+34%

vs.

In-Order

-7%

vs. Perf-L3

Page 50: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance

+34%

vs.

In-Order

-7%

vs. Perf-L3

+26%

vs.

Perf-L3

+82%

vs.

Perf-L3

Page 51: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance

+34%

vs.

In-Order

-7%

vs. Perf-L3

+26%

vs.

Perf-L3

+82%

vs.

Perf-L3

+26%

vs.

OoO

Page 52: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance

+19%

vs.

Clairvoyance[1] K.-A. Tran et al., Clairvoyance: Look-ahead compile-time scheduling, CGO 2017

Page 53: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Energy Efficiency

Page 54: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Energy Efficiency

Lower is better

Page 55: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Energy Efficiency

-23%

vs.

In-Order

Page 56: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Energy Efficiency

-23%

vs.

In-Order

+57%

OoO vs. InO

Page 57: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Summary

• SWOOP is a non-speculative hardware-software design

• SWOOP jumps ahead to independent regions of code

• 34% performance improvement over an In-Order core

• 23% energy reduction over an In-Order core

Page 58: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

2. Software Programmable Accelerator

BISMO: A Scalable Bit-Serial Matrix Multiplication Overlay for Reconfigurable ComputingYaman Umuroglu, Lahiru Rasnayake, and Magnus Själander

FPL 2018

Page 59: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

The Need for Variable-Precision Matrix Multiply

• Not all applications need the same arithmetic precision

• Example: mixed-precision in Quantized Neural Networks (QNNs)

8-bit

8-bitx 8-

bit8-bitx

2-bit

4-bitx 2-

bit4-bitx

1-bit

1-bitx

Weights and

activations can have

different precision

Different layers can

have different precision

Page 60: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Hardware for Variable Precision• HW typically fixed precision, SW benefits from variable precision

• Mismatch causes power, performance, and area overheads

• Potential solutions to the problem

Memory

Cover-All Datapath

8b x 8b

1b x 8b

1b x 4b

Under-utilizationNeed to pick the precisions

Dynamic Reconfiguration

Reconfiguration overhead

Memory

1b x 1b

<< +

Bit-Serial Datapath

One datapath that can do all precisions

Memory

8b x 8b

1b x 8b

1b x 4b

Page 61: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Approach: Bit-Serial Matrix Multiplication

1) Integer matrix = weighted sum of binary matrices

2) Integer matrix multiply = weighted sum of binary matrix multiplications

3) Parallelize inside each binary matrix multiplication for throughput

PE

PE

PE

PE

PE

PE

Page 62: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Example: Bit-Serial Matrix Multiplication

2-bit integer matrix =

weighted sum of 2

binary matrices

2-bit integer matrix multiply =

weighted sum of 4 binary matrix

multiplications

Page 63: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

BISMO Hardware Architecture

• 3-stage pipeline

– Fetch: 2D DMA read

– Execute: Bit-serial mat.mul

– Result: 2D DMA write

– All running in parallel

• BRAM matrix buffers

– Fill/empty via 2D DMA

• Explicit stage synchronization

with token FIFOs

Page 64: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Execute Stage: Dot Product Array (DPA)

• Array of Dot Product Units (DPU)

– Parallelizes a single binary matrix multiply

– 2D broadcast interconnect

– Configurable array dimensions Dm and Dn

• Array fed by matrix buffers

– Address sequencer reads out matrix data

as specified by instruction

– Matrix buffers filled by 2D DMA

Dmrows

Dn

columns

Page 65: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

The Heart of BISMO: Dot Product Unit (DPU)

• Performs weighted binary dot product

– Power-of-two weight via left shift

– Optional negation for signed values

• Configurable width Dk for parallel operation

– Dk binary MACs every cycle

x-bit by y-bit, N-element dot product = Nxy binary MACs = Nxy / Dk cycles

Page 66: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018

Instruction Set and Programming Model

● 5-instruction vector ISA

● Supports tiling, bit position skipping, latency hiding ...

Run weighted binary dotproduct on BRAM data

and accumulate

2D DMA descriptors tomove data between

DRAM ⇔ BRAM

Send/receive tokens representing shared

buffer regions

Page 67: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Evaluation

• Prototype implementation

– Hardware generator in Chisel

– Minimal software stack in C++

– Vivado 2017.4 for synthesis

• Hardware platform

– PYNQ-Z1 board (Zynq Z-7020)

– USB power meter for the entire PYNQ board

• Evaluated on

– Resource cost (synthesis results)

– Throughput, efficiency, power (execution results)

Page 68: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Performance & Efficiency

Largest design: 8x256x8 array, 45 kLUTs, 6.5 binary TOPS @200 MHz

– w-bit by a-bit performance = 6.5 TOPS / (w × a)

Page 69: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Power

• Up to 1.4 binary TOPS/W in power efficiency, including DRAM

• Average power breakdown: 65% idle, 25% fetch/result, 10% execute

Page 70: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Summary

• Bit-serial matrix multiplication overlay with configurable dimensions

• Up to 6.5 binary TOPS on a PYNQ-Z1

• Runtime proportional to product of matrix precisions

• Software-programmable with a 5-instruction ISA

https://github.com/EECS-NTNU/bismo

Page 71: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

3. A New Intermediate Representation

RVSDG: An Intermediate Representation for Optimizing Compilers

Nico Reissmann, Jan Christian Meyer, and Magnus Själander

Manuscript

Page 72: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Motivation

• The key to performance in modern architectures is

exploitation of parallelism

• This is needed for programming

– CMPs and GPUs

– but also for specialization (HLS)

• Thus, we need to be able to easily extract parallelism

from programs in order to further improve performance

Page 73: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Current State

• However, currently used program representations are

mostly sequential in nature (e.g., CFG)

• They are a remnant of a time when sequential

programming was sufficient and parallelization was done

dynamically by the processor

We need to rethink our models

Page 74: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

RVSDG: Regionalized Value State

Dependence Graph

• Single unified IR that normalizes program representation

• Exposes hierarchical structure of the program

– Functions

– Loops

– Conditionals

• Explicitly exposes parallelism

– Not just ILP

– Functions and loops

Page 75: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Example of CFG vs. RVSDG

Page 76: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Parallelism Explicitly Encoded

Page 77: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

The JLM Compiler

• Prototype compiler that uses RVSDG for optimizations

• Plugs into the LLVM compilation flow

• https://github.com/EECS-NTNU/jlm

C/C++ Clang LLVM IR

JLMLLVM

IR

Page 78: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

Conclusion

Page 79: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

High Efficiency Computing

• Today's computing systems are increasingly power

constrained

• Challenges established assumptions and force us to

rethink how we construct programs and build systems

• We need a holistic approach, where combined hardware

and software optimizations are performed across the

complete system stack

Page 80: Hardware-Software Codesign for Efficient Computing · 2019. 3. 31. · Hardware-Software Codesign for Efficient Computing Magnus Själander ... communication is local to the Access

Shonan, September 2018Magnus Själander

We are hiring

Associate Professor in Compiler Design

PhD graduates and Postdocs as well as more senior researchers

are encouraged to apply