IEEE International Symposium on Rapid System Prototyping – Montreal, Canada – October 4, 2013

Riccardo Cattaneo, Christian Pilato, Gianluca C. Durelli, Marco D.

Santambrogio and Donatella SciutoPolitecnico di Milano, Italy

IEEE International Symposium on Rapid System Prototyping – Montreal, Canada – October 4, 2013

SMASH: A Heuristic Methodology for Designing Partially Reconfigurable MPSoCs

Christian Pilato, Politecnico di Milano

What is an FPGA?

• Hardware device that can be customized after the fabrication to execute a specific functionality– Distinct hardware blocks are “intrinsically”

running in parallel on the device• Heterogeneous grid of interconnected

components • look-up tables (LUTs), block rams (BRAMs), digital

signal processors (DSPs), switch matrices, input/output blocks (IOBs) etc…

• Possibility to reuse resources by reconfiguring part of the logic at run time (partial reconfiguration)

2


Heterogeneous SoCs with FPGAs

• Highly coupled heterogeneous systems Zynq Platform: ARM Dual-Cortex A9 cores

tightly coupled with a Xilinx Artix-7 FPGA High speed, low latency reconfigurable

interconnect

3

AVNet ZedBoard(Zynq7000-based dev board)

Coarse Grain overview of Zynq7000 All-Programmable SoC


Design Challenges and Motivation

• Hardware engineer needs to:– partition the application in blocks

(partitioning)– determine which parts are better to be

executed in hardware (mapping and scheduling)

– generate the systems (architecture refinement)

• Partial reconfiguration allows reusing the same logic across different tasks– More tasks can be ported in hardware– Significant overhead to be taken into account 4

The steps are strictly interdependent!

INPUT

SMASH


SMASH: Proposed Methodology

• Design Space Exploration– determines the proper

mapping and scheduling

• Architecture Refinement– customizes the architectural

template to derive the corresponding platform

5

DAG

ArchitectureTemplate

Mapping and Scheduling Heuristic

(Fast)Solution

EvaluationDesign Space

Exploration

Solution

Architecture Refinement

Architecture Solution

SMASH

Implementations


Mapping and Scheduling

Input:• Task graph (DAG)• Architectural Template

– Identifies resources constraints• Implementations

– List of different trade-offs in termsof performance and resources

6

FPGA

INTERCONNECTION CHANNEL

RR0CPU1CPU0

SHARED MEMORY

I/O INTERFACE...

ICAP

RR1

RR2 IP0 Output:• Implementation and

component for each task• Order of execution


Implementation vs. Component

• Each task can have multiple alternative implementations on the same component– Faster tasks usually require more resources

• Some tasks can share implementations to execute the same functionality multiple times– Hardware reuse: no reconfiguration is

required

• Implementation is more related to functionality and resources

• Component is more related to where the task is actually executed– Processor or hardware module

7


SMASH: Execution Overview

8

• Simultaneous MApping and Scheduling HeuristicSMASH iteration

Schedule trace

Generate trace

Evaluate metrics

Store solution

Termination?

Return best solution

Yes

No


Exploring Mapping and Scheduling

• Exploration based on the Serial Generation Scheme (SGS)– Constructive approach to better handle design

constraints• Decision is not taken if it would lead to a constraint

violation

• Different combinations of mapping and scheduling– Each decision represents a mapping of a task

with respect to an implementation and a processing element

– The order of selection represents the priority values for resolving scheduling conflicts on the resources

9


Ant Colony Optimization

• Our proposed approach is based on Ant Colony Optimization (ACO) to limit unfeasible solutions– Cooperative behavior of the ants while

searching – The ant has different possibilities at each step

and takes stochastic decisions, composing a trace

• Stochastic principles guarantee exploration (a probability is generated for each admissible decision at each step)

• Feed-backs guarantee the exploitation of good parts of the solutions

10


Algorithm Overview

• Pseudo-code of the proposed ACO-based exploration:

11

Exploitation: updating global information

Mapping decision

Exploration: generating trace


Stochastic Selection Process

• At each decision point d, the probability to assign a candidate j (task/communication) to a proper implementation point i (implementation+processing element) is:

• Global information G: feedback information– Probability that the decision leads to a good solution

• Local heuristic L: problem-specific hint– “Adjusted” by the global heuristic if wrong

• Roulette wheel and extraction of a combination i, j– Probability is generated iff the resources required by the

resulting PEs can be satisfied by the architecture12

nkijdijd

ijdijdijd

nknk LGLG

p

,,,,,

,,,,,, ][][

][][

global heuristicThere is always the

possibility of adding a new PE or reusing an existing

one (platform customization)

local heuristic

More about SMASH

• Simultaneous MApping and Scheduling Heuristic

SMASH iteration

Schedule trace

Generate trace

Evaluate metrics

Store solution

Termination?

Return best solution

Yes

No

13


Trace Generation and Evaluation

• Evaluation is performed only on the complete trace– Updated version of the original TG augmented

with communications and reconfigurations• Reconfiguration is taken into account from the early

stages of the design process

• Possibility to include different evaluation methods– Analytical estimations vs. TLM simulations

• Decisions composing the best solution are reinforced– As the time goes, the best trace is identified

14


Scheduling Definition

Input• Task graph (DAG)• Trace: ordered list of mapping decisions

(task-component-implementation)Output• Start/end time estimations for each taskGoal• Reduce total

execution time

15

Task Component Implementation

A p1 impl_0

B p2 impl_1

C p1 impl_2

D p3 impl_3


Scheduling: Methodology Overview

16

SMASH scheduler

Create extended task

graph

Actual scheduling

(assign times)Evaluate Metrics

Task graphand

trace

Extendedtask graph Metrics


Extended TG: Communications

17

Adding explicit tasks based on the

communication topology


Extended TG: Reconfigurations

• A reconfiguration task is introduced iff:– Two processing tasks are mapped on the

same component and– Their implementations are different, i.e.,

module cannot be reused

• Insertion of a reconfiguration task:– New edges are introduced from all WRITEs

exiting the source processing task to the reconfiguration

– New edges are introduced from the reconfiguration to all the READs entering the target processing task

18


Extended TG: Reconfigurations

19

Task Component Implementation

A p1 impl_0

B p2 impl_1

C p1 impl_2

D p3 impl_3


Trace Evaluation

Possibility to integrate different

policies to generate the corresponding

scheduling20


Architecture Refinement

• Actual platform instance is derived based on the resulting decisions– Hardware modules with only one task assigned

are converted into static IP blocks– Hardware modules with more tasks assigned

are represented as reconfigurable regions

• Integration with the generation of the run time manager to manage reconfigurations– Still work in progress and manually performed

21


Experimental Evaluation

• Synthetic benchmarks (TGFF)– Focus on scalability of the approach– Possibility to evaluate different task graph patterns

• Resulting systems (platform instance and extended task graph with mapping/scheduling decisions) converted into virtual platforms– Validation of the different solutions assuming

correctness of the execution

• Simulations performed with Synopsys Platform Architect – VPU performance annotations extracted from tasks’

implementations

22


Experimental Setup

• Three different class of experiments:– Static: FPGA area is divided into a set of up to

KS static IP cores (no partial reconfiguration)– Mixed: both IP cores and reconfigurable

regions can be used, with an upper bound of KM IPs and RM reconfigurable regions.

– Reconfigurable: architectures with no more than KR regions

• Reconfigurable regions can be also deployed as static cores in the final architecture if only one task is assigned to them

23


Experimental Results

static mixed reconfigurable#Task IPs RRs HW

tasks#Reconf IPs RRs HW

tasks#Reconf IPs RRs HW

tasks#Reconf

12 7 0 7 0 7 0 7 0 6 0 6 020 20 0 20 0 18 1 20 1 17 1 19 131 30 0 30 0 20 4 31 7 16 7 30 741 30 0 30 0 18 8 40 14 12 12 40 1652 30 0 30 0 17 9 51 25 8 17 51 2660 30 0 30 0 15 10 53 28 10 14 51 2770 30 0 30 0 17 9 55 28 9 16 58 3383 30 0 30 0 15 11 80 54 6 19 81 5690 30 0 30 0 23 3 31 5 9 12 39 18100 30 0 30 0 16 7 46 23 3 17 53 33

24

sta ticmi xe dre co nÞg ura ble

Spee

dup

0

1

2

3

N umb er o f tasks12 20 31 41 52 60 70 83 90 100

Small task graphs cannot benefit of reconfiguration

Large task graphs are affected by communication overhead


Conclusions and Future Work

• SMASH is an automated methodology to design reconfigurable systems– It determines the mapping and scheduling of

the different tasks– It allows customizing the architectural template

• Future work– Integration of floorplanning procedures to

compuate and validate physical constraints of the blocks

– Automatic generation of the platform specification

25


End…

26

http://www.fp7-faster.eu/

Documents

IEEE International Symposium on Rapid System Prototyping – Montreal, Canada – October 4, 2013