Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

GeneratingConfigurableHardwareFromParallelPatterns

RaghuPrabhakar,DavidKoeplinger,KevinJ.Brown,HyoukJoongLee,ChrisDeSa,ChristosKozyrakis,KunleOlukotun

StanfordUniversity

ASPLOS2016

Motivation

n IncreasinginteresttouseFPGAsasacceleratorsn Key advantage: Performance/Watt

n Keydomains:n Bigdataanalytics,imageprocessing,financialanalytics,scientificcomputing,search

2

Problem:Programmability

n VerilogandVHDLtoolowlevelforsoftwaredevelopers

n Highlevelsynthesis(HLS)toolsneeduserpragmastohelpdiscoverparallelismn C-basedinput,pragmasrequiringhardwareknowledge

n Limitedinexploitingdatalocality

n Difficulttosynthesizecomplexdatapathswithnestedparallelism

3

Hardware Design - HLS

Add 512 integers stored in external DRAMvoid(int* mem) {

mem[512] = 0;

for(int i=0; i<512; i++) {mem[512] += mem[i];

}}

SumModule

DRAM

27,236clockcyclesforcomputationTwo-ordersofmagnitudetoolong!

4

Optimized Design - HLS

#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)

void(MPort* mem) { MPort buff[LOOPCOUNT];memcpy(buff, mem, LOOPCOUNT);

int sum = 0;for(int i=1; i<LOOPCOUNT; i++) {

#pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {

#pragma UNROLL sum += (int)(buff[i]>>j*sizeof(int)*8);

}} mem[512] = sum;

}

302clockcyclesforcomputation

WidthofDRAMcontroller interface

Burst Access

Use localvariable

Specialcompilerdirectives

LoopRestructuring

Bit shifting toextract individualelements

5

So, we need to ...

n Use Higher-level Abstractionsn Productivity: Developer focuses on applicationn Performance:

n Capture Locality to reduce off-chip memory trafficn Exploit Parallelism at multiple nesting levels

n Smart compiler generates efficient hardware

6

ParallelPatterns

n Constructswithspecialpropertieswithrespecttoparallelismandmemoryaccess

7

map zip reduce groupBy

key1 key3key2

Why Parallel Patterns?

n Concisen Can express large class of workloads in the machine

learning and data analytics domainn Captures rich semantic information about parallelism

and memory access patternsn Enables powerful transformations using pattern matching

and re-write rulesn Enables generating efficient code for different

architectures

8

Parallel Pattern Language

n A data-parallel language that supports parallel patternsn Example application: k-means

val clusters = samples groupBy { sample =>val dists = kMeans map { mean =>

mean.zip(sample){ (a,b) => sq(a – b) } reduce { (a,b) => a + b }}Range(0, dists.length) reduce { (i,j) =>

if (dists(i) < dists(j)) i else j}

}val newKmeans = clusters map { e => val sum = e reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } }val count = e map { v => 1 } reduce { (a,b) => a + b }

sum map { a => a / count }}

// Compute closest mean for each ‘sample’// 1. Compute distance with each mean// 2. Select the mean with shortest distance

//Compute average of each cluster// 1. Compute sum of all assigned points// 2. Compute number of assigned points// 3. Divide each dimension of sum by count

9

OurApproach

PatternTransformationsFusion

PatternTilingCodeMotion

ParallelPatterns

TiledParallelPatternIR

BitstreamGeneration

FPGAConfiguration

HardwareGenerationMemoryAllocationTemplateSelection

MetapipelineAnalysis

MaxJHGL

10

OurApproach



ParallelPatterns


BitstreamGeneration

FPGAConfiguration



MaxJHGL

11

High-level ParallelPatternshelpsproductivity

DataLocalityimprovedwithparallelpatterntilingtransformations

NestedParallelism exploited

withhierarchicalpipelinesanddoublebuffers

GenerateMaxJ togenerateVHDL

Delite

OurApproach



ParallelPatterns


BitstreamGeneration

FPGAConfiguration



MaxJHGL

12

OurApproach



ParallelPatterns


BitstreamGeneration

FPGAConfiguration



MaxJHGL

13

ParallelPatternTiling:MultiFold

n Tilingusingpolyhedralanalysislimitsdataaccesspatternstoaffinefunctionsofloopindices

n Currentparallelpatternscannotrepresenttiling

n Newparallelpatterndescribestiledcomputation

14

tile0 tile1 tile2 tile3

multiFold

out_tile0 out_tile1

ParallelPatternTiling:MultiFold

15

tile0 tile1 tile2 tile3

map

reduce

out_tile0 out_tile1

reduce

groupBy key

kMeans:Untiled

n Data dependent (non-affine) access to ‘sum’ and ‘count’

n Lots of data locality

n Typically, n >> k

16

samples

kMeans

sum

count

newKmeans

mindistminDistIdx

n

k

d

k

k

d

kMeans #reads: n * k * d

ParallelPatternStripMining

n Transformparallelpatternà nestedpatterns

n Stripminedpatternsenablecomputationreordering

n Insertcopiestoenhancelocalityn Copiesguidecreationofon-chipbuffers

ParallelPatterns Strip MinedPatternsmap(d){i => 2*x(i)} multiFold(d/b){ii =>

xTile = x.copy(b + ii)(i, map(b){i => 2*xTile(i)

}) }

17

Strip Mining:kMeans

18

mindistminDistIdx

sum

count

samples

kMeans kMeansBlock

samplesBlock

kMeans #reads: n * k * d

n

k

bs

bk

ParallelPatternInterchange

n Reordernestedpatternsn Move ‘copy’ operations out toward outer pattern(s)n Improves locality and reuse of on-chip memory

StripMinedPatterns InterchangedPatternsmultiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), map(b0,b1){i,j =>

multiFold(p/b2){kk =>yTl = y.copy(b1+jj, b2+kk)(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})

}{(a,b) => a + b}})

}

multiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), multiFold(p/b2){kk => yTl = y.copy(b1+jj, b2+kk) (0, map(b0,b1){i,j =>(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})

})}{(a,b) => map(b0,b1){i,j =>

a(i,j) + b(i,j) }})

}

19

Pattern Interchange: kMeans

20

mindist

minDistIdx

sum

count

samples

kMeans kMeansBlock

samplesBlock

n

k

bs

bk

kMeans #reads: (n / bs) * k * d

OurApproach



ParallelPatterns


BitstreamGeneration

FPGAConfiguration



MaxJHGL

21

TemplateSelection

Memories Description IRConstruct

Buffer Scratchpadmemory Statically sizedarray

Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline

Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses

Pipe.Exec.Units Description IRConstruct

Vector SIMDparallelism Mapoverscalars

Reduction tree Parallel reductionofassociativeoperations MultiFold overscalars

Parallel FIFO Bufferorderedoutputsofdynamicsize FlatMap overscalars

CAM Fullyassociative key-valuestore GroupByFold overscalars

Controllers Description IRConstruct

Sequential Coordinates sequential execution Sequential IRnode

Parallel Coordinatesparallel execution Independent IRnodes

Metapipeline Execute nestedparallel patternsinapipelinedfashion

Outerparallel patternwithmultiple innerpatterns

Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy

Controllers Description IRConstruct

Sequential Coordinates sequential execution Sequential IRnode

Parallel Coordinatesparallel execution Independent IRnodes

Metapipeline Execute nestedparallel patternsinapipelinedfashion

Outerparallel patternwithmultiple innerpatterns

Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy

Memories Description IRConstruct

Buffer Scratchpadmemory Statically sizedarray

Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline

Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses

22

Metapipelining

n Hierarchicalpipeline:A“pipelineofpipelines”n Exploitsnestedparallelism

n Innerstagescouldbeothernestedpatternsorcombinational logicn Doesnotrequireiterationspacetobeknownstatically

n Doesnotrequirecompleteunrollingofinnerpatterns

n Intermediatedatafromeachstageautomatically storedindoublebuffersn Allowsstagestohavevariableexecutiontimes

n Noneedtocalculateinitiation interval(II)n Useasynchronouscontrolsignalstobeginnextiteration

23

Metapipeline–4stages

map(N) { r =>

}

Metapipeline – Intuition

ld ld

st

-

diff

sub

Pipe2

ld ld

st

*

vprod

Pipe3

ld ld

st

-

diff

sub

Pipe2

row

ld ld

st

*

vprod

Pipe3

diff

row

TileMemControllerPipe1


row


vprod


12 1234

row = matrix.slice(r)

diff = map(D) { i =>row(i) – sub(i)

}

vprod = map(D,D) {(i,j)=> diff(i) * diff(j)

}

vprod

5r= r=

24

Metapipeline Analysis

n DetectsMetapipelines inthetiledparallelpatternIR

n Detectionn Chainofproducer-consumerparallelpatternswithinthebodyofanotherparallelpattern

n Schedulingn TopologicalsortofIRofparallelpatternbody

n Listofstages,whereeachstageconsistsofoneormoreindependentparallelpatterns

n Promoteintermediatebufferstodoublebuffers

25

PuttingItAllTogether:kMeans

26

VectorDist

(Norm)Vector

Dist (Norm)

++

//

VectorDist

(Norm)

samplesTile

Load

Inc

/New

kmeansTile

Store

+

kmeansTile

Load

Scalar Dist

(Tree +)

(MinDist, Idx)

kmeansBlockbuffer

samplesBlockDouble buffer

samplesBlockDouble buffer

minIdxDouble buffer

sumBuffer

countBuffer

new kmeansDouble Buffer

1 1 12 2 23 3 34 4 4

Similarto(andmoregeneralthan)hand-writtendesigns1

[1]Hussainetal,“Fpga implementation ofk-meansalgorithm forbioinformaticsapplication: Anaccelerated approachtoclustering microarraydata”,AHS2011

1.Loadkmeans 2.Metapipeline:Calculatesum andcount

3.Metapipeline: Calculatenewkmeans,storeresults

Experimental Setup

n Board:n AlteraStratix V

n 48GBDDR3off-chipDRAM,6memorychannels

n BoardconnectedtohostviaPCI-e

n Executiontimereported=FPGAexecutiontimen CPUßà FPGAcommunication,FPGAconfigurationtimenotincluded

n Goal:Howbeneficialistiling andmetapipelining?

27

Experimental Setup

n Baselinen AutogeneratedMaxJ

n Representativeofstate-of-the-artHLStools

n BaselineOptimizationsn Pipelinedexecutionofinnermostloops

n Parallelized(unrolled)innerloops

n Parallelismfactorchosenbyhand

n DatalocalitycapturedatthelevelofaDRAMburst(384bytes)

n Parallelism factorsarekeptconsistentacrossbaselineandoptimizedversionsfromourflow

28

Evaluation

29

Evaluation

30

Results Summary

31

n Speedupwithtiling:upto15.5xn Speedupwithtiling+metapipelining: upto39.4xn Minimal (oftenpositive!)impactonresourceusage

n Tileddesignshavefeweroff-chipdataloadersandstorers

Summary

n Twokeyoptimizations:tiling andmetapipelining – togenerateefficientFPGAdesignsfromparallelpatterns

n Automatictilingtransformationsplacingfewerrestrictionsonmemoryaccesspatterns

n Analysistoautomatically inferdesignswithmetapipelines anddoublebuffers

n Significantspeedupsofupto39.4xwithminimal impactonFPGAresourceutilization

32

Documents

Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level