32
Generating Configurable Hardware From Parallel Patterns Raghu Prabhakar , David Koeplinger, Kevin J. Brown, HyoukJoongLee, Chris De Sa, Christos Kozyrakis, Kunle Olukotun Stanford University ASPLOS 2016

Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

GeneratingConfigurableHardwareFromParallelPatterns

RaghuPrabhakar,DavidKoeplinger,KevinJ.Brown,HyoukJoongLee,ChrisDeSa,ChristosKozyrakis,KunleOlukotun

StanfordUniversity

ASPLOS2016

Page 2: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Motivation

n IncreasinginteresttouseFPGAsasacceleratorsn Key advantage: Performance/Watt

n Keydomains:n Bigdataanalytics,imageprocessing,financialanalytics,scientificcomputing,search

2

Page 3: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Problem:Programmability

n VerilogandVHDLtoolowlevelforsoftwaredevelopers

n Highlevelsynthesis(HLS)toolsneeduserpragmastohelpdiscoverparallelismn C-basedinput,pragmasrequiringhardwareknowledge

n Limitedinexploitingdatalocality

n Difficulttosynthesizecomplexdatapathswithnestedparallelism

3

Page 4: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Hardware Design - HLS

Add 512 integers stored in external DRAMvoid(int* mem) {

mem[512] = 0;

for(int i=0; i<512; i++) {mem[512] += mem[i];

}}

SumModule

DRAM

27,236clockcyclesforcomputationTwo-ordersofmagnitudetoolong!

4

Page 5: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Optimized Design - HLS

#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)

void(MPort* mem) { MPort buff[LOOPCOUNT];memcpy(buff, mem, LOOPCOUNT);

int sum = 0;for(int i=1; i<LOOPCOUNT; i++) {

#pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {

#pragma UNROLL sum += (int)(buff[i]>>j*sizeof(int)*8);

}} mem[512] = sum;

}

302clockcyclesforcomputation

WidthofDRAMcontroller interface

Burst Access

Use localvariable

Specialcompilerdirectives

LoopRestructuring

Bit shifting toextract individualelements

5

Page 6: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

So, we need to ...

n Use Higher-level Abstractionsn Productivity: Developer focuses on applicationn Performance:

n Capture Locality to reduce off-chip memory trafficn Exploit Parallelism at multiple nesting levels

n Smart compiler generates efficient hardware

6

Page 7: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

ParallelPatterns

n Constructswithspecialpropertieswithrespecttoparallelismandmemoryaccess

7

map zip reduce groupBy

key1 key3key2

Page 8: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Why Parallel Patterns?

n Concisen Can express large class of workloads in the machine

learning and data analytics domainn Captures rich semantic information about parallelism

and memory access patternsn Enables powerful transformations using pattern matching

and re-write rulesn Enables generating efficient code for different

architectures

8

Page 9: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Parallel Pattern Language

n A data-parallel language that supports parallel patternsn Example application: k-means

val clusters = samples groupBy { sample =>val dists = kMeans map { mean =>

mean.zip(sample){ (a,b) => sq(a – b) } reduce { (a,b) => a + b }}Range(0, dists.length) reduce { (i,j) =>

if (dists(i) < dists(j)) i else j}

}val newKmeans = clusters map { e => val sum = e reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } }val count = e map { v => 1 } reduce { (a,b) => a + b }

sum map { a => a / count }}

// Compute closest mean for each ‘sample’// 1. Compute distance with each mean// 2. Select the mean with shortest distance

//Compute average of each cluster// 1. Compute sum of all assigned points// 2. Compute number of assigned points// 3. Divide each dimension of sum by count

9

Page 10: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

OurApproach

PatternTransformationsFusion

PatternTilingCodeMotion

ParallelPatterns

TiledParallelPatternIR

BitstreamGeneration

FPGAConfiguration

HardwareGenerationMemoryAllocationTemplateSelection

MetapipelineAnalysis

MaxJHGL

10

Page 11: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

OurApproach

PatternTransformationsFusion

PatternTilingCodeMotion

ParallelPatterns

TiledParallelPatternIR

BitstreamGeneration

FPGAConfiguration

HardwareGenerationMemoryAllocationTemplateSelection

MetapipelineAnalysis

MaxJHGL

11

High-level ParallelPatternshelpsproductivity

DataLocalityimprovedwithparallelpatterntilingtransformations

NestedParallelism exploited

withhierarchicalpipelinesanddoublebuffers

GenerateMaxJ togenerateVHDL

Delite

Page 12: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

OurApproach

PatternTransformationsFusion

PatternTilingCodeMotion

ParallelPatterns

TiledParallelPatternIR

BitstreamGeneration

FPGAConfiguration

HardwareGenerationMemoryAllocationTemplateSelection

MetapipelineAnalysis

MaxJHGL

12

Page 13: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

OurApproach

PatternTransformationsFusion

PatternTilingCodeMotion

ParallelPatterns

TiledParallelPatternIR

BitstreamGeneration

FPGAConfiguration

HardwareGenerationMemoryAllocationTemplateSelection

MetapipelineAnalysis

MaxJHGL

13

Page 14: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

ParallelPatternTiling:MultiFold

n Tilingusingpolyhedralanalysislimitsdataaccesspatternstoaffinefunctionsofloopindices

n Currentparallelpatternscannotrepresenttiling

n Newparallelpatterndescribestiledcomputation

14

tile0 tile1 tile2 tile3

multiFold

out_tile0 out_tile1

Page 15: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

ParallelPatternTiling:MultiFold

15

tile0 tile1 tile2 tile3

map

reduce

out_tile0 out_tile1

reduce

groupBy key

Page 16: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

kMeans:Untiled

n Data dependent (non-affine) access to ‘sum’ and ‘count’

n Lots of data locality

n Typically, n >> k

16

samples

kMeans

sum

count

newKmeans

mindistminDistIdx

n

k

d

k

k

d

kMeans #reads: n * k * d

Page 17: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

ParallelPatternStripMining

n Transformparallelpatternà nestedpatterns

n Stripminedpatternsenablecomputationreordering

n Insertcopiestoenhancelocalityn Copiesguidecreationofon-chipbuffers

ParallelPatterns Strip MinedPatternsmap(d){i => 2*x(i)} multiFold(d/b){ii =>

xTile = x.copy(b + ii)(i, map(b){i => 2*xTile(i)

}) }

17

Page 18: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Strip Mining:kMeans

18

mindistminDistIdx

sum

count

samples

kMeans kMeansBlock

samplesBlock

kMeans #reads: n * k * d

n

k

bs

bk

Page 19: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

ParallelPatternInterchange

n Reordernestedpatternsn Move ‘copy’ operations out toward outer pattern(s)n Improves locality and reuse of on-chip memory

StripMinedPatterns InterchangedPatternsmultiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), map(b0,b1){i,j =>

multiFold(p/b2){kk =>yTl = y.copy(b1+jj, b2+kk)(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})

}{(a,b) => a + b}})

}

multiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), multiFold(p/b2){kk => yTl = y.copy(b1+jj, b2+kk) (0, map(b0,b1){i,j =>(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})

})}{(a,b) => map(b0,b1){i,j =>

a(i,j) + b(i,j) }})

}

19

Page 20: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Pattern Interchange: kMeans

20

mindist

minDistIdx

sum

count

samples

kMeans kMeansBlock

samplesBlock

n

k

bs

bk

kMeans #reads: (n / bs) * k * d

Page 21: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

OurApproach

PatternTransformationsFusion

PatternTilingCodeMotion

ParallelPatterns

TiledParallelPatternIR

BitstreamGeneration

FPGAConfiguration

HardwareGenerationMemoryAllocationTemplateSelection

MetapipelineAnalysis

MaxJHGL

21

Page 22: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

TemplateSelection

Memories Description IRConstruct

Buffer Scratchpadmemory Statically sizedarray

Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline

Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses

Pipe.Exec.Units Description IRConstruct

Vector SIMDparallelism Mapoverscalars

Reduction tree Parallel reductionofassociativeoperations MultiFold overscalars

Parallel FIFO Bufferorderedoutputsofdynamicsize FlatMap overscalars

CAM Fullyassociative key-valuestore GroupByFold overscalars

Controllers Description IRConstruct

Sequential Coordinates sequential execution Sequential IRnode

Parallel Coordinatesparallel execution Independent IRnodes

Metapipeline Execute nestedparallel patternsinapipelinedfashion

Outerparallel patternwithmultiple innerpatterns

Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy

Controllers Description IRConstruct

Sequential Coordinates sequential execution Sequential IRnode

Parallel Coordinatesparallel execution Independent IRnodes

Metapipeline Execute nestedparallel patternsinapipelinedfashion

Outerparallel patternwithmultiple innerpatterns

Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy

Memories Description IRConstruct

Buffer Scratchpadmemory Statically sizedarray

Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline

Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses

22

Page 23: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Metapipelining

n Hierarchicalpipeline:A“pipelineofpipelines”n Exploitsnestedparallelism

n Innerstagescouldbeothernestedpatternsorcombinational logicn Doesnotrequireiterationspacetobeknownstatically

n Doesnotrequirecompleteunrollingofinnerpatterns

n Intermediatedatafromeachstageautomatically storedindoublebuffersn Allowsstagestohavevariableexecutiontimes

n Noneedtocalculateinitiation interval(II)n Useasynchronouscontrolsignalstobeginnextiteration

23

Page 24: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Metapipeline–4stages

map(N) { r =>

}

Metapipeline – Intuition

ld ld

st

-

diff

sub

Pipe2

ld ld

st

*

vprod

Pipe3

ld ld

st

-

diff

sub

Pipe2

row

ld ld

st

*

vprod

Pipe3

diff

row

TileMemControllerPipe1

TileMemControllerPipe4

row

TileMemControllerPipe1

vprod

TileMemControllerPipe4

12 1234

row = matrix.slice(r)

diff = map(D) { i =>row(i) – sub(i)

}

vprod = map(D,D) {(i,j)=> diff(i) * diff(j)

}

vprod

5r= r=

24

Page 25: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Metapipeline Analysis

n DetectsMetapipelines inthetiledparallelpatternIR

n Detectionn Chainofproducer-consumerparallelpatternswithinthebodyofanotherparallelpattern

n Schedulingn TopologicalsortofIRofparallelpatternbody

n Listofstages,whereeachstageconsistsofoneormoreindependentparallelpatterns

n Promoteintermediatebufferstodoublebuffers

25

Page 26: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

PuttingItAllTogether:kMeans

26

VectorDist

(Norm)Vector

Dist (Norm)

++

//

VectorDist

(Norm)

samplesTile

Load

Inc

/New

kmeansTile

Store

+

kmeansTile

Load

Scalar Dist

(Tree +)

(MinDist, Idx)

kmeansBlockbuffer

samplesBlockDouble buffer

samplesBlockDouble buffer

minIdxDouble buffer

sumBuffer

countBuffer

new kmeansDouble Buffer

1 1 12 2 23 3 34 4 4

Similarto(andmoregeneralthan)hand-writtendesigns1

[1]Hussainetal,“Fpga implementation ofk-meansalgorithm forbioinformaticsapplication: Anaccelerated approachtoclustering microarraydata”,AHS2011

1.Loadkmeans 2.Metapipeline:Calculatesum andcount

3.Metapipeline: Calculatenewkmeans,storeresults

Page 27: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Experimental Setup

n Board:n AlteraStratix V

n 48GBDDR3off-chipDRAM,6memorychannels

n BoardconnectedtohostviaPCI-e

n Executiontimereported=FPGAexecutiontimen CPUßà FPGAcommunication,FPGAconfigurationtimenotincluded

n Goal:Howbeneficialistiling andmetapipelining?

27

Page 28: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Experimental Setup

n Baselinen AutogeneratedMaxJ

n Representativeofstate-of-the-artHLStools

n BaselineOptimizationsn Pipelinedexecutionofinnermostloops

n Parallelized(unrolled)innerloops

n Parallelismfactorchosenbyhand

n DatalocalitycapturedatthelevelofaDRAMburst(384bytes)

n Parallelism factorsarekeptconsistentacrossbaselineandoptimizedversionsfromourflow

28

Page 29: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Evaluation

29

Page 30: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Evaluation

30

Page 31: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Results Summary

31

n Speedupwithtiling:upto15.5xn Speedupwithtiling+metapipelining: upto39.4xn Minimal (oftenpositive!)impactonresourceusage

n Tileddesignshavefeweroff-chipdataloadersandstorers

Page 32: Generating Configurable Hardware From Parallel Patternscsl.stanford.edu/~christos/publications/2016.delitehw.asplos.pdf · Problem: Programmability n Verilog and VHDL too low level

Summary

n Twokeyoptimizations:tiling andmetapipelining – togenerateefficientFPGAdesignsfromparallelpatterns

n Automatictilingtransformationsplacingfewerrestrictionsonmemoryaccesspatterns

n Analysistoautomatically inferdesignswithmetapipelines anddoublebuffers

n Significantspeedupsofupto39.4xwithminimal impactonFPGAresourceutilization

32