Generating Configurable Hardware From Parallel...

GeneratingConfigurableHardwareFromParallelPatterns

RaghuPrabhakar,DavidKoeplinger,KevinJ.Brown,HyoukJoongLee,ChrisDeSa,ChristosKozyrakis,KunleOlukotun

StanfordUniversity

ASPLOS2016

Motivation

n IncreasinginteresttouseFPGAsasacceleratorsn Key advantage: Performance/Watt

n Keydomains:n Bigdataanalytics,imageprocessing,financialanalytics,scientificcomputing,search

Problem:Programmability

n VerilogandVHDLtoolowlevelforsoftwaredevelopers

n Highlevelsynthesis(HLS)toolsneeduserpragmastohelpdiscoverparallelismn C-basedinput,pragmasrequiringhardwareknowledge

n Limitedinexploitingdatalocality

n Difficulttosynthesizecomplexdatapathswithnestedparallelism

Hardware Design - HLS

Add 512 integers stored in external DRAMvoid(int* mem) {

mem[512] = 0;

for(int i=0; i<512; i++) {mem[512] += mem[i];

SumModule

27,236clockcyclesforcomputationTwo-ordersofmagnitudetoolong!

Optimized Design - HLS

#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)

void(MPort* mem) { MPort buff[LOOPCOUNT];memcpy(buff, mem, LOOPCOUNT);

int sum = 0;for(int i=1; i<LOOPCOUNT; i++) {

#pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {

#pragma UNROLL sum += (int)(buff[i]>>j*sizeof(int)*8);

}} mem[512] = sum;

302clockcyclesforcomputation

WidthofDRAMcontroller interface

Burst Access

Use localvariable

Specialcompilerdirectives

LoopRestructuring

Bit shifting toextract individualelements

So, we need to ...

n Use Higher-level Abstractionsn Productivity: Developer focuses on applicationn Performance:

n Capture Locality to reduce off-chip memory trafficn Exploit Parallelism at multiple nesting levels

n Smart compiler generates efficient hardware

ParallelPatterns

n Constructswithspecialpropertieswithrespecttoparallelismandmemoryaccess

map zip reduce groupBy

key1 key3key2

Why Parallel Patterns?

n Concisen Can express large class of workloads in the machine

learning and data analytics domainn Captures rich semantic information about parallelism

and memory access patternsn Enables powerful transformations using pattern matching

and re-write rulesn Enables generating efficient code for different

architectures

Parallel Pattern Language

n A data-parallel language that supports parallel patternsn Example application: k-means

val clusters = samples groupBy { sample =>val dists = kMeans map { mean =>

mean.zip(sample){ (a,b) => sq(a – b) } reduce { (a,b) => a + b }}Range(0, dists.length) reduce { (i,j) =>

if (dists(i) < dists(j)) i else j}

}val newKmeans = clusters map { e => val sum = e reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } }val count = e map { v => 1 } reduce { (a,b) => a + b }

sum map { a => a / count }}

// Compute closest mean for each ‘sample’// 1. Compute distance with each mean// 2. Select the mean with shortest distance

//Compute average of each cluster// 1. Compute sum of all assigned points// 2. Compute number of assigned points// 3. Divide each dimension of sum by count

OurApproach

PatternTransformationsFusion

PatternTilingCodeMotion

ParallelPatterns

TiledParallelPatternIR

BitstreamGeneration

FPGAConfiguration

HardwareGenerationMemoryAllocationTemplateSelection

MetapipelineAnalysis

MaxJHGL

OurApproach

ParallelPatterns

BitstreamGeneration

FPGAConfiguration

MaxJHGL

High-level ParallelPatternshelpsproductivity

DataLocalityimprovedwithparallelpatterntilingtransformations

NestedParallelism exploited

withhierarchicalpipelinesanddoublebuffers

GenerateMaxJ togenerateVHDL

Delite

OurApproach

ParallelPatterns

BitstreamGeneration

FPGAConfiguration

MaxJHGL

OurApproach

ParallelPatterns

BitstreamGeneration

FPGAConfiguration

MaxJHGL

ParallelPatternTiling:MultiFold

n Tilingusingpolyhedralanalysislimitsdataaccesspatternstoaffinefunctionsofloopindices

n Currentparallelpatternscannotrepresenttiling

n Newparallelpatterndescribestiledcomputation

tile0 tile1 tile2 tile3

multiFold

out_tile0 out_tile1

ParallelPatternTiling:MultiFold

tile0 tile1 tile2 tile3

reduce

out_tile0 out_tile1

reduce

groupBy key

kMeans:Untiled

n Data dependent (non-affine) access to ‘sum’ and ‘count’

n Lots of data locality

n Typically, n >> k

samples

kMeans

newKmeans

mindistminDistIdx

kMeans #reads: n * k * d

ParallelPatternStripMining

n Transformparallelpatternà nestedpatterns

n Stripminedpatternsenablecomputationreordering

n Insertcopiestoenhancelocalityn Copiesguidecreationofon-chipbuffers

ParallelPatterns Strip MinedPatternsmap(d){i => 2*x(i)} multiFold(d/b){ii =>

xTile = x.copy(b + ii)(i, map(b){i => 2*xTile(i)

Strip Mining:kMeans

mindistminDistIdx

samples

kMeans kMeansBlock

samplesBlock

kMeans #reads: n * k * d

ParallelPatternInterchange

n Reordernestedpatternsn Move ‘copy’ operations out toward outer pattern(s)n Improves locality and reuse of on-chip memory

StripMinedPatterns InterchangedPatternsmultiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), map(b0,b1){i,j =>

multiFold(p/b2){kk =>yTl = y.copy(b1+jj, b2+kk)(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})

}{(a,b) => a + b}})

multiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), multiFold(p/b2){kk => yTl = y.copy(b1+jj, b2+kk) (0, map(b0,b1){i,j =>(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})

})}{(a,b) => map(b0,b1){i,j =>

a(i,j) + b(i,j) }})

Pattern Interchange: kMeans

mindist

minDistIdx

samples

kMeans kMeansBlock

samplesBlock

kMeans #reads: (n / bs) * k * d

OurApproach

ParallelPatterns

BitstreamGeneration

FPGAConfiguration

MaxJHGL

TemplateSelection

Memories Description IRConstruct

Buffer Scratchpadmemory Statically sizedarray

Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline

Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses

Pipe.Exec.Units Description IRConstruct

Vector SIMDparallelism Mapoverscalars

Reduction tree Parallel reductionofassociativeoperations MultiFold overscalars

Parallel FIFO Bufferorderedoutputsofdynamicsize FlatMap overscalars

CAM Fullyassociative key-valuestore GroupByFold overscalars

Controllers Description IRConstruct

Sequential Coordinates sequential execution Sequential IRnode

Parallel Coordinatesparallel execution Independent IRnodes

Metapipeline Execute nestedparallel patternsinapipelinedfashion

Outerparallel patternwithmultiple innerpatterns

Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy

Controllers Description IRConstruct

Sequential Coordinates sequential execution Sequential IRnode

Parallel Coordinatesparallel execution Independent IRnodes

Metapipeline Execute nestedparallel patternsinapipelinedfashion

Outerparallel patternwithmultiple innerpatterns

Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy

Memories Description IRConstruct

Buffer Scratchpadmemory Statically sizedarray

Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline

Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses

Metapipelining

n Hierarchicalpipeline:A“pipelineofpipelines”n Exploitsnestedparallelism

n Innerstagescouldbeothernestedpatternsorcombinational logicn Doesnotrequireiterationspacetobeknownstatically

n Doesnotrequirecompleteunrollingofinnerpatterns

n Intermediatedatafromeachstageautomatically storedindoublebuffersn Allowsstagestohavevariableexecutiontimes

n Noneedtocalculateinitiation interval(II)n Useasynchronouscontrolsignalstobeginnextiteration

Metapipeline–4stages

map(N) { r =>

Metapipeline – Intuition

TileMemControllerPipe1

12 1234

row = matrix.slice(r)

diff = map(D) { i =>row(i) – sub(i)

vprod = map(D,D) {(i,j)=> diff(i) * diff(j)

5r= r=

Metapipeline Analysis

n DetectsMetapipelines inthetiledparallelpatternIR

n Detectionn Chainofproducer-consumerparallelpatternswithinthebodyofanotherparallelpattern

n Schedulingn TopologicalsortofIRofparallelpatternbody

n Listofstages,whereeachstageconsistsofoneormoreindependentparallelpatterns

n Promoteintermediatebufferstodoublebuffers

PuttingItAllTogether:kMeans

VectorDist

(Norm)Vector

Dist (Norm)

VectorDist

(Norm)

samplesTile

kmeansTile

Scalar Dist

(Tree +)

(MinDist, Idx)

kmeansBlockbuffer

samplesBlockDouble buffer

minIdxDouble buffer

sumBuffer

countBuffer

new kmeansDouble Buffer

1 1 12 2 23 3 34 4 4

Similarto(andmoregeneralthan)hand-writtendesigns1

[1]Hussainetal,“Fpga implementation ofk-meansalgorithm forbioinformaticsapplication: Anaccelerated approachtoclustering microarraydata”,AHS2011

1.Loadkmeans 2.Metapipeline:Calculatesum andcount

3.Metapipeline: Calculatenewkmeans,storeresults

Experimental Setup

n Board:n AlteraStratix V

n 48GBDDR3off-chipDRAM,6memorychannels

n BoardconnectedtohostviaPCI-e

n Executiontimereported=FPGAexecutiontimen CPUßà FPGAcommunication,FPGAconfigurationtimenotincluded

n Goal:Howbeneficialistiling andmetapipelining?

Experimental Setup

n Baselinen AutogeneratedMaxJ

n Representativeofstate-of-the-artHLStools

n BaselineOptimizationsn Pipelinedexecutionofinnermostloops

n Parallelized(unrolled)innerloops

n Parallelismfactorchosenbyhand

n DatalocalitycapturedatthelevelofaDRAMburst(384bytes)

n Parallelism factorsarekeptconsistentacrossbaselineandoptimizedversionsfromourflow

Evaluation

Results Summary

n Speedupwithtiling:upto15.5xn Speedupwithtiling+metapipelining: upto39.4xn Minimal (oftenpositive!)impactonresourceusage

n Tileddesignshavefeweroff-chipdataloadersandstorers

Summary

n Twokeyoptimizations:tiling andmetapipelining – togenerateefficientFPGAdesignsfromparallelpatterns

n Automatictilingtransformationsplacingfewerrestrictionsonmemoryaccesspatterns

n Analysistoautomatically inferdesignswithmetapipelines anddoublebuffers

n Significantspeedupsofupto39.4xwithminimal impactonFPGAresourceutilization

Generating Configurable Hardware From Parallel...

Documents

F5 Programmability & Orchestration

Algorithms, Games and the Internet Christos H. Papadimitriou UC Berkeley christos

Programmability Issues

Christos Kaidis

Programmability: Network Virtualization

Christos Psarras

Network Programmability

Programmability in spss statistics 17

Equilibria and Complexity: What now? Christos H. Papadimitriou UC Berkeley christos

Universal programmability how ai can help

Us navy network programmability 030316

Christos Lam

Agenda Project Portfolio Server Architecture & Programmability Project Server Architecture & Programmability … and Demos throughout Understand programmatic

Programmability · Programmability • bootipxe,page2 • bootmanual,page3 • bootsystem,page4 • defaultboot,page5 • install,page6 • showinstall,page10 • dig,page13

HyperService: Interoperability and Programmability Across ... · HyperService: Interoperability and Programmability Across Heterogeneous Blockchains∗ Zhuotao Liu1,2 Yangxi Xiang3

Programmability in spss 14

Christos H. Papadimitriou with Jon Kleinberg and P. Raghavan cs.berkeley/~christos

ALTERA_In-System Programmability Guidelines

Myth emati CS Christos H. Papadimitriou UC Berkeley : christos

Open Programmability