Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
GeneratingConfigurableHardwareFromParallelPatterns
RaghuPrabhakar,DavidKoeplinger,KevinJ.Brown,HyoukJoongLee,ChrisDeSa,ChristosKozyrakis,KunleOlukotun
StanfordUniversity
ASPLOS2016
Motivation
n IncreasinginteresttouseFPGAsasacceleratorsn Key advantage: Performance/Watt
n Keydomains:n Bigdataanalytics,imageprocessing,financialanalytics,scientificcomputing,search
2
Problem:Programmability
n VerilogandVHDLtoolowlevelforsoftwaredevelopers
n Highlevelsynthesis(HLS)toolsneeduserpragmastohelpdiscoverparallelismn C-basedinput,pragmasrequiringhardwareknowledge
n Limitedinexploitingdatalocality
n Difficulttosynthesizecomplexdatapathswithnestedparallelism
3
Hardware Design - HLS
Add 512 integers stored in external DRAMvoid(int* mem) {
mem[512] = 0;
for(int i=0; i<512; i++) {mem[512] += mem[i];
}}
SumModule
DRAM
27,236clockcyclesforcomputationTwo-ordersofmagnitudetoolong!
4
Optimized Design - HLS
#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)
void(MPort* mem) { MPort buff[LOOPCOUNT];memcpy(buff, mem, LOOPCOUNT);
int sum = 0;for(int i=1; i<LOOPCOUNT; i++) {
#pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {
#pragma UNROLL sum += (int)(buff[i]>>j*sizeof(int)*8);
}} mem[512] = sum;
}
302clockcyclesforcomputation
WidthofDRAMcontroller interface
Burst Access
Use localvariable
Specialcompilerdirectives
LoopRestructuring
Bit shifting toextract individualelements
5
So, we need to ...
n Use Higher-level Abstractionsn Productivity: Developer focuses on applicationn Performance:
n Capture Locality to reduce off-chip memory trafficn Exploit Parallelism at multiple nesting levels
n Smart compiler generates efficient hardware
6
ParallelPatterns
n Constructswithspecialpropertieswithrespecttoparallelismandmemoryaccess
7
map zip reduce groupBy
key1 key3key2
Why Parallel Patterns?
n Concisen Can express large class of workloads in the machine
learning and data analytics domainn Captures rich semantic information about parallelism
and memory access patternsn Enables powerful transformations using pattern matching
and re-write rulesn Enables generating efficient code for different
architectures
8
Parallel Pattern Language
n A data-parallel language that supports parallel patternsn Example application: k-means
val clusters = samples groupBy { sample =>val dists = kMeans map { mean =>
mean.zip(sample){ (a,b) => sq(a – b) } reduce { (a,b) => a + b }}Range(0, dists.length) reduce { (i,j) =>
if (dists(i) < dists(j)) i else j}
}val newKmeans = clusters map { e => val sum = e reduce { (v1,v2) => v1.zip(v2){ (a,b) => a + b } }val count = e map { v => 1 } reduce { (a,b) => a + b }
sum map { a => a / count }}
// Compute closest mean for each ‘sample’// 1. Compute distance with each mean// 2. Select the mean with shortest distance
//Compute average of each cluster// 1. Compute sum of all assigned points// 2. Compute number of assigned points// 3. Divide each dimension of sum by count
9
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
10
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
11
High-level ParallelPatternshelpsproductivity
DataLocalityimprovedwithparallelpatterntilingtransformations
NestedParallelism exploited
withhierarchicalpipelinesanddoublebuffers
GenerateMaxJ togenerateVHDL
Delite
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
12
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
13
ParallelPatternTiling:MultiFold
n Tilingusingpolyhedralanalysislimitsdataaccesspatternstoaffinefunctionsofloopindices
n Currentparallelpatternscannotrepresenttiling
n Newparallelpatterndescribestiledcomputation
14
tile0 tile1 tile2 tile3
multiFold
out_tile0 out_tile1
ParallelPatternTiling:MultiFold
15
tile0 tile1 tile2 tile3
map
reduce
out_tile0 out_tile1
reduce
groupBy key
kMeans:Untiled
n Data dependent (non-affine) access to ‘sum’ and ‘count’
n Lots of data locality
n Typically, n >> k
16
samples
kMeans
sum
count
newKmeans
mindistminDistIdx
n
k
d
k
k
d
kMeans #reads: n * k * d
ParallelPatternStripMining
n Transformparallelpatternà nestedpatterns
n Stripminedpatternsenablecomputationreordering
n Insertcopiestoenhancelocalityn Copiesguidecreationofon-chipbuffers
ParallelPatterns Strip MinedPatternsmap(d){i => 2*x(i)} multiFold(d/b){ii =>
xTile = x.copy(b + ii)(i, map(b){i => 2*xTile(i)
}) }
17
Strip Mining:kMeans
18
mindistminDistIdx
sum
count
samples
kMeans kMeansBlock
samplesBlock
kMeans #reads: n * k * d
n
k
bs
bk
ParallelPatternInterchange
n Reordernestedpatternsn Move ‘copy’ operations out toward outer pattern(s)n Improves locality and reuse of on-chip memory
StripMinedPatterns InterchangedPatternsmultiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), map(b0,b1){i,j =>
multiFold(p/b2){kk =>yTl = y.copy(b1+jj, b2+kk)(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})
}{(a,b) => a + b}})
}
multiFold(m/b0,n/b1){ii,jj =>xTl = x.copy(b0+ii, b1+jj)((ii,jj), multiFold(p/b2){kk => yTl = y.copy(b1+jj, b2+kk) (0, map(b0,b1){i,j =>(0, multiFold(b2){ k =>(0, xTl(i,j)* yTl(j,k))}{(a,b) => a + b})
})}{(a,b) => map(b0,b1){i,j =>
a(i,j) + b(i,j) }})
}
19
Pattern Interchange: kMeans
20
mindist
minDistIdx
sum
count
samples
kMeans kMeansBlock
samplesBlock
n
k
bs
bk
kMeans #reads: (n / bs) * k * d
OurApproach
PatternTransformationsFusion
PatternTilingCodeMotion
ParallelPatterns
TiledParallelPatternIR
BitstreamGeneration
FPGAConfiguration
HardwareGenerationMemoryAllocationTemplateSelection
MetapipelineAnalysis
MaxJHGL
21
TemplateSelection
Memories Description IRConstruct
Buffer Scratchpadmemory Statically sizedarray
Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline
Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses
Pipe.Exec.Units Description IRConstruct
Vector SIMDparallelism Mapoverscalars
Reduction tree Parallel reductionofassociativeoperations MultiFold overscalars
Parallel FIFO Bufferorderedoutputsofdynamicsize FlatMap overscalars
CAM Fullyassociative key-valuestore GroupByFold overscalars
Controllers Description IRConstruct
Sequential Coordinates sequential execution Sequential IRnode
Parallel Coordinatesparallel execution Independent IRnodes
Metapipeline Execute nestedparallel patternsinapipelinedfashion
Outerparallel patternwithmultiple innerpatterns
Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy
Controllers Description IRConstruct
Sequential Coordinates sequential execution Sequential IRnode
Parallel Coordinatesparallel execution Independent IRnodes
Metapipeline Execute nestedparallel patternsinapipelinedfashion
Outerparallel patternwithmultiple innerpatterns
Tilememory Fetchtilesofdatafromoff-chipmemory Transformer-inserted arraycopy
Memories Description IRConstruct
Buffer Scratchpadmemory Statically sizedarray
Doublebuffer Buffercoupling twostagesinametapipeline Metapipeline
Cache Taggedmemoryexploits locality inrandomaccesses Non-affineaccesses
22
Metapipelining
n Hierarchicalpipeline:A“pipelineofpipelines”n Exploitsnestedparallelism
n Innerstagescouldbeothernestedpatternsorcombinational logicn Doesnotrequireiterationspacetobeknownstatically
n Doesnotrequirecompleteunrollingofinnerpatterns
n Intermediatedatafromeachstageautomatically storedindoublebuffersn Allowsstagestohavevariableexecutiontimes
n Noneedtocalculateinitiation interval(II)n Useasynchronouscontrolsignalstobeginnextiteration
23
Metapipeline–4stages
map(N) { r =>
}
Metapipeline – Intuition
ld ld
st
-
diff
sub
Pipe2
ld ld
st
*
vprod
Pipe3
ld ld
st
-
diff
sub
Pipe2
row
ld ld
st
*
vprod
Pipe3
diff
row
TileMemControllerPipe1
TileMemControllerPipe4
row
TileMemControllerPipe1
vprod
TileMemControllerPipe4
12 1234
row = matrix.slice(r)
diff = map(D) { i =>row(i) – sub(i)
}
vprod = map(D,D) {(i,j)=> diff(i) * diff(j)
}
vprod
5r= r=
24
Metapipeline Analysis
n DetectsMetapipelines inthetiledparallelpatternIR
n Detectionn Chainofproducer-consumerparallelpatternswithinthebodyofanotherparallelpattern
n Schedulingn TopologicalsortofIRofparallelpatternbody
n Listofstages,whereeachstageconsistsofoneormoreindependentparallelpatterns
n Promoteintermediatebufferstodoublebuffers
25
PuttingItAllTogether:kMeans
26
VectorDist
(Norm)Vector
Dist (Norm)
++
//
VectorDist
(Norm)
samplesTile
Load
Inc
/New
kmeansTile
Store
+
kmeansTile
Load
Scalar Dist
(Tree +)
(MinDist, Idx)
kmeansBlockbuffer
samplesBlockDouble buffer
samplesBlockDouble buffer
minIdxDouble buffer
sumBuffer
countBuffer
new kmeansDouble Buffer
1 1 12 2 23 3 34 4 4
Similarto(andmoregeneralthan)hand-writtendesigns1
[1]Hussainetal,“Fpga implementation ofk-meansalgorithm forbioinformaticsapplication: Anaccelerated approachtoclustering microarraydata”,AHS2011
1.Loadkmeans 2.Metapipeline:Calculatesum andcount
3.Metapipeline: Calculatenewkmeans,storeresults
Experimental Setup
n Board:n AlteraStratix V
n 48GBDDR3off-chipDRAM,6memorychannels
n BoardconnectedtohostviaPCI-e
n Executiontimereported=FPGAexecutiontimen CPUßà FPGAcommunication,FPGAconfigurationtimenotincluded
n Goal:Howbeneficialistiling andmetapipelining?
27
Experimental Setup
n Baselinen AutogeneratedMaxJ
n Representativeofstate-of-the-artHLStools
n BaselineOptimizationsn Pipelinedexecutionofinnermostloops
n Parallelized(unrolled)innerloops
n Parallelismfactorchosenbyhand
n DatalocalitycapturedatthelevelofaDRAMburst(384bytes)
n Parallelism factorsarekeptconsistentacrossbaselineandoptimizedversionsfromourflow
28
Evaluation
29
Evaluation
30
Results Summary
31
n Speedupwithtiling:upto15.5xn Speedupwithtiling+metapipelining: upto39.4xn Minimal (oftenpositive!)impactonresourceusage
n Tileddesignshavefeweroff-chipdataloadersandstorers
Summary
n Twokeyoptimizations:tiling andmetapipelining – togenerateefficientFPGAdesignsfromparallelpatterns
n Automatictilingtransformationsplacingfewerrestrictionsonmemoryaccesspatterns
n Analysistoautomatically inferdesignswithmetapipelines anddoublebuffers
n Significantspeedupsofupto39.4xwithminimal impactonFPGAresourceutilization
32