Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU
SUNDAR SRINIVASAN SENTHILKUMAR T. R.
CLB Netlist FPGA
Placement
FPGA PLACEMENT PROBLEM
•Input – A technology mapped netlist of Configurable Logic Blocks (CLB) realizing a given circuit.
•Output – CLB netlist placed in a two dimensional array of slots such that total wirelength is minimized.
∑=
+
netsN
i yav
y
xav
x
iCibb
iCibbiq
1 ,, )()(
)()()(COST FUNCTION =
EXISTING TECHNIQUES FOR FPGA PLACEMENT
VPR - Uses Simulated Annealing (SA) with Adaptive Annealing Schedule
Tabu Search Based Method
Force Directed Placement
Genetic Algorithm Based Placement
Partitioning and Clustering Based Techniques
Which one to choose for parallelization?All the above algorithms are heavily time-consuming.
Most of them are not easily parallelizable except Genetic Algorithm
Genetic Algorithm is a population based optimization technique. It runs through many iterations called generations.
Genetic Algorithm is heavily computation-intensive but an efficient optimization technique for NP-hard Problems
Each generation has atleast two cost evaluation phases that are suitable parallelization. → Reason: Each individual's cost evaluation is independent of each other.
GENETIC ALGORITHM FOR FPGA PLACEMENT
Encoding
An individual (chromosome) in a genetic algorithm is a string of integers (genes).These integers(genes) represent the position of all the CLB and the IO blocks that are to be placed on the FPGA. The index of the gene corresponds to the CLB or the IO block.
Example:
Consider a chromosome that represents the placement of 7 CLBs and 3 IO blocks
C 1IO 2
IO 1
IO 3
C 2
C 3
C 4 C 5 C 6
C 7
8 10 15 19 21 22 27 6 18 17
GENETIC ALGORITHM FOR FPGA PLACEMENT
8 10 15 19 21 22 27 6 18 17 9 1 19 4 22 10 26 12 24 2313 10 16 20 21 22 27 23 17 1122 10 14 13 1 3 4 6 12 23...9 15 21 26 8 22 14 12 18 17
Initialization of Population Cost Evaluation
121.8989.478.12110.34...99.0
Bounding Box Cost
∑=
+
netsN
i yav
y
xav
x
iCibb
iCibbiq
1 ,, )()(
)()()(Bounding Box Cost =
Tournament Selection :
Pick two individuals from the population randomly and select the best one out of them based on the bounding box cost.
Repeat this “pop_size” times so that another population is created with the better individuals.
GENETIC ALGORITHM FOR FPGA PLACEMENTCrossover :
Perform Partially Mapped Crossover for CLB part and IO part of the chromosome separately because the CLBs must not be placed in the IO positions and vice-versa.
8 10 15 19 21 22 27 6 18 17 9 1 19 4 22 10 26 12 24 23
8 10 19 4 22 21 27 6 24 17 9 1 15 19 21 10 26 12 18 23
Mutation :
Choose a chromosome randomly. Choose two genes randomly and swap them. We do this separately for CLB part and IO part for reason mentioned above.
8 10 15 19 21 22 27 6 18 17
8 21 15 19 10 22 27 6 18 17
8 10 15 19 21 22 27 6 18 17 9 1 19 4 22 10 26 12 24 2313 10 16 20 21 22 27 23 17 1122 10 14 13 1 3 4 6 12 23...9 15 21 26 8 22 14 12 18 17
Population Cost Evaluation
121.8989.478.12110.34...99.0
Bounding Box Cost
∑=
+
netsN
i yav
y
xav
x
iCibb
iCibbiq
1 ,, )()(
)()()(Bounding Box Cost =
PARALLEL IMPLEMENTATION OF COST EVALUATION PHASE
Cost Evaluation of individuals are independent of each other and are heavy floating point calculations. Also this is the most time consuming phase of the Genetic Algorithm
So Cost Evaluation can be PARALLELIZED !!!
Multiple threads evaluate the costs of independent individuals.
PARALLELIZATION STRATEGIES IMPLEMENTED
Global Memory Implementation
Place the netlist structures on the global memory
Place the population on the global memory
Shared Memory Implementation
Place the netlist structures on the global memory (Too large to be in SM)
Netlist structures are about 100 KB of memory for typical benchmarks
Place the population on the shared memory
Copying the population to local registers not possible because the chromosomes are stored in the form of arrays.
Shared memory implementation is expected to offer more speed-up when compared to the global memory because for the bounding box calculation the chromosomes are accessed very frequently.
GLOBAL MEMORY IMPLEMENTATION
Memory space not a constraint, but
Memory accesses take too long
Speed-up is very less
Since the cost is evaluated based on a netlist, its highly difficult for the host or the programmer to coalesce the memory accesses.
Before actually running the program, very little can be done with the memory accesses.
Next trial : Increase the number of threads.
At the maximum pop_size threads can be invoked in a kernel
Increasing the number of threads did not help because the memory latency increased.
GLOBAL MEMORY IMPLEMENTATIONWhy not increase the granularity of the kernel?
Calculate each net cost in a thread. The number of threads will increase by “num_nets” times.
This increases the occupancy of the kernel, reduces the probability of multiple threads accessing the same location.
BUT,This again did not work because of the warp divergence
Threads in the warp were accessing non-consecutive locations depending on the netlist and the placement of the chromosome
Conclusions from global memory implementation
Global Memory
• Reduces the performance drastically when multiple threads access the same location.
• Warp Divergence reduces the performance.
∑=
+
netsN
i yav
y
xav
x
iCibb
iCibbiq
1 ,, )()(
)()()(
SHARED MEMORY IMPLEMENTATION
The population array was moved to the shared memory
The netlist could not be moved because of the memory limitation.
The number of individuals that can be moved to the shared memory is limited by the size of the netlist i.e. the number of blocks to be placed
According to our genotype (encoding of individuals) the memory required per individual is
4 bytes * number of blocks
For example, consider the number of blocks to be placed is 100.
Memory required for a single individual is 400 bytes.
Shared memory space is only 16 K .
Therefore 16000 / 400 = 40 individuals in one kernel i.e 40 threads per kernel. Maximum
Very less occupancy, but the speed up was better than global memory.
SHARED MEMORY IMPLEMENTATION
Why not use all the shared memory resources from all the Streaming Multiprocessors ?
This is possible as only the population arrays are placed on the shared memory and each threads evaluates the cost of only one individual.
The threads in a SM needs accesses to the individuals only in that SM and not in the other SM
The compiler takes care of the internal memory partitioning among the 8 shared memory spaced based on the memory accesses by the threads.
In-order to activate all the 8 Multiprocessors we need enough number of threads to be invoked. But we do not have that many cost evaluations in a kernel !!!!
So we invoked dummy threads.
These dummy kernels are present just to increase the occupancy and activate multiple SMs at point of invocation of a kernel.
Will this not cause a load imbalance ?
These dummy threads complete their execution almost immediate after invocation.
Also, its manged by setting appropriate grid and block dimensions and indexing.
Why not use Registers ?
Does Shared memory solve all our problems ?
NO
Still we face warp divergence.
Less occupancy
Bank Conflicts – Multiple accesses simultaneously to same memory location.
These are the reasons for not achieving the ideal speed-up.
Main Drawbacks
Does not support arrays. So the individuals cannot be moved to registers.
Limited number of registers available per SM.
8192 registers per SM. Creates a bottle neck on the number of threads that can be invoked.
Used for temporary registers used during the cost evaluation phase.
RESULTS AND DISCUSSION
0 50 100 150 200 2500
0.2
0.4
0.6
0.8
1
1.2
Spee
d U
p
Number of blocks
Global Memory
✔ Speed Up not very good.
✔ Speed up reduces with the increase in the number of blocks.
Reasons:
✔ Simultaneous memory accesses increase with the increase the increase in the number of blocks.
✔ More blocks lead more randomness, Warp divergence increases.
✔ The threads do not increase with the number of blocks to be placed
RESULTS AND DISCUSSION
Shared Memory Implementation
✔ Speed up increases when compared to global memory implementation
✔ The number of threads that can be invoked in a single kernel is influenced by the problem size
✔ As the problem size increases the Memory Bank Conflicts increase
40 60 80 100 120 140 160 180 200 220 2400
1
2
3
4
5
6
Shared Memory Implementation
No of blocks to be placed
Spe
ed U
p
`
40 60 80 100 120 140 160 180 200 220 2400
100
200
300
400
500
600
Limitation of Shared Memory
Number of Blocks
Num
ber o
f thr
eads
that
can
be
invo
ked
RESULTS AND DISCUSSION
40 60 80 100 120 140 160 180 200 220 2400
1
2
3
4
5
6
Number of blocks to be placed
Spee
d U
p
`Shared memory is up to 5 times faster than global memory implementation
Difference more prominent for smaller circuits
In the shared memory, the number of kernel calls increases as the circuit size increases
LESSONS LEARNED✔ Implement Global memory before Shared memory
Helps to predict the shared memory challenges
Explore multiple strategies of threading before moving to shared memory. Pick the best method and move on to shared memory version.
✔ Memory bank conflicts
Reduce the probability of multiple threads accessing same memory locations simultaneously. Think a lot before moving to fine granularity parallelism.
✔ Warp divergence in Global Memory is more severe than in Shared memory
✔ Usage of Shared Memory of multiple streaming Multi-processors is possible.
The block and grid dimensions have to be appropriately chosen to activate multiple SMs
For eg. Using 64 blocks in a kernel helps you to utilize shared memory of all the SMs
✔ Never blame the hardware architecture before completely exploring software optimizations.
CONCLUSIONS
➢ A genetic algorithm based FPGA placement was implemented on GPU
➢ Different strategies for implementing the parallelization was explored and the best one was chosen (population array on the shared memory)
➢ A speed up curve was obtained with respect to the size of the netlist.
➢ Speed up to 5.2X was achieved for smaller circuit sizes
Back up Slides
Half-perimeter Wire Length Model
Net with 6 terminals
Bounding Box Cost Calculation
6 terminal net
4 (horizontal distance) + 2 (vertical distance)
BB Cost of this net = 6
Total cost is a summation over all the nets
GENETIC ALGORITHM
FPGA CAD FLOW