GENETIC ALGORITHM BASED FPGA PLACEMENT …ece569a/Readings/project/...GENETIC ALGORITHM FOR FPGA PLACEMENT Crossover : Perform Partially Mapped Crossover for CLB part and IO part of

GENETIC ALGORITHM BASED FPGA PLACEMENT ON GPU

SUNDAR SRINIVASAN SENTHILKUMAR T. R.

CLB Netlist FPGA

Placement

FPGA PLACEMENT PROBLEM

•Input – A technology mapped netlist of Configurable Logic Blocks (CLB) realizing a given circuit.

•Output – CLB netlist placed in a two dimensional array of slots such that total wirelength is minimized.

∑=

+

netsN

i yav

y

xav

x

iCibb

iCibbiq

1 ,, )()(

)()()(COST FUNCTION =

EXISTING TECHNIQUES FOR FPGA PLACEMENT

VPR - Uses Simulated Annealing (SA) with Adaptive Annealing Schedule

Tabu Search Based Method

Force Directed Placement

Genetic Algorithm Based Placement

Partitioning and Clustering Based Techniques

Which one to choose for parallelization?All the above algorithms are heavily time-consuming.

Most of them are not easily parallelizable except Genetic Algorithm

Genetic Algorithm is a population based optimization technique. It runs through many iterations called generations.

Genetic Algorithm is heavily computation-intensive but an efficient optimization technique for NP-hard Problems

Each generation has atleast two cost evaluation phases that are suitable parallelization. → Reason: Each individual's cost evaluation is independent of each other.

GENETIC ALGORITHM FOR FPGA PLACEMENT

Encoding

An individual (chromosome) in a genetic algorithm is a string of integers (genes).These integers(genes) represent the position of all the CLB and the IO blocks that are to be placed on the FPGA. The index of the gene corresponds to the CLB or the IO block.

Example:

Consider a chromosome that represents the placement of 7 CLBs and 3 IO blocks

C 1IO 2

IO 1

IO 3

C 2

C 3

C 4 C 5 C 6

C 7

8 10 15 19 21 22 27 6 18 17

GENETIC ALGORITHM FOR FPGA PLACEMENT

8 10 15 19 21 22 27 6 18 17 9 1 19 4 22 10 26 12 24 2313 10 16 20 21 22 27 23 17 1122 10 14 13 1 3 4 6 12 23...9 15 21 26 8 22 14 12 18 17

Initialization of Population Cost Evaluation

121.8989.478.12110.34...99.0

Bounding Box Cost

∑=

+

netsN

i yav

y

xav

x

iCibb

iCibbiq

1 ,, )()(

)()()(Bounding Box Cost =

Tournament Selection :

Pick two individuals from the population randomly and select the best one out of them based on the bounding box cost.

Repeat this “pop_size” times so that another population is created with the better individuals.

GENETIC ALGORITHM FOR FPGA PLACEMENTCrossover :

Perform Partially Mapped Crossover for CLB part and IO part of the chromosome separately because the CLBs must not be placed in the IO positions and vice-versa.

8 10 15 19 21 22 27 6 18 17 9 1 19 4 22 10 26 12 24 23

8 10 19 4 22 21 27 6 24 17 9 1 15 19 21 10 26 12 18 23

Mutation :

Choose a chromosome randomly. Choose two genes randomly and swap them. We do this separately for CLB part and IO part for reason mentioned above.

8 10 15 19 21 22 27 6 18 17

8 21 15 19 10 22 27 6 18 17

8 10 15 19 21 22 27 6 18 17 9 1 19 4 22 10 26 12 24 2313 10 16 20 21 22 27 23 17 1122 10 14 13 1 3 4 6 12 23...9 15 21 26 8 22 14 12 18 17

Population Cost Evaluation

121.8989.478.12110.34...99.0

Bounding Box Cost

∑=

+

netsN

i yav

y

xav

x

iCibb

iCibbiq

1 ,, )()(

)()()(Bounding Box Cost =

PARALLEL IMPLEMENTATION OF COST EVALUATION PHASE

Cost Evaluation of individuals are independent of each other and are heavy floating point calculations. Also this is the most time consuming phase of the Genetic Algorithm

So Cost Evaluation can be PARALLELIZED !!!

Multiple threads evaluate the costs of independent individuals.

PARALLELIZATION STRATEGIES IMPLEMENTED

Global Memory Implementation

Place the netlist structures on the global memory

Place the population on the global memory

Shared Memory Implementation

Place the netlist structures on the global memory (Too large to be in SM)

Netlist structures are about 100 KB of memory for typical benchmarks

Place the population on the shared memory

Copying the population to local registers not possible because the chromosomes are stored in the form of arrays.

Shared memory implementation is expected to offer more speed-up when compared to the global memory because for the bounding box calculation the chromosomes are accessed very frequently.

GLOBAL MEMORY IMPLEMENTATION

Memory space not a constraint, but

Memory accesses take too long

Speed-up is very less

Since the cost is evaluated based on a netlist, its highly difficult for the host or the programmer to coalesce the memory accesses.

Before actually running the program, very little can be done with the memory accesses.

Next trial : Increase the number of threads.

At the maximum pop_size threads can be invoked in a kernel

Increasing the number of threads did not help because the memory latency increased.

GLOBAL MEMORY IMPLEMENTATIONWhy not increase the granularity of the kernel?

Calculate each net cost in a thread. The number of threads will increase by “num_nets” times.

This increases the occupancy of the kernel, reduces the probability of multiple threads accessing the same location.

BUT,This again did not work because of the warp divergence

Threads in the warp were accessing non-consecutive locations depending on the netlist and the placement of the chromosome

Conclusions from global memory implementation

Global Memory

• Reduces the performance drastically when multiple threads access the same location.

• Warp Divergence reduces the performance.

∑=

+

netsN

i yav

y

xav

x

iCibb

iCibbiq

1 ,, )()(

)()()(

SHARED MEMORY IMPLEMENTATION

The population array was moved to the shared memory

The netlist could not be moved because of the memory limitation.

The number of individuals that can be moved to the shared memory is limited by the size of the netlist i.e. the number of blocks to be placed

According to our genotype (encoding of individuals) the memory required per individual is

4 bytes * number of blocks

For example, consider the number of blocks to be placed is 100.

Memory required for a single individual is 400 bytes.

Shared memory space is only 16 K .

Therefore 16000 / 400 = 40 individuals in one kernel i.e 40 threads per kernel. Maximum

Very less occupancy, but the speed up was better than global memory.

SHARED MEMORY IMPLEMENTATION

Why not use all the shared memory resources from all the Streaming Multiprocessors ?

This is possible as only the population arrays are placed on the shared memory and each threads evaluates the cost of only one individual.

The threads in a SM needs accesses to the individuals only in that SM and not in the other SM

The compiler takes care of the internal memory partitioning among the 8 shared memory spaced based on the memory accesses by the threads.

In-order to activate all the 8 Multiprocessors we need enough number of threads to be invoked. But we do not have that many cost evaluations in a kernel !!!!

So we invoked dummy threads.

These dummy kernels are present just to increase the occupancy and activate multiple SMs at point of invocation of a kernel.

Will this not cause a load imbalance ?

These dummy threads complete their execution almost immediate after invocation.

Also, its manged by setting appropriate grid and block dimensions and indexing.

Why not use Registers ?

Does Shared memory solve all our problems ?

NO

Still we face warp divergence.

Less occupancy

Bank Conflicts – Multiple accesses simultaneously to same memory location.

These are the reasons for not achieving the ideal speed-up.

Main Drawbacks

Does not support arrays. So the individuals cannot be moved to registers.

Limited number of registers available per SM.

8192 registers per SM. Creates a bottle neck on the number of threads that can be invoked.

Used for temporary registers used during the cost evaluation phase.

RESULTS AND DISCUSSION

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

1.2

Spee

d U

p

Number of blocks

Global Memory

✔ Speed Up not very good.

✔ Speed up reduces with the increase in the number of blocks.

Reasons:

✔ Simultaneous memory accesses increase with the increase the increase in the number of blocks.

✔ More blocks lead more randomness, Warp divergence increases.

✔ The threads do not increase with the number of blocks to be placed



✔ Speed up increases when compared to global memory implementation

✔ The number of threads that can be invoked in a single kernel is influenced by the problem size

✔ As the problem size increases the Memory Bank Conflicts increase

40 60 80 100 120 140 160 180 200 220 2400

1

2

3

4

5

6


No of blocks to be placed

Spe

ed U

p

`

40 60 80 100 120 140 160 180 200 220 2400

100

200

300

400

500

600

Limitation of Shared Memory

Number of Blocks

Num

ber o

f thr

eads

that

can

be

invo

ked


40 60 80 100 120 140 160 180 200 220 2400

1

2

3

4

5

6

Number of blocks to be placed

Spee

d U

p

`Shared memory is up to 5 times faster than global memory implementation

Difference more prominent for smaller circuits

In the shared memory, the number of kernel calls increases as the circuit size increases

LESSONS LEARNED✔ Implement Global memory before Shared memory

Helps to predict the shared memory challenges

Explore multiple strategies of threading before moving to shared memory. Pick the best method and move on to shared memory version.

✔ Memory bank conflicts

Reduce the probability of multiple threads accessing same memory locations simultaneously. Think a lot before moving to fine granularity parallelism.

✔ Warp divergence in Global Memory is more severe than in Shared memory

✔ Usage of Shared Memory of multiple streaming Multi-processors is possible.

The block and grid dimensions have to be appropriately chosen to activate multiple SMs

For eg. Using 64 blocks in a kernel helps you to utilize shared memory of all the SMs

✔ Never blame the hardware architecture before completely exploring software optimizations.

CONCLUSIONS

➢ A genetic algorithm based FPGA placement was implemented on GPU

➢ Different strategies for implementing the parallelization was explored and the best one was chosen (population array on the shared memory)

➢ A speed up curve was obtained with respect to the size of the netlist.

➢ Speed up to 5.2X was achieved for smaller circuit sizes

Back up Slides

Half-perimeter Wire Length Model

Net with 6 terminals

Bounding Box Cost Calculation

6 terminal net

4 (horizontal distance) + 2 (vertical distance)

BB Cost of this net = 6

Total cost is a summation over all the nets

GENETIC ALGORITHM

FPGA CAD FLOW

Documents

GENETIC ALGORITHM BASED FPGA PLACEMENT …ece569a/Readings/project/...GENETIC ALGORITHM FOR FPGA PLACEMENT Crossover : Perform Partially Mapped Crossover for CLB part and IO part of