ﬁPlacement for Hierarchical Interconnect based FPGA

“Placement for Hierarchical Interconnect basedFPGA Devices”

A final project report submittedby

Kesava R. [email protected]

ID: 998-10-3850

Introduction:�� !��"��$#�%'&)(+*,��-"�� .0/$ ��. �� 1��"��$#2� 3+ �� 4�� 54 61/��. �7�+��/8�� :9 -�� 90/;%�*8 �"<��=��.0/�3��"< �?>�/;��4@��.@3A ��>5��+&B$C (D<0�E:.@��4F��.�. ��4G�A ��.�� H*,��.��"< �? �� 42�� 4061/I�C (D<0�!3A>5.04G�A ��.�� 3��40<�� 4 6

Why FPGAs Came in to Existence:

J�#8�"< ��A # B;K�L�M�N /��5/O�� 3��"< �? �� 4�4$:��4 >� �P/8:.��H#59� 4F�� ;/Q#�/O��2/�*R��0/;��"� ��=�;#2�=< ��.0� 3P>5 1� 3/O��. ��? ��E/;4F�� ?:.@�� 4 ��4$>� �P/!�TS UWV:% CX�Y�Z\[1]�^5_8`�ab[c&�d2 4 ��=9��4F�5/@/@��/ e��>�/gfhV:ij4@��.@�A�� / e0/Q#�/G��j�A ��P/,� ��4 C

kmlFn"o p$q0p�rHsctOs�tGn�p�uvtGnAw x x$y z�{�n"y p=| p�p�{�}�~�r+r�z�| {�~�u��T��x l0p?x ~��w �@��n�~=y p�x ��@~�|�|0p��Gn�n"y p?x z�r��p��:�Ft$��\�1��5�8��b��F��p�| p�r�z nAw | ��x ~�� z�x1�F~�|@n"r�~�x@tFw ��| z�x t8��}�~�r+r�p5t;p nPt�p n��0�b�

��w �+��p$�� W�$z�| {?x ~ nPt,~ }�t$u�z�x x@x ~$��{�p�|0tFw nHsc��W�$z�| {��W�

� l�tOn�~�u�� t,z�r�pEt;~�u�p nAw u�p5t�{�p5tFw ��| p�{�n�~=r�p��x z��@p�n"y p?x z�r��p�z�u�~�l5|@n�~ }��x l0p?x ~��w �0�1 ¡y�w t�y p�x �0t8w:|r�p�{�l0�@p�{EtOs�tOn�p�u��F~�u��x p ¢�w nHs2z�| {=u�z�|�lF}�z��1nAl5rPw |0��@~5tOn"o�w u��5r�~$q�p�{=� p�rH}�~�r"u�z�| �@p�o��lFn�q0p�rHs2p ¢5� p�|0tFw q0pz�| {�{�p�x z$s�w |cw:|@n"r�~�{�l0�1nAw ~�|�~ }£��r�~5{�l0�Gn"�; Dy p�r�p }�~�r�p!n"y p�� l�tOn�~�u�� z��r�~�z5� y,�,z5t,~�|�x s8q�w z��x p�}�~�r��r�~�{�l0�GnPt¤�8w nAy,q�p�r¥s?y�w ��y,q�~�x l5u�p�o z�| {!�!y�w �0y,�Rp�r�p=| ~ n� + W�v�H Dw u�p! �~��8z�rA¦Gp0n��\t;p�|0t$w nAw q�p��§��¨©t¤�,p�r�p?w |@n"r�~�{�l0�@p�{�z5t�z�|�z�x n�p�r"| z nAw q0p�n�~�� l�tOn�~�u�� t�}�~�r�w:u��x p�u�p�|@nAw:| ��x l0p?x ~��w � ��+§\��¨©t!| ~ �z�x t@~��@~�u�� p n�p!�8w n"y�ucw � r�~��r�~��Fp5t@t;~�r�t8w:|�{�p�{�w �Fz n�p�{�z�| {�p�u�� p�{�{�p�{�z��x w �@z nAw ~�|0t$��§\�)¨©t!y z$q�p� p�rª}�~�rAu�z�| �Fp�z�{$q0z�|@n�z��p�~$q�p�r+ucw �0r�~��5r�~��Fp5t@t;~�rPt!� p��Fz�l�t;p�� w r��$l�w nPt��@z�|�� p�� l�tOn�~�ucw «$p�{�}�~�r�n"y p�n�z5t$¦Rz ny z�| {�o;�!y p�r�p�z5t�ucw �0r�~��r�~��@p5tFt;~�rPt!ucl�tOn£��r�~$q�w {�pEt$� p�� w z�xQ}Al5| �1nAw ~�|0t8w:|2t;~ }ªnH�,z�r�p?�Tu�z�|;s2�Gs0� x p5t;��

Some of the benefits of programmable logic versus mask-programmed gate arrays are:

• Faster Design and Verification FPGA can be designed and verified quickly while the sameprocess requires several weeks with gate arrays. There are no non-recurring engineering costs, notest vectors to generate, and no delay while waiting for prototypes to be manufactured.

• Design Changes without Penalty Because the devices are software-configured and user-programmed, modifications are much less risky and can be made anytime - in a manner ofminutes or hours, as opposed to the weeks it would take with a gate array. This results insignificant cost savings in design and production.

• Shortest Time-to-Market many component products are available in common packages withcommon footprints, designs often can be migrated to higher or lower density devices, or evenacross some product families, without any printed circuit board changes. Design ideas,represented in text or schematic format, are converted into a configuration data file for an FPGAor CPLD device using step development software running on a PC or workstation.

The below are the pictures of some commercial FPGAs available in the market:

The above architecture represents the Island style architecture. The Xilinx Company produces thesetypes of devices. Each Configurable logic block contains four look up tables. The interconnectsbetween the blocks are programmable. The input/output blocks can be used for both input and output.The wire length cost is a linear function of the distance between the blocks.

The below is the picture of the Hierarchical Style architecture FPGA:

The Altera Company produces these kinds of FPGAs. Each LAB consists of minimum 100 Look uptables and a maximum of 160 look up tables. These labs are connected to each other with the help ofGlobal Interconnects. The placement cost is not proportional to the distance between the two blocks.The cost is calculated with different techniques , which are discussed later in this document.

All the above information provided a brief introduction to FPGAs. Now we will see the objective ofthe project :

Objective: To generate a legal placement for hierarchical interconnect based FPGAs such that

1. Design performance is maximized : Design performance means the maximum pin to pin delay in the FPGA is minimized and also total area of the circuit is reduced 2. Routing congestion is minimized: This takes of even distribution of routing wires all over the circuit.

The below is the diagram which shows the flow in the Quartus software for placement androuting developed by Altera Company:

¬?=®¯�°¡±+²b³W´

µ)¶ ´�·¹¸£°¡±+²¥±

º¼»:½�¾�¿�À�¿mÁ�Â'Ã�ÄÆÅ¥Á�¾�Ç�ÂDÈ�½£É ¿£ÊÌËP¿£¾�Ç£ËTÉ�Å Í�¿ÌÎ�½mË�ÂÏÅ ÂTÅ ÐmÁ�ÅHÁ5Ñ

Ò2Ó¼Ô�·Õ²H´�³

Ö8×ÏÓD³W×ÏØWÙ¯ÙÚ²ª´\³ÜÛ�² ·W³�°¼´£°¼×Ø\·h²:ÓW´

ÝÞÓ�ßW² à ¶

At the initial stage an Hardware Description Language (Verilog or VHDL) is fed to the quartussoftware. The quartus software then does the synthesis (redundant logic is removed). After synthesisplacement is done using the mincut based recursive partitioning. After the placement routing is doneand finally the programming bit generation is done which is fed to the FPGA.

We basically modify the placement procedure used by the Quartus software. Instead of the min cutbased recursive partitioning, we use Bottom up clustering.

Min cut based recursive partitioning is described below:

It generates design partitions such that

1. Cut size between partitions are minimized: Each edge between the clusters is assigned a weight based on the criticality of the edge. If the edge has more weight then it is said to be more critical.So while making cuts, the tool makes sure that edges which have more weight will not be cut.

2. Cluster size constraints are satisfied: Each LAB can hold a maximum of 160 Look up tables. Sowhile making clusters, the tool makes sure that the cluster size is not violated.

3. Timing driven partitioning minimizes the crossing of critical nets: As mentioned earlier, in orderto meet the timing constraints, the critical nets are not cut.

Now we will see the problems associated with the above procedure:

1. Minimizing cut size does not directly minimize design delay. This can be explained as follows:

á�âhã ähã:á+å¡æ�ç�èDé�ç

ê çWã:éWë�äWìîí5ï ð2ñ�òôó�õ ê çWã:éWë�ä'ö ó�õ åDæHæ+÷ ó ÷Ìá�âhã ähã:á+åDæ5ç�èDémç¡öøìí�í

÷ ó ÷�ùhá�âhã äÕã á+å¡æ5ç�è�é�ç¡ö

ú�ûü�ý¤þ;ÿ��

��

Suppose consider the above two cases, where the tool needs to make a choice. In case (1), the weightof a single edge is 10. So, the edge is considered to be very critical. In case (2), the sum of the weightsof all the edges is 11. As the sum of weights of the edges in second case is more, the tool doesn’t cutthose clusters. It cuts the clusters in the first case. But, we know that the criticality of the edge is more.So, we are losing the timing information here. So, the min cut based partitioning is doing bad here.

2.Routing Congestion: The main idea in min cut based partitioning is that it tries to reduce as much aspossible the number of wires crossing between two partitions. But this actually worsens the routing. Itincreases the routing congestion a lot. This can be shown below:

Case (1)

Case (2)

During the initial stages of partition (case 1), we can see that the number of edges crossing betweenthe clusters is less. So, the routing congestion is very less in those areas. But, once the cluster sizebecomes smaller (case 2), the number of edges crossing between the clusters is very high. As a resultof that the routing congestion is very high in those areas. As a result of the uneven distribution of theedges, the routing is high at some areas and low at some areas. This can be overcome by evenlydistributing the wires.

Motivation:

1. Bottom-up clustering groups closely connected components. As a result of this routing congestionis improved

2.Placement with wire length based cost function. As a result of this design delay is reduced.

The proposed flow:

��

�� ! #"$��%�

Initially HDL design is taken, its synthesized and then it’s clustered. After clustering, simulatedannealing placement is done. After that routing is done and finally programming bits generated.

Framework: VPR (Versatile Place and Route): VPR is a placement and routing tool for array-based FPGAs, and T-VPack is a logic block packing (clustering) program. VPR was written to allow circuits to be placed androuted on a wide variety of FPGAs to facilitate comparisons of different architectures. It takes two inputfiles, a net list describing the circuit to be placed and routed, and a description of the FPGA architecture.Optionally, one can also input a placement file to VPR if one desires that an existing placement be routedonly.

VPR targets island style devices. So, we need to take care of appropriate things such that we can use thatfor Hierarchical devices. The main parameters to be changed are:

1. Each LAB in Hierarchical device contains a maximum of 160LUTs where as island style containsonly 4LUTs. So, while we are clustering, we need to see that the cluster can hold as much aspossible for Hierarchical device

2. The cost function needs to be changed. The cost calculation method in Island style follows thebounded box approximation method. The cost is linearly proportional to the distance between thetwo blocks.

Formula for calculating the cost in Island style is :Cost = Σn=1 to nets q(n) [ (bbx(n)/Cavg) + (bby(n)/Cavg) ]

bbx, bby are horizontal and vertical distances respectively.

Cavg is the average channel capacity.

q(n) is the error factor.

&�')(+*-,/.10325476

8�2)9�(:'<;$,/.>=@?�4>4A.B;�'%2)476DC+'�;:EF.�9G.14F,

HJI (!,#25476

K 0 I 6�0/;�9�9L2)4$6DM:2 ,�6:.14A.10N;$,32 I 4

Cost Calculation for Hierarchical Devices:

Its consists of two types of routing:

1. Intra Cluster Routing ( the cost is zero for this)

2. Inter Cluster Routing : These are of different types-

The chip can be divided in to four quadrants and the cost is calculated based on the location of thetwo nets.

Quadrant: If the two nets are with in the quadrant, then a single wire is used.

Half of the chip: If the two nets are in with in the half of the chip, then a double wire is used.

Same Row: If the two nets are in same row but in different quadrants, then a double is used

Same Column: If the two nets are in same column but in different quadrants, then a double is used

Neither: If the nets lies in cross quadrants, then two double wires are needed.

All the above cases are considered while calculating the cost.

Results:All the above cases are considered and these are implemented in the VPR source code , to obtain theco-ordinates of the cluster.

The below are the tables which shows that we outperform the Alteras Quartus tool in every aspect.

O�P�Q�RTS UVQXWZY![]\_^!`ba c_d�Ufe_Phg1R�[]i_P!`!Y_j$klYTY!monp\Xq+^!i-S P�Q�n�Q�PTr

s:t$us+t7uv+vw7sxzy5{ u

| y5{+}-~��1��V�7y5�7�F��1�$�~3�$�1��~N�L�@{7��V~#{+}

�F�>�p�F{>��o��7y��>}

� S R�Q�U�P�`!^�c-r�UVe-P�c][�^�\�UVY� R�^F`)U�R�Q��VY�R!`�Q��!�B�1�F{>��o��7y%�:}

��a `�PTW�U+ ¡^�\-\�a cld�UVY� R_^`5UVRTQk�YTYFmZRT\Xq:^�i-S PTQ

In the above table, the first column represents the test benches we considered to compare the results.The second column, shows the number of look up tables used by the quartus. Third and fourth columnshows the results we obtained by the bottom up clustering method. We can clearly see that, weoutperform the quartus tool in every test case with a huge difference. This shows that area wise we aresaving a lot.

¢�£�¤�¥T¦ §V¤X¨Z©!ª]«_¬!b® ¯_°�§f±_£h²¡¬_³T® ª´¥�ª¶µ+® ¯�§�©hµ+®<¯z·�£�¦ ¬l¸

Delay (ns) Delay (ns) Delay (ns)

¹1º7»¹�º+»¼�½ » ¼¾�¿$ÀAÁ ½

½+Â1Ã½7ÂpÃ¼TÄ »>»Å ¿:Æ�Ç

¹1º Ã¹�º Ã¼ » ½�ÈÉGÊ%Ë�À7Á È

¼TÄÌ¼Í:Î�ÍÏ+Ï�ÐÑ�ÒBÐ:Ó

Ô7ÕpÖb×7Ô:ØÙ�ÚJÛ3ÜBÝ]Þß+à1á�â>âBàã�äBå:æ

ß å áèç!ß�éß å áNç�êBêß7ëpá�ë>ëBàìzí5îï

ð í5î+ñ-ò�ó1ô�õVæ7í5ö7÷Fó�ö1ø$ùò3ú$ó1ø�û�ö�æ�òNüLý@î7ö�ôVò#î+ñ

þ ÿ �� !#" $�%'&)(+*-,.�.�" /10�(�2354 ,6$7( 4'8

The above table shows the maximum pin to pin delay in the circuit. The second column represents thequartus output. The third and fourth column represents results with our technique. We can see that, weout perform the quartus in design maximum dealy also.

9;:�<�='> ?�<A@CB�D�EGF�HJI KGLNMPORQ�KC?TSG:�<1I <5UWV�I ?�?JI KGL'XZY;:> F1Q

[]\]^`_a_5b[c[d^�ecf]f\6fd^�g#e;hikjmlRn [

[;ha^obpag[mfq^�p][][\+\]^�gagr jZsut

[]\]^J\�_v[[dbc^�\db�w[;ha^�[Zp][xzy`{lcn \

f#^`_5bf;^`_5bb6^�fv\i|s7} _

~ s7}+{��lv��jcs7tc��l�tv�m��mlv��t#j��P��}ct#��}+{

Delay (minutes)

�q� �� ¡�� ¢

Delay (minutes)

£#¤ ¥�¦'§)¨+©-ª«�«�¤ ¬1�¨�®¯5° ª6¥7¨ °'±

Delay (minutes)

Here also we can see that we outperform Quartus in the time it takes for synthesis and fitting.

We can conclude that the bottom up clustering groups closely connected components together and thismakes the placement of the connected components nearby and as a result routing becomes easy. Theresults also show that placing the closely related nets together will reduce the design delay and thearea.

First take a hdl code and give directly to quartus and see the number of lookup tables and thetiming information.

Second, take an hdl code and convert it in to blif format, and then cluster them and then dosimulated annealing placement and then convert back the clustered blif in to hdl code and thenfeed into quartus

²#³`´¶µ²;³�·Z¸¹6³�²#º»k¼m½R¾ ·

²#³�·vµm·²;³�·Z²·;³�·Z¿À ¼ZÁuÂ

²#³�·;º²;³�·Z¿¹6³o¹RÃÄzÅ`Æ½c¾ Ã

²#³�·a·Ç�ÈÊÉdËÌ6ÍoÌÎÏ;Ð]ÑZÒ

Acknowledgements: Thanks to Srini Krishnamoorthy for project related discussions

References:

1. “Performance-Driven Multi-Level Clustering with application to Hierarchical FPGA Mapping”,Jason Cong

2. “Placement Algorithms for Datapath-Oriented FPGAs”, Poplavko

3. “Timing-Driven Placement for Hierarchical Programmable-Logic Devices”, Michael Hutton

4. “FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs”, Jason Cong

Documents

ﬁPlacement for Hierarchical Interconnect based FPGA