HARDWARE ACCELERATION OF WINDOWBIG-DIGIT (WBD) MULTIPLICATION FOR
EMBEDDED APPLICATIONS
LIM EE WAH
UNIVERSITI SAINS MALAYSIA
2015
HARDWARE ACCELERATION OF WINDOWBIG-DIGIT (WBD) MULTIPLICATION FOR
EMBEDDED APPLICATIONS
by
LIM EE WAH
A Dissertation submitted for partial fulfilment of therequirement for the degree of Master of Science
(Electronic Systems Design Engineering)
August 2015
ACKNOWLEDGEMENTS
I would like to acknowledge all those who have made this dissertation a success. With great
pleasure, I would like to extend my sincere thanks and gratitude to my dissertation adviser Prof.
Dr. Rizal for his patience, continuing support and guidance throughout the dissertation.
I extend my heartfelt thanks and gratitude to my fellow Intel colleague Suhaini Annuar and
Dr. Vishnu Paramasivam, who gave their valuable time and guidance in making this thesis a
success.
I would also like to thanks Shahram Jahani and Hani Al-Mimi on their contribution on
inventing the ZOT-Binary and window Big Digits (wBD) algorithm.
ii
TABLE OF CONTENTS
Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstrak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1 – INTRODUCTION
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
CHAPTER 2 – LITERATURE REVIEW
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Multiplication Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.1 Classical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2.2 Karatsuba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Toom-Cook Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Numbering Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 Booth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Non-Adjacent Form (NAF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.3 ZOT-Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Window Big Digits (wBD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Big-Digits Multiplication with Precomputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1 Multiplication of ZOT-Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Multiplication of wBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iii
2.5 Hardware Acceleration for Big Number Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 ARM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 ARM Instruction Set Architecture (ISA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.2 Registers and Register Banking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.3 CPU Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Hardware Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7.1 Hardware Description Language (HDL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.7.2 Field Programmable Gate Array (FPGA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.8 Benchmarking Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8.1 ARM based System on Chip (SoC) System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8.2 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
CHAPTER 3 – DESIGN METHODOLOGY
3.1 Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Control and Data Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Associated Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2(a) Big Digit (BD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2(b) BD Partial Product (PP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2(c) Multiplication Lookup Table (MLUT) . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.3 Compilation and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Top Level Architecture Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Parameterization and Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3.3 Main Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3(a) Avalon Slave Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.3(b) Binary (BIN) to Big Digits (BD) Translator . . . . . . . . . . . . . . . . . . . . 42
iv
3.3.3(c) BD Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.3(d) Multiplication Lookup Table (MLUT) . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.3(e) Partial Product (PP) Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.4 SoC Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
CHAPTER 4 – RESULTS AND DISCUSSION
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2 Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Benchmark Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
CHAPTER 5 – CONCLUSION AND FUTURE WORKS
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
APPENDIX A – SOFTWARE IMPLEMENTATION OF CLASSICAL BIGINTEGER MULTIPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
APPENDIX B – SOFTWARE IMPLEMENTATION OF WBD MULTIPLICATION . . 69
APPENDIX C – HARDWARE IMPLEMENTATION OF WBD MULTIPLICATION . 73
C.1 TOP_IF.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
C.2 TOP.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
C.3 TRANSLATOR.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
C.3.1 BIN.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
C.3.2 BIN2BD.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
C.3.3 BD.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
v
C.4 MULTIPLIER.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
C.4.1 MLUT.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.4.2 PP_FIFO.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
C.5 ACCUMULATOR.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
C.5.1 ASYNC_LATCH.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
APPENDIX D – TEST SCRIPT FOR HARDWARE IMPLEMENTATION OF WBDMULTIPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
LIST OF TABLES
Page
Table 2.1 Steps of computing Non-Adjacent-Form for decimal 1510 14
Table 2.2 Big-One-Big-One MLUT 18
Table 2.3 Big-Two-Big-Two MLUT 19
Table 2.4 Big-One-Big-Two MLUT 19
Table 3.1 Register Map of the Proposed Hardware Accelerator 51
Table 3.2 Example of 48-bits Operand Loading 52
Table 4.1 Example of 128-bits Operand Loading 54
Table 4.2 Fitter Result 56
Table 4.3 Utilization Breakdown 57
Table 4.4 Timing Analysis Summary 58
Table 4.5 Speed Comparison of Classical versus wBD Multiplication 59
Table 4.6 Speed Comparison of Software versus Hardware Accelerated wBDMultiplications 59
Table 4.7 Speed Comparison of Classical versus Hardware Accelerated wBDMultiplication 60
vii
LIST OF FIGURES
Page
Figure 2.1 Integer Multiplication Algorithm (Operand Scanning form) 4
Figure 2.2 Graphical Representation of Classical Multiplication Algorithm 4
Figure 2.3 Recursive Karatsuba Algorithm, C = KA(A,B) (Kodali et al., 2014) 5
Figure 2.4 Final multiplication based on values in Toom3 equations 9
Figure 2.5 Algorithm of Computing the NAF of a positive integer (Hankersonet al., 2004) 13
Figure 2.6 Algorithm to Compute the wBD of a positive integer 17
Figure 2.7 Systolic Architecture for Montgomery Algorithm (Perin et al., 2011) 21
Figure 2.8 Basic Processing Element Architecture for Systolic MontgomeryMultiplier (Perin et al., 2011) 21
Figure 2.9 Architecture of Recursive 128-bit Karatsuba-Ofman’s multiplication(RKM) (Wajih et al., 2008) 22
Figure 2.10 Registers Map for Each CPU Mode in ARM Architecture. 23
Figure 2.11 Illustration of a Logic Element in FPGA 25
Figure 2.12 Architecture of Benchmark Platform 27
Figure 3.1 Development Flow 30
Figure 3.2 Operation Flow for Software Implementation 31
Figure 3.3 Typedef of major data structure of BD symbols. 33
Figure 3.4 Code Snippet showing assingment of two BD symbol to C datastructures. 33
Figure 3.5 Typedef of PP data structure. 33
Figure 3.6 Representation of BD and PP data structure in softwareimplementation. 34
Figure 3.7 MLUT code in software implementation. 35
Figure 3.8 Makefile for test programs compilation. 36
Figure 3.9 Top Level Block Diagram of Proposed Hardware Accelerator 37
Figure 3.10 Interface between AXI Slave and internal RAM 41
viii
Figure 3.11 Port listing for AXI interface 41
Figure 3.12 Read and Write Waveform of Avalon Slave Interface. 41
Figure 3.13 Binary to BD Block Diagram. 42
Figure 3.14 Binary to BD Translation State Machines 42
Figure 3.15 Binary to BD Translation State Machines - Constrained by WindowSize 43
Figure 3.16 BD Multiplier Block Diagram. 44
Figure 3.17 BD Multiplier Data Path Design. 45
Figure 3.18 BD Multiplier Controller State. 46
Figure 3.19 Block Diagram of MLUT implementation. 47
Figure 3.20 Complete Verilog code for MLUT ROM 48
Figure 3.21 Partial Product Accumulator Data Path Design 49
Figure 3.22 Data Flow When PP_o f f set ∈ [0,15] 50
Figure 3.23 Data Flow When PP_o f f set ∈ [31,16] 51
Figure 4.1 Waveform During Input Operands Loading 55
Figure 4.2 Waveform While Reading Multiplier’s Result 55
Figure 4.3 FPGA Floorplan Utilization Map 58
Figure 4.4 Benchmark results 60
ix
LIST OF ABBREVIATIONS
FPGA Field Programmable Gate Array
SoC System on Chip
AXI Advanced eXtensible Interface
ZOT Zero-One-Two
ECC Elliptic Curve Crypto
BD Big Digit
wBD Window Big Digit
NAF Non Adjacent Form
MLUT Multiplication Look-up Table
PP Partial Product
ALM Adaptive Logic Modules
x
PERKAKASAN PECUTAN PENDARABAN DIGIT BESARTETINGKAT (WBD) UNTUK SISTEM TERBENAM
ABSTRAK
Satu algoritma pendaraban baru yang bernama Digit Besar tetingkap (wBD) telah dica-
dangkan kebelakangan ini. Algoritma ini berasaskan sistem penomboran Digit Besar (BD)
dan bersasarkan nombor besar yang beribu-ribu bit. Berat Hamming bagi sistem pernombor-
an wBD hanya n4.6 berbanding dengan n
2 bagi sistem binari, n3 bagi nombor tanpa bersebelahan
(NAF) dan nw+1 bagi NAF tetingkap (wNAF). Berat Hamming yang rendah bermaksud bilangan
daraban separa yang rendah, yang mana akan mengurangkan bilangan langkah yang diperlukan
dalam sesuate operasi pendaraban. Oleh itu, sistem pernomboran wBD dapat mempercepatk-
an proses pendaraban secara keseluruhan. Algoritma wBD telah dianalisis dan dibandingkan
dengan kaedah pendaraban yang lain di tahap algoritma. Namun, belum ada karya yang meng-
enai pelaksanaan in peringkatan perkakasan diterbit. Untuk membolehkan penggunaan sistem
penomboran wBD secara meluas dalam sistem terbenam, sesuatu rekabentuk sistem pengecut-
an pendarab wBD yang optimum telah diperkenalkan dalam kerja ini. Dalam kajian ini, satu
sistem pengecutan pendarab wBD telah direka dengan menggunakan Verilog dan diprototaipk-
an dalam platform FPGA. Sistem pengecutan ini dilengkapi dengan port AXI dan disepadukan
ke dalam sistem SoC yang berasaskan pemproses ARM bagi tujuan perbandingan. Pendaraban
256-bit dengan menggunakan sistem pengecut ini didapati adalah 340 kali ganda lebih cepat
daripada pendaraban klasik yang dijalankan melalui teknik perisian. Ini menunjukkan bahawa
sistem pendaraban wBD boleh dilaksanakan dengan optimumnya dalam peringkat perkakasan
dan dapat mempercepatkan operasi pendaraban yang cuma berdasarkan perisian.
xi
HARDWARE ACCELERATION OF WINDOW BIG-DIGIT(WBD) MULTIPLICATION FOR EMBEDDED APPLICATIONS
ABSTRACT
Window Big-Digit (wBD) is a recently proposed multiplication algorithm. This algorithm
relies on Big-Digit (BD) numbering system and is targeting big integer with thousands bits.
The hamming weight of wBD representation is only n4.6 compared to n
2 for binary, n3 for Non-
adjacent form (NAF) and to nw+1 for window-NAF (wNAF). Low hamming weight of multi-
plicand proportionately reduces the number of immediate partial products, which in turn will
reduce the number of steps required in a multiplication function. Hence, wBD number system
could be an excellent candidate to speed up overall multiplication process. The wBD algo-
rithm has been analyzed and benchmarked against other multiplication methods in algorithmic
level. However, there is no published works regarding hardware implementation of the al-
gorithm yet. In order to enable boarder adoption of the wBD numbering system in resource
constrained embedded systems, an optimized hardware accelerator design is introduced in this
work. In this study, the hardware implementation of wBD multiplier is designed using Ver-
ilog and prototyped in FPGA platform. The accelerator is equipped with AXI interface and
integrated into an ARM-based SoC system for benchmarking purpose. The test programs are
The hardware-accelerated 256-bits multiplication is found that to be 340 fold faster than pure
software implementation of classical multiplication. This shows that wBD algorithm can be
optimally implemented in hardware and demonstrates excellent speed gain over pure software
implementation.
xii
CHAPTER 1
INTRODUCTION
Window big-digit (wBD) is a new window-based translation method that relied on big-digit
(BD) recoding method, which is originally based on ZOT-binary number system. The ham-
ming weight of ZOT-binary numbers is lowest among some popular recoding methods, such
as Non-adjacent Form (NAF) and window NAF (wNAF) (Jahani and Samsudin, 2013b). Low
hamming weight of multiplicand will proportionately reduces the number of immediate partial
products, which in turn will reduce the number of steps required in the multiplication opera-
tion. Thus, wBD number system could be an excellent candidate to speed up systems with high
occurrences of large integer multiplication, such as Elliptic Curve Crypto (ECC) system.
1.1 Problem Statement
Hani Al-Mimi et al. (Mini et al., 2013) has implemented and tested the wBD multiplication
algorithm with MIRACL library, which is a C/C++ based library . But, the manipulation
on integers is not native to most of the existing computer architecture which rely on binary
numbering structure of 8, 16, 32 or 64 bits. Thus, the pure software implementation may
not be a viable solution especially for embedded systems that has limited resource and power
constrained. Besides the register size, single threaded processors tend to do the computationally
intensive processing in sequential nature and unable to take advantage of the parallelism and
optimization that exist in some operations. Therefore, in order to enable boarder adoption of
the wBD numbering system in resource constrained embedded systems, an optimized hardware
accelerator design will be introduced.
1
1.2 Research Objective
The main objective of this research is to build an efficient hardware accelerator prototype of
wBD-based multiplication. Followings are the sub-objectives:
1. The architectural implementation is targeting efficiency as well as provides significant
improvement over equivalent pure software implementation.
2. Prototype of the hardware accelerator will be emulated in FPGA development board.
3. The speed improvement of the hardware accelerator is to be benchmarked against the
pure software implementation of classical multiplication under a same platform.
1.3 Thesis Overview
This thesis is organized into five main chapters. Chapter 2 discusses the background and hard-
ware implementation of several multiplication algorithms reported in literature. In addition,
topics related to hardware prototyping such as ARM architecture and Hardware Description
Language (HDL) are also discussed. In Chapter 3, the development methodology of software
and hardware implementation are elaborated. Detailed block diagrams and data flows are also
discussed in the chapter. Result and discussion are listed down in Chapter 4. Finally, the
conclusion and future works are discussed in Chapter 5.
2
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
Multiplication between large numbers is crucial in wide range of applications, such as cryptog-
raphy and scientific calculations. Thus, there have been many research efforts in enhancing the
multiplication speed (Khan et al., 2015). Generally, there are two ways to speed up the multipli-
cation process, which is using new algorithm or by decoding the operands of the multiplication
into different numbering system.
2.2 Multiplication Algorithms
Among the algorithms for large number multiplication are Classical, Karatsuba (Karatsuba and
Ofman, 1963), Toom-Cook (Toom, 1963) and Schönhage-Strassen (Schönhage and Strassen,
1971). The first two algorithms are especially popular among others in cryptography applica-
tions.
2.2.1 Classical
Classical (also known as Schoolbook) algorithm is a fundamental integer multiplication routine
that use the basic operand scanning methods. Figure 2.1 describes the operand scanning form
of Classical algorithm (Hankerson et al., 2004).
3
Input: Integers A,B ∈ [0, p−1]Output: C = A×BSet C[i]← 0 for 0≤ i≤ t−1for i from 0→ t−1 do(UV )←C[i+ j]+A[i].B[ j]+UC[i+ j]←V
end forreturn C
Figure 2.1: Integer Multiplication Algorithm (Operand Scanning form)
In classical algorithm, the operands will be scanned from left to right, and the partial prod-
uct will be accumulated to generate the final result. Figure 2.2 illustrates a simple classical
multiplication of A = (a3a2a1a0)base (multiplicand) and B = (b2b1b0)base (multiplier). Both
numbers are represented in an arbitrary base.
a3 a2 a1 a0
× b2 b1 b0
a3b0 a3b0 a3b0 a3b0
+ a3b1 a3b1 a3b1 a3b1
+ a3b2 a2b2 a1b2 a0b2
s6 s5 s4 s3 s2 s1 s0
Figure 2.2: Graphical Representation of Classical Multiplication Algorithm
Most simple digital systems implement the multiplication function according to classical ap-
proach and interpreting the operands using radix-2 representation, e.g. binary. The numbering
system is straightforward in digital implementation because multiplication with a single bit of
binary symbol (0 or 1) can be reduced to zero or replaced with the shifted version of operand.
4
2.2.2 Karatsuba
The Karatsuba Algorithm was published by Karatsuba and Ofman in 1962 (Karatsuba and
Ofman, 1963). It is considered as one of the fastest ways to multiply large numbers. The al-
gorithm adopts the binary splitting scheme over its divide and conquer paradigm. This method
requires only four simple additions and three multiplications for each iterations.
To apply the algorithm, both operands are first split into two n/2 digits numbers. In next
iteration, the halves are then being split again in half. Since every iteration chop the number
by half, the algorithm will be terminated after t = log2n steps (Kodali et al., 2014). Figure
2.3 shows the recursive Karatsuba algorithm with assumption that both operands have even
number of bits.
Input: Positive integers A = (an, ...,a0)r and B = (bn, ...,b0)r having n+1 base rdigitsOutput: C = A×BIf n = 1 then return C = A×BSplit A,B into two equal parts:
A = AL× rn/2 +AR and B = BL× rn/2 +BR
Compute the following:d1 = KA(AL,BL);d0 = KA(AR,BR) and d0,1 = KA(AR +AL,BR +BL
return C = d1× rn +(d0,1−d0−d1)× rn/2 +d0
Figure 2.3: Recursive Karatsuba Algorithm, C = KA(A,B) (Kodali et al., 2014)
For example, let’s examine a case where n= 2 and base= 10. In this example, the operands
could be split into halves, that is, into their decimal digits as in (2.1).
b
a
= r×10+ s
= p×10+q
(2.1)
5
Let’s say, we have a = 47 and b = 93 as illustrated in (2.2). Then,
r
p
= 9,s = 3
= 4,q = 7
(2.2)
The product a×b can now be written in terms of the digits as in (2.3).
a×b
= (p× r)×102 +(p× s+q× r)×10+q× s
= (p×10+q)× (r×10+ s)
(2.3)
Continuing with the example of a = 47 and b = 93 as in (2.4).
47×93
= 4371
= (4×9)×102 +(4×3+7×9)×10+(7×3)
(2.4)
By writing the product a× b of the two-digit numbers a and b as above, it can be shown
that the result can be computed using four multiplications of one-digit numbers, followed by
additions. This is how the divide-and-conquer technique being adopted in Recursive Karatsuba
algorithm. Besides this, the Karatsuba algorithm reduces the operation a×b from four multi-
plication steps each iteration to only three steps. These three multiplication steps are used to
compute the following three auxiliary products as in (2.5):
w
v
u
= q× s
= (q− p)× (s− r)
= p× r
(2.5)
The above three auxiliary variables can be related back to the a×b as been described in (2.6).
6
u+w− v
= p× s+q× r
= p× r+q× s− (q− p)× (s− r)
(2.6)
So, equation (2.7) can then be derived by representing equations (2.6) back into (2.3).
a×b = u×102 +(u+w− v)×10+w (2.7)
Generally, Karatsuba multiplication algorithm has a better complexity compared to Classi-
cal multiplication algorithm (Fang and Li, 2007). However, overhead of Karatsuba multiplica-
tion algorithm in lower range numbers (less than 255 digits) has caused designer to combine the
Classical and Karatsuba to achieve better overall algorithm efficiency (Jahani and Samsudin,
2013a). However, for software implementation, classical multiplication algorithm generally
gives faster result than Karatsuba in numbers that is less than 255 digits due to long and ineffi-
cient carry chains on the CPU pipeline. (Fang and Li, 2007).
2.2.3 Toom-Cook Algorithm
Similarly, Toom-Cook algorithm also used the divide-and-conquer paradigm. Same as Karat-
suba multiplication, Toom-Cook algorithm also operates by breaking the input numbers into
chunks of smaller size. The larger products are then expressed in terms of equations of smaller
pieces. Toom-Cook applied recursively on these smaller pieces as well. There is a lot of sim-
ilarity between this method with the Fast Fourier Transform (FFT), as both are working on
the same principles of polynomial multiplication (Crandall and Pomerance, 2006) and (Zuras,
1994).
In Toom-Cook algorithm, the input numbers are split into chunks of given size, and then
7
the inputs will be written as a polynomial using the chunk size as radix. The polynomials will
be evaluated at a set of points and its values will be multiplied together at those points. With
the products, the product polynomial can then be determined. Finally, substitution of the radix
returns the final answer.
Let’s take an example of multiplying 123,456 by 987,654 as in (2.8). In this case, Toom-3
algorithm can be applied. First, each numbers will be split into 2 chunks with 3 digits in length.
Then, choose a set of points, for example {−1,0,1}
A×B = 123,456×987,654 (2.8)
By splitting A and B and writing the chunks into polynomials form, we will have:
B(x)
A(x)
= 987x+654
= 123x+456
(2.9)
Then, substitute the x with a set of points to get the values of A(x),B(x)andA(x)×B(x).
x =−1→ A(−1) = 333, B(−1) =−333, A(−1)×B(x) =−110,889
x = 0→ A(0) = 456, B(0) = 654, A(0)×B(x) = 298,224
x = 1→ A(1) = 579, B(1) = 1641, A(1)×B(x) = 950,139
In next step, the polynomial P(x) = A(x)B(x) can be solved based on the values at the three
points. The solution is of the form as in (2.10):
P(x) = p2x2 + p1x+ p0 (2.10)
The values of x can be substituted into the (2.10) with the points {−1,0,1} to get a set of
simultaneous equations that are linear in the unknowns pi as in (2.11):
8
P(1)
P(0)
P(−1)
= 1∗ p2 +1∗ p1 + p0 = 950,139
= 0∗ p2 +0∗ p0 + p0 = 298,224
= 1∗ p2−1∗ p1 + p0 =−110,889
(2.11)
By solving following matrix as in (2.12), the values for the unknowns pi as shown in (2.13)
can be obtained. 1 −1 1
0 0 1
1 1 1
p2
p1
p0
=
−110,889
298,224
950,139
(2.12)
p0
p1
p2
= 298,224
= 520,514
= 121,401
(2.13)
With values in equation (2.13), the final multiplication result can be calculated as shown in
Figure 2.4.
121,401,000,000
+ 520,514,000
+ 298,224
121,981,812,224
Figure 2.4: Final multiplication based on values in Toom3 equations
Steps such as evaluating the values and solving the equation seems difficult, but it turns out
9
to be relatively straight forward for a computer to execute. Toom-Cook algorithm only requires
some simple operations such as addition, subtraction, shifts, division by small constants and
multiplication with one-third of the original inputs.
2.3 Numbering Systems
Besides optimizing the multiplication algorithm, the other solution to improve the arithmetic
function is by representing the operands using other numbering system. In this regards, several
numbering systems have been introduced.
2.3.1 Booth
Booth proposed an algorithm to speed up the Classical multiplication computation in 1951
(Booth, 1951). The implementation of Booth algorithm requires repeatedly addition of either
A or S to the partial product P, which the A and S can be calculated from steps mentioned. The
resultant P will then be rightward shifted. Also, note that the addition of A or S with P can be
done by using ordinary unsigned binary addition.
As described are the steps for Booth algorithm (Booth, 1951). Let m be the multiplicand
and r be the multiplier; and x and y represent the number of bits in m and r respectively.
1. First the values of A, S and initial value of P have to be determined. These numbers
should have a length equal to (x+ y+1) of bits.
• A: Pad the most significant (leftmost) bits with the value of m. Fill the remaining
(y+1) bits with zeros.
• S: Pad the most significant bits with the value of âLŠm in two’s complement nota-
tion. Fill the remaining (y+1) bits with zeros.
10
• P: Pad the most significant x bits with zeros. Then append the value of r to the
right of this. Pad the least significant (rightmost) bit with a zero.
2. Then, the two least significant (rightmost) bits of P are to be determined.
• If they equal to 01, then find the value of P+A and ignore any overflow.
• If they equal to 10, then find the value of P+S and ignore any overflow.
• If they equal to 00, do nothing and reuse the same P directly in the next iteration.
• If they equal to 11, do nothing and reuse the same P directly in the next iteration.
3. Arithmetically shift the value calculated in the 2nd step by a single bit to the right. Let
set P to this new value.
4. Repeat steps 2 and 3 for y times.
5. Finally, drop the least significant bit from P. This is the product of m and r.
As an example, to find the multiplication result of 3 and -4, we will have following variables
to start with: m = 3, r = -4, x = 4 and y = 4 as in (2.14).
m= 00112,−m = 11012,r = 11002 (2.14)
Then, the A, S and P can be calculated by applying the algorithm listed in step 1 as in 2.15,
P
S
A
= 0000110002
= 1101000002
= 0011000002
(2.15)
Now, start the Booth multiplication by performing following loop for four iterations:
1. P = 0000110002. The last two bits are 002.
11