Download pdf - HARDWARE ACCELERATION OF WINDOW BIG-DIGIT (WBD ...eprints.usm.my/40703/1/Hardware_Acceleration_of_Window_Big-Digit_(Wbd)_Multiplication...Dr. Rizal for his patience, continuing support

HARDWARE ACCELERATION OF WINDOWBIG-DIGIT (WBD) MULTIPLICATION FOR

EMBEDDED APPLICATIONS

LIM EE WAH

UNIVERSITI SAINS MALAYSIA

2015

HARDWARE ACCELERATION OF WINDOWBIG-DIGIT (WBD) MULTIPLICATION FOR

EMBEDDED APPLICATIONS

by

LIM EE WAH

A Dissertation submitted for partial fulfilment of therequirement for the degree of Master of Science

(Electronic Systems Design Engineering)

August 2015

ACKNOWLEDGEMENTS

I would like to acknowledge all those who have made this dissertation a success. With great

pleasure, I would like to extend my sincere thanks and gratitude to my dissertation adviser Prof.

Dr. Rizal for his patience, continuing support and guidance throughout the dissertation.

I extend my heartfelt thanks and gratitude to my fellow Intel colleague Suhaini Annuar and

Dr. Vishnu Paramasivam, who gave their valuable time and guidance in making this thesis a

success.

I would also like to thanks Shahram Jahani and Hani Al-Mimi on their contribution on

inventing the ZOT-Binary and window Big Digits (wBD) algorithm.

ii

TABLE OF CONTENTS

Acknowledgements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

Abstrak . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

CHAPTER 1 – INTRODUCTION

1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

CHAPTER 2 – LITERATURE REVIEW

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Multiplication Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Classical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.2 Karatsuba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.3 Toom-Cook Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Numbering Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Booth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Non-Adjacent Form (NAF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.3 ZOT-Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.4 Window Big Digits (wBD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Big-Digits Multiplication with Precomputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.1 Multiplication of ZOT-Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Multiplication of wBD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

iii

2.5 Hardware Acceleration for Big Number Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6 ARM Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.1 ARM Instruction Set Architecture (ISA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.2 Registers and Register Banking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.3 CPU Operating Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.7 Hardware Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7.1 Hardware Description Language (HDL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.7.2 Field Programmable Gate Array (FPGA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.8 Benchmarking Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.8.1 ARM based System on Chip (SoC) System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.8.2 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

CHAPTER 3 – DESIGN METHODOLOGY

3.1 Development Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Control and Data Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.2.2 Associated Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.2(a) Big Digit (BD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.2(b) BD Partial Product (PP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.2(c) Multiplication Lookup Table (MLUT) . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.3 Compilation and Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.1 Top Level Architecture Diagram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.3.2 Parameterization and Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.3 Main Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.3(a) Avalon Slave Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3.3(b) Binary (BIN) to Big Digits (BD) Translator . . . . . . . . . . . . . . . . . . . . 42

iv

3.3.3(c) BD Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.3.3(d) Multiplication Lookup Table (MLUT) . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.3(e) Partial Product (PP) Accumulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.3.4 SoC Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

CHAPTER 4 – RESULTS AND DISCUSSION

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.2 Simulation Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.4 Benchmark Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

CHAPTER 5 – CONCLUSION AND FUTURE WORKS

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

APPENDIX A – SOFTWARE IMPLEMENTATION OF CLASSICAL BIGINTEGER MULTIPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

APPENDIX B – SOFTWARE IMPLEMENTATION OF WBD MULTIPLICATION . . 69

APPENDIX C – HARDWARE IMPLEMENTATION OF WBD MULTIPLICATION . 73

C.1 TOP_IF.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

C.2 TOP.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

C.3 TRANSLATOR.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

C.3.1 BIN.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

C.3.2 BIN2BD.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

C.3.3 BD.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

v

C.4 MULTIPLIER.v. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

C.4.1 MLUT.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

C.4.2 PP_FIFO.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

C.5 ACCUMULATOR.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

C.5.1 ASYNC_LATCH.v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

APPENDIX D – TEST SCRIPT FOR HARDWARE IMPLEMENTATION OF WBDMULTIPLICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

vi

LIST OF TABLES

Page

Table 2.1 Steps of computing Non-Adjacent-Form for decimal 1510 14

Table 2.2 Big-One-Big-One MLUT 18

Table 2.3 Big-Two-Big-Two MLUT 19

Table 2.4 Big-One-Big-Two MLUT 19

Table 3.1 Register Map of the Proposed Hardware Accelerator 51

Table 3.2 Example of 48-bits Operand Loading 52

Table 4.1 Example of 128-bits Operand Loading 54

Table 4.2 Fitter Result 56

Table 4.3 Utilization Breakdown 57

Table 4.4 Timing Analysis Summary 58

Table 4.5 Speed Comparison of Classical versus wBD Multiplication 59

Table 4.6 Speed Comparison of Software versus Hardware Accelerated wBDMultiplications 59

Table 4.7 Speed Comparison of Classical versus Hardware Accelerated wBDMultiplication 60

vii

LIST OF FIGURES

Page

Figure 2.1 Integer Multiplication Algorithm (Operand Scanning form) 4

Figure 2.2 Graphical Representation of Classical Multiplication Algorithm 4

Figure 2.3 Recursive Karatsuba Algorithm, C = KA(A,B) (Kodali et al., 2014) 5

Figure 2.4 Final multiplication based on values in Toom3 equations 9

Figure 2.5 Algorithm of Computing the NAF of a positive integer (Hankersonet al., 2004) 13

Figure 2.6 Algorithm to Compute the wBD of a positive integer 17

Figure 2.7 Systolic Architecture for Montgomery Algorithm (Perin et al., 2011) 21

Figure 2.8 Basic Processing Element Architecture for Systolic MontgomeryMultiplier (Perin et al., 2011) 21

Figure 2.9 Architecture of Recursive 128-bit Karatsuba-Ofman’s multiplication(RKM) (Wajih et al., 2008) 22

Figure 2.10 Registers Map for Each CPU Mode in ARM Architecture. 23

Figure 2.11 Illustration of a Logic Element in FPGA 25

Figure 2.12 Architecture of Benchmark Platform 27

Figure 3.1 Development Flow 30

Figure 3.2 Operation Flow for Software Implementation 31

Figure 3.3 Typedef of major data structure of BD symbols. 33

Figure 3.4 Code Snippet showing assingment of two BD symbol to C datastructures. 33

Figure 3.5 Typedef of PP data structure. 33

Figure 3.6 Representation of BD and PP data structure in softwareimplementation. 34

Figure 3.7 MLUT code in software implementation. 35

Figure 3.8 Makefile for test programs compilation. 36

Figure 3.9 Top Level Block Diagram of Proposed Hardware Accelerator 37

Figure 3.10 Interface between AXI Slave and internal RAM 41

viii

Figure 3.11 Port listing for AXI interface 41

Figure 3.12 Read and Write Waveform of Avalon Slave Interface. 41

Figure 3.13 Binary to BD Block Diagram. 42

Figure 3.14 Binary to BD Translation State Machines 42

Figure 3.15 Binary to BD Translation State Machines - Constrained by WindowSize 43

Figure 3.16 BD Multiplier Block Diagram. 44

Figure 3.17 BD Multiplier Data Path Design. 45

Figure 3.18 BD Multiplier Controller State. 46

Figure 3.19 Block Diagram of MLUT implementation. 47

Figure 3.20 Complete Verilog code for MLUT ROM 48

Figure 3.21 Partial Product Accumulator Data Path Design 49

Figure 3.22 Data Flow When PP_o f f set ∈ [0,15] 50

Figure 3.23 Data Flow When PP_o f f set ∈ [31,16] 51

Figure 4.1 Waveform During Input Operands Loading 55

Figure 4.2 Waveform While Reading Multiplier’s Result 55

Figure 4.3 FPGA Floorplan Utilization Map 58

Figure 4.4 Benchmark results 60

ix

LIST OF ABBREVIATIONS

FPGA Field Programmable Gate Array

SoC System on Chip

AXI Advanced eXtensible Interface

ZOT Zero-One-Two

ECC Elliptic Curve Crypto

BD Big Digit

wBD Window Big Digit

NAF Non Adjacent Form

MLUT Multiplication Look-up Table

PP Partial Product

ALM Adaptive Logic Modules

x

PERKAKASAN PECUTAN PENDARABAN DIGIT BESARTETINGKAT (WBD) UNTUK SISTEM TERBENAM

ABSTRAK

Satu algoritma pendaraban baru yang bernama Digit Besar tetingkap (wBD) telah dica-

dangkan kebelakangan ini. Algoritma ini berasaskan sistem penomboran Digit Besar (BD)

dan bersasarkan nombor besar yang beribu-ribu bit. Berat Hamming bagi sistem pernombor-

an wBD hanya n4.6 berbanding dengan n

2 bagi sistem binari, n3 bagi nombor tanpa bersebelahan

(NAF) dan nw+1 bagi NAF tetingkap (wNAF). Berat Hamming yang rendah bermaksud bilangan

daraban separa yang rendah, yang mana akan mengurangkan bilangan langkah yang diperlukan

dalam sesuate operasi pendaraban. Oleh itu, sistem pernomboran wBD dapat mempercepatk-

an proses pendaraban secara keseluruhan. Algoritma wBD telah dianalisis dan dibandingkan

dengan kaedah pendaraban yang lain di tahap algoritma. Namun, belum ada karya yang meng-

enai pelaksanaan in peringkatan perkakasan diterbit. Untuk membolehkan penggunaan sistem

penomboran wBD secara meluas dalam sistem terbenam, sesuatu rekabentuk sistem pengecut-

an pendarab wBD yang optimum telah diperkenalkan dalam kerja ini. Dalam kajian ini, satu

sistem pengecutan pendarab wBD telah direka dengan menggunakan Verilog dan diprototaipk-

an dalam platform FPGA. Sistem pengecutan ini dilengkapi dengan port AXI dan disepadukan

ke dalam sistem SoC yang berasaskan pemproses ARM bagi tujuan perbandingan. Pendaraban

256-bit dengan menggunakan sistem pengecut ini didapati adalah 340 kali ganda lebih cepat

daripada pendaraban klasik yang dijalankan melalui teknik perisian. Ini menunjukkan bahawa

sistem pendaraban wBD boleh dilaksanakan dengan optimumnya dalam peringkat perkakasan

dan dapat mempercepatkan operasi pendaraban yang cuma berdasarkan perisian.

xi

HARDWARE ACCELERATION OF WINDOW BIG-DIGIT(WBD) MULTIPLICATION FOR EMBEDDED APPLICATIONS

ABSTRACT

Window Big-Digit (wBD) is a recently proposed multiplication algorithm. This algorithm

relies on Big-Digit (BD) numbering system and is targeting big integer with thousands bits.

The hamming weight of wBD representation is only n4.6 compared to n

2 for binary, n3 for Non-

adjacent form (NAF) and to nw+1 for window-NAF (wNAF). Low hamming weight of multi-

plicand proportionately reduces the number of immediate partial products, which in turn will

reduce the number of steps required in a multiplication function. Hence, wBD number system

could be an excellent candidate to speed up overall multiplication process. The wBD algo-

rithm has been analyzed and benchmarked against other multiplication methods in algorithmic

level. However, there is no published works regarding hardware implementation of the al-

gorithm yet. In order to enable boarder adoption of the wBD numbering system in resource

constrained embedded systems, an optimized hardware accelerator design is introduced in this

work. In this study, the hardware implementation of wBD multiplier is designed using Ver-

ilog and prototyped in FPGA platform. The accelerator is equipped with AXI interface and

integrated into an ARM-based SoC system for benchmarking purpose. The test programs are

The hardware-accelerated 256-bits multiplication is found that to be 340 fold faster than pure

software implementation of classical multiplication. This shows that wBD algorithm can be

optimally implemented in hardware and demonstrates excellent speed gain over pure software

implementation.

xii

CHAPTER 1

INTRODUCTION

Window big-digit (wBD) is a new window-based translation method that relied on big-digit

(BD) recoding method, which is originally based on ZOT-binary number system. The ham-

ming weight of ZOT-binary numbers is lowest among some popular recoding methods, such

as Non-adjacent Form (NAF) and window NAF (wNAF) (Jahani and Samsudin, 2013b). Low

hamming weight of multiplicand will proportionately reduces the number of immediate partial

products, which in turn will reduce the number of steps required in the multiplication opera-

tion. Thus, wBD number system could be an excellent candidate to speed up systems with high

occurrences of large integer multiplication, such as Elliptic Curve Crypto (ECC) system.

1.1 Problem Statement

Hani Al-Mimi et al. (Mini et al., 2013) has implemented and tested the wBD multiplication

algorithm with MIRACL library, which is a C/C++ based library . But, the manipulation

on integers is not native to most of the existing computer architecture which rely on binary

numbering structure of 8, 16, 32 or 64 bits. Thus, the pure software implementation may

not be a viable solution especially for embedded systems that has limited resource and power

constrained. Besides the register size, single threaded processors tend to do the computationally

intensive processing in sequential nature and unable to take advantage of the parallelism and

optimization that exist in some operations. Therefore, in order to enable boarder adoption of

the wBD numbering system in resource constrained embedded systems, an optimized hardware

accelerator design will be introduced.

1

1.2 Research Objective

The main objective of this research is to build an efficient hardware accelerator prototype of

wBD-based multiplication. Followings are the sub-objectives:

1. The architectural implementation is targeting efficiency as well as provides significant

improvement over equivalent pure software implementation.

2. Prototype of the hardware accelerator will be emulated in FPGA development board.

3. The speed improvement of the hardware accelerator is to be benchmarked against the

pure software implementation of classical multiplication under a same platform.

1.3 Thesis Overview

This thesis is organized into five main chapters. Chapter 2 discusses the background and hard-

ware implementation of several multiplication algorithms reported in literature. In addition,

topics related to hardware prototyping such as ARM architecture and Hardware Description

Language (HDL) are also discussed. In Chapter 3, the development methodology of software

and hardware implementation are elaborated. Detailed block diagrams and data flows are also

discussed in the chapter. Result and discussion are listed down in Chapter 4. Finally, the

conclusion and future works are discussed in Chapter 5.

2

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

Multiplication between large numbers is crucial in wide range of applications, such as cryptog-

raphy and scientific calculations. Thus, there have been many research efforts in enhancing the

multiplication speed (Khan et al., 2015). Generally, there are two ways to speed up the multipli-

cation process, which is using new algorithm or by decoding the operands of the multiplication

into different numbering system.

2.2 Multiplication Algorithms

Among the algorithms for large number multiplication are Classical, Karatsuba (Karatsuba and

Ofman, 1963), Toom-Cook (Toom, 1963) and Schönhage-Strassen (Schönhage and Strassen,

1971). The first two algorithms are especially popular among others in cryptography applica-

tions.

2.2.1 Classical

Classical (also known as Schoolbook) algorithm is a fundamental integer multiplication routine

that use the basic operand scanning methods. Figure 2.1 describes the operand scanning form

of Classical algorithm (Hankerson et al., 2004).

3

Input: Integers A,B ∈ [0, p−1]Output: C = A×BSet C[i]← 0 for 0≤ i≤ t−1for i from 0→ t−1 do(UV )←C[i+ j]+A[i].B[ j]+UC[i+ j]←V

end forreturn C

Figure 2.1: Integer Multiplication Algorithm (Operand Scanning form)

In classical algorithm, the operands will be scanned from left to right, and the partial prod-

uct will be accumulated to generate the final result. Figure 2.2 illustrates a simple classical

multiplication of A = (a3a2a1a0)base (multiplicand) and B = (b2b1b0)base (multiplier). Both

numbers are represented in an arbitrary base.

a3 a2 a1 a0

× b2 b1 b0

a3b0 a3b0 a3b0 a3b0

+ a3b1 a3b1 a3b1 a3b1

+ a3b2 a2b2 a1b2 a0b2

s6 s5 s4 s3 s2 s1 s0

Figure 2.2: Graphical Representation of Classical Multiplication Algorithm

Most simple digital systems implement the multiplication function according to classical ap-

proach and interpreting the operands using radix-2 representation, e.g. binary. The numbering

system is straightforward in digital implementation because multiplication with a single bit of

binary symbol (0 or 1) can be reduced to zero or replaced with the shifted version of operand.

4

2.2.2 Karatsuba

The Karatsuba Algorithm was published by Karatsuba and Ofman in 1962 (Karatsuba and

Ofman, 1963). It is considered as one of the fastest ways to multiply large numbers. The al-

gorithm adopts the binary splitting scheme over its divide and conquer paradigm. This method

requires only four simple additions and three multiplications for each iterations.

To apply the algorithm, both operands are first split into two n/2 digits numbers. In next

iteration, the halves are then being split again in half. Since every iteration chop the number

by half, the algorithm will be terminated after t = log2n steps (Kodali et al., 2014). Figure

2.3 shows the recursive Karatsuba algorithm with assumption that both operands have even

number of bits.

Input: Positive integers A = (an, ...,a0)r and B = (bn, ...,b0)r having n+1 base rdigitsOutput: C = A×BIf n = 1 then return C = A×BSplit A,B into two equal parts:

A = AL× rn/2 +AR and B = BL× rn/2 +BR

Compute the following:d1 = KA(AL,BL);d0 = KA(AR,BR) and d0,1 = KA(AR +AL,BR +BL

return C = d1× rn +(d0,1−d0−d1)× rn/2 +d0

Figure 2.3: Recursive Karatsuba Algorithm, C = KA(A,B) (Kodali et al., 2014)

For example, let’s examine a case where n= 2 and base= 10. In this example, the operands

could be split into halves, that is, into their decimal digits as in (2.1).

b

a

= r×10+ s

= p×10+q

(2.1)

5

Let’s say, we have a = 47 and b = 93 as illustrated in (2.2). Then,

r

p

= 9,s = 3

= 4,q = 7

(2.2)

The product a×b can now be written in terms of the digits as in (2.3).

a×b

= (p× r)×102 +(p× s+q× r)×10+q× s

= (p×10+q)× (r×10+ s)

(2.3)

Continuing with the example of a = 47 and b = 93 as in (2.4).

47×93

= 4371

= (4×9)×102 +(4×3+7×9)×10+(7×3)

(2.4)

By writing the product a× b of the two-digit numbers a and b as above, it can be shown

that the result can be computed using four multiplications of one-digit numbers, followed by

additions. This is how the divide-and-conquer technique being adopted in Recursive Karatsuba

algorithm. Besides this, the Karatsuba algorithm reduces the operation a×b from four multi-

plication steps each iteration to only three steps. These three multiplication steps are used to

compute the following three auxiliary products as in (2.5):

w

v

u

= q× s

= (q− p)× (s− r)

= p× r

(2.5)

The above three auxiliary variables can be related back to the a×b as been described in (2.6).

6

u+w− v

= p× s+q× r

= p× r+q× s− (q− p)× (s− r)

(2.6)

So, equation (2.7) can then be derived by representing equations (2.6) back into (2.3).

a×b = u×102 +(u+w− v)×10+w (2.7)

Generally, Karatsuba multiplication algorithm has a better complexity compared to Classi-

cal multiplication algorithm (Fang and Li, 2007). However, overhead of Karatsuba multiplica-

tion algorithm in lower range numbers (less than 255 digits) has caused designer to combine the

Classical and Karatsuba to achieve better overall algorithm efficiency (Jahani and Samsudin,

2013a). However, for software implementation, classical multiplication algorithm generally

gives faster result than Karatsuba in numbers that is less than 255 digits due to long and ineffi-

cient carry chains on the CPU pipeline. (Fang and Li, 2007).

2.2.3 Toom-Cook Algorithm

Similarly, Toom-Cook algorithm also used the divide-and-conquer paradigm. Same as Karat-

suba multiplication, Toom-Cook algorithm also operates by breaking the input numbers into

chunks of smaller size. The larger products are then expressed in terms of equations of smaller

pieces. Toom-Cook applied recursively on these smaller pieces as well. There is a lot of sim-

ilarity between this method with the Fast Fourier Transform (FFT), as both are working on

the same principles of polynomial multiplication (Crandall and Pomerance, 2006) and (Zuras,

1994).

In Toom-Cook algorithm, the input numbers are split into chunks of given size, and then

7

the inputs will be written as a polynomial using the chunk size as radix. The polynomials will

be evaluated at a set of points and its values will be multiplied together at those points. With

the products, the product polynomial can then be determined. Finally, substitution of the radix

returns the final answer.

Let’s take an example of multiplying 123,456 by 987,654 as in (2.8). In this case, Toom-3

algorithm can be applied. First, each numbers will be split into 2 chunks with 3 digits in length.

Then, choose a set of points, for example {−1,0,1}

A×B = 123,456×987,654 (2.8)

By splitting A and B and writing the chunks into polynomials form, we will have:

B(x)

A(x)

= 987x+654

= 123x+456

(2.9)

Then, substitute the x with a set of points to get the values of A(x),B(x)andA(x)×B(x).

x =−1→ A(−1) = 333, B(−1) =−333, A(−1)×B(x) =−110,889

x = 0→ A(0) = 456, B(0) = 654, A(0)×B(x) = 298,224

x = 1→ A(1) = 579, B(1) = 1641, A(1)×B(x) = 950,139

In next step, the polynomial P(x) = A(x)B(x) can be solved based on the values at the three

points. The solution is of the form as in (2.10):

P(x) = p2x2 + p1x+ p0 (2.10)

The values of x can be substituted into the (2.10) with the points {−1,0,1} to get a set of

simultaneous equations that are linear in the unknowns pi as in (2.11):

8

P(1)

P(0)

P(−1)

= 1∗ p2 +1∗ p1 + p0 = 950,139

= 0∗ p2 +0∗ p0 + p0 = 298,224

= 1∗ p2−1∗ p1 + p0 =−110,889

(2.11)

By solving following matrix as in (2.12), the values for the unknowns pi as shown in (2.13)

can be obtained. 1 −1 1

0 0 1

1 1 1

p2

p1

p0

=

−110,889

298,224

950,139

(2.12)

p0

p1

p2

= 298,224

= 520,514

= 121,401

(2.13)

With values in equation (2.13), the final multiplication result can be calculated as shown in

Figure 2.4.

121,401,000,000

+ 520,514,000

+ 298,224

121,981,812,224

Figure 2.4: Final multiplication based on values in Toom3 equations

Steps such as evaluating the values and solving the equation seems difficult, but it turns out

9

to be relatively straight forward for a computer to execute. Toom-Cook algorithm only requires

some simple operations such as addition, subtraction, shifts, division by small constants and

multiplication with one-third of the original inputs.

2.3 Numbering Systems

Besides optimizing the multiplication algorithm, the other solution to improve the arithmetic

function is by representing the operands using other numbering system. In this regards, several

numbering systems have been introduced.

2.3.1 Booth

Booth proposed an algorithm to speed up the Classical multiplication computation in 1951

(Booth, 1951). The implementation of Booth algorithm requires repeatedly addition of either

A or S to the partial product P, which the A and S can be calculated from steps mentioned. The

resultant P will then be rightward shifted. Also, note that the addition of A or S with P can be

done by using ordinary unsigned binary addition.

As described are the steps for Booth algorithm (Booth, 1951). Let m be the multiplicand

and r be the multiplier; and x and y represent the number of bits in m and r respectively.

1. First the values of A, S and initial value of P have to be determined. These numbers

should have a length equal to (x+ y+1) of bits.

• A: Pad the most significant (leftmost) bits with the value of m. Fill the remaining

(y+1) bits with zeros.

• S: Pad the most significant bits with the value of âLŠm in two’s complement nota-

tion. Fill the remaining (y+1) bits with zeros.

10

• P: Pad the most significant x bits with zeros. Then append the value of r to the

right of this. Pad the least significant (rightmost) bit with a zero.

2. Then, the two least significant (rightmost) bits of P are to be determined.

• If they equal to 01, then find the value of P+A and ignore any overflow.

• If they equal to 10, then find the value of P+S and ignore any overflow.

• If they equal to 00, do nothing and reuse the same P directly in the next iteration.

• If they equal to 11, do nothing and reuse the same P directly in the next iteration.

3. Arithmetically shift the value calculated in the 2nd step by a single bit to the right. Let

set P to this new value.

4. Repeat steps 2 and 3 for y times.

5. Finally, drop the least significant bit from P. This is the product of m and r.

As an example, to find the multiplication result of 3 and -4, we will have following variables

to start with: m = 3, r = -4, x = 4 and y = 4 as in (2.14).

m= 00112,−m = 11012,r = 11002 (2.14)

Then, the A, S and P can be calculated by applying the algorithm listed in step 1 as in 2.15,

P

S

A

= 0000110002

= 1101000002

= 0011000002

(2.15)

Now, start the Booth multiplication by performing following loop for four iterations:

1. P = 0000110002. The last two bits are 002.

11