STUDIES ON HIGH-SPEED HARDWARE ...lib.tkk.fi/Diss/2008/isbn9789512295906/isbn9789512295906.pdfHelsinki University of Technology Department of Signal Processing and Acoustics Espoo

Helsinki University of TechnologyDepartment of Signal Processing and Acoustics

Espoo 2008 Report 5

STUDIES ON HIGH-SPEED HARDWARE IMPLEMENTATIONOF CRYPTOGRAPHIC ALGORITHMS

Kimmo Jarvinen

Dissertation for the degree of Doctor of Science in Technology to be presented withdue permission of the Faculty of Electronics, Communications and Automation forpublic examination and debate in Auditorium S1 at Helsinki University of Technology(Espoo, Finland) on the 21st of November, 2008, at 12 noon.

Helsinki University of TechnologyFaculty of Electronics, Communications and AutomationDepartment of Signal Processing and Acoustics

Teknillinen korkeakouluElektroniikan, tietoliikenteen ja automaation tiedekuntaSignaalinkasittelyn ja akustiikan laitos

Distribution:Helsinki University of TechnologyDepartment of Signal Processing and AcousticsP.O. Box 3000FIN-02015 HUTTel. +358-9-451 3211Fax. +358-9-452 3614E-mail: [email protected]

c© Kimmo Jarvinen

ISBN 978-951-22-9589-0 (Printed)ISBN 978-951-22-9590-6 (Electronic)ISSN 1797-4267

Multiprint OyEspoo 2008

AB

ABSTRACT OF DOCTORAL DISSERTATION HELSINKI UNIVERSITY OF TECHNOLOGY P.O. BOX 1000, FI-02015 TKK http://www.tkk.fi

Author

Name of the dissertation

Manuscript submitted Manuscript revised

Date of the defence

Monograph Article dissertation (summary + original articles)

Faculty Department Field of research Opponent(s) Supervisor Instructor

Abstract

Keywords

ISBN (printed) ISSN (printed)

ISBN (pdf) ISSN (pdf)

Language Number of pages

Publisher

Print distribution

The dissertation can be read at http://lib.tkk.fi/Diss/

VÄITÖSKIRJAN TIIVISTELMÄ TEKNILLINEN KORKEAKOULU PL 1000, 02015 TKK http://www.tkk.fi

AB

Tekijä

Väitöskirjan nimi

Käsikirjoituksen päivämäärä Korjatun käsikirjoituksen päivämäärä

Väitöstilaisuuden ajankohta

Monografia Yhdistelmäväitöskirja (yhteenveto + erillisartikkelit)

Tiedekunta Laitos Tutkimusala Vastaväittäjä(t) Työn valvoja Työn ohjaaja

Tiivistelmä

Asiasanat

ISBN (painettu) ISSN (painettu)

ISBN (pdf) ISSN (pdf)

Kieli Sivumäärä

Julkaisija

Painetun väitöskirjan jakelu

Luettavissa verkossa osoitteessa http://lib.tkk.fi/Diss/

Acknowledgments

The research work for this thesis was carried out in the Department ofSignal Processing and Acoustics (formerly, Signal Processing Laboratory),

Helsinki University of Technology (TKK). First of all, I would like to thank mysupervisor, Prof. Jorma Skytta, for the support he gave me during these years.He always took care of the ever-increasing university bureaucracy and let meconcentrate almost 100 %-ly to my research work, which I greatly appreciate.Sincere thanks go also to Dr. Matti Tommiska for patient guidance in the earlyphases of this work.

The pre-examiners of this thesis, Dr. Panu Hamalainen and Dr. Toomas P.Plaks, are greatly acknowledged for insightful comments which, I believe, helpedme to improve the manuscript of this thesis significantly.

I had the privilege of being in the Graduate School of Electronics, Telecom-munications and Automation (GETA) from 2004 to 2008. I thank former di-rector of GETA, Prof. Iiro Hartimo, current director, Prof. Ari Sihvola, andcoordinator, Marja Leppaharju, for making GETA such an exemplary graduateschool. Without GETA’s support making of this thesis would have been muchmore difficult, or perhaps impossible.

Signal Processing Laboratory was always a nice place to work. The thanksfor that go to everyone who have worked there over the years, but I would liketo thank, especially, Juha Forsten, Antti Hamalainen, Jaakko Kairus, Esa Kor-pela, Pekka Korpinen, Keijo Lansikunnas, Dr. Jarno Martikainen, Sampo Ojala,Taneli Riihonen, Dr. Matti Rintamaki, Kati Tenhonen, and Dr. Matti Tom-miska. Also, I express my gratitude to the laboratory’s former secretary, AnneJaaskelainen, and secretary, Mirja Lemetyinen, because they never hesitated tohelp me in various practical problems.

GETA, Prof. Graham Jullien, and Prof. Vassil Dimitrov arranged me anopportunity to visit the ATIPS Lab at the University of Calgary, Canada, forthree months period in autumn 2005. It is hard to exaggerate the importance ofthis visit for my thesis work, and I am therefore most grateful for this wonderfulopportunity that I had. Especially, I would like to thank Prof. Dimitrov, Prof.Michael Jacobson, Jr., Dr. Laurent Imbert, Dr. Zhun Huang, and Andy Chan,with whom I had the pleasure of working during my short stay in Calgary.

Since June 2008, I have been affiliated with the Department of Informationand Computer Science at TKK. I would like to thank everyone there for warmlywelcoming me. In particular, I thank my new boss, Prof. Kaisa Nyberg, for

— i —

giving me time to finalize this thesis, and Billy Bob Brumley for fruitful cooper-ation that started already long before I joined the department. He also deservesthanks for commenting the manuscript of this thesis.

The research work was financed by GETA and two projects: GO-SEC (2003–2005), financed by Tekes and several Finnish telecommunication companies, andPLA (2006–2008), financed entirely by Tekes. My doctoral thesis work was fi-nancially supported also by the Nokia Foundation, Emil Aaltosen saatio, Tek-niikan edistamissaatio (TES), and Elektroniikkainsinoorien saatio, and I greatlyappreciate their generosity.

Finally, I would like to thank those who, although without having a directeffect on my research work, have influenced the outcome of this work in manyother ways. Therefore, I thank my parents, brothers, and sister for providingme a solid background and always supporting me in everything I have chosento do in my life. My warmest thanks go to Hanna for love, encouragement, andsupport she has given me over the years.

Espoo, October 2008

Kimmo Jarvinen

— ii —

Table of Contents

Acknowledgments i

Table of Contents iii

List of Publications vii

List of Abbreviations ix

List of Symbols xi

1 Introduction 11.1 Motivation for the Thesis . . . . . . . . . . . . . . . . . . . . . . 11.2 Scope of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Contributions of the Thesis . . . . . . . . . . . . . . . . . . . . . 31.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Summary of Publications and Author’s Contribution . . . . . . . 5

2 Overview of Cryptography 92.1 Secret-key Cryptography . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Data Encryption Standard (DES) . . . . . . . . . . . . . 112.1.2 Advanced Encryption Standard (AES) . . . . . . . . . . . 12

2.2 Public-key Cryptography . . . . . . . . . . . . . . . . . . . . . . 122.2.1 Diffie-Hellman Key Exchange . . . . . . . . . . . . . . . . 132.2.2 RSA Cryptosystems . . . . . . . . . . . . . . . . . . . . . 132.2.3 ElGamal Cryptosystems . . . . . . . . . . . . . . . . . . . 142.2.4 Digital Signature Algorithm (DSA) . . . . . . . . . . . . . 152.2.5 Elliptic Curve Cryptosystems . . . . . . . . . . . . . . . . 152.2.6 On Difficulties of the Hard Problems . . . . . . . . . . . . 17

2.3 Cryptographic Hash Algorithms . . . . . . . . . . . . . . . . . . . 17

3 Hardware Implementation of Cryptographic Algorithms 213.1 Implementation Platforms . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 General-Purpose Processors . . . . . . . . . . . . . . . . . 233.1.2 Application Specific Integrated Circuits (ASICs) . . . . . 243.1.3 Low-Cost Environments . . . . . . . . . . . . . . . . . . . 24

— iii —

3.2 Field Programmable Gate Arrays (FPGAs) . . . . . . . . . . . . 243.2.1 Modern FPGAs and Recent Trends . . . . . . . . . . . . 263.2.2 FPGA Design Flow . . . . . . . . . . . . . . . . . . . . . 273.2.3 Cryptographic Algorithms in FPGAs . . . . . . . . . . . . 283.2.4 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 Metrics for Evaluating Implementations . . . . . . . . . . . . . . 313.4 Side-Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Finite Fields 354.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2 Prime Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.3 Binary Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3.1 Polynomial Basis . . . . . . . . . . . . . . . . . . . . . . . 394.3.2 Normal Basis . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.5 Notes on Implementations . . . . . . . . . . . . . . . . . . . . . . 43

5 Advanced Encryption Standard 475.1 Description of AES . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.1.1 Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . 495.2 Implementation of AES . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.1 Memory-based Implementations . . . . . . . . . . . . . . 515.2.2 Combinatorial Implementations . . . . . . . . . . . . . . . 52

5.3 Literature Review of AES Implementations . . . . . . . . . . . . 545.3.1 ASIC Implementations . . . . . . . . . . . . . . . . . . . . 555.3.2 FPGA Implementations . . . . . . . . . . . . . . . . . . . 575.3.3 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 58

6 Elliptic Curve Cryptography 616.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.2 Arithmetic on Elliptic Curves . . . . . . . . . . . . . . . . . . . . 63

6.2.1 Affine Coordinates . . . . . . . . . . . . . . . . . . . . . . 646.2.2 Standard Projective Coordinates . . . . . . . . . . . . . . 646.2.3 Lopez-Dahab Coordinates . . . . . . . . . . . . . . . . . . 65

6.3 Elliptic Curve Point Multiplication . . . . . . . . . . . . . . . . . 666.3.1 Methods with Precomputations . . . . . . . . . . . . . . . 696.3.2 Multiple Point Multiplication . . . . . . . . . . . . . . . . 70

6.4 Koblitz Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . 716.5 Literature Review of ECC Implementations . . . . . . . . . . . . 74

6.5.1 Typical Structure of ECC Processors . . . . . . . . . . . . 796.5.2 Parallelism in ECC Processors . . . . . . . . . . . . . . . 806.5.3 Flexibility of ECC Processors . . . . . . . . . . . . . . . . 816.5.4 Optimizations for Specific Curves or Algorithms . . . . . 826.5.5 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . 83

7 Results 857.1 AES-related Contributions . . . . . . . . . . . . . . . . . . . . . . 85

7.1.1 AES Surveys . . . . . . . . . . . . . . . . . . . . . . . . . 857.1.2 Fast AES Implementation . . . . . . . . . . . . . . . . . . 86

7.2 ECC-related Contributions . . . . . . . . . . . . . . . . . . . . . 86

— iv —

7.2.1 ECC Implementations for General Curves . . . . . . . . . 867.2.2 Utilizing Parallelism with Koblitz Curves . . . . . . . . . 877.2.3 Efficient Converters for Koblitz Curves . . . . . . . . . . . 877.2.4 Adaptation of the DBNS to Koblitz Curves . . . . . . . . 88

8 Conclusions 89

Bibliography 93

Errata of Publications 121

— v —

— vi —

List of Publications

This thesis consists of an overview and of the following publications which arereferred to in the text by their Roman numerals.

I. Kimmo Jarvinen, Matti Tommiska and Jorma Skytta, “Comparative Sur-vey of High-Performance Cryptographic Algorithm Implementations onFPGAs,” IEE Proceedings: Information Security, vol. 152, no. 1, Oct.2005, pp. 3–12.

II. Kimmo U. Jarvinen, Matti T. Tommiska and Jorma O. Skytta, “A FullyPipelined Memoryless 17.8 Gbps AES-128 Encryptor,” in Proceedings ofthe 11th ACM International Symposium on Field-Programmable Gate Ar-rays, FPGA 2003, Monterey, California, USA, Feb. 23–25, 2003, pp.207-215.

III. Kimmo Jarvinen, Matti Tommiska and Jorma Skytta, “A Scalable Ar-chitecture for Elliptic Curve Point Multiplication,” in Proceedings of the2004 IEEE International Conference on Field-Programmable Technology,FPT 2004, Brisbane, Queensland, Australia, Dec. 6–8, 2004, pp. 303-306.

IV. Kimmo Jarvinen and Jorma Skytta, “On Parallelization of High-SpeedProcessors for Elliptic Curve Cryptography,” IEEE Transactions on VeryLarge Scale Integration (VLSI) Systems, vol. 16, no. 9, Sep. 2008, pp.1162–1175.

V. Kimmo Jarvinen, Juha Forsten and Jorma Skytta, “FPGA Design ofSelf-certified Signature Verification on Koblitz Curves,” in Proceedings ofthe Workshop on Cryptographic Hardware and Embedded Systems, CHES2007, Vienna, Austria, Sep. 10–13, 2007, Lecture Notes in Computer Sci-ence, vol. 4727, Springer, pp. 256-271.

VI. Kimmo Jarvinen and Jorma Skytta, “Fast Point Multiplication on KoblitzCurves: Parallelization Method and Implementations,” Microprocessorsand Microsystems, in press, 11 pages.

VII. Kimmo U. Jarvinen and Jorma O. Skytta, “High-Speed Elliptic CurveCryptography Accelerator for Koblitz Curves,” in Proceedings of IEEE

— vii —

International Symposium on Field-programmable Custom Computing Ma-chines, FCCM 2008, Stanford, California, USA, Apr. 14–15, 2008, inpress, 10 pages.

VIII. Kimmo Jarvinen, Juha Forsten and Jorma Skytta, “Efficient Circuitry forComputing τ -adic Non-Adjacent Form,” in Proceedings of the 13th IEEEInternational Conference on Electronics, Circuits and Systems, ICECS2006, Nice, France, Dec. 10–13, 2006, pp. 232-235.

IX. Billy Bob Brumley and Kimmo Jarvinen, “Koblitz Curves and IntegerEquivalents of Frobenius Expansions” in Revised Selected Papers of the14th Annual Workshop on Selected Areas in Cryptography, SAC 2007,Ottawa, Ontario, Canada, Aug. 16–17, 2007, Lecture Notes in ComputerScience, vol. 4876, Springer, pp. 126-137.

X. V.S. Dimitrov, K.U. Jarvinen, M.J. Jacobson, Jr., W.F. Chan, and Z.Huang, “FPGA Implementation of Point Multiplication on Koblitz CurvesUsing Kleinian Integers,” in Proceedings of the Workshop on CryptographicHardware and Embedded Systems, CHES 2006, Yokohama, Japan, Oct.10–13, 2006, Lecture Notes in Computer Science, vol. 4249, Springer, pp.445-459.

XI. Vassil S. Dimitrov, Kimmo U. Jarvinen, Michael J. Jacobson, Jr., WaiFong (Andy) Chan and Zhun Huang, “Provably Sublinear Point Multi-plication on Koblitz Curves and Its Hardware Implementation,” IEEETransactions on Computers, vol. 57, no. 11, Nov. 2008, pp. 1469–1481.

— viii —

List of Abbreviations

3DES Triple-DESAES Advanced Encryption StandardAES-128 AES with a 128-bit keyAES-192 AES with a 192-bit keyAES-256 AES with a 256-bit keyALM Adaptive Logic ModuleALU Arithmetic Logic UnitALUT Adaptive Look-Up TableANSI American National Standards InstituteASIC Application Specific Integrated CircuitCBC Cipher Block ChainingCHES Cryptographic Hardware and Embedded SystemsCLB Configurable Logic BlockCMOS Complementary Metal Oxide SemiconductorCPLD Complex Programmable Logic DeviceCTR CounTeRDBNS Double-Base Number SystemDES Data Encryption StandardDLP Discrete Logarithm ProblemDoS Denial-of-ServiceDPA Differential Power AnalysisDSA Digital Signature AlgorithmDSP Digital Signal Processing / ProcessorECB Electronic Code BookECC Elliptic Curve CryptographyECDLP Elliptic Curve Discrete Logarithm ProblemECDSA Elliptic Curve Digital Signature AlgorithmFAP Field Arithmetic ProcessorFCCM Field-programmable Custom Computing MachinesFIPS Federal Information Processing StandardFPGA Field Programmable Gate ArrayFSM Finite State Machinegcd Greatest Common DivisorGCM Galois Counter ModeGLV Gallant, Lambert, and Vanstone

— ix —

GNB Gaussian Normal BasisHDL Hardware Description LanguageHECC Hyper-Elliptic Curve CryptographyIDEA International Data Encryption AlgorithmIEEE Institute of Electrical and Electronics EngineersIFP Integer Factorization ProblemJSF Joint Sparse FormLAB Logic Array Blocklsb Least Significant BitLUT Look-Up TableMD Message Digestmsb Most Significant BitNAF Non-Adjacent FormNAFw Width-w Non-Adjacent FormNBS National Bureau of StandardsNIST National Institute of Standards and TechnologyNSA National Security AgencyONB Optimal Normal BasisPAL Programmable Array LogicPLA Packet Level Authentication / Programmable Logic ArrayRACE Research and development in Advanced Communications

technologies in EuropeRAM Random Access MemoryRFID Radio Frequency IDentificationRIPEMD RACE Integrity Primitives Evaluation MDRISC Reduced Instruction Set ComputerROM Read-Only MemoryRSA Rivest, Shamir, and AdlemanSHA Secure Hash AlgorithmSPA Simple Power AnalysisSPN Substitution-Permutation NetworkSSL Secure Sockets LayerVHDL Very high speed integrated circuit HDLXOR eXclusive-ORτJSF τ -adic Joint Sparse FormτNAF τ -adic Non-Adjacent FormτNAFw Width-w τ -adic Non-Adjacent Form

— x —

List of Symbols

∗ Binary operation⊗ Bitwise logical and operation (AND)⊕ Bitwise logical exclusive-or operation (XOR)b·c The largest smaller or equal integerd·e The smallest larger or equal integer(·)−1 Inverse〈·〉 Bit string notation〈·〉x Hexadecimal notationa, . . . , e Elements of groups, rings, fields, and vector spacesa1, a2, a3, a4, a6 Curve parametersa(x), . . . , d(x) Elements of F2m with polynomial basisA, . . . , J Temporary variables of point operations in P and LDA AdditionA Affine coordinatesAddRoundKey(·, ·) Round key mixing transformation of the AESC Ciphertextd Exponent used in decryptiondN Number of ones in a multiplication matrixdK(·) Decryption function with a key KD Multiplier digit-sizeDi Data after round i in block ciphersdeg(·) Degree ofe Exponent used in encryptioneK(·) Encryption function with a key KE Elliptic curveEa2 Koblitz curve (a2 ∈ {0, 1})E(Fq) Group of points on an elliptic curve E over FqF FieldFq Finite field with q elementsF∗q Multiplicative group of FqFp Prime fieldF2m Binary fieldFq[x] Ring of polynomials over Fqf Clock frequency

— xi —

f(x) Binary polynomial of x ∈ Ff(·, ·) Round function of a block cipherF (·, ·) F -function of normal basis multiplicationg Generator of FqG Groupgcd(·, ·) Greatest common divisorh HashH(·) Hamming weight, the number of nonzero termsHn(·) Joint Hamming weight of n expansionsHash(·) Cryptographic hash algorithmi, j, n, r, . . . , u Positive integersI Inversionl Latency` Bit lengthk Positive (random) integerk An array of binary expansionsK KeyKi Round key for the ith roundKe Encryption keyKd Decryption keyKeySchedule(·) Key schedule of the AESLD Lopez-Dahab coordinatesm Dimension of an extension fieldM Message, plain textM MultiplicationMixColumns(·) Column transformation of the AESNr Number of roundsord(·) Order ofO The point at infinityp Prime numberp The number of parallel operationsp(x) Irreducible polynomialP Point on an elliptic curve, the base pointP Projective coordinatesq Order of a finite fieldQ Point on an elliptic curve, the result pointR RingRoundKeys Round keys of the AESS SquaringS(·) S-boxShiftRows(·) Row permutation transformation of the AESState State in the AESSubBytes(·) Substitution transformation of the AESt Computation timeT ThroughputTi(·) T-box iTN Multiplication matrix of normal basisV Vector spacew Window sizex, y Coordinates of a point in A

— xii —

X,Y, Z Coordinates of a point in P or LDα Normal element of F2m

δ, µ Auxiliary variables for Koblitz curves∆ Temporary variable of composite field inversionλ Temporary variable of point operations in Aξ Element of a multiplication matrix TN

σ Field isomorphismτ Complex rootϕ(·) Euler’s totient functionφ(·) Frobenius endomorphismψ, ω Coefficients of irreducible polynomials

— xiii —

— xiv —

Chapter 1

Introduction

Information security is a fundamental requirement for an operational in-formation society. Although issues considered as information security, such

as privacy of communication and reliable authentication, have been importantthroughout history, developments in digital computing and information tech-nology have set new requirements and challenges for them. The importance ofinformation security has grown because new technologies have made accessingand misusing confidential information easier and more profitable. Hence, infor-mation security, previously considered mostly by militaries and governments,has become an issue having relevance even for an average person living in amodern information society because of its significance in commerce, telecom-munication, etc.

Cryptography, the art and science of hiding data, plays a central role inachieving information security. It has become a fundamental part of commu-nication and commercial applications in the Internet as well as in many otherdigital applications. Cryptography is deployed with cryptographic algorithms,mathematical functions for hiding messages and retrieving hidden messages.This thesis studies and proposes methods to efficiently implement certain cryp-tographic algorithms.

1.1 Motivation for the Thesis

The motivation for this thesis, as perhaps for all engineering, arises from theclassic utilitarian view which was well captured by Cilardo et al. [67] who stated,

“A mathematical construction with practical applications, such asa cryptographic algorithm, has no real interest, in an engineeringsense, as long as methods for feasible implementation are not avail-able.”

In other words, efficient implementation techniques for cryptographic algorithmsare necessary in order to use them in practice.

Applications using cryptographic algorithms often set very tight require-ments for implementations. Low-cost applications, such as mobile hand-held

— 1 —

devices, sensor networks, or smart cards, to name a few, often set strict con-straints on available resources, such as logic, memory, or power [90]. On theother hand, high-speed applications, such as high-speed network servers, re-quire very high computation speeds in order to prevent cryptography from be-coming the bottleneck [105, 126]. Hence, designing implementations fulfillingthe requirements is often a challenging task and implementing cryptographicalgorithms has been widely studied in academia.

Naturally, general-purpose microprocessors can be used for implementingcryptographic algorithms but, in that case, the execution of computationally de-manding cryptographic algorithms easily becomes the limiting factor for overallperformance [63]. Hence, dedicated hardware implementations are commonlyrequired to offload cryptographic algorithms away from the microprocessor [63].Reprogrammable hardware has established itself as one of the most attractiveplatforms for cryptographic algorithms because of the combination of speed andflexibility [293]. This thesis concentrates on Field Programmable Gate Arrays(FPGAs) which are commonly considered very feasible implementation plat-forms among implementors of cryptographic algorithms [293]. However, a ma-jority of the ideas and architectures presented in this thesis can be applied alsoto Application Specific Integrated Circuits (ASICs), and some even for softwareimplementations.

1.2 Scope of the Thesis

The thesis concentrates on a secret-key cryptographic algorithm, Advanced En-cryption Standard (AES) [206], and public-key cryptographic algorithms basedon arithmetic on elliptic curves, Elliptic Curve Cryptography (ECC) [146, 194].AES has become a true de facto algorithm for secret-key cryptography and itis used in countless applications worldwide ranging from RFID tags [95] andsensor networks [120] to heavily-loaded servers [105] and optical networks [126].Because of the very wide use of AES, it is justified to focus this study solelyto AES. ECC is in many ways superior to traditional public-key algorithms,such as the widely-used RSA [239], because it offers similar levels of securitywith considerably shorter keys [32, 164] and faster performance [88, 251]. Inthe recent years, ECC has gained popularity also in practical systems and itis used, e.g., in Windows Vista [160], OpenSSL [218], and German biometricpassports [43, 44], to name a few. The National Security Agency (NSA) ofthe United States is also strongly promoting the use of ECC [210]. ECC hasthus become a mainstream method for public-key cryptography and its use isincreasing. Hence, the selection of ECC is justified.

Because this thesis considers both secret-key and public-key cryptographicalgorithms, the results also enable designing cryptosystems where a secret key isfirst shared by two parties with the public-key algorithm, i.e., with ECC, and,then, the actual encryption is performed with the faster secret-key algorithm,i.e., with AES. As mentioned, both AES and ECC are algorithms which areused in numerous practical applications and they are likely to remain popularalso in the foreseeable future. Consequently, most of the findings of this thesiscan be directly applied in practice.

Requirements for cryptographic implementations are diverse, but the pri-mary requirement considered in this thesis is speed of computation. Before a

— 2 —

cryptographic algorithm can be adopted to a practical application, there mustexist ways to implement it so that it matches the speed requirements of theapplication. For instance, if traffic in a high-speed network is encrypted withcryptographic algorithms, the implementations must be able to compute thealgorithms with throughputs matching the throughput of the network, or oth-erwise they become the bottleneck for the entire communication. An exampleof an application that sets very high requirements for computation speed is thePacket Level Authentication (PLA) communication scheme [46, 47], which wasthe main target application, especially, for the implementations presented in Vand VII. Other examples include heavily-loaded e-commerce servers [105] andoptical network switches [126].

Most publications of this thesis concern ECC. The emphasis is on methodsallowing increased use of parallelism in ECC operations, or more specifically inelliptic curve point multiplication, the principal operation of ECC. Point mul-tiplication is recursive in nature and it is thus hard to parallelize. Many of thecontributions focus on a class of elliptic curves, called Koblitz curves [148], whichoffer faster computation, but require certain conversions for integers. Koblitzcurves have attained a lot of interest in theoretical cryptography research andthey are listed in major ECC standards, such as [52, 205], but, prior to thisthesis, they have attained only limited interest in the community implementingcryptographic algorithms in hardware.

The focus of this thesis is on the algorithmic level and in methods allowingefficient implementation. This thesis does not discuss higher level aspects, suchas cryptographic protocols. Also lower levels, such as gate or transistor levels,are not discussed in this thesis. All contributions address the problem of achiev-ing fast computation of cryptographic algorithms efficiently. The scope of thethesis can be described as finding answers to the following questions:

— Which algorithms are the most suitable for hardware (or FPGA) imple-mentations and how could they be improved in order to increase theirsuitability?

— How should these algorithms be implemented in order to provide the bestpossible results?

Author’s other publications on topics related to cryptography, in additionto the publications of this thesis, include an automated design generator forECC [137], studies of cryptographic hash algorithms, MD5 and SHA-1 [138,139], and a recent work on fast decompressions in ECC [41].

1.3 Contributions of the Thesis

This thesis has contributed to several areas of research in cryptographic im-plementations. More specifically, the contributions enable faster implementa-tions of AES and ECC than those that were available previously, and they thusprovide improvements predominantly to applications requiring high speed im-plementations. The following lists the main contributions of this thesis. Moredetailed discussions are given in the publications and in Ch. 7.

The AES-related contributions (I, II) include

— 3 —

— a review of existing FPGA-based implementations of widely-used secret-key cryptographic algorithms including AES and hash algorithms (I) anda review complementing and updating this review (Ch. 5),

— a study illuminating the relationship between achieved throughput andconsumed area in AES implementations also when they use embeddedmemory of FPGAs (I), and

— a high-speed implementation of AES which was the fastest FPGA-basedimplementation at the time of publication (II).

The most significant contributions are those considering ECC (III–XI) andthey include

— a review of existing ECC implementations on FPGAs and ASICs (Ch. 6),

— an implementation optimized for the structure of FPGAs (III),

— a study on the effects of parallelism in ECC implementations and toolsfor evaluating and optimizing parallelism in these implementations (IV),

— a method that splits point multiplications on Koblitz curves for parallelprocessors and FPGA implementations utilizing this method (IV),

— an FPGA design for computing signature verifications, the most demand-ing operations of the PLA, with very high speed (V),

— improvements to precomputation algorithms used in these signature veri-fications (V),

— a method utilizing parallelism and interleaved point operations on Koblitzcurves (VI),

— implementations of the above method (VI) and an optimized implemen-tation which further improves on this by utilizing similarity of certainpoint multiplication algorithms (VII); these implementations achieve thefastest point multiplications currently available in the literature,

— hardware-optimized algorithms for integer conversions related to Koblitzcurves and their implementations (VIII, IX), and

— an introduction of a new multiple-base expansion to Koblitz curves andits FPGA implementations (X, XI).

These contributions increase knowledge of implementing AES and ECC inhardware, and more specifically in FPGAs. The most substantial results arethose that provide methods to implement ECC with increased amounts of par-allelism, because they result in very fast FPGA implementations. Especially,the results related to Koblitz curves dramatically improve existing solutions.The contributions of this thesis could be directly applied in various practicalsystems and, as a consequence, they also have significant practical relevance.

— 4 —

1.4 Structure of the Thesis

This thesis comprises an overview and eleven publications which are appendedin the end. The remainder of this overview is structured as follows:

Ch. 2 presents an overview of cryptography.

Ch. 3 discusses implementation platforms of cryptographic algorithms.

Ch. 4 introduces preliminaries of finite fields and their implementations.

Ch. 5 discusses AES and its implementations.

Ch. 6 discusses ECC and its implementations.

Ch. 7 summarizes the results of the thesis.

Ch. 8 draws conclusions.

Ch. 2 and 3 give general descriptions of cryptography and cryptographichardware, whereas Ch. 4–6 present detailed descriptions of both theory andpractice of the algorithms discussed in the publications. Ch. 7 and 8 presentthe results and conclude their importance for the research field.

1.5 Summary of Publications and Author’sContribution

The following provides short summaries of the eleven publications of this thesis.The author had a major role in research and writing in all these publications,and the author’s contribution in every particular publication is specified in theend of each summary.

Publication I The article presents a comparative survey of cryptographic al-gorithm implementations on FPGAs. The study concentrates on high-speed im-plementations of AES, International Data Encryption Algorithm (IDEA), MD5,and SHA-1, but other secret-key algorithms and hash algorithms are studiedshortly as well. The survey addressed the well-recognized problem of comparingimplementations using embedded memory to memory-free implementations. Itwas concluded that FPGAs suit well for cryptographic algorithms.

The author is responsible for all research and writing, except for the study ofIDEA in Sec. 2.2 which was written by Dr. Matti Tommiska who also made im-provement suggestions to the rest of the paper. Prof. Jorma Skytta supervisedthe research.

Publication II The paper describes an implementation of AES. The imple-mentation provided the highest reported throughput at the time of publication,and the most important contribution of the paper was thus the speed of the im-plementation. Composite fields were utilized in order to increase the efficiencyof the implementation. Xilinx Virtex FPGAs were used for demonstration butthe architecture itself is generic and applies to both FPGAs and ASICs.

All hardware designs are due to the author. The author wrote all of the pa-per; except Secs. 1, 6, and 7. Dr. Matti Tommiska provided extensive support in

— 5 —

both research work and writing of the paper. He also introduced the possibilityof using composite fields to the author and wrote the above mentioned sections.Prof. Jorma Skytta supervised the research.

Publication III The paper presents an implementation of ECC optimizedfor the structure of Xilinx Virtex FPGAs. The implementation was designedso that it could be easily generated with an automatic VHDL (Very high speedintegrated circuit Hardware Description Language) generator which was pre-sented in [137]. The achieved point multiplication times were among the fastestpublished results at the time of publication, and even the fastest ones for certainparameters.

The author designed and implemented all hardware architectures. The au-thor also wrote the entire paper while Dr. Matti Tommiska supervised the workand made many observations in both research work and writing of the paper.Prof. Jorma Skytta supervised the research.

Publication IV The article discusses parallelization of ECC processors anddescribes a generic processor architecture which is used in analyzing effects ofparallelism. The article presents a comprehensive study of existing paralleliza-tion techniques and, most importantly, presents a novel parallelization techniqueusing parallel processors which results in considerable reductions in latency, es-pecially, on Koblitz curves but also on general curves with certain presupposi-tions. A number of FPGA-based implementations are presented for both generaland Koblitz curves.

The author is responsible for all theory, hardware implementations, and theentire writing of the manuscript. Prof. Jorma Skytta supervised the researchand presented comments on the manuscript.

Publication V The paper presents an accelerator designed specifically forverifying self-certified signatures on Koblitz curves. This operation acts as thebottleneck in the PLA. It was concluded that up to 166,000 signatures canbe verified in second with a single FPGA by using massive parallelism. Thepaper also presents certain improvements to precomputations involved in veri-fications. The main contribution of the paper is that it shows that considerableimprovements in the number of operations per second are achievable by toler-ating slightly longer computation times for single operations.

The author is responsible for everything presented in the paper and he wrotethe entire paper. Juha Forsten helped in verifying the designs on an FPGA andhe also made certain improvements to the manuscript. Prof. Jorma Skyttasupervised the research.

Publication VI The article discusses ECC on Koblitz curves. It introducesa new parallelization method which achieves optimal multiplier utilization byinterleaving successive operations. The method has several advantages overthe best previously presented methods (the ones presented in IV). First, theachievable critical path is shorter; second, the method enables implementingboth general and Koblitz curves efficiently in the same hardware; third, themethod increases the attractiveness of using Koblitz curves; and, fourth, efficient

— 6 —

implementations can be designed by using the method as demonstrated in thepaper on FPGAs.

The author is alone responsible for all theory, implementations, and writing.Prof. Jorma Skytta supervised the research.

Publication VII The paper presents an implementation designed specificallyfor Koblitz curves. The implementation can compute point multiplications andsums of up to three point multiplications and, hence, it supports all ECC op-erations required in the PLA, confer V which supports only verifications. Theimplementation utilizes the common structure of the algorithms and specificprocessing units optimized for different parts of the algorithms, especially, themethod presented in VI. Both fast computation time for single operations andhigh throughput are achieved with the implementation and, to the author’sknowledge, it outperforms all other implementations available in the literature.

The author is responsible for everything presented in the paper and all writ-ing. Prof. Jorma Skytta supervised the research.

Publication VIII The paper describes an architecture for integer conversionsrelated to Koblitz curves. To the author’s knowledge, it was (and still is) theonly publication presenting such converters. The converter is compact andfast and, hence, it has significance in hardware realization of ECC on Koblitzcurves. Prior to the publication of the paper, a general-purpose processor wasused for the conversions in all publications. The contributions of the paperinclude modifications to conversion algorithms, which make them more suitablefor hardware, and an efficient converter architecture.

The author is responsible for all research, implementations, and writing inthe publication. Juha Forsten helped in testing the converter and gave somecomments during the writing of the paper. Prof. Jorma Skytta supervised theresearch.

Publication IX This paper discusses conversions related to Koblitz curvestoo, but to the other direction than in VIII. Such conversions have signifi-cance, e.g., in digital signature algorithms. The benefit compared to conversionsdiscussed in VIII is that conversions can be computed in parallel with otheroperations. The paper also presents efficient hardware implementations of theconversion, and such converters were not presented previously in the literature.

The author designed and implemented the hardware architecture. Theoret-ical work and ideas are due to Billy Bob Brumley. The author made certainsuggestions on how to efficiently compute the conversions and wrote Sec. 4.Brumley wrote the rest of the paper.

Publication X The paper presents multiple-base expansions, analogous tothe Double-Base Number System (DBNS) [79, 80], and they are useful in ECCon Koblitz curves. Two types of representations, one with two bases and theother with three, were suggested and it was proven that the number of termsin the three-base expansions is sublinear in the bit length of an integer. Thesame was conjectured for the two-base expansions based on extensive numer-ical evidence. The practical feasibility of the two-base method was shown by

— 7 —

designing an FPGA implementation. The implementation used the state-of-the-art algorithms and achieved high speed with moderate area requirements. Theimplementation was the fastest published ECC implementation at the time ofpublication. The suggested expansions were the first multiple-base expansionspresented for Koblitz curves which were shown to be feasible in practice.

The author designed and implemented the entire hardware implementation,except the multiple-base expansion converter which was designed and imple-mented by Dr. Zhun Huang. Theoretical work and ideas on multiple-baseexpansions are due to Prof. Vassil Dimitrov and Prof. Michael Jacobson, Jr.The software implementation and numerical results were made by Andy Chan.The author is responsible for the writing of Sec. 4, excluding Sec. 4.3, and apart of Sec. 5. The co-authors wrote the rest of the paper.

Publication XI The article extends X. It discusses multiple-base expansionswith more details and provides additional numerical evidence supporting feasi-bility of the expansions. A new section discussing inherent parallelism availablein the expansions is included together with a description of an FPGA imple-mentation utilizing this parallelism.

The most important new part, Sec. 5, was written solely by the author, andthe author is alone responsible for all research related to that section. Otherwise,the responsibilities were the same as in X.

— 8 —

Chapter 2

Overview of Cryptography

Cryptography, the art and science of keeping messages secure, has a longhistory which can be traced back to ancient Egypt some 4000 year ago [189].

Undoubtedly, cryptanalysis, the art and science of revealing messages hiddenby means of cryptography, has an equally long history. Together cryptographyand cryptanalysis are called cryptology. When the history of cryptology isdiscussed, it must be emphasized that historical cryptology has little commonwith modern cryptology which started in the 20th century, and which is thetopic of this thesis.

First, some fundamental terminology is introduced. The message, which isto be kept in secret, is referred to as plaintext. The process of hiding its contentis called encryption and the encrypted message is referred to as ciphertext. Theprocess of receiving the content of plaintext back from ciphertext is decryption.A cryptographic algorithm is the mathematical function used for encryptingand decrypting messages. A modern cryptographic algorithm always includes akey. A cryptographic algorithm, plaintexts, ciphertexts, and keys are referredto as cryptosystem.

Fig. 2.1 shows the basic communication model where Alice (A) and Bob(B) are communicating over an unsecured channel, and Eve (E) is trying toeavesdrop their communication. In order to prevent Eve from obtaining thecontent of a message, M , Alice uses a cryptographic algorithm to turn M into

A B

E

Unsecured channel

Figure 2.1: Communication model

— 9 —

a ciphertext, C, i.e. she performs

C = eK(M) (2.1)

where eK(·) is the encryption function with the key, K. Alice sends C over theunsecured channel to Bob who performs

M = dK(C) (2.2)

where dK(·) is the decryption function. Thus, Bob has received M but Eve,who does not have K, is unable to get M from C unless she can break thecryptographic algorithm. [122, 258]

Let Ke and Kd be the keys used in encryption and decryption, respectively.Cryptographic algorithms where Ke can be computed from Kd, and vice versa,are called secret-key1 cryptographic algorithms. The secrecy of keys is essentialfor the security of secret-key cryptographic algorithms because if Eve obtainsKe, she can easily compute Kd. Thus, Alice and Bob must exchange keys byusing a secured channel which makes key distribution one of the most difficultproblems in many applications. The most efficient way to attack a secret-keyalgorithm should be an exhaustive search through the key space. Even in thatcase, a secret-key algorithm is weak if it is realistic to assume that it couldbe possible to build a machine that can perform an exhaustive search in somereasonable time with realistic cost. [258]

Secret-key algorithms were the only option to deploy cryptography untilthe mid-1970s when Whitfield Diffie and Martin E. Hellman published theirlandmark paper, “New Directions in Cryptography” [76]. They introduced theconcept of public-key cryptography2 where Ke can be easily computed from Kd

but Kd cannot be computed from Ke (in any reasonable time). Thus, Ke canbe made public. Communication between Alice and Bob operates so that Bobpublishes his Ke. Alice uses it to encrypt M and sends C over the unsecuredchannel. Only Bob has access to Kd because it cannot be computed from Ke,and Bob is thus the only one able to decrypt C. Public-key cryptographicalgorithms are based on hard problems where “hard” means that a problemis considered impossible to solve with any computational resources availablecurrently or in the (near) future. In proportion, a problem is “weak” if there aremethods for solving the problem with some amount of computational resourcesavailable at the time although acquiring such resources might be unreachablefor all but the richest governments or corporations. Thus, the hardness of aproblem is open to interpretation. [258]

Moreover, the required level of security depends on confidentiality of thedata being encrypted. Hence, an adequate level of security may be achievedwith weak algorithms. For example, Data Encryption Standard (DES) wasused until late-1990s by some corporations although it was commonly consideredweak already in the 1980s [164]. As a rule of thumb, if the cost of breaking acryptosystem exceeds the value of encrypted data, then the cryptosystem canbe considered adequately secure [258]. However, cryptography alone does notprovide information security, but it is rather a set of techniques [189]. In additionto cryptographic algorithms, security relies also on many other aspects. In

1Terms such as private-key or symmetric(-key) are also commonly used in the literature.2Also called as asymmetric(-key) cryptography.

— 10 —

fact, failures in practical cryptosystems, such as automatic teller machines, arecommonly due to implementation and management errors [11]. Hence, in orderto make a system secure, all sides of an application must be secured: protocols,algorithms, implementations, etc., [164] and security considerations should be anorganic part in designing applications where security is an issue [231]. The abovedoes not diminish the importance of secure cryptographic algorithms becausethey have an essential role in building a secure system in the first place. Securityis always compromised if a severe flaw is found in a cryptographic algorithm.

Secret-key cryptography is considered in more detail in Sec. 2.1 and public-key cryptography is discussed in Sec. 2.2. Cryptographic hash algorithms areone-way functions which have a great significance in many cryptosystems, andthey are considered in Sec. 2.3.

2.1 Secret-key Cryptography

Secret-key cryptographic algorithms are categorized into either stream ciphers orblock ciphers based on how they manipulate data. A stream cipher handles dataone symbol, typically a bit, at a time whereas a block cipher encrypts data infixed-length blocks. Most cryptographic algorithms used today are block ciphersand stream ciphers are used predominately in situations where transmissionerrors are probable and implementation resources are limited [189], such asmobile communication devices. Stream ciphers are not considered further inthis thesis.

Modern block ciphers are usually constructed using a Feistel structure ora Substitution-Permutation Network (SPN), e.g., DES uses a Feistel structureand AES is an SPN. In both cases, encryption and decryption are performed byiteratively applying a round function, Di = f(Di−1,Ki), where Di−1 is the dataat iteration i−1 and Ki is the round key for iteration i. Round keys are derivedfrom the key, K, with a routine called key schedule. The round function, f(·, ·),includes substitutions performed with the so-called S-boxes which are predefinedtables, permutations that rearrange the input data using fixed rules, and roundkey mixing which is typically performed with an exclusive-or (XOR). [274]

An enormous amount of different block ciphers have been proposed and arein use in different applications. Probably the two most important ones, DESand AES, are discussed in Secs. 2.1.1 and 2.1.2, respectively.

2.1.1 Data Encryption Standard (DES)

In 1972, National Bureau of Standards (NBS) of the United States, nowadaysNational Institute of Standards and Technology (NIST), initiated a program forcomputer security which included the development of an encryption standard.After initial problems in finding a candidate that would fulfill the requirementsset for the standard, they received a promising candidate called Lucifer fromIBM. Lucifer was evaluated and revised by the NSA before its adaptation as aFederal Information Processing Standard (FIPS) in 1976. The standard [203]was published in January 1977. The development process and introductionof DES was perhaps the most important landmark in the history of academicresearch on cryptosystems. It was the first time in history when a high-security

— 11 —

cryptographic algorithm became available to everyone and, hence, it marked thebeginning of wide-scale academic research on cryptography. [258]

It had become obvious that the key length of DES was too short to besecure against exhaustive key search and, thus, a strengthened variant calledTriple-DES (3DES) was included into the final reaffirmation of the standardin 1999 [204]. DES was withdrawn officially in 2005 and it has been replacedby AES. DES is not considered further in this thesis because of its decreasingimportance in contemporary cryptosystems.

2.1.2 Advanced Encryption Standard (AES)

Because it had become obvious that DES had exceeded the end of its reason-able lifetime, NIST began the process to replace DES in January 1997. An opencompetition was announced for algorithms in September of the same year. Therequirements for the algorithm were 128-bit block size, support for keylengthsof 128, 192, and 256 bits, and royalty free availability. Altogether 15 candi-date algorithms were submitted to the competition. After the first round, fivefinalist algorithms were selected; namely, Mars, RC6, Rijndael, Serpent, andTwofish. [45, 211]

Criteria for the selection of the winner included security, performance, andalgorithm characteristics. All finalists were stated to be secure enough to beselected as AES, and there were no known intellectual property issues for anyof the algorithms. Thus, performance was the key issue which finally decidedthe winner. Performance was evaluated on various platforms including 8-bitembedded processors, 32-bit Pentium processors (the most important), DigitalSignal Processors (DSPs), FPGAs, and ASICs. [45, 211]

In October 2000, NIST announced that Rijndael was selected as the new AESalgorithm. The reasons why Rijndael was selected were its strong performanceon basically all platforms and the ease of hardware implementation [45, 211].Rijndael was also the most popular choice in straw polls at the final AES con-ference, and its selection received an approving reception in the community [45].Rijndael was officially adopted as the AES on December 6, 2001 [206]. Hence-forth, AES always refers to Rijndael in this thesis3. Details of the algorithm arenot considered here because they are discussed thoroughly in Sec. 5.1.

2.2 Public-key Cryptography

As mentioned, public-key cryptography was invented by Diffie and Hellman inthe mid-1970s. The main idea that they presented in [76] was that differentkeys Ke and Kd could be used for encryption and decryption so that it is easyto derive Ke from Kd but it would be infeasible to find Kd from Ke. Diffieand Hellman suggested using discrete exponentiation as a method for obtainingKe. Because of its fundamental nature, their key exchange method is reviewedin Sec. 2.2.1 and it is followed by short descriptions of certain other public-keyalgorithms currently used in many practical applications. Only basic versions

3To be precise, there is a small difference between AES and Rijndael: Rijndael acceptsblock sizes of 128, 192, and 256 bits, whereas AES defines only a fixed block size of 128 bits.In this thesis, both terms refer to the official AES unless stated otherwise.

— 12 —

are presented and they may be insecure against certain attacks, such as the-man-in-the-middle attack; see [258], for example.

Public-key cryptography is used in key exchange, encryption, and digitalsignatures. Key exchange and encryption are self-explanatory: Key exchangepermits two parties to safely agree a shared secret key over an unsecured channeland encryption allows data to be encrypted without sharing a secret key. Digitalsignatures can be seen as digital counterparts for handwritten signatures andthey provide unforgeable proofs of authorship and authenticity as well as non-repudiation.

2.2.1 Diffie-Hellman Key Exchange

First, Alice selects a random integer Kd,A from an interval [1, p− 1] where p isa prime. Then, she computes and publishes

Ke,A = gKd,A (mod p) (2.3)

where g is a fixed element of a finite field Fp. Finite fields will be discussedin more detail in Ch. 4. Alice’s private-key and public-key are Kd,A and Ke,A,respectively. Similarly, Bob selects Kd,B and computes and publishes Ke,B .Now, Alice and Bob get a common secret key K, which can be used in a secret-key cryptographic algorithm, by computing

K = gKd,AKd,B = (Ke,B)Kd,A = (Ke,A)Kd,B (mod p). (2.4)

In order to compute K, Eve, who knows only Ke,A and Ke,B , must computeeither Kd,A from Ke,A or Kd,B from Ke,B , i.e., she must solve the followingDiscrete Logarithm Problem (DLP)

Kd = loggKe (mod p). (2.5)

DLP is believed to be a hard problem if g and p are chosen carefully. Thus, itis impossible for Eve to obtain K. [76]

2.2.2 RSA Cryptosystems

Soon after Diffie and Hellman’s paper, Ronald L. Rivest, Adi Shamir, andLeonard Adleman published a cryptosystem [239] which is now known, afterits inventors, as the RSA cryptosystem. A key pair is generated in RSA so thatone randomly selects two primes p1 and p2 of the same bit length `/2 where `is referred to as the security parameter [122]. Then, one computes

n = p1p2 and ϕ(n) = (p1 − 1)(p2 − 1), (2.6)

and selects an integer e from the interval [1, ϕ(n)] so that gcd(e, ϕ(n)) = 1, i.e.,the greatest common divisor (gcd) of e and ϕ(n) is one4. Finally, one computesd = e−1 (mod ϕ(n)). The public-key is Ke = (n, e) and the private-key isKd = d. RSA encryption and signature schemes are based on the fact that

Med ≡M (mod n) (2.7)4ϕ(n) is Euler’s totient function which gives the number of positive integers smaller than

n and coprime to n [274].

— 13 —

for all integers M . [239]The RSA encryption operates so that Alice computes

C = MeB (mod nB), (2.8)

by using Bob’s public-key Ke,B = (nB , eB). She sends C over the unsecuredchannel to Bob who decrypts the message by computing

M = CdB (mod nB) (2.9)

where dB is Bob’s private-key Kd,B . [239]In the RSA signature scheme, Kd is used in signing (encryption) and Ke

in verification (decryption), but this does not change the fact that Kd must beprivate while Ke is public. Bob signs a message M as follows. First, he computesa hash, h = Hash(M), of M by using a cryptographic hash algorithm whichare discussed in more detail in Sec. 2.3. At this point, it suffices to state that his a fixed-length bit string which can be considered as the “fingerprint” of M .Second, Bob generates the signature s by computing

s = hdB (mod nB) (2.10)

and sends s and M to Alice. Alice verifies that M was signed by Bob so that,first, she computes h, as Bob did, and, second, she computes

h′ = seB (mod nB). (2.11)

Alice accepts the signature if h = h′, else she rejects it. [239]RSA is believed to depend on the difficulty of Integer Factorization Problem

(IFP) because Eve is unable to compute d from e if she does not know ϕ(n).Furthermore, she is unable to find ϕ(n) unless she can factor p1 and p2 out fromn, which requires solving the IFP. The best known method for factorization iscurrently the number field sieve factoring [18].

2.2.3 ElGamal Cryptosystems

Taher ElGamal [92] expanded the idea of Diffie and Hellman by introducingencryption and signature schemes based on the DLP in 1985. The ElGamalencryption scheme is considered next. Let private-keys and public-keys be asin Diffie-Hellman in Sec. 2.2.1. Then, Alice encrypts a plaintext, M , with thefollowing equations by using a random integer k ∈ [1, p−1] and Bob’s public-key,Ke,B :

C1 = gk (mod p)

C2 = MKke,B (mod p).

(2.12)

Then, Alice sends (C1, C2) to Bob who decrypts the ciphertext by computing

M = C2C−Kd,B

1 (mod p). (2.13)

— 14 —

2.2.4 Digital Signature Algorithm (DSA)

Digital Signature Algorithm (DSA), first presented by David W. Kravitz in [155],is included in standards by American National Standards Institute (ANSI) [8],NIST [205], and Institute of Electrical and Electronics Engineers (IEEE) [132,133], among others. DSA specifies how to sign and verify messages with cryp-tographic signatures. Consider a case where Bob signs a message, M , and Aliceverifies the signature.

The domain parameters are agreed so that p is a prime, n is a prime divisorof p − 1 and g is a generator of Fn; see Ch. 4. First, Bob generates a key pair(Kd,B ,Ke,B) by selecting a random integer Kd,B ∈ [1, n− 1] and by computing

Ke,B = gKd,B (mod p). (2.14)

Bob signs a message, M , by computing its hash, h = Hash(M). The signa-ture (s, r) is computed as follows:

r = (gk mod p) (mod n)

s = k−1(h+ rKd,B) (mod n)(2.15)

where k ∈ [1, n− 1] is a random integer and k−1 is its multiplicative inverse inFn; see Ch. 4. Bob published the domain parameters and Ke,B and sends themessage, M , and the signature, (r, s), to Alice. [205]

When Alice receives a message, M , and an attached signature, (r, s), whichis claimed to be signed by Bob, she retrieves the domain parameters and Bob’spublic key Ke,B . She verifies the signature by computing h = Hash(M) andthe following formulae:

u1 = s−1h (mod n)

u2 = s−1r (mod n)v = (gu1Ku2

e,B mod p) (mod n).

(2.16)

Alice accepts the signature if v = r, else she rejects it. [205]If Bob indeed generated the signature Alice received and the message was

not tampered after signing, s received by Alice is as in (2.15) which gives

k = s−1(h+ rKd,B) = u1 + u2Kd,B (mod n). (2.17)

Thus, gu1Ku2e,B = gu1gK

u2d,B = gu1+u2Kd,B = gk (mod n), and v = r which shows

that Bob generated the signature and M was not tampered after the generation.This proof was adapted from [140].

2.2.5 Elliptic Curve Cryptosystems

The use of elliptic curves in cryptography was independently proposed by NeilKoblitz [146] and Victor Miller [194] in 1985. Since then, large amounts ofwork has been done on ECC in academia and increasingly also in industry. Thesecurity of ECC is based on the hardness of a problem called Elliptic CurveDiscrete Logarithm Problem (ECDLP) which is an elliptic curve analogue ofthe DLP. ECC operates analogously to the cryptosystems based on the DLPwith the exception that exponentiations are replaced by elliptic curve operations.

— 15 —

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 200000

100

200

300

400

500

600

Discrete logarithm or factorization (bits)

Elli

ptic

cur

ves

(bits

)

Figure 2.2: Approximative key sizes of ECC and public-key cryptosystems based onthe DLP or the IFP for a similar level of security [32]. The corresponding key sizesfor the curves specified by NIST [205] are plotted with dotted lines.

Elliptic curve operations are not considered here because a comprehensive reviewis presented in Ch. 6.

Standardization has great significance in ECC because there is a multitudeof parameters to choose from, they have a large impact on design decisions, andimplementations using different parameters usually fail to cooperate [67]. ECCis included in the following major standards: ANSI X9.62 [8] and X9.63 [9],IEEE 1363 [132, 133], FIPS 186-2 [205] (NIST), ISO/IEC 15946-1/2/3/4 [135]and 18033-2 [134], and SECG [51, 52].The DSA standard [205] also includes anelliptic curve variant of DSA called Elliptic Curve Digital Signature Algorithm(ECDSA) which is equivalent with the algorithm described in Sec. 2.2.4 with theexception that the exponentiations are replaced by elliptic curve operations [132,205]. A thorough presentation of ECDSA is given in [140].

The main reason why ECC has attained so much interest during the pasttwenty years is depicted in Fig. 2.2. The curve in Fig. 2.2 shows an approx-imation of the required key sizes of ECC vs. conventional public-key cryp-tosystems, i.e., schemes based on the DLP and the IFP, for a similar level ofsecurity. Fig. 2.2 is based on a formula from [32]. Fig. 2.2 also plots the cor-responding key sizes for the five elliptic curves recommended by NIST [205],henceforth called the NIST curves. When the ECC key sizes are 163, 233, 283,409, or 571 bits, the key sizes of cryptosystems based on the DLP or the IFPrequired for a similar level of security are approximately 890, 2030, 3220, 7820,or 17730 bits. These values are naturally highly approximative and they arebased on the assumption that there are no sub-exponential algorithms for theECDLP [32]. Further estimates on key sizes with equivalent security levels areprovided in [164] where they favor ECC even more.

— 16 —

Technical reasons supporting the use of RSA instead of ECC are basicallynonexistent. The reasons that have prevented ECC from becoming the primarypublic-key cryptosystems are most probably the tradition of using RSA and thefact that mathematics related to ECC are more difficult to understand. Thefield of ECC also contains many patents (many of them owned by CerticomCorporation [53], the leading company in ECC solutions) whose validity has notyet expired (confer RSA) which may have hindered the use of ECC. However,ECC has already been included in many widely-used standards and adapted tomany major applications.

2.2.6 On Difficulties of the Hard Problems

Although public-key cryptosystems ultimately rely on conjectures that hardproblems are indeed impossible to solve, several studies have supported suchconjectures. In 1991, RSA Laboratories published a challenge where cash prizesup to $200,000 (US dollars) were offered for successful factorizations of productsof two primes (n in Sec. 2.2.2). The challenge was set in order to motivateresearchers and, thus, increase knowledge on the difficulty of the IFP [246]. Thelargest challenge solved is 640-bit n that was factored in November 2005 [246].Nowadays, the smallest bit length of n recommended for use in practice is 1024bits. The challenge ended in 2007 when RSA Laboratories stated that enoughknowledge on the cryptographic strength of the IFP exists making the challengeirrelevant [246].

In 1997, Certicom Corporation launched a challenge for ECC, similar to theRSA challenge. The largest problem solved thus far is ECDLP with a 109-bitkey which was solved in April 2004 [50]. They used 2600 computers running17 months (roughly equivalent to one Athlon XP 3200+ running non-stop for1200 years) [50]. When the key size is 163 bits, the smallest size recommendedby NIST [205], the ECDLP is roughly 100,000,000 times harder [50]. In 2001, amodel presented in [164] predicted with various presumed parameters5 that 163-bit ECC will remain secure up to year 2021. More pessimistic parameters whichassumed considerable advances in methods for solving the ECDLP gave year2011 [164], but such advances have not yet been presented. A recent academicsecurity analysis [192] estimated that manufacturing and running ASICs thatcould solve the ECDLP with a 163-bit key in one year would cost $2,200,000,000,half of which is for power consumption. Hence, the 163-bit ECDLP is likely toremain unsolvable for several years to come.

2.3 Cryptographic Hash Algorithms

Cryptographic hash algorithms play a central role in many modern cryptosys-tems because they can provide data integrity. A hash algorithm produces ahash6 for an input message. The hash acts as a “fingerprint” of the message.This fingerprint can be used in ensuring integrity of a message because if themessage changes, so does its hash. Thus, a user can check the integrity of a

5Such as, DES was secure up to 1982, computing power doubles every 18 months (Moore’slaw), budget doubles every 10 years, and no new cryptanalytic methods are developed for theECDLP.

6Also commonly called message digest (MD), hash-code, hash-result, hash-value, etc.

— 17 —

message by computing a hash and comparing it to a previously computed hash.If the hashes are different, someone has edited the message. [189, 274]

Let Hash(·) be a cryptographic hash algorithm. Then the hash, h, of amessage, M , is given by

h = Hash(M) (2.18)

where M is a binary string with an arbitrary length and h has a fixed length.Two principal properties of a hash algorithm are compression and ease of com-putation [189]. Compression means that the function maps an input of an arbi-trary length to an output of a relatively small fixed length. Ease of computationsimply means that it is computationally easy to derive h from M .

In order to a hash algorithm to be secure, it must be resistant against thefollowing three problems [274].

— It must be computationally infeasible to find a preimage, i.e., given ahash, h, find M such that h = Hash(M). If a preimage is hard to find,the function is said to be preimage resistant.

— It must be hard to find a second input M ′ which has the same hash withany specific input M . If it is hard to find M ′ 6= M such that Hash(M ′) =Hash(M), the hash algorithm is said to be second preimage resistant.

— The function must be collision resistant, i.e., it must be hard to find twoarbitrary inputs M and M ′, such that M 6= M ′, for which Hash(M) =Hash(M ′).

If the first and second problem are infeasible to solve, a hash algorithm is re-ferred to as a one-way function [189]. If any of the three problems is “easy” tosolve, a hash algorithm is broken in cryptographical sense. However, in someapplications where cryptographic hash algorithms are used, it suffices that hashalgorithms are resistant only against certain problems, e.g., when used in a one-way password file, the only requirement is that the hash algorithm is preimageresistant [189].

Hash algorithms may involve a key, K, as well. Hash algorithms with a keyare referred to as keyed hash algorithms whereas algorithms without a key arecalled unkeyed hash algorithms. Unkeyed hash algorithms can be viewed as aspecial case of keyed hash algorithms where the key is constant. If an unkeyedhash algorithm is used, the hashes must be protected so that they cannot bealtered but, when a keyed hash algorithm is used with a secret key, the hashescan be public or transmitted over an unsecured channel without sacrificing theintegrity of a message [274]. Unkeyed hash algorithms are commonly used inmanipulation detection codes and keyed hash algorithms in message authenti-cation codes [189].

Cryptographic hash algorithms commonly used in practical applications in-clude MD algorithms proposed by Ronald L. Rivest of which MD5 [238] isthe latest, Secure Hash Algorithms (SHA) [207] recommended by the NIST,RIPEMD7 algorithms of which RIPEMD-160 and its extensions [81] are thenewest, and Whirlpool [21]. The author has published FPGA implementationsof MD5 [139] and a combined implementation of MD5 and SHA-1 [138].

7Research and Development in Advanced Communications Technologies in Europe (RACE)Integrity Primitives Evaluation Message Digest

— 18 —

Several cryptographic hash algorithms have been compromised recently sothat collisions can be found “easily”. MD5 and SHA-1 are among these algo-rithms. Thus, these hash algorithms should be used only in applications whichdo not require collision resistivity. [286, 287]

The SHA algorithms will be replaced with an algorithm developed throughan open competition similar to AES. The call for new hash algorithms waspublished in November 2007 in [208] and the new algorithm is scheduled to beselected in 2012.

— 19 —

— 20 —

Chapter 3

Hardware Implementation ofCryptographic Algorithms

Naturally, in order to use cryptographic algorithms in practice they needto be implemented. This chapter presents the most common platforms on

which cryptographic algorithms are implemented and discusses their suitability.The emphasis is on FPGAs because they are used in the publications.

Designing a good cryptographic implementation is a challenging task. Thereare at least the following requirements for implementations of cryptographicalgorithms and security systems in general:

Speed An implementation should be fast enough to ensure that execution ofcryptographic algorithms does not slow down a system significantly. Be-cause computation of cryptographic algorithms dominates in computa-tional complexity of many protocols, the need for fast implementations isevident [231]. Achieving fast performance is often a difficult task and it hasattained considerable amount of interest in both industry and academia.This requirement often yields a need for hardware acceleration becausemany studies have shown that software implementations cannot achieverequired performance levels with reasonable costs [118].

Resources Many environments set tight constraints to resources available forimplementations of cryptographic algorithms [231]. If resources are low onthe whole, only a small portion of those few resources is usually devotedfor cryptography which complicates implementing high-security crypto-graphic algorithms. The aforementioned constraints may include powerconsumption, amount of available logic or memory, etc. Arguably, theimportance of this requirement is increasing because of the emerging useof cryptographic algorithms in various low-cost applications [90], such assmart cards or mobile hand-held devices.

Costs Use of cryptographic algorithms should not increase the total cost ofa product significantly. This requirement sets strict constraints for de-signers, especially, if other requirements, such as speed, are considered

— 21 —

too. This is naturally an important requirement for any product but itsimportance is apparent in low-price consumer products.

Invisibility Implementations of cryptographic algorithms and other securitymeasures should be invisible, transparent, and easy to use for an end user.In other words, they should not cause any considerable harm for the user.The reason for this is that, if they do, it is likely that the algorithms willnot be used or they will be misused in a way that compromises security.A classical example is an exhaustive use of complex passwords which caneven weaken the system because users commit passwords to paper if theycannot remember them.

Exactness Designs must implement algorithms exactly according to the speci-fications, i.e., cryptographic algorithms are in general intolerant for errorscaused by optimizations. In many applications, such as Digital SignalProcessing (DSP), it is possible to optimize implementations by allow-ing small errors caused by a reduced computation accuracy, for instance.However, cryptographic algorithms usually fail to operate correctly if evenone bit is incorrect. On the other hand, optimizations inside the algorithmare possible as long as they do not change the output; see Sec. 5.2.2, forexample.

Security An implementation should not leak any information which can po-tentially compromise security. Although this first seems as a trivial re-quirement, it is—possibly together with an incompetent user—one of themost important issues affecting overall security. Actually, it has beenstated that it is more likely that a real world cryptosystem is broken be-cause of the weaknesses of implementations rather than the weaknesses ofcryptographic algorithms [294].

As this thesis is about hardware implementation of cryptographic algorithms,it is of interest to examine how the above requirements relate to hardware imple-mentations. Invisibility requirement seldom has any direct impact on hardwareimplementations because the user interface is usually software engineer’s workrather than hardware engineer’s. Even hardware designers must indirectly takethe end user’s comfort into account through other requirements. All other re-quirements need to be considered by a hardware designer and the importanceof a particular requirement depends on implementation platforms and applica-tions.

3.1 Implementation Platforms

A coarse categorization for implementations, which has already been used above,is to divide them to software and hardware implementations. A software imple-mentation is an implementation running on a general-purpose processor; suchas Intel Core-2 Duo processors. The role of an implementor is to write softwarefor a processor by using a programming language. These implementations aretherefore referred to as software implementations although the processor itself ishardware. In a hardware implementation the implementor designs the hardwarewhich performs the desired function.

— 22 —

Reprogrammable logic refers to digital logic devices whose functions can bereprogrammed throughout the device [281]. The place of reprogrammable logicin the aforementioned categorization is problematic because it can be catego-rized to either category on a sound basis. On one hand, an implementor doesnot design the actual hardware but produces a “software”, i.e., a program-ming file which programs the hardware to implement the desired function. Onthe other hand, the way in which an implementor works is almost identicalto hardware design process and all decisions (s)he makes during the processare closer to implementing hardware than software. Besides, people workingon reprogrammable logic in industry or academia almost always have a hard-ware background. In this thesis, reprogrammable logic is considered hardware.The implementations presented in the publications are all designed primarilyfor FPGAs. However, the ideas used in them apply usually to ASIC-basedimplementations, too.

General-purpose processors, ASICs, and certain low-cost environments arediscussed in the following sections concentrating on their use for cryptographicalgorithms. The prime target devices of this thesis, FPGAs, are discussed inmore detail Sec. 3.2.

3.1.1 General-Purpose Processors

General-purpose processors execute a program which is a series of instructions.Instructions supported by a processor are referred to as an instruction set. In-struction sets usually include instructions for integer arithmetic, boolean op-erations, data handling, and control flow as well as, e.g., for floating pointarithmetic. Programmer writes a program, either in assembly or high-level lan-guages, such as C, which is then compiled into instructions. The performanceof a program depends on programmer’s skill, the instruction set provided by aprocessor, and the efficiency of hardware.

The advantages of general-purpose processors are easy programming and up-dating and the fact that the same code is usually easy to use in different proces-sors. Typical disadvantages are slow performance and high power consumption.These disadvantages emphasize if a cryptosystem is implemented using proces-sors with instructions sets that do not naturally support operations requiredin cryptographic algorithms. For instance, instruction sets of microprocessorsusually support integer and floating point arithmetic and boolean operations,whereas cryptographic algorithms often use more “exotic” arithmetic such asfinite fields.

Academics—and more recently also processor manufacturers—have recog-nized the problems caused by limited instructions, and instruction set extensionsfor cryptography are gaining interest [89]. Such extensions have been proposed,e.g., in [89, 112, 279]. On the commercial side, better support for cryptographyhas been included at least in Sun Microsystems’ new UltraSPARC T2 micro-processors which include specific cryptographic accelerator blocks supportingvarious cryptographic algorithms including (3)DES, AES, SHA-1, MD5, RSA,and ECC [276]. Such instruction set extensions can provide considerable perfor-mance increases and, thus, make general-purpose microprocessors more feasibleplatforms for cryptography.

— 23 —

3.1.2 Application Specific Integrated Circuits (ASICs)

As the name suggests, ASICs are integrated circuits designed for a specific ap-plication. Contrary to general-purpose circuits, such as microprocessors, ASICscan perform only functions that they are designed to perform. In ASICs, adesign is implemented directly on silicon and, hence, it cannot be changed afterthe chip is manufactured. The resulted implementation can be highly optimizedin terms of speed, area, and power consumption, but the inability to change thedesign after manufacturing is a serious disadvantage. Modern ASIC processesare also very expensive and time consuming and, hence, they are nowadays usedpredominantly in large volume consumer products, such as mobile phones.

Traditionally cryptographic algorithms were always implemented as ASICswhereas, nowadays, a large majority of practical cryptosystems are undoubtedlysoftware implementations running on general-purpose microprocessors. How-ever, ASICs still have significant importance for cryptographic implementationsin environments where microprocessors, or even FPGAs, cannot be used, e.g., inenvironments requiring very low power consumption or high encryption speeds.

3.1.3 Low-Cost Environments

The use of cryptographic algorithms is increasing in low-cost environments,such as smart cards, radio frequency identification (RFID) tags, etc. [90]. Thereason is that they often handle sensitive data, such as health-monitoring, anti-counterfeit, or biometric data, which yields a need for strong encryption [90].Although implementations for low-cost environments use the same techniquesthat are used in implementations targeting to high speed, i.e., ASICs, micro-processors, or FPGAs, the requirements set by these environments differ greatlyfrom the ones discussed previously. As a consequence, low-cost environments arediscussed as a separate topic. Low-cost implementations require difficult trade-offs between security, cost, and performance [90]. Besides, they are used in themost insecure environments [231] which makes implementations more vulnera-ble for physical attacks, as will be discussed in Sec. 3.4. Hence, cryptographyin low-cost environments is an important and difficult research area.

Modern cryptographic algorithms, such as AES, have been developed sothat they achieve good performance in software. This may not be a problemfor smart cards because their processors have become powerful enough to runcryptographic algorithms originally designed for high-end microprocessors [229].However, this sets challenges for designing efficient implementations for certainlow-cost environments, such as RFID tags [90]. Hence, several block ciphersdeveloped specifically for low-cost applications have been proposed recently in-cluding DESXL (a low cost variant of DES) [163], Hight [127], and Present [33].Implementing public-key cryptography is costlier but ECC has been shown tobe a viable solution even for low-cost environments [23, 116]. A recent surveyof the field of low-cost cryptography is given by Eisenbarth et al. in [90].

3.2 Field Programmable Gate Arrays (FPGAs)

Implementations on general-purpose processors are flexible in the sense thatprograms are easily written and updated by using high-level programming lan-guages. As a downside, they are slow and consume considerable amounts of

— 24 —

Low High

Low

High

Performance

Fle

xibi

lity

General−PurposeProcessors

Field−ProgrammableGate Arrays

(FPGAs)

Application−SpecificIntegrated Circuits

(ASICs)

Figure 3.1: A performance-flexibility comparison of general-purpose processors,ASICs, and FPGAs [34, 281]. Performance can be understood as pure speed or powerefficiency, or as their combination (throughput/Watt).

power which practically rule general-purpose processors out from many appli-cations. ASICs, on the other hand, are less flexible but offer high performanceand low power consumption. Thus, there is a significant gap between general-purpose processors and ASICs [34]. Reconfigurable logic fills this gap and—in an ideal case—combines the best of both; namely, the speed of ASICs tothe flexibility of general-purpose processors [281]. This is illustrated in theperformance-flexibility graph of Fig. 3.1 [34, 281].

Reconfigurable logic includes a wide scope of programmable devices in-cluding PLA (Programmable Logic Array), PAL (Programmable Array Logic),CPLD (Complex Programmable Logic Device), and FPGA. However, FPGAsare the only ones which have any major interest from the cryptography point-of-view and, henceforth, the terms reconfigurable logic and device always refer toFPGAs in this thesis. This presentation concentrates on FPGAs manufacturedby the two leading vendors: Altera and Xilinx. Other FPGA manufacturersinclude Achronix, Actel, Atmel, Cypress, Lattice, and QuickLogic.

FPGAs consists of reconfigurable functional units, interconnections, and in-terface. Reconfigurable functional units are used for implementing the logicneeded in a design and they are connected with the reconfigurable intercon-nections. Reconfigurable interfacing is used for communication with the rest ofa system. As a rule of thumb, the more reconfigurable an architecture is theslower it is and the more power it consumes. [280]

Reconfigurability depends on the size of reconfigurable units referred to asgranularity. An architecture where reconfigurable functional units are small iscalled fine-grained and an architecture where they are large is coarse-grained. Afine-grained functional unit typically operates on a small number of bits and acts

— 25 —

LUT &Carry

Control

FFD Q

LUT &Carry

Control

FFD Q

COUT

CIN

BX

BY

F1F2F3F4

G1G2G3G4

Y

YQ

X

XQ

Figure 3.2: A simplified presentation of a Virtex-E/II slice (adapted from [299, 302]).

as a Look-Up Table (LUT) with a small number of inputs and a single bit out-put [280]. An example of a coarse-grained functional unit is an Arithmetic LogicUnit (ALU) [280] and, thus, coarse-grained architectures are usually better inapplications which are intensive on arithmetic [281]. The granularity of recon-figurable interconnections also varies in different reprogrammable architecturesso that fine-grained interconnections allow each bit to be connected separatelywhereas coarse-grained interconnections connect several bits at a time [280].Most modern commercial FPGAs use fine-grained architectures where recon-figurable functional units are typically 4-to-1-bit LUTs which are arranged inclusters [280].

Programming an FPGA differs considerably from general-purpose processorsbecause a general-purpose processor program is stored into a separate memoryas instructions, whereas programming of an FPGA directly implements logicfunctions and interconnections inside the device [290].

3.2.1 Modern FPGAs and Recent Trends

In Xilinx Virtex-E and Virtex-II FPGAs, which are used as implementationplatforms in I–III and X, the functional units are 4-to-1-bit LUTs which arearranged in clusters called slices and Configurable Logic Blocks (CLBs). A CLBcomprises four slices which consists of two 4-to-1-bit LUTs, two registers, andcarry and control logic [299, 302]. Fig. 3.2 presents a Xilinx Virtex slice. TheLUTs can be configured also as a 16-bit RAM (Random Access Memory) [299,302] or 16-bit shift register [302]. The Altera Stratix II architecture, used in IV–IX and XI, differs from the Virtex architecture so that the clusters called LogicArray Blocks (LABs) consists of eight subblocks named Adaptive Logic Modules(ALMs) which fracture into two Adaptive LUTs (ALUTs), two registers, andcarry and control logic [6]. ALUTs are flexible so that they can implement upto a 7-to-1-bit LUT [6]. Virtex-E, Virtex-II, and Stratix II devices all contain

— 26 —

considerable amounts of embedded memory [6, 299, 302].Modern FPGAs typically contain hardwired logic units specialized for cer-

tain commonly used tasks, such as DSP [280]. These units provide improve-ments in performance and power consumption in those specialized tasks butthey are useless in many applications where these operations are not directlyneeded [280]. Virtex-II has hardwired 18×18-bit multipliers [302] and Stratix IIhas more flexible DSP blocks which are useful in various DSP-related tasks [6],for example.

In 2005, Todman et al. [280] listed main trends in reconfigurable architec-tures as follows. First, there is a trend to increase granularity of logic units inorder to reduce the amount of interconnections, e.g., Stratix II 7-input LUT [6].Second, the number of specific embedded function units, e.g., DSP blocks andembedded memories, is increasing. Third, the use of soft cores, and espe-cially soft core processors, is growing. A soft core is implemented using re-programmable logic by using synthesizable function provided by a vendor [280].Although they are slower and require more area than hard cores which arehardwired onto the silicon, they provide upgradeability and they are easy to in-tegrate into the rest of a system [280]. Xilinx offers a soft core processor calledMicroBlaze [300] and Altera calls their newest soft core processor Nios II [5].

Since Todman’s predictions, Xilinx has launched Virtex-5 devices [303] andAltera has introduced Stratix III family FPGAs [7]. Xilinx has increased thegranularity of Virtex-5 by introducing 5-to-1-bit and 6-to-2-bit LUTs and Al-tera’s Stratix III is still using up to 7-to-1-bit ALUTs. The number of specializedembedded function units has continued growing in the new device families, andalso Xilinx has introduced more flexible DSP blocks [303]. Both Xilinx andAltera are still supporting their soft core processors. Hence, the developmenthas obviously had the direction predicted in [280]. There is also an evidentdirection which began by the introduction of Virtex-4 and which has continuedin Virtex-5 and Stratix III as well; namely, the vendors introduce sub-familiestailored for certain types of applications. Virtex-4 has sub-families LX, SX, andFX for logic applications, DSP applications, and embedded platform applica-tions, respectively [301]. Virtex-5 also has an additional sub-family specializedin advanced serial connectivity [303]. Stratix III has a balanced sub-family Lfor mainstream logic applications, a memory and multiplier rich sub-family Efor data-centric applications, and a sub-family GX with high-speed connectivityfor high bandwidth applications [7].

3.2.2 FPGA Design Flow

A design flow of FPGA-based implementations is presented in Fig. 3.3. The flowbegins from design specifications and ends to a working design in an FPGA. Themost laborious part is the writing of Hardware Description Language (HDL).VHDL was used as an HDL for all designs of this thesis. The functionality ofthe HDL is verified by functional simulations. The verified HDL is mapped toa gate-level netlist with logic synthesis. Placement in the place & route processdedicates operations to specific reconfigurable functional units and routing con-nects these units by using reconfigurable interconnections. The place & routeis constraint-driven. A place & route of a complex design is computationallydemanding and often takes several hours with a modern desktop computer. Theplace & route returns a programming bitfile and a timing model description of

— 27 —

netlistGate-levelLogic

synthesisHDLPlace &route

bitfileProgramming

modelTiming

Timingsimulation

Functionalsimulation

Designspecifications FPGA

Figure 3.3: FPGA design flow

the result. The timing model is verified to fulfill timing requirements set in thespecifications with timing simulations. Finally, the FPGA is programmed withthe programming file.

The entire design flow can be performed within design softwares provided byFPGA vendors. Logic synthesis programs are included in the design softwaresbut several third party tools are available, too. Simulation software is oftenprovided by a third party because only simple simulators are included in thedesign softwares. Place & route tools are always provided by the FPGA vendoras well as software used in programming the FPGA.

As mentioned, writing HDL is the most laborious part. Automatic HDLgenerators, which automatically map specifications to HDL, can considerablyreduce time required in this step. Typically, such generators can produce com-petitive results compared to hand-optimized versions only if they are realizedfor a small and well-constrained set of specifications, e.g., for one cryptographicalgorithm with specific parameters. The author developed a simple generatorfor ECC in [137].

3.2.3 Cryptographic Algorithms in FPGAs

Cryptographic algorithms often dominate in computational costs of securityprotocols which yields a need for acceleration; for example, over one half of theprocessing time in Secure Sockets Layer (SSL) is consumed in cryptographicalgorithms [231]. Acceleration is typically performed by delegating the mostexpensive operations to an accelerator while the rest of the application is imple-mented in the main system which is usually a general-purpose processor [281].Consider DSA presented in Sec. 2.2.4 as an example. The most demanding taskin DSA is the modular exponentiation and, hence, DSA can be accelerated bydelegating exponentiations to an accelerator while the main processor performsother operations concurrently. Because this thesis predominantly considers im-plementing cryptographic algorithms using FPGAs, the following assumes that

— 28 —

Memorycaches

I/Ointerface

(a)

(b)(c)

(e)

Processor

(d)

Processor

Figure 3.4: Classification of reconfigurable systems [69, 280]; (a) external stand-alone processing unit, (b) attached processing unit, (c) coprocessor, (d) reconfigurablefunctional unit, and (e) processor embedded in reconfigurable logic.

the accelerators are implemented with reconfigurable logic. Many of the princi-ples are, however, valid also for other implementation platforms, such as ASICs.

Fig. 3.4 shows the classification of reconfigurable systems. The first fourclasses were presented by Compton and Hauck in [69] and their classificationwas complemented by Todman et al. in [280] with the class where the pro-cessor is embedded in reconfigurable logic. The closer the reconfigurable unitis to the control processor the faster the communication between the unit andthe processor is but, as a downside, the amount of reconfigurability is usuallysmaller and more control is required [280]. The external stand-alone processingunit favors applications which are computationally intensive but require littlecommunication and the reconfigurable functional unit, e.g., reconfigurable in-structions inside a processor, is ideal for small but frequently used tasks [69].The attached processing unit and coprocessor classes can be seen as tradeoffsbetween these two extremes. The last class, where the processor is embeddedin reconfigurable logic, is currently emerging in many applications because thegrowth in FPGA resources has enabled implementing entire systems within oneFPGA. This approach combines many of the advantages of the other four classesbut requires a large (and expensive) device. The embedded processor can beeither hard core, such as PowerPC in Xilinx Virtex-4 FX [301], or soft core,such as Nios II in Altera FPGAs [5]. Notice that all classes except (d) canbe implemented with a stand-alone FPGA. Thus, the implementations of thisthesis suit for all other classes except (d).

FPGAs have several advantages in cryptographic applications [91, 293]:

Algorithm agility It is possible to easily switch from an operation to an-other. This has significant importance in cryptographic applications be-

— 29 —

cause modern cryptosystems must often support many encryption algo-rithms. It is also possible to delete broken algorithms and introduce newalgorithms into the system.

Algorithm upload It is possible to easily update devices which are alreadyin use. This is, again, important for cryptographic applications becauseparameters of the current cryptosystem may need to be changed, e.g.,key length needs to be increased or an algorithm needs to be changed asstandards expire, e.g., DES needs to be changed to AES.

Architecture efficiency Significantly more efficient implementations can bedesigned with fixed parameters in certain cases. Reprogrammability ofFPGAs allows designing optimized implementations for all parametersseparately because the device can be reprogrammed when parameters arechanged whereas an ASIC design must support all parameters and suchoptimizations are out of reach. One example is the finite field multiplica-tion in polynomial basis; see Ch. 4. If an irreducible polynomial is fixed,reductions required in multiplication can be hardwired resulting in fasterand smaller multipliers. If the design must support arbitrary irreduciblepolynomials, such optimizations cannot be made.

Resource efficiency Many protocols are hybrid in the sense that a public-keyalgorithm is used in the beginning for key exchange after which the actualcommunication is encrypted using a secret-key algorithm. Because thesealgorithms are not used simultaneously, they can be switched on-the-flyby reprogramming the device which leads to a considerable reduction inrequired resources.

Algorithm modification Parts of standardized algorithms, e.g., S-boxes, mayneed to be modified, and such modifications are not a problem because ofreprogrammability.

Throughput Considerable increase in throughput is achievable with FPGAswhen compared to general-purpose processors as shown in an extensivenumber of publications; see, e.g., [59, 106, 107, 280].

Cost efficiency FPGA-based projects are typically significantly cheaper thanASIC-based projects when the products are targeted to low-to-mediumsized markets. The main reasons are shorter time-to-market and lowerinitial costs. ASICs become cheaper when the production quantities risebecause production of a single chip is cheaper after the often very largestartup cost.

It has been stated based on the above reasonings that cryptographic al-gorithms are prime candidates for FPGA-based implementations [281]. Thesuitability of FPGAs for cryptography has been pointed out also in other pub-lications, e.g., [34, 280].

Although embedded function units, such as DSP blocks, certainly improveperformance and power consumption in many applications, they have little usein most cryptographic applications, with the exception of embedded memories.The reason for this is that cryptographic algorithms commonly require “exotic”operations, such as arithmetic in finite fields, which are not supported by the

— 30 —

embedded blocks of modern FPGAs. Introduction of blocks specialized forcryptographic operations would naturally benefit the performance and powerconsumption of cryptographic algorithms on FPGAs, but currently none of theFPGAs available at market supports cryptography if hardwired cryptomodulesfor programming bitstream decryption are not taken into account.

3.2.4 Comparisons

Generally, FPGAs can provide speedups up to 500 times and power savings of35–70% compared to general-purpose processors [280]. However, the achievableimprovements depend on the implemented application. Cryptographic applica-tions usually achieve significant speedups on FPGAs, e.g., speedups of over 1000times have been reported for ECC [59]. The efficiency of FPGAs compared togeneral-purpose processors originates from parallel and pipelined processing andfrom the fact that general-purpose processors are often ill-suited for operationsused in cryptography. Operations causing problems include modular arithmeticand bit-level manipulations on bitvectors with uncommon lengths, i.e., otherthan 2w and longer than the wordlength. They cause problems, especially, inpublic-key algorithms [293].

A recent study in [158] showed that FPGA-based designs are on average 3.4to 4.6 times slower than ASIC-based designs and implementations on FPGAsconsume about 35 times more silicon area. Hence, the speed-area product isover 100 times better for ASICs than for FPGAs. FPGAs also consume about14 times more dynamic power. Area and power differences reduce substantiallyby using dedicated memory and multiplier blocks whereas their effect on per-formance is small. These values suggest that the price of reprogrammability isrelatively high. [158]

However, an obvious weakness of the study of [158] is that it simply synthe-sizes a design for both ASICs and FPGAs and compares the results. Althoughthe values themselves are solid, omitting such advantages as architecture andresource efficiency skews the results because these advantages have a direct im-pact on the above values. For example, as discussed previously, FPGAs allowoptimizing field multipliers for a fixed field but such optimizations are usuallyout of reach in ASICs. Thus, the difference of FPGAs and ASICs is certainlysmaller in cryptographic applications where the effects of these advantages areoften significant.

3.3 Metrics for Evaluating Implementations

Thus far, speed and performance were used as arbitrary terms and more precisedefinitions are necessary. Ways of measuring speed are diverse. Supposedlythe simplest is to report computation time, t, i.e., the time in seconds requiredfor a single operation, encryption of a single block, for example. Another wayof expressing speed is to provide the number of clock cycles required for asingle operation, referred to as latency1, l. In single clock systems, latency and

1Sometimes latency may also refer to computation time but, in this thesis, latency refersstrictly to the number of clock cycles, except in II where it refers to computation time.

— 31 —

computation time are linked through the following equation:

t =lf

(3.1)

where f is a clock frequency. Latency is usually more informative than computa-tion time because it is independent of the clock frequency. However, latency andcomputation time are not always sufficient for measuring speed because imple-mentations capable of computing several operations simultaneously may havelong latencies and computation times but can still be considered fast. Through-put, T, the number of operations per second, has considerable importance inevaluating cryptographic processors. If a design can process p operations simul-taneously, T, t, and l are related through the equation:

T =pt

=pfl. (3.2)

When designs implementing secret-key cryptographic algorithms are considered,it is a common practice to report the number of bits that the design can processin second. In that case, throughput given by (3.2) is multiplied by the numberof bits processed in one operation, e.g., with the block size.

Objectives vary in designing cryptographic implementations. A commonobjective is computation acceleration where the goal is in minimizing t and/ormaximizing T. Another objective is implementing an algorithm with as fewresources as possible. Supposedly, the most common objective is a combinationof these two objectives; namely, maximization of speed under some constraintson resources. It can be measured with area-latency (or area-computation time,etc.) efficiencies, e.g., with a product of latency and area of an implementation.

3.4 Side-Channel Attacks

Traditional cryptanalysis is out of the scope of this thesis but cryptanalyticmethods called side-channel attacks form a serious threat for many crypto-graphic implementations and a few words about them are thus in order. Side-channel attacks are methods where Eve is trying to reveal secret keys by exploit-ing information leaked by an implementation. Side-channel attacks are based,e.g., on time, power, and electromagnetic measurements and they have beensuccessfully applied, especially, against smart cards [271]. Actually, the mainthreat for practical cryptosystems is not traditional cryptanalysis but rather theweaknesses of implementations [294].

Although smart cards and simple general-purpose processors are generallymore vulnerable for side-channel attacks than hardware implementations, suc-cessful applications of certain attacks have been reported on FPGA implemen-tations as well [271]. The first such attack was demonstrated in [223] wherepower analysis attacks were successfully applied to an ECC implementation.The following discusses certain principal side-channel attacks and their coun-termeasures.

Timing analysis was introduced by Paul C. Kocher in [152]. Timing attacksare based on information about computation time. It can be possible to retrieveinformation about a secret key by measuring computation time variations de-pending on the secret key [152]. A successful practical implementation of theattack was demonstrated for smart cards in [75].

— 32 —

Time

Pow

er

S S S S SM M M

Figure 3.5: Simple artificial example of SPA in revealing a secret exponent in expo-nentiation using binary method; S stands for squaring and M for multiplication.

Simple Power Analysis (SPA) and Differential Power Analysis (DPA) wereintroduced by Kocher et al. in [151] and they exploit information leaked bypower consumption. The idea is that instantaneous power consumption is linkedto an instruction or operation currently being executed. Consider exponenti-ation as an example. Let the exponent be secret as in Diffie-Hellman key ex-change, for example; see Sec. 2.2.1. If an exponentiation is performed withthe binary method (see Sec. 6.3), each bit in an exponent results in a squar-ing operation, whereas multiplications are performed only for bits which areones. It may be possible to reveal the secret exponent by observing instanta-neous power consumption. Fig. 3.5 illustrates SPA by presenting an examplepower trace2. Although both squaring and multiplication require equal amountof time, they are clearly distinguishable because their power traces differ fromeach other. Thus, it is obvious that the bit string used as the exponent inFig. 3.5 is 〈10011〉.

Because both timing analysis and SPA exploit the feature that some opera-tions can be distinguished from each other by time or power consumption, theobvious way to countermeasure these attacks is to make these operations indis-tinguishable. In practice this is done by using side-channel atomic blocks whichconsist of exactly the same operations performed in the same order [60]. Usuallyachieving side-channel atomicity requires that certain dummy (fake) operationsare performed thus reducing performance. It is, however, possible to achieveside-channel atomicity with low overhead as presented for exponentiation andelliptic curves in [60], for example.

SPA operates directly on a single power trace whereas DPA analyzes a large

2The power trace in Fig. 3.5 does not represent any actual measurement as it was artificiallycreated for the sake of exemplification.

— 33 —

number of traces with statistical methods [151]. It is considerably harder toprotect an implementation against DPA because it can distinguish dummy op-erations from real ones if there is enough data available and the power model ofthe implementation is good enough. Protecting implementations against DPAand other statistical attacks is generally very hard but attacking can be madeconsiderably more difficult by time randomization, noise addition, masking, orwith dynamic and differential logic styles [271].

Side-channel attacks can be based on electromagnetic emanations, as well.They are referred to as electromagnetic side-channel attacks and they, too, havesimple and differential methods similar to SPA and DPA. Electromagnetic side-channel attacks were adapted to AES in FPGAs in [49].

Although side-channel attacks form a serious threat in many applications,hardware implementations are usually much harder to attack than software im-plementations [271] and FPGAs are sometimes even harder to attack than ASICimplementations [294]. In general, provably secure implementations cannot bemade. Thus, the designer’s role is to evaluate possible threats and implementcountermeasures that measure up with them. As the implementations of thisthesis mostly target to high-performance applications where the risk that anattacker is able to measure, e.g., power consumption is low, countermeasuresagainst side-channel attacks have not been considered by any great extent inthe publications.

Development of side-channel attacks has been intensive during the past fewyears. The most important forum for developments in side-channel attacks isprobably the Workshop on Cryptographic Hardware and Embedded Systems(CHES) and, thus, interested readers should consult the proceedings of CHESfor further information. Various attacks are discussed also in [10]. Compre-hensive reviews of power analysis attacks and their countermeasures are givenin [180, 226].

— 34 —

Chapter 4

Finite Fields

This chapter presents preliminaries of finite fields which have importancein many cryptographic algorithms. Both AES and ECC use finite field

arithmetic and, thus, finite fields are considered before discussing the algorithms.The efficiency of arithmetic in an underlying finite field greatly determines theefficiency of an entire cryptosystem and implementation techniques are thusdiscussed in the end of this chapter.

Finite fields are commonly referred to as Galois fields in order to honor thecontributions of Evariste Galois, a French mathematician, who lived in the early19th century. His work includes some of the most fundamental results of thetheory of finite fields. Finite fields have importance in many branches of math-ematics. In addition to their extensive use in cryptography, finite fields havepractical applications also in coding theory; see, e.g., [171]. Because of thesepractical applications, finite fields and, especially, their implementations havebeen studied also by computer scientists and electrical engineers. Finite fieldsare extensively studied and an enormous amount of literature exists. Hence, it isimpossible to provide an all-inclusive survey. The following study concentrateson the most relevant topics for implementations of cryptographic algorithmswith the emphasis being on issues related to efficient hardware implementation.

The following discussion is twofold: First, a more theoretical descriptionis given and, second, practical implementation related aspects are discussed.The basic properties of algebraic structures of groups, rings, fields, and vectorspaces are introduced in Sec. 4.1. Two types of finite fields primarily usedin cryptographic algorithms, namely prime and binary fields, are consideredin Secs. 4.2 and 4.3, respectively. Sec. 4.4 discusses the most complex fieldoperation, inversion. Implementation related aspects are discussed in Sec. 4.5.

4.1 Preliminaries

This section presents preliminaries of algebraic structures: groups, rings, fields,and vector spaces. The discussion is based on comprehensive presentationsavailable in [85, 122, 189].

— 35 —

A set G and a binary operation ∗ form a group (G, ∗) if they satisfy thefollowing axioms

1. The operation is associative, i.e., a ∗ (b ∗ c) = (a ∗ b) ∗ c for a, b, c ∈ G.

2. There exists an identity element e ∈ G such that e ∗ a = a ∗ e = a for alla ∈ G.

3. Every element a ∈ G has an inverse element b ∈ G such that a∗b = b∗a =e.

Furthermore, the group (G, ∗) is said to be Abelian (or commutative) if it sat-isfies

4. a ∗ b = b ∗ a for all a, b ∈ G.

If the group operation is multiplication ×, the group (G,×) is said to be amultiplicative group. In that case, the identity element is denoted by 1 and theinverse element by a−1. If the group operation is addition +, the group (G,+)is an additive group, and the identity element is 0 and the inverse element is−a.

The order of the group, ord(G), is the number of elements in the set G. Iford(G) is finite, the group G is finite. The order of an element a ∈ G, denotedby ord(a), is the smallest positive integer, n, for which an = e. If there is anelement a ∈ G such that for each b ∈ G there is an integer i for which b = ai,the group G is cyclic and a is a generator of G.

A ring (R,+,×) consists of the set R and two binary operations + (addition)and × (multiplication), and it satisfies the following axioms

1. (R,+) is an Abelian group with an identity element 0.

2. The operation × is associative, i.e., a×(b×c) = (a×b)×c for all a, b, c ∈ R.

3. There is a multiplicative identity element 1 such that 1 6= 0 with theproperty that 1× a = a× 1 = a for all a ∈ R.

4. The operation × is distributive over the operation +, i.e., a × (b + c) =(a× b) + (a× c) and (b+ c)× a = (b× a) + (c× a) for all a, b, c ∈ R.

Furthermore, the ring (R,+,×) is a commutative ring if it satisfies

5. a× b = b× a for all a, b ∈ R.

An element a ∈ (R,+,×) is an invertible element if there exists an elementb ∈ (R,+,×) such that a× b = 1.

A field, henceforth denoted by F, is a commutative ring in which all nonzeroelements are invertible. A field with q elements is said to be finite if q is finite,and a finite field is denoted1 by Fq. Again, the order of Fq, ord(Fq), is thenumber of elements in Fq, and Fq exists if and only if q is a prime or a powerof a prime, i.e., q = pm where m is a positive integer. If m = 1, Fp is calleda prime field and, if m ≥ 2, Fpm is called an extension field. Extension fields,where p = 2, i.e., F2m , are called binary fields (or fields with characteristictwo), and they are commonly deployed in cryptosystems. There is essentially

1The notation GF (q) (Galois field) is also commonly used in the literature and in I–III.

— 36 —

only one finite field with an order q and therefore two fields Fq and F′q arestructurally the same and only the labeling of elements is different, i.e., Fq andF′q are isomorphic.

By definition, Fq has two operations: addition and multiplication. Subtrac-tion in Fq is defined through addition so that a− b = a+ (−b) for all a, b ∈ Fqso that −b is a unique element with the property: b + (−b) = 0. Similarly,division is defined for all a ∈ Fq and b ∈ Fq\{0} using multiplication so thata/b = a×b−1 where b−1, the inverse of b, is unique and fulfills b×b−1 = 1. Here-after, multiplication is denoted by juxtaposition and × is omitted. The nonzeroelements in Fq, i.e., the elements in Fq\{0}, form a multiplicative group of Fq,denoted by F∗q , which is cyclic with ord(F∗q) = q − 1. Hence,

aq = a (4.1)

for all a ∈ Fq, also known as Fermat’s Little Theorem: ap ≡ a (mod p).A vector space V over a field F is an additive Abelian group (V,+) with a

multiplication, F × V → V , if the following axioms are satisfied for all a, b ∈ Fand c, d ∈ V .

1. a(c+ d) = ac+ ad.

2. (a+ b)c = ac+ bc.

3. (ab)c = a(bc).

4. 1c = c.

The elements of V are called vectors and the elements of F are scalars. Thefinite field Fpm can be considered as a vector space over Fp. Thus, the elementsof Fpm are vectors and the elements of Fp are scalars. The dimension of thevector space is m and there are many ways to select a basis.

4.2 Prime Fields

Let p be a prime. Integers modulo p form the set {0, 1, 2, . . . , p−2, p−1} whichconstructs a finite field with addition and multiplication modulo p as discussedabove. The field is denoted by Fp and called prime field.

Operations in Fp are computed by using normal integer arithmetic and byreducing the result modulo p. The result is thus the unique integer remainderobtained with a division by p. Consider the prime field F17, for example. Theelements in F17 are {0, 1, 2, . . . , 16}. An addition 12 + 7 = 2 because the re-mainder of 19 divided by 17 is 2. Similarly, 2− 16 = 3 and 5× 12 = 9 because−14 ≡ 3 (mod 17) and 60 ≡ 9 (mod 17), respectively. Furthermore, 5−1 = 7because 5× 7 ≡ 1 (mod 17).

Modular reductions using divisions of large integers are expensive. If p isfixed or changed infrequently (which usually is the case in cryptography), modu-lar reductions can be performed more efficiently with a series of multiplications,additions, and shifts by using precomputed data [22]. This technique is knownas the Barrett’s reduction.

A widely-used method circumventing the problem of expensive reductionsfor arbitrary p was introduced by Peter L. Montgomery in [196]. It uses Mont-gomery representation for integers. The Montgomery representation of a ∈

— 37 —

[0, p − 1] is aR mod p where R > p such that gcd(R, p) = 1. The value,R = 2dlog2 pe, where d·e denotes rounding to the closest larger integer, is com-monly used in practice. For b ∈ [0, Rp − 1], Montgomery reduction is givenby bR−1 mod p. When R is a power of the radix, the Montgomery reductionrequires only multiplications, additions, and shifts. The conversion from theMontgomery representation to conventional integer representation is performedwith the Montgomery reduction. The Montgomery representation does not giveany computational advantage for a single multiplication. However, as multipli-cations in the Montgomery representation are cheap compared to conventionalmodular multiplications, serious enhancements in computational requirementsoccur for several multiplications performed in the Montgomery representation.Hence, Montgomery representation is useful, e.g., in modular exponentiation(Diffie-Hellman, RSA, and ElGamal) and ECC over Fp.

Consider multiplication 5 × 12 mod 17. Let R = 25 = 32. Hence, R−1 ≡ 8(mod 17). The Montgomery representations are 7 and 10 for 5 and 12, re-spectively. An integer multiplication gives 7 × 10 = 70 and the Montgomeryreduction is 70 × 8 = 16 (mod 17) which is cheap to compute. The result ofthe multiplication is finally given by the Montgomery reduction: 16 × 8 = 9(mod 17).

Certain reduction methods are developed for primes of special forms. Re-ductions for primes with the form p = 2t − c where c is small (fits into oneword) can be computed with additions and two multiplications by c [70]. Thistechnique is known as the Crandall reduction. It is commonly known that if pis a Mersenne number, i.e., p = 2t − 1, modular reductions can be computedwith a single addition (the most significant half plus the least significant half).Unfortunately, Mersenne primes (primes which are Mersenne numbers) are rare.Generalized Mersenne primes [265] consist of a small number of powers of two,e.g., p = 2192 − 264 − 1. Modular reduction with generalized Mersenne primesare inexpensive (only additions and subtractions) [265]. The work of [265] wasrecently generalized for even wider class of primes with some reductions in per-formance [64]. Many practical cryptosystems use primes of special form. Forinstance, all primes listed by NIST in [205] are (generalized) Mersenne primesand [52] includes primes for which Crandall reduction can be applied.

4.3 Binary Fields

The elements of F2m are represented as a vector space over F2 with respectto a basis. As mentioned, there are many alternatives for the basis, and allrepresentations with different bases are isomorphic. Because the two elementsof F2 can be represented with a bit, m bits are required in total to representelements of F2m . Addition of two elements in F2m is simply performed coefficientwise, i.e., bitwise, but multiplication requires information about dependenciesbetween the elements. [85]

F2 has two elements: {0, 1}. Addition is computed modulo 2 and thus0 + 0 = 0, 0 + 1 = 1, and 1 + 1 = 0, i.e., addition is a logical XOR operationdenoted by ⊕. Multiplication is 0 × 0 = 0, 0 × 1 = 0, and 1 × 1 = 1, whichis computed by using a logical AND operation denoted by ⊗. The absenceof carry makes binary fields suitable from implementation point-of-view, andbinary fields are usually considerably faster than prime fields.

— 38 —

4.3.1 Polynomial Basis

Let f(x) be a polynomial whose coefficients are elements of F, i.e.,

f(x) =∞∑i=0

fixi (4.2)

where fi ∈ F. The ring of polynomials over F is denoted by F[x]. Let deg(f(x))denote the degree of f(x), i.e., the largest integer m for which fi 6= 0. Apolynomial f(x) is an irreducible polynomial over F if it cannot be presentedas a product of two polynomials with positive degrees [189]. Henceforth, anirreducible polynomial is denoted by p(x).

Let p(x) be an irreducible polynomial over F2 with deg(p(x)) = m given by

p(x) = xm +m−1∑i=0

pixi = xm + p′(x) (4.3)

where pi ∈ F2. Then, the binary field F2m can be constructed by setting

F2m : F2[x]/p(x), (4.4)

and the elements of F2m can be represented with the basis {1, x, x2, . . . , xm−1}.From now on, such basis is referred to as polynomial basis2. An element a(x) ∈F2m is represented with a polynomial basis as follows:

a(x) =m−1∑i=0

aixi (4.5)

where ai ∈ F2. In order to simplify, a bitvector representation is commonlyused so that a(x) = 〈am−1am−2 . . . a1a0〉. The identity element of addition, 0,is 〈00 . . . 00〉 and the identity element of multiplication, 1, is 〈00 . . . 01〉. [83, 122]

Addition and multiplication are performed modulo p(x). An addition a(x)+b(x) where a(x), b(x) ∈ F2m is simply given by

a(x) + b(x) =m−1∑i=0

(ai ⊕ bi)xi . (4.6)

Thus, addition is a bitwise XOR of the bitvectors representing a(x) and b(x).Subtraction is the same because XOR is its own inverse operation. [122]

Multiplication a(x)b(x) where a(x), b(x) ∈ F2m is more complex. First,a(x) and b(x) are multiplied by using ordinary polynomial multiplication and,second, the result is given by taking the remainder after polynomial division byp(x). The result of the ordinary polynomial multiplication, denoted by c(x) =∑2m−2i=0 cix

i, is a polynomial with deg(c(x)) < 2m − 1. The coefficients of c(x)are given by

ci =i∑

k=0

ak ⊗ bi−k . (4.7)

The number of ANDs can be reduced at the expense of more XORs by adaptingtechniques presented in [143], called Karatsuba multiplication. [122]

2Also called standard or canonical basis

— 39 —

Finally, the multiplication result is obtained by computing

a(x)b(x) = c(x) mod p(x). (4.8)

For p(x) given by (4.3), the computation of (4.8) is based on the followingcongruence: [122]

c(x) = c2m−2x2m−2 + . . .+ c1x

1 + c0

≡ (c2m−2xm−2 + . . .+ cm)p′(x)

+ cm−1xm−1 + . . .+ c1x+ c0 (mod p(x))

(4.9)

which is applied one bit at a time starting from the msb. The procedure iscontinued while deg(c(x)) ≥ m. Fixing p(x) drastically reduces computationalcomplexity, especially, if p(x) is sparse, i.e., only few pi are ones in (4.3). Hence,trinomials (three nonzero terms) or pentanomials (five nonzero terms) are oftenused in practical cryptosystems. [122]

A special case of multiplication, where a(x) = b(x), is called squaring anddenoted by a2(x). Squaring is computationally cheaper than multiplicationbecause c(x) is simply given by

c(x) =m−1∑i=0

aix2i, (4.10)

i.e., c(x) is obtained by inserting zeros between each bit in the bitvector rep-resenting a(x) [122]. The result of squaring is obtained by computing (4.8).Complexities of squarings are discussed in more detail in [296], for example.

Consider the field F24 as an example. The polynomial p(x) = x4 + x + 1is irreducible over F2 and, thus, F24 can be constructed as F2[x]/p(x). Now,consider two elements of F24 , a(x) = 〈0110〉 = x2 + x and b(x) = 〈1011〉 =x3 +x+1. Then, a(x)+b(x) = 〈1101〉 and a(x)b(x) = 〈1111〉 since (x2 +x)(x3 +x+ 1) = x5 + x4 + x3 + x and (x5 + x4 + x3 + x) mod p(x) = x3 + x2 + x+ 1.The inverse of a(x) is a−1(x) = 〈0111〉 because (x2 + x)(x2 + x + 1) = x4 + xand (x4 + x) mod p(x) = 1.

Let p1(x) be an irreducible polynomial over F2 with deg(p1(x)) = m1. Then,F2m1 : F2[x]/p1(x). Let p2(y) be an irreducible polynomial over F2m1 withdeg(p2(y)) = m2. Then, F(2m1 )m2 : F2m1 [y]/p2(y). These fields are calledcomposite fields. Composite fields introduce computational advantages overconventional binary fields, because arithmetic operations in F(2m1 )m2 can becomputed with a series of operations in F2m1 . Naturally, this can be extendedto F(...(2m1 )m2 ...)mn . Composite fields are frequently used, e.g., in hardwareimplementations of AES where computational complexity of inversion in F28

reduces considerably with composite field representations; see Sec. 5.2.2. How-ever, ECC over composite fields has vulnerabilities against certain cryptanalyticattacks [263]. Prior to finding these vulnerabilities, and even afterwards, theyhave been used in implementations for efficiency reasons.

Because reductions are expensive also in F2m , their computation requiresattention. The idea of the Barrett’s reduction can be applied also for F2m overpolynomial basis [74]. An adaption of the Montgomery multiplication for F2m

was introduced in [149] and improved in [298]. Furthermore, analogously tousing primes of special forms, the use of special form p(x), e.g., trinomials,

— 40 —

pentanomials, all-one polynomials, or equally-spaced polynomials, offers con-siderably faster reductions than general irreducible polynomials as discussedabove.

4.3.2 Normal Basis

A normal basis is constructed by using a normal element over F2. The elementα ∈ F2m is a normal element if α, α2, . . . , α2m−1 are linearly independent overF2. Then, {α, α2, α22

, . . . , α2m−1} is a normal basis of F2m over F2. It iscommonly known that a normal basis can be found for all F2m [202]. An elementa ∈ F2m can be represented by using the basis as follows:

a =m−1∑i=0

aiα2i (4.11)

where ai ∈ F2. A bitvector notation is used also for normal basis. The identityelement of addition, 0, is 〈00 . . . 00〉 and the identity element of multiplication,1, is 〈11 . . . 11〉. [85]

Addition a+b where a, b ∈ F2m is essentially the same as in polynomial basisand it is given by

a+ b =m−1∑i=0

(ai ⊕ bi)α2i. (4.12)

Normal bases are attractive mainly because squaring is very efficient withthem. Obviously, squaring (4.11) gives

a2 =m−1∑i=0

aiα2i+1

(4.13)

and (4.1) gives that α2m = α. Thus, squaring is a cyclic shift of the bitvectorrepresenting a, i.e.,

a2 = 〈am−2am−3 . . . a0am−1〉. (4.14)

Multiplication ab, where a, b ∈ F2m , is more involved and it is computedusing a multiplication matrix, TN , with entries, ξi,h, satisfying [83]

α2iα =m−1∑h=0

ξi,hα2h so that α2iα2j =

m−1∑h=0

ξi−j,h−jα2h . (4.15)

Then the bits of c = ab are given by [83]

ch =m−1∑i=0

m−1∑j=0

ξi−j,h−j ⊗ ai ⊗ bj . (4.16)

The number of ones, dN , in TN defines the complexity of multiplication,and it is bounded by 2m − 1 ≤ dN ≤ m2 [202]. If dN is on the lower bound,the basis is said to be an Optimal Normal Basis (ONB). There are two typesof ONBs, referred to as Type I and Type II ONBs [202]. An ONB does notexist for all m. In fact, the density is about 23 % for m < 1200 [202]. Forexample, an ONB does not exist for F2163 and, as a consequence, the basis used

— 41 —

in IV–VI, X, and XI is not an ONB. A comprehensive evaluation of the lowestcomplexities achievable with different m is provided in [183]. A low complexitynormal basis, called Gaussian Normal Basis (GNB), can be constructed if anONB does not exist [14]. GNBs have been included in many ECC standards,such as [8, 132, 205], and the basis used in the above mentioned publications ofthis thesis is a GNB.

As shown in (4.16), the leftmost bit of c = ab, cm−1, can be computed froma and b with a 2m-to-1-bit logic function as cm−1 = F (a, b). The function iscalled F -function. Because squaring is a cyclic shift, the leftmost bit of thesquaring, c2, is cm−2. On the other hand, the leftmost bit is again given by theF -function as cm−2 = F (a2, b2). Following this down to c0 leads to the followingformula:

c = ab =m−1∑i=0

F (a2m−1−i, b2

m−1−i)α2i (4.17)

which defines the Massey-Omura multiplier. [217, 285]

4.4 Inversion

Inversion, i.e., given an element a ∈ Fq finding an element a−1 ∈ Fq such thataa−1 = 1, is much more involved than addition or multiplication. Moreover,it is commonly required in practical applications of finite fields and, hence,its efficient implementation is important. There are essentially two ways tocompute an inversion in Fq: extended Euclidean algorithm and Fermat’s LittleTheorem.

Extended Euclidean algorithm is an extension of the Euclidean greatest com-mon divisor algorithm which, in addition to finding gcd(a, b), also gives c and dsatisfying

ac+ bd = gcd(a, b) . (4.18)

If b is a prime and a < b or if b is irreducible with deg(b) = m and deg(a) < m,gcd(a, b) = 1. Thus, by setting b = p, the algorithm returns c such that ac ≡ 1(mod p), i.e., c = a−1 in Fp. Analogously, if b = p(x), then c(x) = a−1(x) inF2m . [122]

The standard version of the extended Euclidean algorithm requires severalinteger (for Fp) or polynomial (for F2m) divisions, which can be costly. A binarygcd algorithm which requires only additions, shifts (divisions by 2), and paritytests was introduced in [273]. An algorithm for binary inversion in F2m wasgiven in [30]. An algorithm, called almost inverse algorithm, which returnsa weighted inverse, xta−1(x), was presented in [260]. Algorithms for efficientinversions in the Montgomery representation were given in [142, 256]. Normally,division b/a is computed by multiplying b with a−1, but modifications of theabove algorithms performing a direct division have been given in [54, 106], forexample.

Inversion based on Fermat’s Little Theorem uses consecutive squarings andmultiplications in Fq. It follows directly from (4.1) that for a ∈ Fq

a−1 = aq−2 . (4.19)

— 42 —

Hence, inversion in F2m is given by a−1 = a2m−2. Optimizing this exponen-tiation returns to the theory of addition chains3. In exponentiation, the ad-dition chain consists of multiplications and squarings required to exponentiatea. A straightforward application of the binary method results in an additionchain with the cost of m− 2 multiplications and m− 1 squarings [285], because2m − 2 = 〈111 . . . 1110〉2 (the length is m bits). Because m is known a pri-ori, addition chains with lower costs than binary method can be searched forexponentiation. Efficient addition chains proposed in [136], called Itoh-Tsujiiinversion, reduce the cost to blog2(m− 1)c+H(m− 1)− 1 multiplications andm− 1 squarings, where b·c denotes rounding to the closest smaller integer andH(·) is the Hamming weight, the number of ones in the binary representation.The Itoh-Tsujii inversion is efficient, especially, in normal basis because squar-ings are almost free [136], but it can be efficiently used in polynomial basis,too [113, 295].

Because inversions are typically more expensive than other field operations,it is of interest to trade inversions to other operations. A method, known asMontgomery’s trick [197], trades inversions to multiplications. It is based on theobservation: a−1 = b(ab)−1 and b−1 = a(ab)−1, which is generalized for n inver-sions resulting in a cost of one inversion and 3(n− 1) multiplications [197, 261].Montgomery’s trick is a general method, but some specific methods for tradinginversions to other operations are presented in Secs. 5.2.2 and 6.2, which discussAES S-box optimization with composite fields and elliptic curve arithmetic inprojective coordinates, respectively.

4.5 Notes on Implementations

The following presents certain notes on implementation related aspects. Thisdiscussion is not all-inclusive because of the superfluity of papers discussingefficient implementation of finite fields in various applications. However, thefollowing points give a brief overview of issues related to implementations offinite field arithmetic and present the most commonly used multiplier struc-tures. The discussion focuses on F2m and multiplication because Fp is not usedin the publications of this thesis and multiplication dominates in both compu-tation time and resource requirements of practical cryptosystems. The subjectis revisited in Sec. 6.5 which discusses ECC processors.

A bit-serial multiplier computes one bit of the product of two elements ofF2m in one clock cycle, thus, requiring m clock cycles in total. A bit-parallelmultiplier computes all bits of the product simultaneously in one clock cycle.Obviously, the bit-parallel multiplier consumes a lot of area while the bit-serialmultiplier is slow. A digit-serial multiplier is a tradeoff between these two ex-tremes and it computes D bits of the product in one clock cycle resulting indm/De clock cycles for an entire multiplication. As a result of the rounding, onlycertain values of D reduce latency and only they should be considered for use inpractical applications in order to prevent unnecessary use of area. For instance,D should be chosen from the set {1− 15, 17, 19, 21, 24, 28, 33, 41, 55, 82, 163} forF2163 used in all ECC publications of this thesis. Furthermore, there existsa value of D optimizing the latency-area product of a multiplier. This value

3An addition chain is the chain of additions required to get a starting from 1 by using onlyvalues that have appeared previously in the chain.

— 43 —

is usually small (see IV) which challenges practical feasibilities of digit-serialmultipliers with large D and, especially, bit-parallel multipliers.

A multiplier is said to be fixed if it supports only one specific field and theability to support several fields is called flexibility. As a rule of thumb, themore flexible a multiplier is the slower it is and the more area it requires. Anextreme example is squaring in polynomial basis which can be computed inone clock cycle with a fixed (sparse) irreducible polynomial but requires severalclock cycles with an arbitrary irreducible polynomial.

A classical bit-parallel multiplier for polynomial basis was presented in [181,182], and it is nowadays referred to as the Mastrovito multiplier after its in-vertor. A Mastrovito multiplier combines the polynomial multiplication andreduction phases to a single matrix multiplication. An optimized Mastrovitomultiplier for trinomials was developed in [277] which was later generalized forarbitrary irreducible polynomials in [117]. A systematic approach for designingMastrovito multipliers was presented in [305]. A disadvantage of Mastrovitomultipliers is that flexible multipliers cannot be designed. Other bit-parallelmultipliers have been presented, e.g., in [233, 235].

Because m is usually large in public-key cryptography applications, bit-serialor digit-serial multipliers are commonly preferred in practice. Ordinary polyno-mial multiplication, as given by (4.7), and Karatsuba multiplication [109, 143,283] can be implemented in both bit-serial and digit-serial fashion. Complex-ity of reduction has been studied with various types of irreducible polynomialsin [295, 297]. The following multipliers interleave polynomial multiplication andreduction steps but still incorporate different circuitries for both of them con-trary to the Mastrovito multiplier. A multiplier, called super-serial multiplier,suitable for applications where resources are very scarce was presented in [219]and it allows trading off resources to even longer latency than that of the bit-serial multiplier. Digit-serial multipliers for polynomial basis were introducedin [269]. They were improved in [156] by introducing new multiplier architec-tures with different numbers of accumulators in order to minimize the criticalpath and by studying their optimality in terms of speed and area. A digit-serialmultiplier utilizing LUTs suitable for software implementations was introducedin [123], but it can be efficiently implemented in hardware as well [123, 177]. Itwas used in VI and VII.

If the reduction step is interleaved with polynomial multiplication, it canbe computed very efficiently even with arbitrary irreducible polynomials by us-ing the so-called partial reduction techniques [115] where a result of ordinarypolynomial multiplication is reduced only to a congruent polynomial with adegree higher than m. Partial reduction embedded in an efficient digit-serialmultiplier can provide support for arbitrary irreducible polynomials (up to cer-tain m) with half the speed compared to fixed polynomials [87]. Combinationof specific reduction circuitries for a few fixed irreducible polynomials and cir-cuitry for arbitrary irreducible polynomials provides flexible multipliers withfast computation for frequently used fields [87, 114, 115].

A number of papers have discussed multiplication in normal basis. A classicalmultiplier structure for normal basis is the Massey-Omura multiplier describedin [217, 285]. Normal basis multiplication is easy to parallelize because thesame F -function is used for all bits of the result, as shown in (4.17). Bit-serial architectures have been presented in [1, 217, 285], digit-serial multipliersin [232, 234], and bit-parallel multipliers in [278, 285], to name a few. Several

— 44 —

multipliers exploit specific types of normal basis. A multiplier structure for ONBwas presented in [1]. A bit-parallel multiplier, called Sunar-Koc multiplier, thatis smaller than the bit-parallel Massey-Omura multiplier was given for Type IIONBs in [278]. Multiplier architectures specifically for GNBs were presentedin [159, 236].

Two bit-parallel polynomial basis multipliers, based on ordinary polyno-mial multiplication and modified Karatsuba method [109, 143], and normalbasis multipliers, digit-serial (D = 117) Massey-Omura and bit-parallel Sunar-Koc multipliers, were compared in [109] on a Xilinx Virtex-II FPGA in thecase of F2233 for which exists a Type II ONB. They concluded that their mod-ified Karatsuba multiplier outperforms ordinary polynomial multiplier. TheMassey-Omura and Sunar-Koc multipliers had similar time-area products, butSunar-Koc was faster. Sunar-Koc was also faster than both polynomial basismultipliers, but Karatsuba had clearly the best time-area product.

Polynomial basis is superior over normal basis in software; see, e.g., [121,213]. Both bases result in efficient implementations in hardware and, thus,normal basis is a viable choice especially in applications where the numberof squarings is high, such as ECC with Koblitz curves. Commonly supportfor normal basis is implemented in software by utilizing the isomorphism ofdifferent representations of Fq so that elements are converted to polynomial basiswhere computations are carried out followed by a conversion back to normalbasis. However, such conversions can be costly and direct implementation canbe more efficient in certain applications [213] and software implementations ofnormal basis have been proposed, such as [213, 232, 242]. Conversions betweennormal basis and polynomial basis are very efficient with Type I ONB and all-one polynomials which enables efficient implementations as shown in [150, 166].However, Type I ONBs are not practical in ECC because they can exist only ifm is even [202], although it should be a prime for security reasons; see, e.g., [99].

Because many standards in (elliptic curve) cryptography, e.g., [51, 52, 205],define both prime and binary fields, an implementation must support both Fpand F2m in order to cover all standardized fields. Hence, processors combin-ing support for Fp and F2m have attained interest. The first such processorswere presented in [106, 257] which presented unified Montgomery multiplierssupporting both prime and binary fields with arbitrary field sizes. The workof [106] was later extended in [107]. Another multiplier for Fp and F2m basedon Montgomery multiplication was given in [255]. A multiplier which does notuse Montgomery representation was presented in [111]. A low-power arithmeticunit supporting addition, multiplication, and extended Euclidean algorithm inFp and F2m was presented in [291] and instruction set extensions for a 32-bitRISC (Reduced Instruction Set Computer) processor supporting both Fp andF2m were developed in [112]. A multiplier supporting both Fp and F2m typicallyrequires only slightly more area than an Fp multiplier and, in fact, the supportfor F2m can be added to an Fp multiplier with no additional area cost [255].In all cases, F2m operations clearly outperform Fp operations with comparablefield sizes.

As mentioned, inversion is the most complex field operation. Hence, its fastcomputation is important in applications where inversions are frequently used.Inversion algorithms based on extended Euclidean algorithm are preferred insoftware implementations, because they usually result in slightly faster results.However, Fermat’s Little Theorem computes inversions without an additional

— 45 —

inverter circuitry, namely with successive multiplications and squarings, whichmakes it a feasible alternative in hardware implementations.

To summarize, an enormous amount of work has been done on realizing finitefield arithmetic and many architectures are available having merits in variousdifferent target applications. The research field can be considered mature andthis thesis does not introduce any major contributions to this field. A scalablemultiplier that can be easily generated with an automatic generator and thatsuits for the 4-to-1-bit LUT structure of Xilinx FPGAs was, however, introducedin III. Otherwise, the contributions of this thesis are on higher algorithm levelsthan field arithmetic and they are discussed in the following chapters.

— 46 —

Chapter 5

Advanced Encryption Standard

This chapter studies AES from the hardware implementation point-of-viewand provides an extensive review of implementation methods developed for

AES. The objective is to review the state-of-the-art of AES implementations andto point out the relationship of I and II to other studies. The focus is againon FPGAs, but also ASIC implementations are discussed in detail in order toprovide an inclusive view of the state of hardware implementation of AES andbecause many techniques used in FPGAs have been first introduced for ASICs.Software implementations are not discussed in the following description but, tothe author’s knowledge, the fastest reported software has been written by HelgerLipmaa in assembly and it achieves throughput of over 1.5 Gbps on a 3.2 GHzPentium 4 [172]. This value is a useful benchmark for hardware implementa-tions. It should be noted, however, that slower hardware implementations canalso be useful because low price or power consumption may be required whichtypically cannot be achieved with general-purpose microprocessors.

AES algorithm is presented in Sec. 5.1. Different implementation techniquespresented for AES are discussed in Sec. 5.2 and Sec. 5.3 reviews and comparespublished ASIC and FPGA implementations of AES.

5.1 Description of AES

The following description of AES is based on [71, 206]. AES has a block size of128 bits and there are three options for the key size: 128, 192, or 256 bits. AES-128, AES-192, and AES-256 henceforth refer to AES and the key size. AES isan iterative SPN and the number of rounds, Nr, depends on the key size sothat Nr = 10 for AES-128, Nr = 12 for AES-192, and Nr = 14 for AES-256.

AES operates on an intermediate result called the State. At the beginningof encryption, the 128-bit M is interpreted as a two dimensional table or amatrix with bytes as elements where the first byte of M is s0,0 and the last oneis s3,3. The rows and columns of the matrix are referred to as rows and columnsof the State. AES involves four different transformations which are all appliedto the State. Separate round keys are used in different rounds and they arederived in a key schedule. Once all rounds have been processed, C is build up

— 47 —

so that s0,0 becomes the first byte of C and s3,3 becomes the last one. Alg. 5.1shows the outline of the AES algorithm. The transformations are considered inthe following.

Algorithm 5.1 Outline of the AES encryptionInput: M (128 bits), K (128, 192 or 256 bits)Output: C (128 bits)1. State←M2. RoundKeys← KeySchedule(K)3. State← AddRoundKey(State,RoundKeys[0])4. for i = 1 to Nr − 1 do5. State← SubBytes(State)6. State← ShiftRows(State)7. State←MixColumns(State)8. State← AddRoundKey(State,RoundKeys[i])9. end for

10. State← SubBytes(State)11. State← ShiftRows(State)12. State← AddRoundKey(State,RoundKeys[Nr])13. C ← State

The AddRoundKey(·, ·) transformation embeds key into the State by per-forming a 128-bit bitwise XOR, i.e.,

AddRoundKey(State,RoundKeys[i]) = State⊕RoundKeys[i]. (5.1)

AddRoundKey(·, ·) is the only transformation in AES involving the key.The SubBytes(·) transformation is a non-linear operation which operates

on each byte of the State independently. A byte is considered to be an elementof the field

F28 : F2[x] / x8 + x4 + x3 + x+ 1 . (5.2)

It is, thus, presented as∑7i=0 bix

i so that b0 is the lsb. SubBytes(·) consistsof two phases:

1. Computation of an inversion in F28 . The element 〈00〉x maps to itself.

2. An affine transformation over F2 defined as

b′i = bi ⊕ b(i+4) mod 8 ⊕ b(i+5) mod 8 ⊕ b(i+6) mod 8 ⊕ b(i+7) mod 8 ⊕ ci (5.3)

where c = 〈63〉x and 0 ≤ i ≤ 7.

The ShiftRows(·) transformation operates on the 32-bit rows of the Stateindependently. Each byte of the row i is shifted cyclically to the left by i byteswith 0 ≤ i ≤ 3.

The MixColumns(·) transformation operates on the 32-bit columns of theState independently. Each column is considered as an element of the finitefield

F(28)4 : F28 [x] / x4 + 1 (5.4)

where F28 is defined by (5.2). The columns are then multiplied with a fixedpolynomial

a(x) = 〈03〉xx3 + 〈01〉xx2 + 〈01〉xx+ 〈02〉x . (5.5)

— 48 —

The KeySchedule(·) routine expands the length of the key, K, to 128(Nr+1) bits. The result is used as RoundKeys[i] in encryption as presented inAlg. 5.1. K is placed into the beginning of RoundKeys and, thus, K formsthe first 128, 192, or 256 bits of RoundKeys depending on the key size. Fromthis point forward, KeySchedule(·) builds RoundKeys one 32-bit word at atime with SubBytes(·), XORs, and shifts.

More detailed descriptions of AES are available in [71, 206].

5.1.1 Decryption

Decryption inverts all transformations performed in encryption. Thus, the orderin which transformations are performed is reversed and the transformationsare replaced by their inverses. However, because the inverse of SubBytes(·)operates byte-wise, it can be computed before the inverse of ShiftRows(·) anda structure similar to Alg. 5.1 can be devised also for decryption [71, 206].

The AddRoundKey(·, ·) is its own inverse and it remains unchanged. Theinverse of the ShiftRows(·) shifts bytes to the right, not to the left, andthe inverse of the SubBytes(·) performs an inverse of the affine transfor-mation followed by an inversion in F28 which is its own inverse operationby definition. The inverse of the MixColumns(·) multiplies a column witha(x)−1 = 〈0b〉xx3 + 〈0d〉xx2 + 〈09〉xx+ 〈0e〉x. The KeySchedule(·) is exactlythe same for both encryption and decryption but the RoundKeys[i] are usedin reversed order. [71, 206]

5.2 Implementation of AES

Although AES is a well-defined standard, there are certain choices that can bemade and both academic and commercial AES implementations typically im-plement only a part of the full AES. For instance, only encryption or decryptionis implemented or support only for 128-bit keys is included in order to minimizearea and increase speed. Many published designs, e.g., [37, 42, 55, 86, 154, 169,201, 309] and II, include only AES-128 encryption. Some implementations,such as [86, 91, 97, 201], do not implement KeySchedule(·) and they requirethat RoundKeys are computed by an external processor or that the key Kis fixed and RoundKeys are precomputed. The support for KeySchedule(·)is referred to as key agility and if an implementation can switch keys withoutadditional delays, it is fully key agile. Achieving full key agility can requirelarge amounts of area, especially, if several blocks are encrypted or decryptedsimultaneously. Implementations supporting the full AES, i.e., all key sizes withkey agility, encryption, and decryption, are presented in [4, 179, 228, 275].

AES can be implemented with basic techniques used in implementing blockciphers [91]. They include unrolling and pipelining. The following discussion isbased on categorizations from [100, 120, 306]. Different unrolling and pipelin-ing techniques are depicted in Fig. 5.1. An iterative architecture implementsonly one round of the algorithm which is then used iteratively. A loop-unrolledarchitecture performs several rounds in one iteration. If all rounds are unrolled,an architecture is referred to as fully unrolled. Naturally, the more rounds areunrolled the more area is required to implement the architecture. Pipelining canbe categorized as outer-round or inner-round pipelining. Outer-round pipelining

— 49 —

Round

(a)

Round

Round

nro

unds

(b)

Round

Round

Round

Round

Nr

rounds

(c)

Round

nro

unds

Round

(d)

Round

nro

unds

Round

(e)

Figure 5.1: Implementation techniques (adapted from [100, 120, 306]): (a) iterative,(b) loop-unrolled, (c) fully unrolled, (d) loop-unrolled with outer-round pipeline, and(e) loop-unrolled with inner-round and outer-round pipeline. The black bars depictregister stages. The number of inner-round pipeline stages may be higher than oneand the number of outer-round pipeline stages is bounded by the number of unrolledrounds.

— 50 —

includes registers between two or several rounds, whereas inner-round pipeliningadds registers inside a round between or in transformations. In either case, thetotal number of pipeline stages determines the number of simultaneous encryp-tions. The highest throughputs are achieved with fully unrolled architecturesutilizing aggressive inner-round pipelining.

If the entire State is operated simultaneously, the width of the data path is128 bits. However, even the iterative architecture with a 128-bit data path maybe too large for certain applications. A technique called folding uses smallerdata path, such as 32 or 8 bits, so that the same circuitry is used for differentparts of the State during several clock cycles. Folding is nontrivial becauseShiftRows(·) complicates accessing bytes of the State and results in a largeswitching circuitry [61]. The penalty in speed can be considerable in foldedarchitectures but, nonetheless, even lower speed is adequate for many consumerapplications where low cost is often essential [104].

Throughput of an AES implementation is given by the following equationwhich is similar to (3.2) but gives the number of bits processed per second

T = 128pfl

(5.6)

where f, l, and p are as defined in Sec. 3.3; hence, p is the number of pipelinestages. In a fully unrolled (and pipelined) design, p = l resulting in T = 128f.

Block ciphers are used in different modes of operations. The simplest modeof operation is Electronic Code Book (ECB) mode where each block is encryptedseparately. Thus, the same plaintext encrypted with the same key always re-sults in the same ciphertext. This can be avoided by using more sophisticatedmodes of operations. They commonly require that the ciphertext of the previ-ous block is available in encryption of the next block which prevents efficientpipelining. However, certain modes of operations, e.g., Cipher Block Chaining(CBC) mode, allow pipelining in decryption but not in encryption whereas cer-tain modes of operations, e.g., counter (CTR) mode, allow pipelining in both.The reader is referred to basic textbooks on cryptography, e.g., [258, 274], forfurther information.

5.2.1 Memory-based Implementations

As discussed in Sec. 5.1, SubBytes(·) operates the State byte-wise. It com-putes an inversion in F28 and performs an affine transformation. These opera-tions can be combined and performed with a single look-up from a precomputedLUT of 256 8-bit elements and such a table is called S-box. This approach isparticularly feasible for software, but also for FPGAs where logic resources canbe saved at the expense of a few embedded memory blocks, e.g., BlockRAMsin Xilinx FPGAs. In total 16 S-boxes are required if the entire SubBytes(·) iscomputed in parallel, and the KeySchedule(·) requires additional 4 S-boxes.This approach is straightforward and it has been successfully used in numerouspapers, such as [62, 97, 185, 248, 272, 304]. However, memory requirementsmay restrict the use of this approach in certain low-cost environments and fullyunrolled designs. There are also certain other issues that require commenting.

Because SubBytes(·) and its inverse differ, efficient implementation of bothencryption and decryption in the same design is not easy. A straightforwardimplementation requires two different LUTs: one for encryption and one for

— 51 —

decryption. This problem can be circumvented with an approach using com-binatorial logic for the affine transformation and its inverse and a LUT onlyfor inversion in F28 [176, 240]. Hence, both encryption and decryption can beimplemented with the same memory requirements than plain encryption or de-cryption with an increase in logic requirements and computation delay. Anotherapproach is to use two ROMs (Read-Only Memories) including precomputedvalues for the S-boxes of SubBytes(·) and its inverse [185]. Encryption anddecryption are then carried out with LUTs implemented in programmable em-bedded memory, e.g., BlockRAMs in Xilinx Virtex FPGAs, whose contents areloaded from one of the ROMs at the beginning [185]. However, switching la-tencies may become a problem if encryption and decryption modes need to beswitched frequently.

SubBytes(·), ShiftRows(·), and MixColumns(·) can be combined intocolumn-wise LUTs, referred to as T-boxes. This approach was first proposed bythe inventors of AES for 32-bit software [71], but it can be efficiently adapted tomemory-rich FPGAs, too, and it has been used at least in [56, 86, 97, 103, 186].The quadruple memory requirements compared to the S-box approach [97] area downside of this method.

Let S(a) denote an S-box. Then, four T-boxes needed in AES encryptionare given as follows [71]:

T0(a) =

S(a)× 〈02〉xS(a)× 〈01〉xS(a)× 〈01〉xS(a)× 〈03〉x

T1(a) =

S(a)× 〈03〉xS(a)× 〈02〉xS(a)× 〈01〉xS(a)× 〈01〉x

T2(a) =

S(a)× 〈01〉xS(a)× 〈03〉xS(a)× 〈02〉xS(a)× 〈01〉x

T3(a) =

S(a)× 〈01〉xS(a)× 〈01〉xS(a)× 〈03〉xS(a)× 〈02〉x

(5.7)

each of which can be implemented with a LUT of 256 32-bit elements. Thevalues of a new column are received by using the bytes of an old column asinputs for the T-boxes and XORing the 32-bit outputs. In T-boxes used indecryption, 〈01〉x, 〈02〉x, and 〈03〉x are replaced by 〈0b〉x, 〈0d〉x, 〈09〉x, and〈0e〉x.

As shown in Alg. 5.1, MixColumns(·) is not computed in the last roundand, hence, an implementation must be capable of computing only SubBytes(·)(or its inverse), i.e., S(a). This is not a major problem in encryption because theT-boxes include plain S-boxes as well (those multiplied by 〈01〉x) [71]. However,in decryption the S-boxes are not available and they must be multiplied out ofthe T-boxes which requires additional logic.

The T-boxes provide faster implementations than the S-boxes, but expect-edly with increased memory requirements [97]. Because modern FPGAs areusually memory-rich, the T-boxes can provide very fast performance with lowlogic requirements as shown in [56, 86, 186] as well as very small logic require-ments [245].

5.2.2 Combinatorial Implementations

Memory must be available in order to implement the previously discussed ap-proaches efficiently. However, it is important to have an efficient method for

— 52 —

implementing SubBytes(·) with combinatorial logic, especially, in ASIC im-plementations. Efficient combinatorial implementation of the S-boxes is impor-tant also in high-speed implementations on FPGAs because throughput can beincreased by pipelining the S-boxes which is not possible with memory-basedimplementations. Straightforward derivations of combinatorial expressions forSubBytes(·), either directly from the truth-table of the S-box or with Fer-mat’s Little Theorem, result in fast but large circuitries [201]. Composite fields,discussed in Sec. 4.3.1, reduce the complexity of combinatorial inversion consid-erably.

Using composite fields in the S-boxes was proposed by Rijmen [237]. Theidea is that although all representations of F28 are isomorphic, computationalcomplexities differ between representations and it is more efficient to computeinversions in a composite field rather than in the field specified by (5.2).

A byte is first mapped with an isomorphism, σ, to

F(24)2 : F24 [y]/y2 + ψy + ω . (5.8)

An inversion for an element, a = a1y + a0 where ai ∈ F24 : F2[y]/y4 + y + 1, isthen computed with the following equation

(a1y + a0)−1 = a1∆y + (a1 + a0ψ)∆ (5.9)

where ∆ = (a21ω+a1a0ψ+a2

0)−1 [237]. Finally, the result is mapped back to theoriginal representation with an isomorphism, σ−1. Hence, an inversion in F28 hasbeen replaced by two isomorphisms and an inversion and some multiplications,squarings, and additions in F24 . [237]

The constants ψ and ω effect both the complexity of operations of (5.9) andmappings, σ and σ−1. The values can be chosen freely as long as y2 + ψy + ωremains irreducible over F24 . If ψ = 1, there are four possible values for ω and,for each of them, there are seven alternatives for σ and σ−1 [247].

The above approach can be used further so that operations of F28 are de-composed into operations of F2 with irreducible trinomials as follows [254]

F((22)2)2 : F(22)2 [y]/y2 + ψ0y + ω0 (5.10)

F(22)2 : F22 [y]/y2 + ψ1y + ω1 (5.11)

F22 : F2[y]/y2 + ψ2y + ω2 . (5.12)

If ψ0,1,2 = 1, then there are eight possible values for ω0, two for ω1, and onefor ω2 [308]. Decompositions beyond F24 can be beneficial for ASIC implemen-tations but they do not reduce the complexity of inversion in typical FPGAimplementations because the 4-to-1-bit LUT structure provides all arithmeticoperations in F24 already with the minimum cost of 4 LUTs [105]. Besides, it issimple to directly derive equations for operations in F24 and they provide moreefficient inversions in F24 than further decompositions [128, 307]. However, thesimplest isomorphisms, σ and σ−1, have been presented by decomposing F28 toF((22)2)2 [48, 308]. Naturally, there are no restrictions for the basis of F24 , andnormal bases have been suggested at least in [48, 275].

The first practical implementation of composite fields was presented in [247]where the composite field F(24)2 was used for all operations of AES, i.e., alsoMixColumns(·) was performed in the composite field. Thus, σ was applied

— 53 —

to the plaintext and the key once in the beginning and the resulted cipher-text was mapped with σ−1 in the end. This was also the approach takenin II. However, it has become evident that it is more efficient for the over-all area consumption and critical path to perform only inversions in compositefields [48, 254, 292, 307, 308]. The reasons are that the simple multiplicationsof MixColumns(·) become much more complex in composite fields and eitherσ or σ−1 can be embedded in affine transformations which reduces the overheadof these mappings. The first implementations using composite fields to computeonly the inversion were presented in [254] using F((22)2)2 and [292] using F(24)2 .

Evaluations of the optimal composite field constructions have been presentedin [48, 190, 308]. The study of [190] optimized the implementation of [254] bystudying effects of different irreducible polynomials to the complexity of theisomorphisms. In [48], various different constructions including normal baseswere studied resulting in the smallest S-box available in the literature. Slightlylarger circuit with shorter critical path was given in [308]. The exact valuesare 91 XORs and 36 ANDs for [48] with a critical path of 23 XORs and 4ANDs and 120 XORs and 35 ANDs for [308] with a critical path shorter by 4XORs. The difference in area consumption is notable with all above mentioneddecompositions compared to straightforward implementations from the truth-table of the S-box which requires about 1500-3000 ASIC gates with differenttechniques [201]. A study on the most energy efficient construction in F((22)2)2

was provided in [200]. The above mentioned complexity studies are not valid forFPGAs where LUT structures change both area requirements and critical pathsfrom the above values as shown for one construction in [309]. However, optimalpipeline constructions in composite fields for FPGAs were studied in [103, 105].

Although SubBytes(·) dominates in area consumption and computationtime, MixColumns(·) also deserves attention, especially, when combined withits inverse; see, e.g., [98, 129, 254, 306]. A basic implementation technique,which was described already in the standard [206], builds around multiplica-tions by 〈02〉x which can be implemented with shifts and 4 XORs [306]. BothMixColumns(·) and its inverse can be implemented with these multiplica-tions and additions (XORs). When both encryption and decryption are im-plemented, area can be reduced by sharing multiplications by 〈02〉x [307] orby decomposing MixColumns(·) out of its inverse [71]. The latter approachutilizes the fact that the polynomial used in multiplication in the inverse ofMixColumns(·), a−1(x), can be expressed by using a(x) of MixColumns(·),(5.5), as a−1(x) = (〈04〉xx2+〈05〉x)a(x) (mod x4+1). As a consequence, the in-verse of MixColumns(·) is computed by first applying multiplications by 〈04〉xand 〈05〉x followed by MixColumns(·).

5.3 Literature Review of AES Implementations

This literature review complements the review presented in I. This reviewsurveys progress that has took place after the writing of I and considers alsopublished ASIC implementations which were not discussed in I. This reviewalso updates the previous ones presented in [118, 120, 293] by including im-plementations which appeared after their publication. The scope is also widercompared to [120], which focused only on low-cost implementations. Details ofimplementations are collected into Table 5.1. Selected implementations from

— 54 —

I are also discussed in order to avoid discontinuations, and in particular in-cluded in Table 5.1 for reader’s convenience. If a publication presents severalimplementations, only the ones which are the most comparable with other imple-mentations in the literature and, especially, with II were selected. CommercialAES implementations were not included in this review because only limited andmarketing-oriented information is available.

5.3.1 ASIC Implementations

This review begins with ASIC implementations. The first studies on ASIC-basedimplementation of AES (Rijndael) were published during the AES selectionprocess by Ichikawa et al. [131] and Weeks et al. [288]. Rijndael was found to besuperior compared to other algorithms in [131] whereas the study of [288] refusedto put algorithms in any particular order, but the values seem to support bothRijndael and Serpent. Rijndael was found superior to Serpent in a study [176]that appeared after the decision of AES.

A multitude of ASIC implementations have been published since the se-lection of Rijndael as AES. Most of them are iterative or loop-unrolled archi-tectures with 128-bit data paths achieving throughputs ranging from severalhundred Mbps to Gbps level. Such architectures have been presented, for ex-ample, by Alam et al. [4], Hsiao et al. [128], Kosaraju et al. [153], Kuo andVerbauwhede [157], Lutz et al. [176], Mangard et al. [179], and Su et al. [275].Important aspects that have attained interest in the papers are methods tocombine encryption and decryption with key agility and support for differentkey sizes. Alam et al. [4] presented an ASIC-based implementation support-ing the full AES (all key sizes with encryption and decryption) and additionalplaintext sizes of 192 and 256 bits. They used composite fields in the imple-mentation. Hsiao et al. [128] optimized logic functions of all transformations ofAES and described an ASIC implementation supporting encryption and decryp-tion with key agility. Kosaraju et al. [153] presented a straightforward ASICimplementation which achieves full key agility. Kuo and Verbauwhede [157]presented a fully key agile implementation of the full AES. Their design usesS-boxes implemented with combinatorial logic without composite fields and thedata path is 256 bits. Special attention was given for an efficient implemen-tation of KeySchedule(·). The implementation of [157] was later improvedin [282] which also studied power consumption. Lutz et al. [176] compared im-plementations of Rijndael and Serpent as mentioned above. Mangard et al. [179]presented a highly regular and scalable iterative ASIC implementation for bothencryption and decryption with full key agility and support for the CBC modeof operation. Su et al. [275] implemented the full AES with key agility by uti-lizing composite field F(24)2 with a normal basis. Next, two interesting specialcases, very low cost and high throughput implementations, are discussed.

Low cost, meaning low area and/or power consumption, is required in manyapplications, including smart cards, home consumer electronics, and mobile de-vices, and there is an obvious need for minimizing the cost of AES implemen-tations. As discussed earlier, the key techniques for achieving low cost are datapath folding and composite fields for the S-boxes. The first implementationaiming to a small area was presented by Satoh et al. [254]. They presented bothsmall and high throughput ASIC implementations which use the composite fieldF((22)2)2 in SubBytes(·). They also presented how encryption and decryption

— 55 —

Table 5.1: AES implementations. The six implementations on the top were includedin I and the implementations from II are in the bottom.

Reference Year Device Area requirements1 Features2 T (Mbps)

Chodowiec [61] 2003 Spartan-II 30-6 222 sl., 3 BRAM EDK1 166Hodjat [125] 2004 Virtex-II Pro 20-7 9446 sl. EK1 21640Pramstaller [228] 2004 Virtex-E 1000-8 1125 sl. EDK123 215Rouvroy [245] 2004 Spartan-III 50-4 163 sl., 3 BRAM EDK1 208Zambreno [304] 2004 Virtex-II 4000 16938 sl. EK1 23570Zhang [307] 2004 Virtex-E 1000-8 11022 sl. EDK1 21556Alam [4] 2007 0.18 µm CMOS 21000 gt. EDK123M 384Brokalaikis [37] 2005 Virtex-II 2000-5 1122 sl., 8 BRAM EK1 1941Bulens [42] 2008 Virtex-5 400 sl. EK1 4100Charot [55] 2003 Apex 20 —— EK1 10800Chaves [56] 2006 Virtex-II Pro 20-7 3513 sl., 80 BRAM ED1 34760Chaves [56] 2006 Virtex-II Pro 20-7 515 sl., 12 BRAM ED1 2332Drimer [86] 2008 Virtex-5 SX95T-3 428 sl., 80 BRAM,

160 DSPE1 55000

Feldhofer [96] 2005 0.35 µm CMOS 3400 gt. EDK1 9.9Good [103] 2005 Spartan-III 2000-5 17425 sl. EDK1 25107Good [103] 2005 Virtex-E 2000-8 16693 sl. EDK1 23654Good [104] 2006 Spartan-II 15-6 122 sl., 2 BRAM EDK1 2.18Good [105] 2007 Spartan-III 4000-5 20720 sl. EDK1 30835Good [105] 2007 Virtex-II 8000-5 31674 sl. EDK1 28526Hamalainen [119] 2006 0.13 µm CMOS 3100 gt. EK1 121Hsiao [128] 2005 0.18 µm CMOS 18540 gt. EDK1 1500Hwang [130] 2006 0.18 µm CMOS 199000 gt. EK1 3840Hwang [130] 2006 0.18 µm CMOS 596000 gt. EK1C 990Ichikawa [131] 2000 0.35 µm CMOS 612834 gt. EDK1 1950.03Kosaraju [153] 2006 0.35 µm CMOS —— EDK1 232.7Kotturi [154] 2005 Virtex-II Pro 70-7 5408 sl., 200 BRAM EK1 29770Kuo [157] 2001 0.18 µm CMOS 173000 gt. EK123M 1820Liberatori [169] 2007 Spartan-III 200 822 sl. EK1 224.12Lutz [176] 2002 0.6 µm CMOS ∼300000 trans. EDK1 2260Mangard [179] 2003 0.6 µm CMOS 10799 gt. EDK1 128Morioka [201] 2004 0.13 µm CMOS 167566 gt. E1 11600Satoh et al. [254] 2001 0.11 µm CMOS 21337 gt. EDK1 2609.11Satoh et al. [254] 2001 0.11 µm CMOS 5400 gt. EDK1 311.09Satoh [253] 2006 0.13 µm CMOS 297542 gt. EK123G 42670Su [275] 2003 0.35 µm CMOS 58430 gt. EDK123 2008Verbauwhede [282] 2003 0.18 µm CMOS 173000 gt. EK123M 2290Zhou [309] 2007 Virtex-4 LX40-12 16396 sl. EK1G 20608II 2003 Virtex-E 1000-8 11719 sl. EK1 16540II 2003 Virtex-II 2000-5 10750 sl. EK1 178001 sl. = slice, gt. = gate, BRAM = BlockRAM, DSP = DSP block, trans. = transistor2 E = Encryption, D = Decryption, K = Key agile, 1 = AES-128, 2 = AES-192, 3 = AES-256,

G = Galois Counter Mode, M = Also 192 and 256-bit M , C = Countermeasures against DPA

— 56 —

can be compactly combined in SubBytes(·) and MixColumns(·). Feldhoferet al. [96] implemented key agile AES-128 encryption and decryption with only3400 gates with throughput of nearly 10 Mbps by using an 8-bit data path.Hamalainen et al. [119] introduced an 8-bit implementation of the AES-128 en-cryption using parallel AES round and KeySchedule(·, ·) and they achieveda throughput of 121 Mbps with only 3100 gates. This represents the smallestpublished ASIC implementation. However, it must be emphasized when com-pared to [96] that [119] does not include decryption. A review of low-cost AESimplementations was presented in [120] which concluded that the best architec-tures for wireless sensor networks requiring only low encryption speed with verylow cost are the 8-bit dedicated architectures from [96, 119].

Throughputs exceeding 10 Gbps are required in heavily-loaded servers andoptical network switches, for example, and fully unrolled and pipelined archi-tectures are therefore of interest. The implementation of Rudra et al. [247]was the first one to utilize composite fields. They implemented AES so thatthe entire AES was computed in the composite field F(24)2 , but they providedonly estimates for speed and area of such an implementation. Morioka andSatoh [201] presented that a throughput of over 11 Gbps can be reached evenwith an iterative AES implementation by utilizing combinatorial T-boxes de-rived directly from the truth-table of the S-box. As a consequence, their im-plementation supports also the CBC mode of operation with maximal through-put. Satoh [253] presented an implementation supporting all key sizes with athroughput of over 40 Gbps for AES-128. The design also implements all logicrequired in the GCM (Galois Counter Mode) mode of operation [209], i.e., alsoa 128-bit finite field multiplier is included. Hodjat and Verbauwhede [126] pre-sented area-throughput tradeoffs for fully unrolled ASIC-based architecturesachieving throughputs of over 30 Gbps. They concluded that the best re-sults are achieved by using composite fields for SubBytes(·) with an offlineKeySchedule(·) unit. They did not present any exact implementation re-sults but showed that throughputs ranging from 30 Gbps to nearly 70 Gbps areachievable with 150000-275000 gates.

As discussed in Sec. 3.4, side-channel attacks form a serious threat in certainapplications and countermeasures are thus required. Hwang et al. [130] pre-sented an AES coprocessor using differential logic style which aims to an equalpower consumption regardless of the operation being executed and thwarts DPAattacks as a result. They showed how differential logic style can make DPA at-tacks considerably more difficult, but with relatively high costs in area, powerconsumption, and throughput. They claimed that their coprocessor is the firstpractical implementation that is resistant against DPA.

5.3.2 FPGA Implementations

FPGAs are appealing platforms for cryptographic algorithms as discussed inSec. 3.2 and, therefore, numerous FPGA-based implementations of AES havebeen published. Three fastest FPGA implementations included in I were pre-sented by Hodjat and Verbauwhede [125], Zambreno et al. [304], and Zhanget al. [307]. They all used full unrolling and pipelining and utilized compositefields in SubBytes(·) [125, 304, 307]. High throughput with FPGAs has beenthe target in numerous other publications, too. Charot et al. [55] presented afully key agile AES-128 implementation which supports the CTR mode of op-

— 57 —

eration on an Altera APEX FPGA and studied effects of pipelining. Good andBenaissa [103] presented implementations where throughput was maximized bycarefully balancing pipeline registers so that the critical path was formed by 3LUTs. The implementations were optimized further in [105] where the criticalpath was reduced to only 2 LUTs. This required careful design and floor-planning in order to prevent routing delays from degrading the results. As aconsequence, they achieved a throughput of over 30 Gbps even in a low-costSpartan-III FPGA [105]. Kotturi et al. [154] presented a straightforward fullyunrolled and aggressively pipelined implementation of AES-128 encryption us-ing BlockRAMs. Brokalakis et al. [37] presented a straightforward, but efficient,iterative implementation of AES-128 encryption using BlockRAMs for S-boxes.Chaves et al. [56] presented folded and fully unrolled FPGA-based AES im-plementations utilizing T-boxes implemented in BlockRAMs. Zhou et al. [309]presented an AES implementation in the GCM mode of operation and theyoptimized their implementation for 4-to-1-bit LUTs.

As discussed in Sec. 3.2.1, the recent FPGA families have introduced severalnew features, such as larger memories and granularity, and they have changedthe ways how AES can be implemented. Utilization of the new FPGA archi-tectures has been studied in two very recent papers by Bulens et al. [42] andDrimer et al. [86]. Bulens et al. [42] studied the effects of the new Virtex-5architecture for AES implementations and they concluded that the new 6-bitLUT structure is well-suited for implementing SubBytes(·). A straightforwardimplementation of the S-box directly from the truth-table requires only 32 6-bitLUTs whereas it requires 144 4-bit LUTs [42]. Hence, the increase in granularityhas positive effects for AES implementations. Drimer et al. [86] presented anFPGA implementation which extensively utilizes BlockRAMs for the T-boxesand DSPs for AddRoundKey(·, ·). As a result, their fully unrolled implemen-tation of AES-128 encryption using 80 BlockRAMs and 160 DSPs requires only428 slices and achieves a throughput of 55 Gbps [86]. This represents the fastestFPGA-based implementation currently available in the literature.

Although FPGAs are generally unsuitable for low-cost applications becauseof their high power consumption compared to ASICs, several low-cost FPGAimplementations of AES have been presented. Three smallest implementationsconsidered in I were presented by Chodowiec and Gaj [61], Pramstaller andWolkerstorfer [228] and Rouvroy et al. [245]. Only slices were used in [228]whereas [61] used BlockRAMs to store the State and the S-boxes and [244] usedthem to implement the T-boxes. Liberatori et al. [169] presented a straightfor-ward implementation with a 64-bit data path that requires fewer slices than [228]and no BlockRAMs, but it supports only AES-128 encryption. To the au-thor’s knowledge, the smallest FPGA-based AES implementation was intro-duced by Good and Benaissa [103] and it was later considered in more de-tail in [104]. In the architecture, the S-box was implemented with combina-torial logic and other operations were implemented with a single multiplier-accumulator in F28 [103, 104]. The implementation utilized an 8-bit data pathand optimized instruction sets and microcodes.

5.3.3 Comparisons

Difficulty of comparing published AES implementations is a problem that isgenerally recognized in the community. In the author’s opinion, the most thor-

— 58 —

ough discussion on aspects related to these difficulties was provided by Drimeret al. in [86] and, hence, the following discussion is based mainly on their paper.The most obvious problem is that published AES implementations commonlyimplement different functions (different key sizes, encryption and/or decryption,inclusions of KeySchedule(·, ·), modes of operation, etc.). The large range oftarget applications and implementation platforms, ASICs with different CMOS(Complementary Metal Oxide Semiconductor) processes and various FPGAs,makes comparison very challenging. The lamentable fact that authors some-times left out necessary information, such as speed grades or sizes of devices,makes comparison even more difficult. Even though functions were the same andall information was available, fair comparison of FPGA implementations is noteasy because, as Drimer et al. concluded, resources combining several smallerresources, such as slices, ALMs, etc., are not good metrics for comparisons forthree reasons [86]:

— Their “definitions” differ among device families, e.g., slices in Virtex fam-ilies consist of different numbers of LUTs with different sizes; see Sec. 3.2.

— It cannot be differentiated whether only part or all of the resources areused, e.g., one can use only one LUT and no registers or all LUTs andregisters from a slice or an ALM.

— Listing and comparing only the number of such resources neglects thenumber of other blocks, e.g., embedded memory or DSP blocks. Hence,metrics such as throughput per slice are seldom sufficient for fair compar-ison.

The same points also cause problems when LUTs are considered [86]. Publish-ing source codes would help in comparisons because the same target devices anddesign softwares could be used [86]. However, it would not solve problems en-tirely because authors would probably—even without intention—optimize theirown designs more carefully [86].

Hence, unconditionally fair comparison is impossible without a specific ap-plication in mind [86] and implementations presented in Table 5.1 should not becompared one-to-one only based on the values given in the table. For instance,the implementation of [86] using numerous BlockRAMs and DSPs but only fewslices can be the most useful if logic resources are scarce, but BlockRAMs andDSPs are unused [86]. If the situation is vice versa, other architectures, suchas [105, 304, 307], are more feasible.

As discussed, difficulty of fair comparison is an issue that requires attention.I includes one of few efforts for providing fair(er) comparisons. BlockRAMsand slices were mapped to a common quantity by utilizing equations from [248]which helped to compare designs implemented with older Virtex family FPGAs.However this method does not provide unconditionally fair comparisons either,because it does not differentiate between different utilization factors inside slicesor BlockRAMs and also neglects the subjective factor that in certain applica-tions BlockRAMs can indeed be considered free. Values obtained similarly asin I are not included in Table 5.1 because the considerably wider scope of targetdevices makes the use of such values impossible. Notice that the above dis-cussion on the difficulty of comparisons is valid also for ECC implementationsdiscussed in Ch. 6.

— 59 —

Despite the above, certain implementations which represent the state-of-the-art of AES implementation are raised above others and discussed next. Theseare discussed in contexts of different target applications in order to provide fairercomparison. However, it should be noted that some of the designs discussedabove, but which are not given in the following selections, may have meritswhich make them the best suitable solutions for certain applications.

If an application requires a low-cost ASIC implementation, then implementa-tions by either Feldhofer et al. [96] or Hamalainen et al. [119] should be preferredso that [96] should be selected if both encryption and decryption are needed,[119] otherwise. When low-cost FPGA implementations are required, then oneof the implementations by Chodowiec and Gaj [61], Good and Benaissa [104],Liberatori et al. [169], or Rouvroy et al. [245] should be chosen depending onthe target device. Of these, the smallest one was presented in [104] and it doesnot set any specific requirements for the target device. For very high through-put ASIC implementations, implementation techniques proposed by Hodjat andVerbauwhede [126], Morioka and Satoh [201], or Satoh [253] should preferreddepending on which modes of operations are required. In FPGAs, the besthigh throughput architectures are available in the papers presented by Goodand Benaissa [105], Chaves et al. [56], and Drimer et al. [86], and their mutualsuperiority depends on how much BlockRAMs and DSPs are available on thetarget device. If support for the full AES is needed in ASICs, the design pre-sented by Su et al. [275] is the best option. The only design supporting thefull AES that has been targeted to FPGAs was presented by Pramstaller andWolkerstorfer [228]. When pure speed is considered, speedups of several tensof times can be achieved compared to the fastest software implementation byLipmaa [172].

The above review and comparisons demonstrate that AES is suitable forvarious environments and means for its implementation for nearly all imaginableapplications exist. Consequently, the research field of AES implementation canbe considered mature.

— 60 —

Chapter 6

Elliptic Curve Cryptography

This chapter discusses both ECC algorithms and implementations. Theemphasis is, again, on issues related to hardware implementation and, es-

pecially, FPGAs. The objectives of this chapter are to provide an overview ofECC, to present details of algorithms used in the publications, III–XI, and toreview existing literature on ECC implementations.

Many of the publications, namely IV–IX, were written during the PLAproject. The PLA is a countermeasure against Denial-of-Service (DoS) attacksin computer networks [46, 47], e.g., in the Internet. A cryptographic signature isattached to each packet and it is verified from node to node when the packet pro-ceeds through the network [46, 47]. Hence, the computational requirements arevery high making hardware acceleration essential. ECC and Koblitz curves areused in the PLA in order to provide short signatures and adequate performance.Hardware implementations for the PLA are presented in the publications of thisthesis, and a comparable software implementation is presented in [40].

This chapter is organized as follows. Preliminaries of ECC are presented inSec. 6.1 followed by detailed descriptions of algorithms used in ECC in Secs. 6.2and 6.3. A special class of elliptic curves called Koblitz curves, which has beenextensively used in the publications, is discussed in Sec. 6.4. Review of existingliterature on hardware implementations of ECC is given and implementationsof this thesis are compared to them in Sec. 6.5.

6.1 Preliminaries

Theory of elliptic curves is rich and deep. Elliptic curves have been studiedby mathematicians over a hundred years [122]. They arise naturally in manymathematical problems but were considered only as a beautiful branch of math-ematics with little practical use. However, in the recent decades elliptic curveshave been used in solving a variety of problems. They have been used, e.g., infactoring integers [165] and proving Fermat’s Last Theorem [289]. In 1985, NealKoblitz [146] and Victor Miller [194] independently proposed the use of ellipticcurves in public-key cryptography after which an enormous amount of work hasbeen done on elliptic curves in academia.

— 61 —

An elliptic curve is defined as the set of solutions for the following so-calledaffine version of the Weirstraß equation [32]:

E : y2 + a1xy + a3y = x3 + a2x2 + a4x+ a6. (6.1)

where a1, a2, a3, a4, a6 ∈ F.A point P = (x, y) is on the curve E if it satisfies (6.1). Also a point called

the point at infinity, denoted by O, is considered as a point on E. The origin ofthe name is that O can be thought of as lying infinitely far up the y-axis whencurves are defined over real numbers [32]. Hence, the set of points on E definedover F is

E(F) = {(x, y) ∈ F × F | y2 + a1xy + a3y = x3 + a2x2 + a4x+ a6} ∪ O. (6.2)

It is possible to define an addition of two points, P1, P2 ∈ E(F), and produce athird point, P3 = P1 + P2 ∈ E(F), by using arithmetic operations in F. Thisaddition is called point addition. The set E(F) forms an additive Abelian groupwith the addition rule and O serves as the identity element; see Sec. 4.1. [122]

Elliptic curve point multiplication1 is defined over the Abelian group and itis given as follows

Q = kP = P + P + . . .+ P︸︷︷︸k times

(6.3)

where Q,P ∈ E(F) and k is a positive integer. The point P is called the basepoint and Q is the result point.

ECDLP is the problem of finding k if P and Q are known. Methods forsolving ECDLP are not considered in this thesis; however, extensive reviewscan be found in [32] and Part V of [68], for example.

The order of a point P , ord(P ), is the smallest integer, t, for which tP = O.The integer k in (6.3) should be chosen so that it fulfills

1 < k ≤ ord(P )− 1. (6.4)

When curves over Fq are considered, the length of the binary expansion of k, `,is bounded by ` ≤ dlog2 qe. When q = 2m, this gives ` ≤ m. [122]

Elliptic curve point multiplication, (6.3), is hierarchical in nature. The oper-ation decomposes into three levels as shown in Fig. 6.1. This thesis concentratesin improving the two highest levels of the hierarchy and offers only minor con-tributions to the level of finite field arithmetic.

The lowest level, finite field arithmetic, was discussed in detail in Ch. 4,and it is not considered further here. Point operations are computed by usingfield arithmetic. Computational costs of point operations, i.e., the numberof required field operations, can be reduced mainly by representing points indifferent coordinates. When a point is represented with two coordinates asP = (x, y) as above, it is said to be in affine coordinates, or A for short. Ifinversions in F are considerably more expensive than multiplications (whichusually is the case), it is more efficient to use projective coordinates which areconsidered in more detail in Sec. 6.2 which discusses elliptic curve arithmetic,i.e., point operations.

The highest level, elliptic curve point multiplication, is computed by usingpoint operations and these algorithms are discussed in Sec. 6.3. If the base point

1The term scalar multiplication is also commonly used in the literature

— 62 —

Multiplication Addition Squaring Inversion

Pointaddition

Pointdoubling

Pointmultiplication

The middle level

Point operations

The highest level

Elliptic curve point multiplication

The lowest level

Finite field arithmetic

Figure 6.1: Hierarchy of point multiplication is depicted on the left-hand side. Typi-cal construction of operations in this hierarchy is shown on the right-hand side. Notice,however, that all operations may not be needed or implemented in all algorithms, e.g.,squaring may not be included, or additional operations can be used, e.g., point halv-ings or Frobenius maps. Inversion can be interpreted also as a separate level betweenthe lowest level and the middle level because it can be computed with a series of otherfinite field operations.

P is fixed, i.e., P is the same for all point multiplications, the computation canbe accelerated considerably with precomputations involving P . These methodswill be discussed in Sec. 6.3.1. Certain applications, including digital signaturealgorithms, require computing sums of point multiplications, such as k(1)P(1) +k(2)P(2). There are methods for speeding up these so-called multiple pointmultiplications and they are discussed in Sec. 6.3.2. A family of curves onwhich point multiplication can be computed considerably faster than on generalcurves is called Koblitz curves and they are considered in detail in Sec. 6.4.

Hyper-Elliptic Curve Cryptography (HECC) [147], a generalization of ECC,achieves a similar level of security than ECC with smaller finite fields, thus,resulting in faster finite field arithmetic. However, point operations are morecomplex. Recent studies [249, 250] suggest that HECC is slower than ECC withsimilar levels of security. HECC is not studied in detail in this thesis; the readershould consult [68], for example, for details.

6.2 Arithmetic on Elliptic Curves

The following discusses arithmetic on elliptic curves, the middle level of thehierarchy of Fig. 6.1. The discussion is restricted to so-called non-supersingularelliptic curves over F2m , because they are used in the publications. Non-supersingular curves are preferred also in other published ECC implementations.Supersingular curves, which were preferred previously for implementation effi-ciency reasons, should be avoided because ECDLP can be reduced to a DLPin an extension of the underlying field [188]. Formulae for curves over Fp areavailable in [84, 122], for example. The following formulae and notations arepresented as in [84].

Non-supersingular curves over F2m are defined by

E : y2 + xy = x3 + a2x2 + a6 (6.5)

— 63 —

with a2, a6 ∈ F2m so that a6 6= 0. The implementations of this thesis mainlyfocus on elliptic curves standardized by NIST in FIPS 186-2 [205]. Specifically,the curves NIST B-163, K-163, K-233, and K-283 were considered in the pub-lications and they all have the form of (6.5). The K-curves are Koblitz curves,and they will be discussed in more detail in Sec. 6.4.

Elliptic curve point multiplication, (6.3), is typically computed using twoprincipal operations on the curve; namely, point addition and point doubling.Let P1 and P2 be two points in E(F). Point addition is P3 = P1 + P2 withP1 6= ±P2 and it results in P3 ∈ E(F). Point doubling refers to a case whereP1 = P2, and it is denoted by P3 = 2P1. Computational costs of elliptic curveoperations depend on the number of operations required in F. This discussionfocuses on curves over F2m . The following notations are used for operations inF2m : I denotes inversion, M multiplication, S squaring, and A addition.

6.2.1 Affine Coordinates

Point addition P3 = P1+P2 = (x3, y3) for points P1 = (x1, y1) and P2 = (x2, y2)such that P1 6= ±P2 is given by

λ =y1 + y2x1 + x2

, x3 = λ2 +λ+x1 +x2 + a2, y3 = λ(x1 +x3) +x3 + y1 . (6.6)

Point doubling P3 = 2P1 = (x3, y3) for a point P1 = (x1, y1) is given by

λ = x1 +y1x1, x3 = λ2 + λ+ a2, y3 = λ(x1 + x3) + x3 + y1 . (6.7)

Thus, point addition costs I + 2M + S + 8A and point doubling costs I + 2M +S + 6A which are almost the same because additions are cheap. Given a pointP = (x, y), its point negation, −P , is given by (x, x+ y) which costs A. [84]

As discussed in Ch. 4, inversions are more expensive than other field opera-tions; e.g., an Itoh-Tsujii inversion requires 9M+162S in F2163 [136]. Inversionscan be avoided in point addition and point doubling by using projective coordi-nates with the expense of more multiplications. Projective coordinates usuallyresults in considerable speed enhancements in practical cryptosystems. Twoprojective coordinate systems used in the publications are discussed in the fol-lowing.

6.2.2 Standard Projective Coordinates

By using standard projective coordinates, P, a point is represented with thetriple (X,Y, Z) so that it represents the point (X/Z, Y/Z) in A if Z 6= 0 andO = (0, 1, 0) otherwise.

Point addition P3 = P1 +P2, where Pi = (Xi, Yi, Zi) and P1 6= ±P2, is givenby:

A = Y1Z2 + Z1Y2, B = X1Z2 + Z1X2, C = B2,

D = Z1Z2, E = (A2 +AB + a2C)D +BC,

X3 = BE, Y3 = (AX1 + Y1B)CZ2 + (A+B)E, Z3 = (BC)D.

(6.8)

— 64 —

Point doubling P3 = 2P1 is given by:

A = X21 , B = A+ Y1Z1, C = X1Z1,

D = C2, E = B2 +BC + a2D,

X3 = CE, Y3 = (B + C)E +A2C, Z3 = CD.

(6.9)

Neither of the operations requires inversions, and the costs are 16M + 2S + 6Aand 8M + 4S + 5A for point addition and point doubling, respectively. Pointnegation is given by (X,X +Y,Z), and it costs A. It depends on the cost of aninversion whether this representation is faster than A. [84]

A very efficient algorithm for point multiplication is achievable in P byadapting Montgomery’s idea from [197] to curves with the form of (6.5) aspresented by Lopez and Dahab in [175]. This method, henceforth referred toas Montgomery point multiplication, relies on the fact that y-coordinate is notneeded during point multiplication because it can be recovered in the end [175,197]. The method will be discussed in more detail in Sec. 6.3, but the pointoperation level formulae are given in the following. The x-coordinate of pointaddition, P1 + P2, is obtained with the following formulae [175]:

Z3 = (X1Z2 +X2Z1)2, X3 = xZ3 +X1Z2X2Z1 (6.10)

where x is the x-coordinate of the base point P . The x-coordinate of pointdoubling, 2P1, is given by the formulae [175]:

X3 = X41 + a6Z

41 = (X2

1 +√a6Z

21 )2, Z3 = X2

1Z21 . (6.11)

The cost of (6.10) is 4M + S + 2A while (6.11) costs 2M + 3S + A if√a6 is

precomputed [175]. The y-coordinate is recovered by computing x1 = X1/Z1

and x2 = X2/Z2 and by using the formula [175]:

y1 =(x1 + x)

((x1 + x) (x2 + x) + x2 + y

)x

+ y (6.12)

where (x, y) is the base point P . The y-coordinate can be recovered with the costof I+10M+S+6A [175]. Standard projective coordinates, P, and Montgomerypoint multiplication were used in III and IV.

6.2.3 Lopez-Dahab Coordinates

When Lopez-Dahab coordinates [174], LD, are in use, a point is representedwith the triple (X,Y, Z) which represent the point (X/Z, Y/Z2) in A whenZ 6= 0 and O = (1, 0, 0) otherwise.

Point addition, P3 = P1 + P2, is carried out with the formulae [124]

A = X1Z2, B = X2Z1, C = A2, D = B2, E = A+B,

F = C +D, G = Y1Z22 , H = Y2Z

21 , I = G+H, J = IE,

Z3 = FZ1Z2, X3 = A(H +D) +B(C +G),Y3 = (AJ + FG)F + (J + Z3)X3,

(6.13)

and point doubling, P3 = 2P1, is given by the following formulae [161]:

A = X1Z1, B = X21 , C = B + Y1, D = AC,

Z3 = A2, X3 = C2 +D + a2Z3, Y3 = (Z3 +D)X3 +B2Z3.(6.14)

— 65 —

Table 6.1: The costs of point operations

Operation P1 P2 P3 Cost

P3 = P1 + P2 A A A I + 2M + S + 8AP3 = 2P1 A - A I + 2M + S + 6AP3 = −P1 A - A AP3 = P1 + P2 P P P 16M + S + 6AP3 = 2P1 P - P 8M + 4S + 5AP3 = −P1 P - P AP1 7→ P3 P - A I + 2MP3 = P1 + P2 (x-coord.) P P P 4M + S + 2AP3 = 2P1 (x-coord.) P - P 2M + 3S + AP1 7→ P3 (and y-coord.) P - A I + 10M + S + 6AP3 = P1 + P2 LD LD LD 13M + 4S + 9AP3 = 2P1 LD - LD 5M + 4S + 5AP3 = P1 + P2 LD A LD 9M + 5S + 9AP3 = −P1 LD - LD M + AP3 = P1 7→ P3 LD - A I + 2M + S

The costs of the above point addition and point doubling are 13M+4S+9A and5M + 4S + 5A, respectively. Point negation is given by (X,XZ + Y, Z) whichcosts M + A. The attractiveness of LD coordinates is based on a very efficientpoint addition when the point P2 is represented in A. This is called mixedcoordinate point addition and the formulae for (X3, Y3, Z3) = (X1, Y1, Z1) +(x2, y2) are given by [3]

A = Y1 + y2Z21 , B = X1 + x2Z1, C = BZ1,

Z3 = C2, D = x2Z3,

X3 = A2 + C(A+B2 + a2C),

Y3 = (D +X3)(AC + Z3) + (y2 + x2)Z23 .

(6.15)

This has the cost of 9M + 5S + 9A. Further reductions by M if a2 ∈ {0, 1}and by A if a2 = 0 occur on Koblitz curves; see Sec. 6.4. The Lopez-Dahabcoordinates, LD, were used in IV–VII, X, and XI.

Table 6.1 summaries all above-discussed point operations and their costs.

6.3 Elliptic Curve Point Multiplication

Elliptic curve point multiplication is computed with algorithms which are analo-gous to those used in exponentiation. A comprehensive survey of exponentiationis provided in [108]. This section gives a review of a few most commonly usedalgorithms for elliptic curve point multiplication by concentrating on the algo-rithms used in the publications.

A standard method for computing (6.3) is called binary method2 where theinteger k is represented with binary expansion as

k =`−1∑i=0

ki2i (6.16)

2Also called double-and-add method (or square-and-multiply method in exponentiation).

— 66 —

where ki ∈ {0, 1}. When binary method operates k from the right to the left,i.e., from the lsb to the msb, 2iP is added to the result point Q when ki = 1.Consider 18P as an example. Binary expansion of 18 is 〈10010〉. One firstcomputes 2P with one point doubling and adds it to Q after which 16P iscomputed with three point doublings and the result is added to Q giving 18P .A modification which operates k from the left to the right is commonly usedin practical implementations and in IV–VII, X, and XI. It is presented inAlg. 6.1 and it has the benefit that only Q needs to be accumulated while Premains unchanged.

Algorithm 6.1 Left-to-right binary method for elliptic curve point multiplica-tionInput: P ∈ E(F), integer k =

∑`−1i=0 ki2

i where ki ∈ {0, 1}Output: Q = kPQ← Ofor i = `− 1 downto 0 doQ← 2Qif ki = 1 thenQ← Q+ P

end ifend forreturn(Q)

Obviously, the computational cost of the binary method depends on thenumber of ones in the binary expansion of k. Again, H(k) denotes the Hammingweight of k, i.e., the number of nonzero terms in k. The computational cost ofthe binary method is `−1 point doublings and H(k)−1 point additions becausethe first point addition is simply a substitution. On average, half of the bits ina binary expansion are ones; thus, H(k) ≈ `/2.

Because H(k) has a direct impact on performance, it is of interest to reduceit. Signed-binary representations, k =

∑`−1i=0 ki2

i where ki ∈ {0,±1}, are partic-ularly interesting because point subtractions have approximately the same costwith point additions3 [198]. Point subtraction, P1−P2, is simply point additionwhere P2 is negated, and point negation is cheap as discussed in Sec. 6.2.

A Non-Adjacent Form4 (NAF) is a representation that has the minimumH(k) among all signed-binary representations which makes it well-suited forpractical applications. When k is in NAF, two consecutive digits are nevernonzero, i.e., kiki+1 = 0 for all i. Every positive integer k has a unique NAFwith a length which is at most one bit longer than its binary expansion. Mostimportantly, the average density of nonzero terms in NAF is 1/3 which resultsin H(k) ≈ `/3. An NAF is constructed from a binary expansion starting fromthe lsb by replacing strings of j ones with 10 . . . 01, where the number of zerosis j − 1 and 1 = −1.[82]

Consider 1853P as an example. The binary method requires seven pointadditions because 1853 = 〈11100111101〉. The number of point additions and

3Signed-binary representations cannot be used as efficiently in exponentiation, e.g., inRSA or ElGamal, unless precomputations are allowed, because divisions are considerablymore expensive than multiplications.

4In DSP, such representations are commonly called canonic signed-digit representations.

— 67 —

subtractions reduces to four if 1853 is given in NAF as 〈100101000101〉. Pointmultiplication requires in this case an additional point doubling but the speedupis, nonetheless, considerable.

If used naıvely, NAF doubles requirements for storage space compared tobinary expansion because two bits are needed to represent one signed-bit. How-ever, if storage space is an issue, the easily-disposable compact encoding pre-sented in [141] allows representing NAF with at most one extra bit comparedto binary expansion. The same paper suggest also a technique for selectingwell-distributed random NAFs [141].

One advantage of using NAF is that H(k) can be reduced without any pre-computations involving the base point P . If such precomputations are allowed,H(k) can be reduced by using methods which are discussed in more detail inSec. 6.3.1. If the base point is fixed and there are no strict memory constraints,those methods can provide considerable speedups.

Computational complexity of point multiplication can be reduced also byusing a DBNS in representing k. When k is represented in the DBNS, it hasthe form:

k =∑i,j

ki,j2i3j (6.17)

where ki,j ∈ {0, 1} and i, j are non-negative integers [79, 80]. Obviously, suchrepresentations are not unique and it is possible to find representations whichare very sparse, i.e., H(k) is small [80]. Hence, fewer point operations arerequired but extra computation is needed in conversions to the DBNS. In [79,80], Dimitrov et al. proposed the use of the DBNS in DSP applications andexponentiation. The idea has been adapted to elliptic curve point multiplicationin [66, 77, 78].

Binary method has the weakness that, if point additions and point doublingsare distinguishable, it may be possible to retrieve confidential information withside-channel attacks because point additions are performed only when k 6= 0.This can be avoided by using a structure called Montgomery’s ladder [197] wherepoint doubling and point addition are both performed for each bit. Montgomerypoint multiplication is outlined in Alg. 6.2 [197]. As mentioned in Sec. 6.2.2,Montgomery point multiplication can be adapted to curves with a form of (6.5)as shown by Lopez and Dahab in [175] and it results in a very efficient pointmultiplication algorithm. Point multiplication is carried out in P without infor-

Algorithm 6.2 Point multiplication using Montgomery’s ladder

Input: P ∈ E(F), integer k =∑`−1i=0 ki2

i where ki ∈ {0, 1} and k`−1 = 1Output: Q = kPP1 ← P and P2 ← 2Pfor `− 2 downto 0 do

if ki = 0 thenP2 ← P1 + P2 and P1 ← 2P1

elseP1 ← P1 + P2 and P2 ← 2P2

end ifend forreturn(Q = P1)

— 68 —

mation about the y-coordinate, i.e., only X and Z are updated. In the beginningP1 and P2 are computed from P with 2S+A. Point multiplication is computedwith Alg. 6.2 so that point addition and point doubling are performed for all bitswith (6.10) and (6.11), respectively, requiring together only 6M + 4S + 3A periteration. The y-coordinate is recovered in the end by computing (6.12). [175]

Computation of the binary method can be accelerated also by replacingpoint doublings with more efficiently computable operations. Point doublingscan be replaced with very cheap Frobenius maps on Koblitz curves as will bediscussed in Sec. 6.4. Koblitz curves are defined over F2m , but analogouslyefficiently-computable endomorphisms are used in the GLV (Gallant, Lambert,and Vanstone) method for a class of curves over Fp [101]. Another approachwhich gives smaller improvements but applies to a wider class of curves is calledpoint halving, proposed independently by Erik W. Knudsen [145] and RichardSchroeppel (at the rump session of CRYPTO 2000). Point halving, the inverseoperation of point doubling, is the operation P3 = 1

2P1 such that 2P3 = P1.After manipulations on k, computationally cheaper point halvings can be usedinstead of point doublings which leads to faster point multiplication [145].

6.3.1 Methods with Precomputations

The common idea in the methods presented in this section is that the com-putational cost of (6.3) is reduced by precomputing some data depending onP . These methods are especially useful if P is fixed because, in that case, pre-computations need to be performed only once; however, considerable speedupmay be achievable even though P is not fixed. Naturally, storage for precom-puted data must be available. Analogous methods exist for exponentiation andthe following methods are their adaptations for elliptic curves; see [82, 108] forcomprehensive reviews.

The first method, which dates to 1939, is called 2w-ary algorithm [35] wherethe integer k is represented in radix-2w as

k =`−1∑i=0

ki(2w)i (6.18)

where ki ∈ [0, 2w − 1]. The precomputed points are 3P, 5P, . . . , (2w − 1)P .Each ki is given in the form ki = 2su where u is odd; s = w and u = 0 ifki = 0 [82]. Then each ki transforms into the operation 2s(2w−sQ+uP ) on thecurve. The computation of (6.3) requires `− 1 point doublings and on average(2w − 1)/2wd`/we − 1 point additions.

Consider 846P as an example with w = 3. Precomputed points are 3P, 5P ,and 7P and k is represented as 〈1 101 001 110〉2 = 〈1516〉23 . First, one setsQ = P because k3 = 1. Because 5 = 20 · 5, one computes 20(23P + 5P ) = 13P ,and continues by computing 20(23(13P )+P ) = 105P because 1 = 20 ·1. Finally,the result is given by 2(22(105P ) + 3P ) = 846P because 6 = 21 · 3. Hence,point multiplication requires 9 point doublings and 3 point additions excludingprecomputations requiring 3 point additions and 1 point doubling.

The 2w-ary algorithm slices k by using a fixed window. A sliding windowanalogue results in slightly more optimal representations because it skips consec-utive zeros [82]. In the case of the above example, a sliding window with w = 3results in the following windowing: 〈1 101 00 111 0〉2, and the number of point

— 69 —

additions reduces by one. The window methods can be used also for signed-digit representations. A generalization of NAF called width-w NAF, or NAFwfor short, gives a representation whose H(k) ≈ m/(w + 1) with precomputedpoints 3P, 5P, . . . , (2w−1 − 1)P [122].

The above ideas can be used in accelerating (6.3) even if P is not fixed.The following methods are useful only with a fixed P . The simplest idea isto precompute 2iP for all integers i ∈ [1,m − 1]. This shortens runtime byeliminating all point doublings, but requires storage for m points. Combinationof the previous idea with the 2w-ary algorithm results in an algorithm [36],called fixed-base windowing method. The method is based on the followingequation [122]:

kP =t−1∑i=0

ki(2wiP ) =2w−1∑j=1

j ∑i:ki=j

2wiP

(6.19)

where ki ∈ [0, 2w − 1], t = d`/we and points 2wiP are precomputed for integersi ∈ [0, t − 1]. The fixed-base windowing method computes (6.3) with approx-imately 2w + t − 3 point additions [36, 122]. The idea can be generalized toNAF which results in an average runtime of 2w+1/3 + d(` + 1)/we − 2 pointadditions [36, 122]. Furthermore, Montgomery’s trick, discussed in Sec. 4.4, canbe applied to reduce the number of inversions if A coordinates are in use [195].

The fixed-base comb method [170] is also a method involving precomputa-tions with P . The integer k is represented as a binary array of w rows andt = d`/we columns. Points are precomputed for all possible values of thecolumns. Point multiplication operates on the array one column at a time sothat a precomputed value of the column is added. Moving to the next columnimplies a point doubling. It requires t − 1 point doublings and, on average,(2w− 1)/2wt− 1 point additions [122, 170]. The number of point doublings canbe reduced to dt/2e − 1 by using two precomputation tables [122, 170].

Methods for speeding up (6.3) when the integer k is fixed exist but they arenot considered in this thesis because they have little practical importance inECC. Such methods are reviewed in [82], for example.

6.3.2 Multiple Point Multiplication

Multiple point multiplication refers to a sum of point multiplications, i.e.,

Q =n−1∑i=0

k(i)P(i) (6.20)

where k(i) are integers and P(i) ∈ E(F). Multiple point multiplications are usedin various schemes including digital signatures, e.g., multiple point multiplica-tions with n = 2 are required in ECDSA verifications [140] and verifications ofself-certified identity based signatures in the PLA require multiple point mul-tiplications with n = 3 [38]. Of course, multiple point multiplications can becomputed naıvely with n separate point multiplications and n − 1 point addi-tions combining their results. This requires

∑n−1i=0 H(k(i)) − 1 point additions

and∑n−1i=0 (`(i) − 1) point doublings and it is possible to do much better.

Shamir’s trick, a method attributed to Shamir by ElGamal in [92], operatesin the following way when n = 2. Bits of k(0) and k(1) are scanned from the left

— 70 —

to the right and point doubling is performed when moving from a bit to another.When the present bits of k(0)

k(1)are 1

0 , 01 , or 1

1 points P(0), P(1), or P(0) + P(1) areadded, respectively. The point P(0) + P(1) should be precomputed. Clearly,Shamir’s trick is easy to generalize for n integers. Let k denote an array of nbinary expansions and let Hn(k) be its joint Hamming weight, i.e., the numberof nonzero columns in k. Shamir’s trick requires Hn(k)− 1 point additions and`− 1 point doublings excluding precomputations which require 2n−n− 1 pointadditions.

As the computational cost depends on Hn(k), it is of interest to reduce itonce again. Because the probability of a zero bit in a binary expansion is 0.5,the probability of a zero column in an array of n integers is simply 0.5n whichgives Hn(k) ≈ `(1− 0.5n). If the integers are in NAF, the probability becomes(2/3)n. This can be improved because signed-binary representations are notunique. Thus, one can use signed-binary representations which maximize thenumber of zero columns in the array. One such representation is referred to asJoint Sparse Form (JSF) and it is unique and has the minimum Hn(k) amongall signed-binary representations of n integers [230, 268]. Jerome A. Solinaspresented an algorithm for obtaining JSF for a pair of integers in [268]. Ageneralization for n integers is due to John Proos [230]. Probabilities of nonzerocolumns are presented in Table 6.2 with different values of n. A variant ofJSF called simple JSF, which also has a minimal Hn(k) but is more efficient tocompute, was introduced in [110].

If n is relatively small—which usually is the case—it can be advantageousto combine window methods with multiple point multiplications. As a con-sequence, a large number of points need to be precomputed, but their com-putational cost can be reduced by eliminating inversions with Montgomery’strick [215], discussed in Sec. 4.4. Montgomery’s trick provides considerablespeedups even though window methods are not in use as shown in V for n = 3.

6.4 Koblitz Curves

Koblitz curves [148] are a class of non-supersingular curves defined over F2 by

Ea2 : y2 + xy = x3 + a2x2 + 1 (6.21)

where a2 ∈ {0, 1}. The NIST curves K-163, K-233, and K-283 [205], which areused in IV–XI, are Koblitz curves and have the above form. Koblitz curves arean attractive family of curves because Frobenius endomorphisms can replacepoint doublings. The Frobenius endomorphism, φ : Ea2(F2m)→ Ea2(F2m), is a

Table 6.2: Probabilities of nonzero columns in n integers with different representa-tions [39, 230]

n Binary NAF JSF

1 0.5000 0.3333 —2 0.7500 0.5555 0.50003 0.8750 0.7037 0.58974 0.9375 0.8025 0.64255 0.9688 0.8683 0.6727

— 71 —

map such thatφ(x, y) = (x2, y2) and φ(O) = O. (6.22)

Frobenius maps are obviously very cheap if compared to point operations dis-cussed in Sec. 6.2 because only two or three squarings are required depending onthe coordinate system. Squarings are efficient especially in normal basis wherethey are simple bitvector rotations as discussed in Ch. 4. Successive applicationsof Frobenius maps are simply φt(x, y) = (x2t, y2t).

It can be shown that the following equation stands for all P ∈ Ea2(F2m):

µφ(P )− φ2(P ) = 2P (6.23)

where µ = (−1)1−a2 . Thus, the Frobenius map can be seen as a complex numberτ satisfying

µτ − τ2 = 2 (6.24)

which has two solutions, and it is chosen that

τ =µ+√−7

2. (6.25)

It is now possible to multiply points in Ea2(F2m) with elements of a ring ofpolynomials in τ with integer coefficients, given by

k =`−1∑i=0

kiτi (6.26)

where ki are integers. Obviously, it is of interest to find such a representationwhere the length, `, and the nonzero digits, ki, are small. [148]

It follows directly from (6.24) that every element k of the ring can be ex-pressed in a canonical form as

k = k0 + k1τ. (6.27)

If an integer, k, is represented with a binary expansion, each bit ki requirespoint doubling. Point addition is computed if ki = 1, as discussed previously.Similarly, if k is represented with a τ -adic expansion of (6.26) where ki ∈ {0, 1},each ki requires a Frobenius map and ki = 1 yields point addition. Such rep-resentation can be found by repeatedly dividing the integer k by τ . The digitski ∈ {0, 1} are the remainders of the divisions. This procedure is analogous withthe computation of binary representation where divisions by 2 are applied. [148]

Analogously with the NAF, it is possible to represent k in τ -adic Non-Adjacent Form (τNAF) where ki ∈ {0,±1} and kiki+1 = 0 for all i, i.e., thereare no adjacent nonzero digits. Algorithms for computing τNAF and width-wτNAF, or τNAFw for short, were presented by Solinas in [264, 266, 267]. Theaverage density of τNAF is the same as for binary NAF, i.e., 1/3. The problemis, however, that the length of an expansion, `, doubles to approximately 2mif the τNAF conversion algorithm from [264, 266, 267] is applied directly to aninteger.

It is possible to circumvent this problem because the following equationstands for all P ∈ Ea2(F2m) [187]:

(φm − 1)P = φmP − P = P − P = O. (6.28)

— 72 —

This implies that if k′ ≡ k (mod τm − 1), then k′P = kP . τNAF of theremainder of k divided by τm − 1 has a length of approximately m which isonly one half of the length of τNAF of k [264, 266, 267]. It can be proven thatk′P = kP stands for all P ∈ Ea2(F2m) also in the case k′ ≡ k (mod δ) whereδ = (τm−1)/(τ−1) [266, 267]. The strategy is to find k′ = k0+k1τ ≡ k (mod δ)with the smallest possible norm and then convert it to τNAF. The result has alength of approximately m. Finding k′ ≡ k (mod δ) can be expensive, althoughan efficient partial reduction algorithm was proposed in [267].

Another approach is to first convert an integer k to τNAF and then factor(τm − 1) out of the result. The problems in this approach are that the τNAFconversion takes longer and the non-adjacency is lost in the reduction and itsrecovery may be computationally demanding. This approach was outlined atleast in [177]. This method yields efficient hardware implementations as shownin VIII.

If reduced τNAFs are used, the average computational cost of (6.3) is onlym/3 point additions because point doublings are replaced by Frobenius mapswhich are almost free. This is a significant reduction compared to computa-tions on general curves. Hence, Koblitz curves offer major benefits over generalcurves.

However, before fast point multiplications can be utilized, conversions needto be applied to integers which require time and resources. In order to avoidthese conversions, random τ -adic expansions can be generated in a simple man-ner and they have been shown to be well-distributed [162]. Such representationsare troublesome in certain applications including digital signatures because thebinary equivalent is required which yields a need for a conversion from τ -adicto binary expansion. An efficient algorithm for the conversion from τ -adic tobinary expansion and its hardware implementation were presented in IX.

Many methods, which were discussed earlier, can be adapted to Koblitzcurves and, in some cases, even further improvements are gained comparedto their general curve equivalents. τNAFs can be compressed similarly asNAFs and the method for selecting random NAFs gives almost perfectly dis-tributed τNAFs, as well [141]. Adaptations of DBNS-related expansions toKoblitz curves were presented in X and in [15, 16], but practical significance ofsuch methods was first demonstrated in X. Point halving gives computationaladvantages also on Koblitz curves even though there are no point doublingsto be replaced [17, 19]. Computing one point halving reduces the number ofrequired point additions from m/3 to m/4 [19]. Unfortunately, Montgomerypoint multiplication has not been adapted to Koblitz curves. Window meth-ods, on the other hand, can be computed without requirements for storagespace as presented in [216, 284] by exploiting the inexpensiveness of Frobeniusmaps. The technique operates so that an entire k is scanned once for everyprecomputed point and, hence, it requires storage only for the point currentlyat hand [216, 284]. The inexpensiveness of Frobenius maps was used also in IVallowing efficient parallelization of point multiplication. Multiple point multi-plications and the idea of JSF can be used also for τ -adic representations. Analgorithm for obtaining τ -adic Joint Sparse Form (τJSF) for a pair of integerswas proposed by Ciet et al. in [65] and it was generalized for n integers by Brum-ley in [39]. Also τJSFs have the probabilities of nonzero columns presented inTable 6.2.

— 73 —

6.5 Literature Review of ECC Implementations

This section reviews published implementations of ECC on both ASICs andFPGAs in chronological order. The FPGA-based implementations are consid-ered in more detail because of the focus of this thesis. Details of implementationsdiscussed in the following are collected to Table 6.3. The table includes the de-sign which is the most comparable with the implementations of the publicationsof this thesis when several implementations are presented. Other reviews of ECCimplementations are available in [26, 193].

The first hardware implementation of ECC was published by Agnew et al.in [2] in 1993. They presented a processor architecture with a bit-serial normalbasis multiplier for F2155 . They provided performance estimates for both super-singular and non-supersingular curves. The first FPGA-based ECC architecturewas presented by Rosner in [243] in 1998. Both of these implementations useparameters which are nowadays considered insecure, i.e., composite fields areconsidered in [2, 243] and supersingular curves are used in [2]. Nevertheless,at least [2] is commonly considered one of the landmark works in the field ofcryptographic hardware implementation. It was also considerably ahead of itstime because implementations of ECC did not start to gain serious interest inacademia before the turn of the millennium.

In 1999, Gao et al. presented ECC implementations on a low-cost XilinxFPGA in [102]. However, the very small field sizes (m ≤ 53) used in [102] wereinsecure already in 1999. Orlando and Paar [219] also presented performanceestimates of ECC with their super-serial multiplier at the same conference, IEEEInternational Symposium on Field-Programmable Custom Computing Machines(FCCM) 1999.

In 2000, Leung et al. presented a microcoded processor architecture forFPGAs in [168]. The paper also presented an automated design generatorwhich facilitated generation of several designs with different design parame-ters. The microcoded architecture provided easily modifiable point operations.This work was later continued by Leong and Leung in [167] which providedfurther studies on the architecture as well as the effects of coordinate selection,field size, and parallelism. In [106, 107], Goodman and Chandrakasan presentedan ASIC implementation targeting to low-power applications. The architecturewas reconfigurable in the sense that it supported several irreducible polyno-mials and implemented support for both Fp and F2m . The architecture alsoperformed SHA-1 hashing and RSA and, hence, supported practically all of theIEEE P1363 standard [132]. The papers concluded that ECC is both faster andmore energy-efficient than RSA. Energy-efficiency comparisons to FPGA andsoftware based implementations were also provided and the ASIC implemen-tation was claimed to be 30-180 times more energy-efficient than FPGA andover three orders of magnitude better than software. Okada et al. presented anECC processor based on a novel field multiplier architecture in [214]. The mul-tiplier can be seen as a generalization of the super-serial multiplier presentedin [219] and it supports arbitrary irreducible polynomials. This was also thefirst paper considering FPGA-based implementation of Koblitz curves. Koblitzcurves were found to be nearly twice as fast as general curves. The architecturedid not, however, include circuitry for τ -adic conversions. Orlando and Paarpresented an FPGA-based processor optimized for a fixed irreducible polyno-mial in [220]. Irreducible polynomials could be changed by reprogramming the

— 74 —

FPGA. They provided comparisons between Montgomery point multiplicationand binary method and concluded that Montgomery point multiplication is su-perior. Notice that [106, 214, 220] were all published in CHES 2000.

In 2001, Orlando and Paar [221] published an FPGA-based ECC processorfor Fp using Montgomery multiplier with precomputed LUTs stored in embed-ded memory. Ernst et al. [94] discussed implementation of ECC by using anautomated VHDL generator. Again, the generator provided multiple designswith various design parameters in a short time.

In 2002, Kerins et al. [144] presented an FPGA-based processor optimizedfor a(x)b(x)/c(x) mod p(x) in F2m . The processor was designed to supportpoint multiplication in A, where the above computation is useful. Compre-hensive comparisons of various aspects of ECC implementations were given byBednara et al. in [27, 28]. Polynomial and normal bases were compared andpolynomial basis was found to give better results. Coordinate systems and pointmultiplication algorithms were compared and Montgomery point multiplicationwas shown to provide the best results. Choosing the number of multipliersin the processor was also briefly discussed. Ernst et al. [93] presented a low-cost implementation for computing kP on E(F2133) by using an Atmel FPGAboard. In [114], Gura et al. presented an ECC implementation integrated intoOpenSSL. Computations were optimized for F2m with certain fixed polynomialsbut support for arbitrary polynomials was also included. The fixed polynomialswere shown to provide ten times faster performance. The architecture was im-proved by introducing partial reduction for arbitrary polynomials in [115] whichreduced the gap to six times. Even further improvements were shown to be pos-sible by Eberle et al. in [87] where a new multiplier was introduced reducing thedifference to only two times. This is very impressive considering the complexityof polynomial reduction with an arbitrary irreducible polynomial. Schroeppel etal. [259] published a low-power ASIC implementation of ElGamal signatures onelliptic curves. The implementation utilized point halving, composite field F2178

which was claimed to be secure, and included logic for performing all operationsrequired in signature generation and verification, i.e., also SHA-1 was included.Potgieter and van Dyk [227] presented a scalable and flexible processor archi-tecture for F2m . Their architecture was based on a multiplier combining theclassical and Montgomery multipliers.

In 2003, Satoh and Takano [255] presented a processor architecture imple-menting support for both F2m and Fp with different irreducible polynomialsand primes. It was concluded that F2m is 3-6 times faster than Fp and sup-port for F2m was achieved with no additional cost compared to Fp. Nguyen etal. [212] studied implementation of ECC on a reconfigurable computer. Theirstudy concentrated on partitioning between software and hardware (FPGA).They concluded that the entire point multiplication should be implemented inhardware, but it suffices that only the finite field arithmetic is implementedin VHDL. Ors et al. [222] described an ECC processor over Fp. They used amultiplier based on systolic array architecture and Montgomery representation.The processor is capable of computing RSA as well.

In 2004, Lutz and Hasan [178] (based on Lutz’s Master’s thesis [177]) pre-sented a processor architecture computing point multiplication on both generaland Koblitz curves. The implementation was optimized for Koblitz curves sothat all coordinates of Frobenius maps were computed in parallel with squarersattached to registers. The implementation did not include a τ -adic converter.

— 75 —

Table 6.3: Published ECC implementations

Reference Year Curve Device Area requirements1 Features2 t (µs) T (ops)

Agnew [2] 1993 E(F2155 ), NB Gate array ∼11,000 gt. —— ——Ansari [12] 2006 E(F2163 ), PB Virtex-II 2000 8,300 LUT, 1,100 FF, 7 BRAM 41 24,000Ansari [13] 2007 E(F2163 ), PB 0.18µm CMOS 36000 gt., 1KB RAM 62 19,000Bajracharya [20] 2004 E(F2233 ), NB Virtex-II 6000 59% of sl. 201 5,000Batina [24] 2005 E(F2179 ), PB Virtex 800 11,811 sl. C 557 1,800Batina [25] 2006 E(F2163 ), PB Virtex-II Pro 8769 sl. H 667 1,500Bednara [28] 2002 E(F2191 ), PB Virtex 1000-4 —— O 2270 440Benaissa [29] 2006 E(F2163 ), PB Virtex-E 2000 6,010 LUT, 504 FF, 12 BRAM F 804 1,200Chelton [57] 2008 E(F2163 ), PB Virtex-4 L200 16209 sl. O 19.55 51,120Chen [58] 2007 E(Fp), 256-bit 0.13µm CMOS 122000 gt. 1010 990Cheung [59] 2005 E(F2162 ), NB Virtex-II 6000 —— O 100 10,000Daly [72] 2004 E(Fp), 160-bit Virtex-E 2000-6 3434 sl. O —— ——Daneshbeh [73] 2004 —— 0.18µm CMOS 35,000 gt. F —— ——Eberle [87] 2003 E(F2163 ), PB Virtex-E 2000-7 20,068 LUT, 6321 FF F 302.29 3,308Ernst [94] 2001 E(F2155 ), NB XC4085XLA 63% of CLB 1290 775Ernst [93] 2002 E(F2113 ), PB AT94K40 96% of resources 1400 700Gao [102] 1999 E(F253 ), NB XC4044XL 1626 CLB O 370 2,670Goodman [107] 2001 E(F2176 ), PB 0.25µm CMOS 2.9× 2.9 mm2 FI 6950 140Gura [114] 2002 E(F2163 ), PB Virtex-E 2000-7 20,068 LUT, 6321 FF (F)I 143.81 6,987Gura [114] 2002 E(F2163 ), PB Virtex-E 2000-7 20,068 LUT, 6321 FF FI 1554 644Gura [115] 2002 E(F2163 ), PB Virtex-E 2000-7 —— F 930 1,075Kerins [144] 2002 E(F2151 ), PB Virtex-E 2000-6 4,048 sl. F 5100 200Leong [167] 2002 E(F2113 ), NB Virtex 1000-6 1,410 sl. O 4300 230Leung [168] 2000 E(F2113 ), NB Virtex 300-4 1,290 sl. O 3700 270Liu [173] 2006 E(F2163 ), PB Virtex-II 2000 910 CLB and 1 BRAM 1900 530Lutz [177, 178] 2004 E(F2163 ), PB Virtex-E 2000 10,017 LUT, 1,930 FF 233 4,300Lutz [177, 178] 2004 E1(F2163 ), PB Virtex-E 2000 10,017 LUT, 1,930 FF 75 13,300McIvor [184] 2006 E(Fp), < 2256 Virtex-II Pro 125-7 15,755 sl., 256 mult. 3860 260Mentens [191] 2004 E(F2160 ), PB Virtex 800-4 150,678 equiv. gt. O 3801 260Morales-S. [199] 2006 E(F2163 ), PB Virtex-II 4000 7342 sl. O 1020 9801 sl. = slice, gt. = gate, BRAM = BlockRAM, FF = Flip-Flop, mult. = embedded multiplier, M4K and M512 = Stratix II embedded memory blocks2 F = Flexible, O = Other designs also included in the paper, C = Countermeasures against side-channel attacks, H = HECC, B = Both F2m and Fp,

M = Multiple point multiplications, I = Includes other operations, e.g., SHA-1, T = τ -adic converter Continued on next page. . .

—76

—

Reference Year Curve Device Area requirements1 Features2 t (µs) T (ops)

Nguyen [212] 2003 E(F2233 ), PB Virtex-II 6000 39% of 33,792 sl. O 2979 336Okada [214] 2000 E(F2163 ), PB Flex 10K 250-2 —— F 80300 12Okada [214] 2000 E1(F2163 ), PB Flex 10K 250-2 —— F 45600 22Orlando [220] 2000 E(F2167 ), PB Virtex-E 400-8 3002 LUT, 1769 FF, 10 BRAM O 210 4,800Orlando [221] 2001 E(Fp), P-192 Virtex-E 1000-8 11,416 LUT, 5,735 FF, 35 BRAM 3000 330

Ozturk [224] 2004 E(Fp), (2167 + 1)/3 0.13µm CMOS 34,390 gt. O 3100 320

Ors [222] 2003 E(Fp), 160-bit Virtex-E 1000-8 11,227 LUT, 6,959 FF 14414 69Peter [225] 2007 E(F2163 ), PB 0.25µm CMOS 1.0 mm2 (F)O 83 12,000Potgieter [227] 2002 E(F2163 ), PB Spartan-II 200 —— F 3776 260Rodrıguez-H. [241] 2004 E(F2191 ), PB Virtex-E 2600 17630 sl. 63 16,000Rosner [243] 1998 E(F(28)21 ), PB XC4062XL-1 1810 sl. O 4470 220

Sakiyama [249] 2006 E(F2163 ), PB Virtex-II Pro 30 8450 sl. FOH 280 3,600Sakiyama [251] 2007 E(Fp), 256-bit Spartan-III 5000-5 27597 sl. O 17700 56Sakiyama [250] 2007 E(F2163 ), PB 0.13µm CMOS 115000 gt. (F)OH 29 34,000Saqib [252] 2004 E(F2191 ), PB Virtex-E 3200 18314 sl., 24 BRAM 56.44 17,700Satoh [255] 2003 E(F2160 ), PB 0.13µm CMOS 106,659 gt. FB 190 5,300Satoh [255] 2003 E(Fp), 192-bit 0.13µm CMOS 106,659 gt. FB 1440 700Schroeppel [259] 2002 E(F(289)2 ), PB 0.5µm CMOS 191,000 gt. I 4400 230

Shu [262] 2005 E(F2163 ), PB Virtex-E 2000-7 25,763 LUT, 7,467 FF O 48 21,000Sozzani [270] 2005 E(F2163 ), PB 0.13µm CMOS 0.66 mm2 27 36,805Zhou [310] 2007 E(F2163 ), PB Stratix II S15C3 6589 ALUT, 5777 FF, memory I 750 1,300III 2004 E(F2163 ), PB Virtex-II 8000-5 18,076 sl. O 106 9,400IV 2007 E(F2163 ), NB Stratix II S180C3 11,800 ALM, several M4K O 48.88 20,458IV 2007 E1(F2163 ), NB Stratix II S180C3 13,472 ALM, several M4K OT 25.81 49,318V 2007 E1(F2163 ), NB Stratix II S180C3 67,467 ALM, 305 M4K, 240 M512 MT 114.2 166,000VI 2007 E1(F2163 ), PB Stratix II S180C3 26,148 ALM O 4.91 203,670VII 2008 E1(F2163 ), PB Stratix II S180C3 16,930 ALM, 21 M4K MT 16.36 161,290X 2006 E1(F2163 ), NB Virtex-II 2000-6 8,745 sl., 10 BRAM, 2 mult. T 35.75 28,000XI 2008 E1(F2163 ), NB Stratix II S180C3 8799 ALM, 36 M4K, 20 M512, 4 DSP T 35.04 28,500XI 2008 E1(F2163 ), NB Stratix II S180C3 28328 ALM, 66 M4K, 52 M512, 4 DSP T 13.38 74,7001 sl. = slice, gt. = gate, BRAM = BlockRAM, FF = Flip-Flop, mult. = embedded multiplier, M4K and M512 = Stratix II embedded memory blocks2 F = Flexible, O = Other designs also included in the paper, C = Countermeasures against side-channel attacks, H = HECC, B = Both F2m and Fp,

M = Multiple point multiplications, I = Includes other operations, e.g., SHA-1, T = τ -adic converter

—77

—

Mentens et al. [191] presented an implementation for Fp using Montgomerymultiplier. Rodrıguez-Henrıquez et al. [241] presented an implementation com-puting Montgomery point multiplication with Karatsuba multipliers over F2191 .The paper also studied parallelization and concluded that up to four multiplierscan be used in speeding up Montgomery point multiplication. Slightly improvedresults were presented in [252]. Bajracharya et al. [20] discussed implementa-tion on a reconfigurable computer similarly as in [212] with different designparameters and considerably improved results. Ozturk et al. [224] presented alow-power ASIC implementation for ECC over Fp. The cost of modular opera-tions was reduced by introducing modulus scaling techniques and a new inver-sion algorithm together with an efficient hardware implementation. Daneshbehand Hasan presented an ASIC-based implementation utilizing combined mul-tiplier and inverter in [73]. The implementation handled different irreduciblepolynomials and field sizes. Daly et al. [72] introduced an FPGA-based ALUsupporting AB/C operation in Fp using Montgomery representation.

In 2005, Sozzani et al. [270] discussed ASIC implementation of ECC and pre-sented an implementation capable of computing both Montgomery point multi-plication and binary method at the same time. This was achieved by prioritizingmultiplier for multiplication-intensive Montgomery point multiplication and in-verter for inversion-intensive binary method in A. ECC was also compared toRSA and it was concluded that ECC is both faster and more area efficient.Batina et al. [24] presented an implementation which takes side-channel attacksinto account on all design levels. Cheung et al. [59] introduced a customizable ar-chitecture and provided several implementations which were generated with anautomated design generator. They also independently showed how Montgomerypoint multiplication can be computed with four parallel multipliers similarlyas in [241]. Also the problem of comparing designs on different FPGAs wasaddressed by studying the implementations on different Xilinx FPGAs. Shuet al. [262] presented high-speed implementations utilizing parallel multipliersspecialized for certain operations.

In 2006, Batina et al. [25] compared ECC and HECC. They concluded thatHECC is faster but requires more area. However, they used binary method inP for ECC which is not the most efficient method. Liu et al. [173] proposeda method for computing point multiplication with shared LUTs in multiplierssuggested in [123]. Benaissa et al. [29] presented a new word-level multiplierfor arbitrary irreducible polynomials and provided point multiplication timingswith a processor using the multiplier. McIvor et al. [184] studied implementa-tion of ECC over Fp. Their processor included a new combined modular inverterand bit-parallel multiplier. Morales-Sandoval and Feregrino-Uribe evaluateddifferent implementation techniques and parallelization alternatives in [199].They targeted implementations especially to mobile applications. Sakiyamaet al. [249] presented a processor for both ECC and HECC. They exploitedparallelism in operations and provided comparisons between ECC and HECCand concluded that ECC is faster. Another study from the same authors [250],continuing the work of [249], described a parallelized ASIC implementation ofECC and HECC and also preferred ECC. Ansari and Hasan [12] introduced acarefully implemented processor where the multiplier is always kept busy withno idle cycles. The processor implemented Montgomery point multiplicationwith polynomial basis.

In 2007, Sakiyama et al. [251] presented a reconfigurable ALU supporting

— 78 —

RSA and ECC over 256-bit Fp. They concluded that ECC outperforms RSAby about factor of two. Chen et al. [58] presented a systolic array based archi-tecture for Fp arithmetic and used it for ECC. Ansari and Wu [13] presentedan ECC implementation employing an efficient digit-serial multiplier in poly-nomial basis and utilized parallelism in point operations which resulted in afast ECC implementation. Peter et al. [225] studied the effects of support-ing several irreducible polynomials in ECC implementations. They concludedthat supporting a few irreducible polynomials can be done without major de-crease in speed and increase in area. However, supporting arbitrary irreduciblepolynomials is costly but, nonetheless, considerably more efficient than soft-ware. Zhou et al. [310] presented an implementation on a Nios II soft-coreprocessor utilizing custom functions for multiplication and inversion. Hence,the implementation falls to the class (e) in the classification of Fig. 3.4. Theyconcluded that considerable speedup is achieved with the approach comparedto pure software on Nios II. In 2008, Chelton and Benaissa [57] presented animplementation which appears to be the fastest FPGA-based implementationusing general curves. Their processor utilized a carefully pipelined multiplierand supported custom instructions: multiply-accumulate, multiply-square, andrepeated-square. The processor achieved point multiplication times of 19.55µsand 33.05µs in a Virtex-4 and Virtex-E FPGAs, respectively.

As can be seen above, a large number of ECC implementations have beenpublished. The implementations target to various applications with differ-ent requirements and introduce numerous different implementation techniques.Hence, the following discussion is presented in order to provide an overview ofthe implementations and to ensure fairer comparison in Sec. 6.5.5.

6.5.1 Typical Structure of ECC Processors

The most typical structure of an ECC implementation is depicted in Fig. 6.2.Such a structure can be identified from a large majority of publications discussedabove. It can be found also from implementations presented in III–V, X, andXI. The structure includes a Field Arithmetic Processor (FAP) which supportsarithmetic in Fq. FAPs always include dedicated processing units at least formultiplication and addition and memory or registers for storing elements of Fq.An FAP implements the lowest level of the hierarchy of Fig. 6.1 and higherlevels are implemented in control logic with a Finite State Machine (FSM)and/or microcode. If both FSM and microcode are used, the FSM implementsthe highest level of the hierarchy while point operation level is implementedwith the microcode.

Depending on Fq, other operations besides multiplication and addition arealso commonly supported. Squaring is typically supported for F2m whereas Fpimplementations usually include an inverter. The reasons are obvious: Squaringis very cheap in F2m , especially in normal basis or with a fixed irreducible poly-nomial, and inversion in Fp would be very time consuming without a dedicatedinverter. If an inverter is not available, inversion is computed with Fermat’s Lit-tle Theorem, (4.19). However, inverters are commonly included also for F2m , es-pecially, with polynomial basis. Other operations that may be supported includeshifts, comparisons, and custom-instructions, such as multiply-and-accumulate.

Control logic for an FAP is usually, but not always (see e.g. [115]), includedin implementations presented in the papers. FAPs which do not include control

— 79 —

FAP

Memory/Register file Inverter(∗)Multiplier

Adder/Subtractor Squarer(∗)

Q

Control logic(∗)

Microcode(∗)FSMk

P

Figure 6.2: Typical structure of an ECC implementation. (*) not necessarily availablein all FAPs.

logic can be used only in a close relationship with a controlling processor, i.e., inthe classification of Fig. 3.4, they can be used as coprocessors (c), reconfigurablefunction units (d), or embedded processing units (e). All implementations pre-sented in the publications of this thesis include control logic and, therefore, theycan be used in all roles given in Fig. 3.4, although their sizes probably preventtheir use as a reconfigurable function unit.

6.5.2 Parallelism in ECC Processors

The possibility of using parallelism is the key factor that gives advantage forFPGA (and ASIC) implementations, but the use of parallelism in ECC is farfrom trivial. Because multiplications dominate in costs of point operations (ifprojective coordinates are in use), efficient computation of multiplications isemphasized in almost all published implementations. Use of parallelism is easyin finite field operations as discussed in Ch. 4. However, highly parallelizedmultipliers, i.e., bit-parallel or digit-serial with large digit size, have poor area-latency efficiencies which degrade their feasibility in practice. Besides, when anarea of a multiplier grows, routing in the place & route becomes much harderwhich degrades results even further. These aspects were studied in IV.

Another approach is to reduce the number of multiplications on the criticalpath by using multiple multipliers. This utilizes parallelism in point operations.

— 80 —

However, data dependencies prevent achieving full multiplier utilization withparallel multipliers which leads to poor area-latency efficiency. For example,point addition and point doubling of Montgomery point multiplication can becomputed with critical paths of six, three, or two multiplications with one, two,or four multipliers, respectively [59, 241]. Hence, full multiplier utilization canbe achieved only with one or two multipliers.

The situation is even worse when parallelism is utilized on the highest levelof the hierarchy of Fig. 6.1. Point multiplication algorithms are recursive and,thus, contain only limited amounts of intrinsic parallelism. For example, pointoperations in Alg. 6.1 must be computed recursively because the result of a pointaddition is needed in point doublings, and vice versa. However, point additionand point doubling can be computed in parallel in Montgomery point multi-plication, Alg. 6.2, [241, 262] and right-to-left binary method [212]. Beyondthat, consecutive point operations must be computed recursively, which limitsthe amount of exploitable parallelism. However, this problem can be circum-vented with Koblitz curves by exploiting the inexpensiveness of Frobenius mapsas shown in IV. It is also possible to use more parallelism in point operationsby interleaving successive point additions as shown in VI. Sozzani et al. [270]took another approach by computing two point multiplications simultaneouslywith different algorithms, one of which is multiplication-intensive and the otherwhich is inversion-intensive.

6.5.3 Flexibility of ECC Processors

In many practical applications, implementations must handle elliptic curve op-erations with various different parameters, such as curve, field size, etc. In suchapplications, support for all parameters must be implemented. This support isreferred to as flexibility. It is one of the most important features determiningspeed and area requirements of an implementation and, therefore, comparingdesigns with different levels of flexibility is difficult.

Many implementations having the structure of Fig. 6.2 provide inherent flex-ibility on the two highest levels of the hierarchy of Fig. 6.1, e.g., point operationscan be modified easily by uploading a new microcode. On the other hand, thesame point multiplication algorithms and even point operation formulae can beused for a variation of curves which limits the need for such flexibility. However,if an FAP fails to support different finite fields, an implementation using thatFAP is always limited to only few elliptic curves, and it cannot be consideredflexible. Hence, support for various finite fields is the feature that has the mostcrucial effect on flexibility.

At simplest, limited field flexibility can be achieved by including specificcircuitries for a small number of fields. For example, support for a few binaryfields with polynomial basis can be achieved with specific reduction circuitriesfor a few irreducible polynomials. This approach provides adequate flexibilityfor many applications without major decrease in speed. However, the approachis not viable if a large number of fields must be supported.

Reduction with an arbitrary irreducible polynomial is iterative which makesit inherently slow. However, it can be computed fairly efficiently with methodssuch as the partial reduction [115]. On the other hand, fixed reduction circuitrycan perform an entire reduction in one clock cycle with a high clock frequencywith a few XORs if the polynomial is sparse (e.g., trinomial or pentanomial).

— 81 —

Combination of the two above approaches, i.e., combining the support forboth arbitrary and specific irreducibles as shown in [87, 114, 115], is a viablesolution in many practical applications. Reduction circuitries for specific irre-ducible polynomials provide fast computation for a few frequently used fieldswhile arbitrary fields are supported by a slower generic reduction circuitry. How-ever, even this approach restricts to binary fields with polynomial basis.

Normal bases do not suit well for flexible designs. Although it is easy todesign flexible adders and squarers for normal bases, multipliers supportingarbitrary fields cannot be implemented efficiently, because they would need tosupport multiple F -functions; see Sec. 4.3.2. Therefore, an implementationshould include circuitry for computing different F -functions on-the-fly and itshould be able to update the F -function circuitry accordingly. At least to theauthor’s knowledge, such multipliers have not been published, and it is hard tosee how such multipliers could be devised without serious reductions in speedand increases in area.

Thus far, only flexibility of F2m has been discussed. Support for both F2m

and Fp can be added by utilizing techniques discussed in Sec. 4.5. It should benoted that Fp is slower than F2m , and speedups compared to software are alsoconsiderably lower.

Flexibility is more crucial for ASICs than for FPGAs because FPGAs canprovide flexibility through reprogramming; see the advantages provided by re-programmability from Sec. 3.2. Of course, if parameters need to be changedfrequently, reprogramming times start to dominate which makes this approachunpractical and, thus, flexible FAPs may be needed in FPGAs as well. Any-how, design flexibility is undoubtedly more important in ASIC implementationswhere designs cannot be changed after manufacturing.

6.5.4 Optimizations for Specific Curves or Algorithms

The structure of Fig. 6.2 can be applied in elliptic curve point multiplicationover all sets of parameters, e.g., coordinate systems. However, structures opti-mized for certain operations have been presented and they provide considerablespeed increases with the expense of reduced generality. A specific structure forMontgomery point multiplication was presented in [241]. Point addition andpoint doubling are computed in parallel with processing units build around onemultiplier and several registers. Control logic is very simple: Only the bit kiis used for selecting the inputs of point addition and point doubling processingunits, see Alg. 6.2. However, the structure does not include logic for (6.12), i.e.,it outputs only X and Z coordinates of the result point, Q. Similar approachwas presented in [262], but their implementation included a specific processingunit also for coordinate conversions.

The structure of Fig. 6.2 can be optimized for Koblitz curves by attachingsquarers to the registers holding Q [178]. This enhances the performance onKoblitz curves without sacrificing generality. The architecture of VI can beseen as an adaptation of the ideas of [241, 262] to Koblitz curves, althoughthe architecture also utilizes point operation interleaving allowing even furtherimprovements in speed. The structure of VII builds upon VI and introducesspecific units for precomputations, τ -adic conversions, for-loop computation,and coordinate conversion.

Complex instructions, e.g., A(B+D)+C in [250] or a(x)b(x)/c(x) mod p(x)

— 82 —

in [144], can be seen as optimizations for specific algorithms. For example,a(x)b(x)/c(x) mod p(x) is useful for point operations in A, but it has little usewith projective coordinates.

6.5.5 Comparisons

Details of published ECC designs have been collected into Table 6.3. Fair com-parison is extremely difficult (confer AES implementations) and all issues dis-cussed in Sec. 5.3.3 cause difficulties also for ECC comparisons. However, com-paring ECC designs is even more difficult because the variety of implementedalgorithms is larger and this variety has great impacts on implementation re-sults. For instance, curves and field sizes have major influences on both speedand area requirements, as can be seen, for example, from the tables in VI whichlist implementation results with different field sizes. Nevertheless, certain imple-mentations that represent the state-of-the-art are given in the following togetherwith estimates of how designs presented in III–VII, X, and XI compare withother published designs.

To the author’s knowledge, the fastest design for general curves was pre-sented by Chelton and Benaissa in [57] where point multiplication time 19.55µswas achieved on NIST B-163 with a Virtex-4 FPGA. The fastest design forKoblitz curves is given in VI for NIST K-163 on which point multiplication takes4.91µs in a Stratix II FPGA excluding τ -adic conversions. When throughputis considered instead of computation time, the implementations presented in Vand VII outperform other published designs, because they are the only onesoptimized for high throughput, with the exception of [270], i.e., throughput issimply the inverse of computation time for other designs. The most compactimplementations given in Table 6.3 use parameters which are insecure, but com-pact implementations using secure parameters are given, for example, by Liuet al. in [173] and Orlando and Paar in [220]. Compact implementations forboth general and Koblitz curves are listed also in IV. However, very compactFPGA-based implementations are still missing from the literature. Support forarbitrary irreducible polynomials was achieved efficiently by Eberle et al. in [87]with a technique called partial reduction. They also included fixed reductioncircuitries for specific irreducible polynomial in order to accelerate their com-putation [87, 114, 115]. When support for both F2m and Fp is needed, a goodoption is the implementation presented in [255].

Table 6.3 shows that the designs of this thesis compare favorably with otherdesigns from the literature. The implementations are faster than most otherpublished implementations and, in fact, VI and VII utilizing Koblitz curvesrepresent the fastest published ECC implementations. The fastest general curveimplementation, which was presented in [57], outperforms the implementationspresented in IV, but it used polynomial basis whereas IV uses a normal basis.Table 6.3 also shows that Koblitz curves give considerable enhancements inspeed compared to general curves. They are more than twice as fast as generalcurves even including the time of τ -adic conversions. Koblitz curves are thereforehighly feasible alternatives in applications requiring very fast computation.

The area requirements of implementations of this thesis are quite large. Theyprevent using the implementations in constrained applications, but the area re-quirements are in line with other published implementations. Besides, the valuesgiven in Table 6.3 reflect only the area requirements of the fastest implementa-

— 83 —

tions and smaller implementations were also presented in the publications. Allarchitectures presented in the publications are also scalable in the sense thatspeed can be traded off for smaller area. In the publications, the target has beenin high speed which has resulted in large area requirements but, for example,Tables VII and VIII of IV show that compact designs can be realized with thesame architectures.

The use of fixed parameters, especially, fixed fields, give a large speed advan-tage compared to flexible designs and, hence, the implementations of this thesisare not comparable with flexible designs. It has been shown that fixed designsare at least about twice as fast as flexible designs [87]. However, because theimplementations of this thesis use FPGAs, the disadvantages of fixed designsare reduced as flexibility can be achieved in many cases by reprogramming theFPGA.

— 84 —

Chapter 7

Results

This chapter summarizes the main research results of this thesis and identi-fies their significance to the research field of cryptographic implementation.

Details, such as exact computation times, etc., are not considered in this chapterbut they are available in the appended publications. The contributions of thisthesis are twofold as they relate to either AES or ECC, and they are discussedseparately in Secs. 7.1 and 7.2, respectively.

7.1 AES-related Contributions

AES was discussed in I, which surveyed and compared AES implementations,and II, which presented a fast AES implementation for FPGAs. The followingtwo sections discuss their results.

7.1.1 AES Surveys

The survey provided in I gave an overview of secret-key cryptographic imple-mentations using FPGAs, as well as hash algorithms. The emphasis was onAES and the survey was the first review concentrating to high-speed implemen-tations; to the author’s knowledge, FPGA implementations of AES had beenreviewed only in [293] where the emphasis was on security questions of FPGAsas implementation platforms. The implementations were compared both interms of throughput per slice and throughput per area which consisted of theslice and BlockRAM utilization. A fairer comparison was provided than inother comparisons which had only considered slice count. However, even thenew metric cannot provide an unconditionally fair comparison as discussed inSec. 5.3.3. Nonetheless, the survey provided valuable information of the stateof FPGA-based AES implementations at the time of publication.

The survey presented in Ch. 5 should be considered as a contribution of thisthesis as well because, to the author’s knowledge, it is the most comprehensiveand up-to-date review of published ASIC and FPGA implementations availablein the literature at the time of writing this thesis. Thus, it complements otherreviews previously presented in the literature, such as I and [118, 120, 293].

— 85 —

7.1.2 Fast AES Implementation

The main contribution of the implementation of II was its speed. In retrospec-tive, the approach taken in II is not the most efficient because it computes theentire encryption in F(24)2 . The design could be improved also in many other as-pects; for instance, the pipeline should be balanced better and the isomorphismsshould be optimized. Nonetheless, the implementation of II was the fastest pub-lished implementation of AES-128 encryption at the time of publication andpresented the first published FPGA implementation utilizing composite fields.Hence, despite it weaknesses from the current point-of-view, it represented thestate-of-the-art in achieved throughput at the time of publication, and it is stillcited many recent publications.

7.2 ECC-related Contributions

The main contributions of this thesis relate to ECC, and ECC was the subjectin III–XI. The focus was on high-speed implementation using parallelism. Op-erations of ECC are recursive in nature which restricts the use of parallelism. Inthis thesis, methods enabling increased parallelism were studied and developed.They resulted in faster implementations than existing methods, as shown inthe comparison of Sec. 6.5.5. The studies considered both general and Koblitzcurves, the emphasis being in Koblitz curves.

The implementations presented in IV–XI use curves specified in [205], i.e.,the NIST curves, and III uses curves from [52] recommended by SECG. Despitethis fact, the contributions of this thesis are curve independent in the sense thatthey can be easily adapted to curves defined in different standards, or evento non-standardized curves. Of course, the methods developed specifically forKoblitz curves set restrictions for curve selection but even they are not restrictedto any particular Koblitz curves.

7.2.1 ECC Implementations for General Curves

General curves were studied in III and IV. A fast and scalable architecturesuitable for an automated VHDL generator [137] was presented in III. Thearchitecture was optimized for the 4-to-1-bit LUT structure of the target FPGA.It was shown to result in very fast point multiplication times, including someof the fastest results available in the literature at the time of publication. Inretrospective, the weakest link in the implementation was the use of a largecontrol FSM which resulted in a large circuitry.

One of the key publications of this thesis, IV, also considered general curves.The main contribution of IV concerning general curves was that it illuminatedthe effects of parallelization in ECC implementations which was not adequatelyaddressed in the existing literature. A generic architecture was presented andtools were provided for analyzing parallelization on different levels of the hier-archy of Fig. 6.1. The paper concluded that large field multipliers are inefficientand better results are obtained with several smaller multipliers in parallel.

— 86 —

7.2.2 Utilizing Parallelism with Koblitz Curves

Koblitz curves were considered in IV–XI. The main contribution of IV, es-pecially useful for Koblitz curves, was the method to parallelize point multipli-cation for several FAPs. The method resulted in implementations which werefaster and required less area than traditional implementations utilizing eitherone large field multiplier or several smaller ones.

Throughput maximization with parallel FAPs was studied in V. It wasconcluded that allowing slightly longer computation times for single operationsresults in a considerably higher throughput because more parallel FAPs fit intoan FPGA. This conclusion is, again, due to the fact that multipliers with largedigit sizes are inefficient. Improvements to precomputation algorithms relatedto three-term multiple point multiplications were also developed and they wereshown to provide major enhancements compared to a straightforward approach.To the author’s knowledge, V was the first paper presenting ECC implemen-tations which prioritize throughput over computation time of a single pointmultiplication. Arguably, the focus in high-speed ECC implementations willshift to high throughput in the future. The reasons are that the computationtimes achievable with current knowledge are adequate for most applications, butthere is still a need for higher throughput. Furthermore, increasing throughputby shortening computation time increases area considerably, whereas parallelprocessing leads to high throughput with reasonable area, as shown in V.

The parallelization methods presented in IV and V utilized parallelism onthe highest level of the hierarchy of Fig. 6.1. Data dependencies restrict efficientuse of parallelism on the middle level of the hierarchy, but observations madein VI helped to circumvent this problem by interleaving consecutive point ad-ditions. Hence, the method utilized parallelism on both the highest and middlelevels of the hierarchy. The proposed method was shown to be highly feasibleand scalable in practice. It resulted in the shortest point multiplication timesavailable in the literature today. The same idea was used in VII, which in-troduced also further optimizations by utilizing similar structures of windowmethods and multiple point multiplications. An FPGA implementation withboth short computation time and high throughput was achieved by designingoptimized processing units for different parts of the algorithms.

These results prove that highly specialized implementations, which use largeamounts of parallelism and features of specific curves, can provide very largespeedups compared to more general implementations. Such specializations canbe done in FPGAs without major reductions in generality of the system becausean FPGA can be reprogrammed if other parameters are needed. The results alsodemonstrate the efficiency of Koblitz curves in hardware implementations. Priorto this thesis, only few publications had discussed implementations using Koblitzcurves, but this thesis shows that Koblitz curves give major speed improvementseven with τ -adic conversions. Hence, the results of this thesis can improvethe attractiveness of Koblitz curves in hardware implementations and, as aconsequence, also their use in practical systems.

7.2.3 Efficient Converters for Koblitz Curves

Because point multiplication times have been reduced to only a few microsec-onds, the traditional approach to compute τ -adic conversions in host processors

— 87 —

as in [178, 214] is not a viable solution as it becomes the bottleneck. Hence,efficient hardware implementations of τ -adic conversions are necessary. Hard-ware implementations of τ -adic conversions were not presented in the literatureprior to VIII and IX and, thus, they delivered undeniable novelty. Algorithmmodifications and an efficient implementation for converting integers into theτNAF were presented in VIII. The implementation was later used in IV, V,and VII. The converter from IX performs conversions to the other direction,from τ -adic representation to binary integer. Such conversions are useful inapplications where random τ -adic expansions are used but also their integerequivalents are needed. Such applications include, for example, ECDSA withKoblitz curves. They could be realized with the converter of VIII as well, butthe conversions of IX provide performance improvements because they can becomputed in parallel with point multiplications.

7.2.4 Adaptation of the DBNS to Koblitz Curves

An adaptation of the DBNS to τ -adic representations was proposed in X wherek was represented as

∑i,j ki,jτ

i(τ − 1)j . The representation was considerablysparser than the τNAF and, as a result, achieved point multiplication timeswere shorter. The paper also presented a variation using three bases which hasserious theoretical significance because it results in the first provably sublinearpoint multiplication algorithm for Koblitz curves. Different adaptations withthe form

∑i,j ki,jτ

i3j were proposed almost at the same time with X in [15,16], but their practical usefulness was not verified. However, the results ofthe FPGA implementation proved that the representation suggested in X wasindeed practical. The work of X was extended to a journal paper, XI, includingalso a study of the inherent parallelism included in the DBNS. It was shownthat exploiting this parallelism leads to a very fast FPGA implementation.

— 88 —

Chapter 8

Conclusions

This thesis studied implementation of cryptographic algorithms. The sub-ject has importance in many practical systems which use cryptographic al-

gorithms for improving security. Cryptographic algorithms are omnipresent inmodern communication in which information security, such as confidentialityof communication or reliable authentication, are absolute necessities. The the-sis studied techniques enabling fast computation of cryptographic algorithmswhich cannot be achieved with software-based solutions. Hardware accelerationis commonly required in important applications, such as heavily-loaded networkservers. Efficient implementation techniques are necessary in order to providehigh-security cryptographic algorithms to such applications and, hence, the sub-ject has been studied intensively in both academia and industry. The focus ofthe thesis was on a secret-key cryptographic algorithm, AES, which was studiedin I and II, and public-key cryptography algorithms, ECC, which were the topicof III–XI. The most important contributions of the thesis consider ECC.

Because cryptographic algorithms are used in a variety of applications rang-ing from low cost environments, such as RFID tags, to high speed systems,such as optical networks, the requirements set for cryptographic implementa-tions differ. The most important requirements in low cost environments arelow resource utilization and power consumption whereas high speed systems re-quire short computation times and high throughputs. These requirements areusually out of reach with general-purpose microprocessors which are too slowand power consuming. On the other hand, ASICs are often too expensive andthey lack flexibility which is essential for many cryptosystems. All presentedimplementations of this thesis targeted to FPGAs. However, many of the ideasand architectures can be adapted to ASICs as well. The focus of implementa-tions was on accelerating computation and, hence, the target applications areenvironments which require very high computation speed.

In the introduction, Ch. 1, the scope of this thesis was described as findinganswers to certain questions. The first ones asked which algorithms are themost suitable for hardware and how should they be improved in order to in-crease suitability. As shown in this thesis, the answers to these questions dependmuch on the implementation platform and target application. For instance, in

— 89 —

AES, both memory-based and combinatorial approaches, discussed in Sec. 5.2.1and 5.2.2, respectively, have merits. Their mutual superiority depends on howmuch memory is available and whether operations are pipelined, etc. On theECC side, this thesis showed that Koblitz curves are highly suitable also forhardware implementations and they result in significant speedups compared togeneral curves. It was also asked what kind of implementations provide the bestresults. This thesis has answered this question in many ways. For instance, al-though it is obvious that parallelism can improve the speed of implementations,this thesis showed that it must be carefully studied where to focus parallelismin order to maximize efficiency.

The thesis presented improvements both to algorithms and techniques forimplementing them. Improvements to algorithms included modifications thatmade them more suitable for hardware implementations and reorganizationsthat enabled increased amount of parallelism. Practical implementations werepresented in all publications and they exploited the results of the algorithmmodifications. Many of the implementations were the fastest ones available inthe literature at the time of publication which verifies the efficiency of bothimplementations and techniques used in them.

Next, the contributions of this thesis are shortly revisited, and certain topicsfor future research are pointed out, first, for AES and, then, for ECC. Efficientimplementation techniques of AES have been studied extensively in the litera-ture and they have considerable importance in many practical systems becauseof the wide distribution of AES. AES-related contributions of this thesis werethe surveys and comparisons of high-speed FPGA-based implementations (I,Ch. 5) and an efficient combinatorial implementation (II). It can be concluded,based on the discussion of Ch. 5, that hardware implementation of AES is amature field of research and it is unlikely that any new techniques considerablyimproving implementations will emerge. This conclusion is supported by thefacts that techniques to implement AES have been extensively studied alreadyfor several years and methods to implement AES in practically all applicationsexist. However, introduction of some new features in FPGAs can produce asmall renaissance to the field because they can open new directions for imple-mentations. Indications of such possibilities can be found from the very recentpapers [42, 86] which utilized the new Virtex-5 architecture. However, as statedin [42], such improvements arise directly from the new architecture and do notnecessarily include any theoretical novelty.

As mentioned, the most important contributions relate to ECC. They in-clude techniques increasing speed of ECC computations, especially studies onparallelization of ECC operations, and issues related to Koblitz curves. Sev-eral methods and architectures were presented for Koblitz curves including newmethods to utilize parallel processing (IV, V), a method to interleave point ad-ditions (VI, VII), architectures for conversions needed for Koblitz curves (VIII,IX), and an adaptation of the DBNS which was shown to be highly feasible withan FPGA implementation (X, XI). There exist a plethora of published ECCdesigns as well. Despite the amount of work that has been done, the field ofECC implementations has not been fully matured and certain open problemsstill exist. Although basic solutions for implementing ECC are available andwell-known, means to include ECC in very demanding environments requiring,e.g., very low cost or high throughput, are not fully known. Minimizing arearequirements is indispensable in order to provide ECC, or any kind of high-

— 90 —

security public-key cryptography for that matter, to low-cost environments. Ithas started to gain interest in the community [23, 116] and will probably bean important research topic in the near future. Combining low area require-ments with efficient side-channel countermeasures, which are essential becauseof the viability of side-channel attacks in low-cost environments, is a challengingtask that requires more research. On the other hand, the focus in high-speedimplementations will probably shift to high throughput in the future. The rea-sons are that, arguably, adequately short computation times are achievable withcurrent knowledge, but there would still be need for higher throughput. Nat-urally, throughput can by improved by shortening computation times but suchan approach easily leads to poor area efficiency, whereas parallel processing pro-vides more area efficient solutions, as shown in V. To the author’s knowledge,the first studies focusing on maximizing throughput were presented in V andVII. New algorithms and curves providing more efficient computations havebeen introduced even recently, e.g., [31], which naturally increases also the needfor hardware implementation related research. Another topic, which will prob-ably attain interest in the community, is implementation of hash algorithms.When NIST decides the finalists of the new hash algorithm competition, theirhardware implementation will likely be an active research topic.

To conclude, this thesis has increased knowledge of how to accelerate com-putation of mainstream cryptographic algorithms with hardware accelerators,implemented specifically in FPGAs. The most important results are those en-abling faster computation of ECC by using increased parallelism because theyshow that high-security public-key cryptography can be used also in computa-tionally demanding systems. The publications of this thesis include the fastestreported ECC implementations currently available in the literature. The contri-butions of the thesis can be directly adapted to practical systems. Consequently,they can—in their part—enable increasing use of high-security cryptographicalgorithms in various practical systems and, thus, improve the quality of suchsystems.

— 91 —

— 92 —

Bibliography

[1] Agnew, G.B., Mullin, R.C., Onyszchuk, I.M., and Vanstone, S.A., “An im-plementation for a fast public-key cryptosystem,” Journal of Cryptology,vol. 3, no. 2, Jan. 1991, pp. 63–79.

[2] Agnew, G.B., Mullin, R.C., and Vanstone, S.A., “An implementation ofelliptic curve cryptosystems over F2155 ,” IEEE Journal on Selected Areasin Communications, vol. 11, no. 5, Jun. 1993, pp. 804–813.

[3] Al-Daoud, E., Mahmod, R., Rushdan, M., and Kilicman, A., “A newaddition formula for elliptic curves over GF (2n),” IEEE Transactions onComputers, vol. 51, no. 8, Aug. 2002, pp. 972–975.

[4] Alam, M., Ray, S., Mukhopadhayay, D., Ghosh, S., RoyChowdhury, D.,and Sengupta, I., “An area optimized reconfigurable encryptor for AES-Rijndael,” in Proceedings of the Conference and Exhibition on Design,Automation and Test in Europe, DATE 2007, Nice, France, Apr. 16–20,2007, pp. 1116–1121.

[5] Altera Corporation, “Nios II processor reference handbook,” Manual,ver. 7.2.0, Oct. 2007, http://www.altera.com/literature/hb/nios2/n2cpunii5v1.pdf (Oct. 9, 2008).

[6] ———, “Stratix II device handbook,” Data sheet, May 2007, http://www.altera.com/literature/hb/stx2/stratix2 handbook.pdf (Oct. 9, 2008).

[7] ———, “Stratix III device handbook,” Data sheet, Jul. 2008,http://www.altera.com/literature/hb/stx3/stratix3 handbook.pdf (Oct.9, 2008).

[8] American National Standards Institute (ANSI), “Public key cryptographyfor the financial services industry: The elliptic curve digital signaturealgorithm (ECDSA),” ANSI X9.62, 1999.

[9] ———, “Public key cryptography for the financial services industry: Keyagreement and key transport using elliptic curve cryptography,” ANSIX9.63, 2001.

— 93 —

[10] Anderson, R., Bond, M., Clulow, J., and Skorobogatov, S., “Crypto-graphic processors—a survey,” Proceedings of the IEEE, vol. 94, no. 2,Feb. 2006, pp. 357–369.

[11] Anderson, R.J., “Why cryptosystems fail,” Communications of the ACM,vol. 37, no. 11, Nov. 1994, pp. 32–40.

[12] Ansari, B. and Anwar Hasan, M., “High performance architecture of el-liptic curve scalar multiplication,” Techical Report CORR 2006-01, Uni-versity of Waterloo, Canada, 2006.

[13] Ansari, B. and Wu, H., “Efficient finite field processor for GF (2163) and itsVLSI implementation,” in Proceedings of the 4th International Conferenceon Information Technology: New Generations, ITNG 2007, Las Vegas,Nevada, USA, Apr. 2–4, 2007, pp. 1021–1026.

[14] Ash, D.W., Blake, I.F., and Vanstone, S.A., “Low complexity normalbases,” Discrete Applied Mathematics, vol. 25, no. 3, Nov. 1989, pp. 191–210.

[15] Avanzi, R., Dimitrov, V.S., Doche, C., and Sica, F., “Extending scalarmultiplication using double bases,” in Proceedings of the 12th Interna-tional Conference on the Theory and Application of Cryptology and Infor-mation Security, Advances in Cryptology — ASIACRYPT 2006, Shang-hai, China, Dec. 3–7, 2006, Lecture Notes in Computer Science, vol. 4284,Springer, pp. 130–144.

[16] Avanzi, R. and Sica, F., “Scalar multiplication on Koblitz curves usingdouble bases,” in Revised Selected Papers of the 1st International Confer-ence on Cryptology in Vietnam, Progress in Cryptology — VIETCRYPT2006, Hanoi, Vietnam, Sep. 25–28, 2006, Lecture Notes in Computer Sci-ence, vol. 4341, Springer, pp. 131–146.

[17] Avanzi, R.M., Ciet, M., and Sica, F., “Faster scalar multiplication onKoblitz curves combining point halving with the Frobenius endomor-phism,” in Proceedings of the 7th International Workshop on Theory andPractice in Public Key Cryptography, PKC 2004, Singapore, Mar. 1–4,2004, Lecture Notes in Computer Science, vol. 2947, Springer, pp. 28–40.

[18] Avanzi, R.M. and Cohen, H., “Compositeness and primality testing—factoring,” in Handbook of Elliptic and Hyperelliptic Curve Cryptography(H. Cohen and G. Frey, eds.), chap. 25, Chapman & Hall/CRC, 2006, pp.591–614.

[19] Avanzi, R.M., Heuberger, C., and Prodinger, H., “Minimality of the Ham-ming weight of the τ -NAF for Koblitz curves and improved combinationwith point halving,” in Revised Selected Papers of the 12th InternationalWorkshop on Selected Areas on Cryptography, SAC 2005, Kingston, On-tario, Canada, Aug. 11–12, 2005, Lecture Notes in Computer Science, vol.3897, Springer, pp. 332–344.

[20] Bajracharya, S., Shu, C., Gaj, K., and El-Ghazawi, T., “Implementationof elliptic curve cryptosystems over GF (2n) in optimal normal basis on

— 94 —

a reconfigurable computer,” in Proceedings of the International Confer-ence on Field Programmable Logic and Application, FPL 2004, Antwerp,Belgium, Aug. 29–Sep. 1, 2004, Lecture Notes in Computer Science, vol.3203, Springer, pp. 1001–1005.

[21] Barreto, P.S.L.M. and Rijmen, V., “The Whirlpool hashing function,” On-line document, May 24, 2003, http://planeta.terra.com.br/informatica/paulobarreto/whirlpool.zip (Oct. 9, 2008).

[22] Barrett, P., “Implementing the Rivest Shamir and Adleman public keyencryption algorithm on a standard digital signal processor,” in Advancesin Cryptology — CRYPTO ’86, Santa Barbara, California, USA, Aug. 11–15, 1986, Lecture Notes in Computer Science, vol. 263, Springer, pp. 311–323.

[23] Batina, L., Guajardo, J., Kerins, T., Mentens, N., Tuyls, P., and Ver-bauwhede, I., “Public-key cryptography for RFID-tags,” in Proceedings ofthe 5th IEEE International Conference on Pervasive Computing and Com-munications, PerCom 2007, White Plains, New York, USA, Mar. 19–23,2007, pp. 217–222.

[24] Batina, L., Mentens, N., Preneel, B., and Verbauwhede, I., “Side-channelaware design: Algorithms and architectures for elliptic curve cryptographyover GF (2n),” in Proceedings of the 16th International Conference onApplication-Specific Systems, Architecture and Processors, ASAP 2005,Samos, Greece, Jul. 23–25, 2005, pp. 350–355.

[25] ———, “Flexible hardware architectures for curve-based cryptography,”in Proceedings of the 2006 IEEE International Symposium on Circuitsand Systems, ISCAS 2006, Island of Kos, Greece, May 21–24, 2006, pp.4839–4842.

[26] Batina, L., Ors, S.B., Preneela, B., and Vandewalle, J., “Hardware ar-chitectures for public key cryptography,” Integration, the VLSI Journal,vol. 34, no. 1-2, 2003, pp. 1–64.

[27] Bednara, M., Daldrup, M., Teich, J., von zur Gathen, J., and Shokrollahi,J., “Tradeoff analysis of FPGA based elliptic curve cryptography,” in Pro-ceedings of the 2002 IEEE International Symposium on Circuits and Sys-tems, ISCAS 2002, Phoenix-Scottdale, Arizona, USA, May 26–29, 2002,vol. 5, pp. 797–800.

[28] Bednara, M., Daldrup, M., von zur Gathen, J., Shokrollahi, J., and Teich,J., “Reconfigurable implementation of elliptic curve crypto algorithms,”in Proceedings of the International Parallel and Distributed ProcessingSymposium, IPDPS 2002, Reconfigurable Architectures Workshop, RAW2002, Ft. Lauderdale, Florida, USA, Apr. 15–19, 2002, pp. 157–164.

[29] Benaissa, M. and Lim, W.M., “Design of flexible GF (2m) elliptic curvecryptography processors,” IEEE Transactions on Very Large Scale Inte-gration (VLSI) Systems, vol. 14, no. 6, Jun. 2006, pp. 659–662.

[30] Berlekamp, E.R., Algebraic Coding Theory, McGraw-Hill, 1968.

— 95 —

[31] Bernstein, D.J., Lange, T., and Farashahi, R.R., “Binary Edwardscurves,” in Proceedings of the Workshop on Cryptographic Hardware andEmbedded Systems, CHES 2008, Washington, DC, USA, Aug. 10–13, 2008,Lecture Notes in Computer Science, vol. 5154, Springer, pp. 244–265.

[32] Blake, I., Seroussi, G., and Smart, N., Elliptic Curves in Cryptography,London Mathematical Society Lecture Notes Series, vol. 265, CambridgeUniversity Press, 1999.

[33] Bogdanov, A., Knudsen, L.R., Leander, G., Paar, C., Poschmann, A.,Robshaw, M.J.B., Seurin, Y., and Vikkelsoe, C., “PRESENT: An ultra-lightweight block cipher,” in Proceedings of the Workshop on Crypto-graphic Hardware and Embedded Systems, CHES 2007, Vienna, Austria,Sep. 10–13, 2007, Lecture Notes in Computer Science, vol. 4727, Springer,pp. 450–466.

[34] Bouldin, D., “Enhancing electronic systems with reconfigurable hard-ware,” IEEE Circuits and Devices Magazine, vol. 22, no. 3, May–Jun.2006, pp. 32–36.

[35] Brauer, A., “On addition chains,” Bulletin of the American MathematicalSociety, vol. 45, no. 10, Oct. 1939, pp. 736–739.

[36] Brickell, E.F., Gordon, D.M., McCurley, K.S., and Wilson, D.B., “Fastexponentiation with precomputation,” in Proceedings of the Workshopon the Theory and Application of Cryptographic Techniques, Advancesin Cryptology — EUROCRYPT ’92, Balatonfured, Hungary, May 24–28,1992, Lecture Notes in Computer Science, vol. 658, Springer, pp. 200–207.

[37] Brokalakis, A., Kakarountas, A.P., and Goutis, C.E., “A high-throughputarea efficient FPGA implementation of AES-128 encryption,” in Proceed-ings of the IEEE Workshop on Signal Processing Systems Design and Im-plementation, SiPS 2005, Athens, Greece, Nov. 2–4, 2005, pp. 116–121.

[38] Brumley, B.B., “Efficient three-term simultaneous elliptic scalar multipli-cation with applications,” in Proceedings of the 11th Nordic Workshop onSecure IT Systems, NordSec 2006, Linkoping, Sweden, 2006, pp. 105–116.

[39] ———, “Left-to-right signed-bit τ -adic representations of n integers,” inProceedings of the 8th International Conference on Information and Com-munications Security, ICICS 2006, Raleigh, North Carolina, USA, Dec.4–7, 2006, Lecture Notes in Computer Science, vol. 4307, Springer, pp.469–478.

[40] ———, “Implementing cryptography for packet level authentication,” inProceedings of the 2008 International Conference on Security and Man-agement, SAM 2008, Las Vegas, Nevada, USA, Jul. 14–17, 2008, CSREAPress, pp. 475–480.

[41] Brumley, B.B. and Jarvinen, K.U., “Fast point decompression for stan-dard elliptic curves,” in Proceedings of the 5th European PKI Workshop,EuroPKI 2008, Trondheim, Norway, Jun. 16–17, 2008, Lecture Notes inComputer Science, vol. 5057, Springer, pp. 134–149.

— 96 —

[42] Bulens, P., Standaert, F.X., Quisquater, J.J., Pellegrin, P., and Rouvroy,G., “Implementation of the AES-128 on Virtex-5 FPGAs,” in Proceedingsof the 1st International Conference on Cryptology in Africa, Progress inCryptology — AFRICACRYPT 2008, Casablanca, Morocco, Jun. 11–14,2008, Lecture Notes in Computer Science, vol. 5023, Springer, pp. 16–26.

[43] Bundesamt fur Sicherheit in der Informationstechnik (BSI), “Ellipticcurve cryptography based on ISO 15946,” Technical Guideline TR-03111, ver. 1.00, Feb. 14, 2007, http://www.bsi.de/literat/tr/tr03111/BSI-TR-03111.pdf (Oct. 9, 2008).

[44] ———, “Advanced security mechanisms for machine readable travel doc-uments — extended access control (EAC),” Technical Guideline TR-03110, ver. 1.11, Feb. 21, 2008, http://www.bsi.de/fachthem/epass/TR-03110 v111.pdf (Oct. 9, 2008).

[45] Burr, W.E., “Selecting the advanced encryption standard,” IEEE Securityand Privacy, vol. 1, no. 2, Mar.–Apr. 2003, pp. 43–52.

[46] Candolin, C., Securing Military Decision Making in a Network-CentricEnvironment, Ph.D. thesis, Helsinki University of Technology, 2005.

[47] Candolin, C., Lundberg, J., and Kari, H., “Packet level authenticationin military networks,” in Proceedings of the 6th Australian InformationWarfare & IT Security Conference, Geelong, Australia, Nov. 2005.

[48] Canright, D., “A very compact S-box for AES,” in Proceedings of theWorkshop on Cryptographic Hardware and Embedded Systems, CHES2005, Edinburgh, Scotland, UK, Aug. 29–Sep. 1, 2005, Lecture Notesin Computer Science, vol. 3659, Springer, pp. 441–456.

[49] Carlier, V., Chabanne, H., Dottax, E., and Pelletier, H., “Electromagneticside channels of an FPGA implementation of AES,” Cryptology ePrintArchive, Report 2004/145, 2004, http://eprint.iacr.org/ (Oct. 9, 2008).

[50] Certicom, Corp., “The Certicom ECC challenge,” Web page, http://www.certicom.ca/index.php/the-certicom-ecc-challenge (Oct. 9, 2008).

[51] Certicom Research, “SEC 1: Elliptic curve cryptography,” Standards forEfficient Cryptography, Sep. 20, 2000.

[52] ———, “SEC 2: Recommended elliptic curve domain parameters,” Stan-dards for Efficient Cryptography, Sep. 20, 2000.

[53] ———, “Certicom patent letter,” Online document, Feb. 10, 2005, http://www.secg.org/download/aid-398/certicom patent letter SECG.pdf (Oct.9, 2008).

[54] Chang Shantz, S., “From Euclid’s GCD to Montgomery multiplication tothe great divide,” Techical Report SMLI TR-2001-95, Sun Microsystems,Inc., Jun. 2001.

— 97 —

[55] Charot, F., Yahya, E., and Wagner, C., “Efficient modular-pipelined AESimplementation in counter mode on Altera FPGA,” in Proceedings of the13th International Conference on Field-Programmable Logic and Appli-cations, FPL 2003, Lisbon, Portugal, Sep. 1–3, 2003, Lecture Notes inComputer Science, vol. 2778, Springer, pp. 282–291.

[56] Chaves, R., Kuzmanov, G., Vassiliadis, S., and Sousa, L., “Reconfigurablememory based AES co-processor,” in Proceedings of the 20th InternationalParallel and Distributed Processing Symposium, IPDPS 2006, the 13th Re-configurable Architectures Workshop, RAW 2006, Rhodes Island, Greece,Apr. 25–26, 2006.

[57] Chelton, W.N. and Benaissa, M., “Fast elliptic curve cryptography onFPGA,” IEEE Transactions on Very Large Scale Integration (VLSI) Sys-tems, vol. 16, no. 2, Feb. 2008, pp. 198–205.

[58] Chen, G., Bai, G., and Chen, H., “A high-performance elliptic curve cryp-tographic processor for general curves over GF (p) based on a systolicarithmetic unit,” IEEE Transactions on Circuits and Systems—Part II:Express Briefs, vol. 54, no. 5, May 2007, pp. 412–416.

[59] Cheung, R.C.C., Telle, N.J., Luk, W., and Cheung, P.Y.K., “Customiz-able elliptic curve cryptosystem,” IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, vol. 13, Sep. 2005, pp. 1048–1059.

[60] Chevallier-Mames, B., Ciet, M., and Joye, M., “Low-cost solutions forpreventing simple side-channel analysis: Side-channel atomicity,” IEEETransactions on Computers, vol. 53, no. 6, Jun. 2006, pp. 760–768.

[61] Chodowiec, P. and Gaj, K., “Very compact FPGA implementation ofthe AES algorithm,” in Proceedings of the Workshop on CryptographicHardware and Embedded Systems, CHES 2003, Cologne, Germany, Sep.8–10, 2003, Lecture Notes in Computer Science, vol. 2779, Springer, pp.319–333.

[62] Chodowiec, P., Khuon, P., and Gaj, K., “Fast implementations of secret-key block ciphers using mixed inner- and outer-round pipelining,” in Pro-ceedings of the 2001 ACM/SIGDA 9th International Symposium on FieldProgrammable Gate Arrays, Monterey, California, USA, Feb. 11–13, 2001,pp. 94–102.

[63] Chou, W., “Inside SSL: Accelerating secure transactions,” IEEE IT Pro-fessional, vol. 4, no. 5, Sep.–Oct. 2002, pp. 37–41.

[64] Chung, J. and Hasan, M.A., “Low-weight polynomial form integers for effi-cient modular multiplication,” IEEE Transactions on Computers, vol. 56,no. 1, Jan. 2007, pp. 44–57.

[65] Ciet, M., Lange, T., Sica, F., and Quisquater, J.J., “Improved algorithmsfor efficient arithmetic on elliptic curves using fast endomorphisms,” inProceedings of the International Conference on the Theory and Applica-tions of Cryptographic Techniques, Advances in Cryptology — EURO-CRYPT 2003, Warsaw, Poland, May 4–8, 2003, Lecture Notes in Com-puter Science, vol. 2656, Springer, pp. 388–400.

— 98 —

[66] Ciet, M. and Sica, F., “An analysis of double base number systems and asublinear scalar multiplication algorithm,” in Proceedings of the 1st Inter-national Conference on Cryptology in Malaysia, Progress in Cryptology —Mycrypt 2005, Kuala Lumpur, Malaysia, Sep. 28–30, 2005, Lecture Notesin Computer Science, vol. 3715, Springer, pp. 171–182.

[67] Cilardo, A., Coppolino, L., Mazzocca, N., and Romano, L., “Elliptic curvecryptography engineering,” Proceedings of the IEEE, vol. 94, no. 2, Feb.2006, pp. 395–406.

[68] Cohen, H. and Frey, G. (eds.), Handbook of Elliptic and HyperellipticCurve Cryptography, Chapman & Hall/CRC, 2006.

[69] Compton, K. and Hauck, S., “Reconfigurable computing: A survey ofsystems and software,” ACM Computing Surveys, vol. 34, no. 2, Jun.2002, pp. 171–210.

[70] Crandall, R.E., “Method and apparatus for public key exchange in a cryp-tographic system,” United States Patent 5,159,632, Oct. 27, 1992.

[71] Daemen, J. and Rijmen, V., The Design of Rijndael: AES — The Ad-vanced Encryption Standard, Springer, 2002.

[72] Daly, A., Marnane, W., Kerins, T., and Popovici, E., “An FPGA imple-mentation of a GF (p) ALU for encryption processors,” Microprocessorsand Microsystems, vol. 28, no. 5-6, Aug. 2004, pp. 253–260.

[73] Daneshbeh, A.K. and Hasan, M.A., “Area efficient high speed ellipticcurve cryptoprocessor for random curves,” in Proceedings of the Inter-national Conference on Information Technology: Coding and Computing,ITCC 2004, Las Vegas, Nevada, USA, Apr. 5–7, 2004, vol. 2, pp. 588–592.

[74] Dhem, J.F., “Efficient modular reduction algorithm in Fq[x] and its appli-cation to ”left to right” modular multiplication in F2[x],” in Proceedings ofthe Workshop on Cryptographic Hardware and Embedded Systems, CHES2003, Cologne, Germany, Sep. 8–10, 2003, Lecture Notes in ComputerScience, vol. 2779, Springer, pp. 203–213.

[75] Dhem, J.F., Koeune, F., Leroux, P.A., Mestre, P., Quisquater, J.J., andWillems, J.L., “A practical implementation of the timing attack,” in Pro-ceedings of the 3rd International Conference on Smart Card Research andApplications, CARDIS ’98, Louvain-la-Neuve, Belgium, Sep. 14–16, 1998,Lecture Notes in Computer Science, vol. 1820, Springer, pp. 167–182.

[76] Diffie, W. and Hellman, M.E., “New directions in cryptography,” IEEETransactions on Information Theory, vol. 22, no. 6, Nov. 1976, pp. 644–654.

[77] Dimitrov, V., Imbert, L., and Mishra, P.K., “Efficient and secure ellipticcurve point multiplication using double-base chains,” in Proceedings of the11th International Conference on the Theory and Application of Cryptol-ogy and Information Security, Advances in Cryptology — ASIACRYPT2005, Chennai, India, Dec. 4–8, 2005, Lecture Notes in Computer Science,vol. 3788, Springer, p. 59–78.

— 99 —

[78] ———, “The double-base number system and its application to ellipticcurve cryptography,” Mathematics of Computation, vol. 77, no. 262, Apr.2008, pp. 1075–1104.

[79] Dimitrov, V.S., Jullien, G.A., and Miller, W.C., “An algorithm for mod-ular exponentiation,” Information Processing Letters, vol. 66, no. 3, May1998, pp. 155–159.

[80] ———, “Theory and applications of the double-base number system,”IEEE Transactions on Computers, vol. 48, no. 10, Oct. 1999, pp. 1098–1106.

[81] Dobbertin, H., Bosselaers, A., and Preneel, B., “RIPEMD-160: Astrengthened version of RIPEMD,” in Proceedings of the 3rd InternationalWorkshop on Fast Software Encryption, FSE 1996, Cambridge, UK, Feb.21–23, 1996, Lecture Notes in Computer Science, vol. 1039, Springer, pp.71–82.

[82] Doche, C., “Exponentiation,” in Handbook of Elliptic and HyperellipticCurve Cryptography (H. Cohen and G. Frey, eds.), chap. 9, Chapman &Hall/CRC, 2006, pp. 145–168.

[83] ———, “Finite field arithmetic,” in Handbook of Elliptic and HyperellipticCurve Cryptography (H. Cohen and G. Frey, eds.), chap. 11, Chapman &Hall/CRC, 2006, pp. 201–237.

[84] Doche, C. and Lange, T., “Arithmetic of elliptic curves,” in Handbookof Elliptic and Hyperelliptic Curve Cryptography (H. Cohen and G. Frey,eds.), chap. 13, Chapman & Hall/CRC, 2006, pp. 267–302.

[85] Doche, C. and Lubicz, D., “Algebraic background,” in Handbook of Ellip-tic and Hyperelliptic Curve Cryptography (H. Cohen and G. Frey, eds.),chap. 2, Chapman & Hall/CRC, 2006, pp. 19–37.

[86] Drimer, S., Guneysu, T., and Paar, C., “DSPs, BRAMs and a pinch oflogic: New recipes for AES on FPGAs,” in Proceedings of the 16th AnnualIEEE Symposium on Field-programmable Custom Computing Machines,FCCM 2008, Stanford, California, USA, Apr. 14–15, 2008, to appear.

[87] Eberle, H., Gura, N., and Chang-Shantz, S., “A cryptographic processorfor arbitrary elliptic curves over GF (2m),” in Proceedings of the IEEEInternational Conference on Application-Specific Systems, Architectures,and Processors, ASAP 2003, The Hague, The Netherlands, Jun. 24–26,2003, pp. 444–454.

[88] Eberle, H., Gura, N., Chang Shantz, S., Gupta, V., Rarick, L., and Sun-daram, S., “A public-key cryptographic processor for RSA and ECC,” inProceedings of the 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2004, Galveston,Texas, USA, Sep. 27–29, 2004, pp. 98–110.

[89] Eberle, H., Shantz, S., Gupta, V., Gura, N., Rarick, L., and Spracklen,L., “Accelerating next-generation public-key cryptosystems on general-purpose CPUs,” IEEE Micro, vol. 25, no. 2, Mar.–Apr. 2005, pp. 52–59.

— 100 —

[90] Eisenbarth, T., Kumar, S., Paar, C., Poschmann, A., and Uhsadel, L., “Asurvey of lightweight-cryptography implementations,” IEEE Design andTest of Computers, vol. 24, no. 6, Nov.–Dec. 2007, pp. 522–533.

[91] Elbirt, A.J., Yip, W., Chetwynd, B., and Paar, C., “An FPGA-based per-formance evaluation of the AES block cipher candidate algorithm final-ists,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,vol. 9, no. 4, Aug. 2001, pp. 545–557.

[92] ElGamal, T., “A public key cryptosystem and a signature scheme based ondiscrete logarithms,” IEEE Transactions on Information Theory, vol. 31,no. 4, Jul. 1985, pp. 469–472.

[93] Ernst, M., Jung, M., Madlener, F., Huss, S., and Blumel, R., “A reconfig-urable system on chip implementation for elliptic curve cryptography overGF (2n),” in Proceedings of the Workshop on Cryptographic Hardware andEmbedded Systems, CHES 2002, Redwood City, California, USA, Aug.13–15, 2002, Lecture Notes in Computer Science, vol. 2523, Springer, pp.381–399.

[94] Ernst, M., Klupsch, S., Hauck, O., and Huss, S.A., “Rapid prototypingfor hardware accelerated elliptic curve public-key cryptosystems,” in Pro-ceedings of the 12th International Workshop on Rapid System Prototyping,RSP 2001, Monterey, California, USA, Jun. 25–27, 2001, pp. 24–29.

[95] Feldhofer, M., Dominikus, S., and Wolkerstorfer, J., “Strong authenti-cation for RFID systems using the AES algorithm,” in Proceedings ofthe Workshop on Cryptographic Hardware and Embedded Systems, CHES2004, Cambridge, Massachusetts, USA, Aug. 10–13, 2004, Lecture Notesin Computer Science, vol. 3156, Springer, pp. 357–370.

[96] Feldhofer, M., Wolkerstorfer, J., and Rijmen, V., “AES implementation ona grain of sand,” IEE Proceedings: Information Security, vol. 152, no. 1,Oct. 2005, pp. 13–20.

[97] Fischer, V. and Drutarovsky, M., “Two methods of Rijndael implementa-tion in reconfigurable hardware,” in Proceedings of the Workshop on Cryp-tographic Hardware and Embedded Systems, CHES 2001, Paris, France,May 14–16, 2001, Lecture Notes in Computer Science, vol. 2162, Springer,pp. 77–92.

[98] Fischer, V., Drutarovsky, M., Chodowiec, P., and Gramain, F., “InvMix-Column decomposition and multilevel resource sharing in AES implemen-tations,” IEEE Transactions on Very Large Scale Integration (VLSI) Sys-tems, vol. 13, no. 8, Aug. 2005, pp. 989–992.

[99] Frey, G. and Lange, T., “Algebraic realizations of DL systems,” in Hand-book of Elliptic and Hyperelliptic Curve Cryptography (H. Cohen andG. Frey, eds.), chap. 23, Chapman & Hall/CRC, 2006, pp. 547–572.

[100] Gaj, K. and Chodowiec, P., “Fast implementation and fair comparisonof the final candidates for advanced encryption standard using field pro-grammable gate arrays,” in Proceedings of the Cryptographers’ Track at

— 101 —

the RSA Conference 2001, Topics in Cryptology — CT-RSA 2001, SanFrancisco, California, USA, Apr. 8–12, 2001, Lecture Notes in ComputerScience, vol. 2020, Springer, pp. 84–99.

[101] Gallant, R.P., Lambert, R.J., and Vanstone, S.A., “Faster point multipli-cation on elliptic curves with efficient endomorphisms,” in Proceedings ofthe 21st Annual International Cryptology Conference, Advances in Cryp-tology — CRYPTO 2001, Santa Barbara, California, USA, Aug. 19–23,2001, Lecture Notes in Computer Science, vol. 2139, Springer, pp. 190–200.

[102] Gao, L., Shrivastava, S., Lee, H., and Sobelman, G.E., “A compact fastvariable key size elliptic curve cryptosystem coprocessor,” in Proceedingsof the 7th Annual IEEE Symposium on Field-Programmable Custom Com-puting Machines, FCCM 1999, Napa Valley, California, USA, Apr. 21–23,1999, pp. 304–305.

[103] Good, T. and Benaissa, M., “AES on FPGA from the fastest to the small-est,” in Proceedings of the Workshop on Cryptographic Hardware and Em-bedded Systems, CHES 2005, Edinburgh, Scotland, UK, Aug. 29–Sep. 1,2005, Lecture Notes in Computer Science, vol. 3659, Springer, pp. 427–440.

[104] ———, “Very small FPGA application-specific instruction processor forAES,” IEEE Transactions on Circuits and Systems—Part I: Regular Pa-pers, vol. 53, no. 7, Jul. 2006, pp. 1477–1486.

[105] ———, “Pipelined AES on FPGA with support for feedback modes (ina multi-channel environment),” IET Information Security, vol. 1, no. 1,Mar. 2007, pp. 1–10.

[106] Goodman, J. and Chandrakasan, A., “An energy efficient reconfigurablepublic-key cryptography processor architecture,” in Proceedings of theWorkshop on Cryptographic Hardware and Embedded Systems, CHES2000, Worcester, Massachusetts, USA, Aug. 17–18, 2000, Lecture Notesin Computer Science, vol. 1965, Springer, pp. 175–190.

[107] ———, “An energy-efficient reconfigurable public-key cryptography pro-cessor,” IEEE Journal of Solid-State Circuits, vol. 36, no. 11, Nov. 2001,pp. 1808–1820.

[108] Gordon, D.M., “A survey of fast exponentiation methods,” Journal ofAlgorithms, vol. 27, no. 1, Apr. 1998, pp. 129–146.

[109] Grabbe, C., Bednara, M., Teich, J., von zur Gathen, J., and Shokrollahi,J., “FPGA designs of parallel high performance GF (2233) multipliers,” inProceedings of the 2003 IEEE International Symposium on Circuits andSystems, ISCAS 2003, Bangkok, Thailand, May 25–28, 2003, vol. 2, pp.268–271.

[110] Grabner, P.J., Heuberger, C., and Prodinger, H., “Distribution resultsfor low-weight binary representations for pairs of integers,” TheoreticalComputer Science, vol. 319, no. 1-3, Jun. 2004, pp. 307–331.

— 102 —

[111] Großschadl, J., “A bit-serial unified multiplier architecture for finite fieldsGF (p) and GF (2m),” in Proceedings of the Workshop on CryptographicHardware and Embedded Systems, CHES 2001, Paris, France, May 14–16, 2001, Lecture Notes in Computer Science, vol. 2162, Springer, pp.202–219.

[112] Großschadl, J. and Savas, E., “Instruction set extensions for fast arith-metic in finite fields GF (p) and GF (2m),” in Proceedings of the Workshopon Cryptographic Hardware and Embedded Systems, CHES 2004, Cam-bridge, Massachusetts, USA, Aug. 10–13, 2004, Lecture Notes in Com-puter Science, vol. 3156, Springer, pp. 133–147.

[113] Guajardo, J. and Paar, C., “Itoh-Tsujii inversion in standard basis andits application in cryptography and codes,” Designs, Codes and Cryptog-raphy, vol. 25, no. 2, Feb. 2002, pp. 207–216.

[114] Gura, N., Chang Shantz, S., Eberle, H., Gupta, S., Gupta, V., Finchel-stein, D., Goupy, E., and Stebila, D., “An end-to-end systems approachto elliptic curve cryptography,” in Proceedings of the Workshop on Cryp-tographic Hardware and Embedded Systems, CHES 2002, Redwood City,California, USA, Aug. 13–15, 2002, Lecture Notes in Computer Science,vol. 2523, Springer, pp. 349–365.

[115] Gura, N., Eberle, H., and Chang Shantz, S., “Generic implementations ofelliptic curve cryptography using partial reduction,” in Proceedings of the9th ACM conference on Computer and Communications Security, CCS2002, Washington, DC, USA, Nov. 18–22, 2002, vol. 1, pp. 108–116.

[116] Gura, N., Patel, A., Wander, A., Eberle, H., and Shantz, S., “Comparingelliptic curve cryptography and RSA on 8-bit CPUs,” in Proceedings ofthe Workshop on Cryptographic Hardware and Embedded Systems, CHES2004, Cambridge, Massachusetts, USA, Aug. 10–13, 2004, Lecture Notesin Computer Science, vol. 3156, Springer, pp. 119–132.

[117] Halbutogullari, A. and Koc, C.K., “Mastrovito multipliers for general ir-reducible polynomials,” IEEE Transactions on Computers, vol. 49, no. 5,May 2000, pp. 503–518.

[118] Hamalainen, P., Cryptographic Security Design and Hardware Architec-tures for Wireless Local Area Networks, Ph.D. thesis, Tampere Universityof Technology, 2006.

[119] Hamalainen, P., Alho, T., Hannikainen, M., and Hamalainen, T.D., “De-sign and implementation of low-area and low-power AES encryption hard-ware core,” in Proceedings of the 9th EUROMICRO Conference on DigitalSystem Design: Architectures, Methods and Tools, DSD 2006, Dubrovnik,Croatia, Aug. 30–Sep. 1, 2006, pp. 577–583.

[120] Hamalainen, P., Hannikainen, M., and Hamalainen, T.D., “Review ofhardware architectures for advanced encryption standard implementationsconsidering wireless sensor networks,” in Proceedings of the 7th Interna-tional Workshop on Embedded Computer Systems: Architectures, Mod-eling, and Simulation, SAMOS 2007, Samos, Greece, Jul. 16–19, 2007,Lecture Notes in Computer Science, vol. 4599, Springer, pp. 443–453.

— 103 —

[121] Hankerson, D., Hernandez, J.L., and Menezes, A., “Software implemen-tation of elliptic curve cryptography over binary fields,” in Proceedings ofthe Workshop on Cryptographic Hardware and Embedded Systems, CHES2000, Worcester, Massachusetts, USA, Aug. 17–18, 2000, Lecture Notesin Computer Science, vol. 1965, Springer, pp. 1–24.

[122] Hankerson, D., Menezes, A., and Vanstone, S., Guide to Elliptic CurveCryptography, Springer, 2004.

[123] Hasan, M.A., “Look-up table-based large finite field multiplication inmemory constrained cryptosystems,” IEEE Transactions on Computers,vol. 49, no. 7, Jul. 2000, pp. 749–758.

[124] Higuchi, A. and Takagi, N., “A fast addition algorithm for elliptic curvearithmetic in GF (2n) using projective coordinates,” Information Process-ing Letters, vol. 76, no. 3, Dec. 15 2000, pp. 101–103.

[125] Hodjat, A. and Verbauwhede, I., “A 21.54 Gbits/s fully pipelined AESprocessor on FPGA,” in Proceedings of 2004 IEEE Symposium on Field-programmable Custom Computing Machines, FCCM 2004, Napa, Califor-nia, USA, Apr. 20–23, 2004, pp. 308–309.

[126] ———, “Area-throughput trade-offs for fully pipelined 30 to 70 Gbits/sAES processors,” IEEE Transactions on Computers, vol. 55, no. 4, Apr.2006, pp. 366–372.

[127] Hong, D., Sung, J., Hong, S., Lim, J., Lee, S., Koo, B.S., Lee, C., Chang,D., Lee, J., Jeong, K., Kim, H., Kim, J., and Chee, S., “HIGHT: A newblock cipher suitable for low-resource device,” in Proceedings of the Work-shop on Cryptographic Hardware and Embedded Systems, CHES 2006,Yokohama, Japan, Oct. 10–13, 2006, Lecture Notes in Computer Science,vol. 4249, Springer, pp. 46–59.

[128] Hsiao, S.F., Chen, M.C., Tsai, M.Y., and Lin, C.C., “System-on-chip im-plementation of the whole advanced encryption standard processor usingreduced XOR-based sum-of-product operations,” IEE Proceedings: Infor-mation Security, vol. 152, no. 1, Oct. 2005, pp. 21–30.

[129] Hsiao, S.F., Chen, M.C., and Tu, C.S., “Memory-free low-cost designs ofadvanced encryption standard using common subexpression eliminationfor subfunctions in transformations,” IEEE Transactions on Circuits andSystems—Part I: Regular Papers, vol. 53, no. 3, Mar. 2006, pp. 615–626.

[130] Hwang, D.D., Tiri, K., Hodjat, A., Lai, B.C., Yang, S., Schaumont, P., andVerbauwhede, I., “AES-based security coprocessor IC in 0.18-µm CMOSwith resistance to differential power analysis side-channel attacks,” IEEEJournal of Solid-State Circuits, vol. 41, no. 4, Apr. 2006, pp. 781–791.

[131] Ichikawa, T., Kasuya, T., and Matsui, M., “Hardware evaluation of theAES finalists,” in Proceedings of the 3rd Advanced Encryption StandardCandidate Conference, New York, New York, USA, Apr. 13–14, 2000, pp.279–285.

— 104 —

[132] Institute of Electrical and Electronics Engineers (IEEE), “IEEE standardspecifications for public-key cryptography,” IEEE Std 1363-2000, Jan. 30,2002.

[133] ———, “IEEE standard specifications for public-key cryptography—amendment 1: Additional techniques,” IEEE Std 1363a-2004, Jul. 22,2004.

[134] International Organization for Standardization / International Elec-trotechnic Commission (ISO/IEC), “Encryption algorithms,” ISO/IEC18033, Part 2, 2002.

[135] ———, “Techniques based on elliptic curves,” ISO/IEC 15946, Parts 1-4,2002.

[136] Itoh, T. and Tsujii, S., “A fast algorithm for computing multplicativeinverses in GF (2m) using normal bases,” Information and Computation,vol. 78, no. 3, Sep. 1988, pp. 171–177.

[137] Jarvinen, K., Tommiska, M., and Skytta, J., “A VHDL generator for ellip-tic curve cryptography,” in Proceedings of the 14th International Confer-ence on Field Programmable Logic and Application, FPL 2004, Antwerp,Belgium, Aug. 29–Sep. 1, 2004, Lecture Notes in Computer Science, vol.3203, Springer, pp. 1098–1100.

[138] ———, “A compact MD5 and SHA-1 co-implementation utilizing algo-rithm similarities,” in Proceedings of the 2005 International Conferenceon Engineering of Reconfigurable Systems and Algorithms, ERSA 2005,Las Vegas, Nevada, USA, Jun. 27–30, 2005, pp. 48–54.

[139] ———, “Hardware implementation analysis of the MD5 hash algorithm,”in Proceedings of the 38th Annual Hawaii International Conference onSystem Sciences, HICSS-38, Big Island, Hawaii, USA, Jan. 3–6, 2005, p.298 (abstract), full paper in the CD proceedings and IEEE Xplore.

[140] Johnson, D., Menezes, A., and Vanstone, S., “The elliptic curve digitalsignature algorithm (ECDSA),” International Journal of Information Se-curity, vol. 1, no. 1, Aug. 2001, pp. 36–63.

[141] Joye, M. and Tymen, C., “Compact encoding of non-adjacent forms withapplications to elliptic curve cryptography,” in Proceedings of the 4th In-ternational Workshop on Practice and Theory in Public Key Cryptosys-tems, PKC 2001, Cheju Island, Korea, Feb. 13–15, 2001, Lecture Notesin Computer Science, vol. 1992, Springer, pp. 353–364.

[142] Kaliski, B.S., “The Montgomery inverse and its applications,” IEEETransactions on Computers, vol. 44, no. 8, Aug. 1995, pp. 1064–1065.

[143] Karatsuba, A. and Ofman, Y., “Multiplication of multidigit numbers onautomata,” Doklady Akademii Nauk SSSR, vol. 145, no. 2, Jul. 1962, pp.293–294, English translation: Soviet Physics — Doklady, vol. 7, no. 7,Jan. 1963, pp. 595-596.

— 105 —

[144] Kerins, T., Popovici, E., Marnane, W., and Fitzpatrick, P., “Fully pa-rameterizable elliptic curve cryptography processor over GF (2m),” inProceedings of the 12th International Conference on Field ProgrammableLogic and Applications, FPL 2002, Montpellier, France, Sep. 2–4, 2002,p. 750–759.

[145] Knudsen, E.W., “Elliptic scalar multiplication using point halving,” inProceedings of the International Conference on the Theory and Applica-tion of Cryptology and Information Security, Advances in Cryptology —ASIACRYPT ’99, Singapore, Nov. 14–18, 1999, Lecture Notes in Com-puter Science, vol. 1716, Springer, pp. 135–149.

[146] Koblitz, N., “Elliptic curve cryptosystems,” Mathematics of Computation,vol. 48, 1987, pp. 203–209.

[147] ———, “Hyperelliptic cryptosystems,” Journal of Cryptology, vol. 1,no. 3, Oct. 1989, pp. 139–150.

[148] ———, “CM-curves with good cryptographic properties,” in Advances inCryptology — CRYPTO ’91, Santa Barbara, California, USA, Aug. 11–15,1991, Lecture Notes in Computer Science, vol. 576, Springer, pp. 279–287.

[149] Koc, C.K. and Acar, T., “Montgomery multiplication in GF (2k),” De-signs, Codes and Cryptography, vol. 14, no. 1, Apr. 1998, p. 57–69.

[150] Koc, C.K. and Sunar, B., “Low-complexity bit-parallel canonical and nor-mal basis multipliers for a class of finite fields,” IEEE Transactions onComputers, vol. 47, no. 3, Mar. 1998, pp. 353–356.

[151] Kocher, P., Jaffe, J., and Jun, B., “Differential power analysis,” in Proceed-ings of the 19th Annual International Cryptology Conference, Advances inCryptology — CRYPTO ’99, Santa Barbara, California, USA, Aug. 15–19, 1999, Lecture Notes in Computer Science, vol. 1666, Springer, pp.388–397.

[152] Kocher, P.C., “Timing attacks on implementations of Diffie-Hellman,RSA, DSS, and other systems,” in Proceedings of the 16th Annual In-ternational Cryptology Conference, Advances in Cryptology — CRYPTO’96, Santa Barbara, California, USA, Aug. 18–22, 1996, Lecture Notes inComputer Science, vol. 1109, Springer, pp. 104–113.

[153] Kosaraju, N.M., Varanasi, M., and Mohanty, S.P., “A high-performanceVLSI architecture for advanced encryption standard (AES) algorithm,” inProceedings of the 19th International Conference on VLSI Design, VLSID2006, Hyderabad, India, Jan. 3–7, 2006.

[154] Kotturi, D., Yoo, S.M., and Blizzard1, J., “AES crypto chip utilizinghigh-speed parallel pipelined architecture,” in Proceedings of the IEEEInternational Symposium on Circuits and Systems, ISCAS 2005, Kobe,Japan, May 23–26, 2005, vol. 5, pp. 4653–4656.

[155] Kravitz, D.W., “Digital signature algorithm,” United States Patent5,231,668, Jul. 27, 1993.

— 106 —

[156] Kumar, S., Wollinger, T., and Paar, C., “Optimum digit serial GF (2m)multipliers for curve-based cryptography,” IEEE Transactions on Com-puters, vol. 55, no. 10, Oct. 2006, pp. 1306–1311.

[157] Kuo, H. and Verbauwhede, I., “Architectural optimization for a1.82Gbits/sec VLSI implementation of the AES Rijndael algorithm,” inProceedings of the Workshop on Cryptographic Hardware and EmbeddedSystems, CHES 2001, Paris, France, May 14–16, 2001, Lecture Notes inComputer Science, vol. 2162, Springer, pp. 51–64.

[158] Kuon, I. and Rose, J., “Measuring the gap between FPGAs and ASICs,”IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems, vol. 26, no. 2, Feb. 2007, pp. 203–215.

[159] Kwon, S., Gaj, K., Kim, C.H., and Hong, C.P., “Efficient linear array formultiplication in GF (2m) using a normal basis for elliptic curve cryptog-raphy,” in Proceedings of the Workshop on Cryptographic Hardware andEmbedded Systems, CHES 2004, Cambridge, Massachusetts, USA, Aug.10–13, 2004, Lecture Notes in Computer Science, vol. 3156, Springer, pp.76–91.

[160] LaMacchia, B.A. and Manferdelli, J.L., “New Vistas in elliptic curve cryp-tography,” Information Security Technical Report, vol. 11, no. 4, 2006, pp.186–192.

[161] Lange, T., “A note on Lopez-Dahab coordinates,” Cryptology ePrintArchive, Report 2004/323, 2004, http://eprint.iacr.org/, (Oct. 9, 2008).

[162] ———, “Koblitz curve cryptosystems,” Finite Fields and Their Applica-tions, vol. 11, no. 2, Apr. 2005, pp. 200–229.

[163] Leander, G., Paar, C., Poschmann, A., and Schramm, K., “Newlightweight DES variants,” in Revised Selected Papers of the 14th Inter-national Workshop on Fast Software Encryption, FSE 2007, Luxemburg,Luxemburg, Mar. 26–28, 2007, Lecture Notes in Computer Science, vol.4593, Springer, pp. 196–210.

[164] Lenstra, A.K. and Verheul, E.R., “Selecting cryptographic key sizes,”Journal of Cryptology, vol. 14, no. 4, Dec. 2001, p. 255–293.

[165] Lenstra, H.W., “Factoring integers with elliptic curves,” The Annals ofMathematics, vol. 126, no. 3, Nov. 1987, pp. 649–673.

[166] Leone, M., “A new low complexity parallel multiplier for a class of finitefields,” in Proceedings of the Workshop on Cryptographic Hardware andEmbedded Systems, CHES 2001, Paris, France, May 14–16, 2001, LectureNotes in Computer Science, vol. 2162, Springer, pp. 160–170.

[167] Leong, P.H.W. and Leung, K.H., “A microcoded elliptic curve processorusing FPGA technology,” IEEE Transactions on Very Large Scale Inte-gration (VLSI) Systems, vol. 10, no. 5, Oct. 2002, pp. 550–559.

— 107 —

[168] Leung, K.H., Ma, K.W., Wong, W.K., and Leong, P.H.W., “FPGA im-plementation of a microcoded elliptic curve cryptographic processor,” inProceedings of the 2000 IEEE Symposium on Field-Programmable CustomComputing Machines, FCCM 2000, Napa Valley, California, USA, Apr.17–19, 2000, pp. 68–76.

[169] Liberatori, M., Otero, F., Bonadero, J.C., and Castineira, J., “AES-128cipher. high speed, low cost FPGA implementation,” in Proceedings ofthe 3rd Southern Conference on Programmable Logic, SPL 2007, Mar delPlata, Argentina, Feb. 26–28, 2007, pp. 195–198.

[170] Lim, C.H. and Lee, P.J., “More flexible exponentiation with precompu-tation,” in Proceedings of the 14th Annual International Cryptology Con-ference, Advances in Cryptology — CRYPTO ’94, Santa Barbara, Cali-fornia, USA, Aug. 21–25, 1994, Lecture Notes in Computer Science, vol.839, Springer, pp. 95–107.

[171] Ling, S. and Xing, C., Coding Theory: A First Course, Cambridge Uni-versity Press, 2004.

[172] Lipmaa, H., “AES/Rijndael: speed,” Web page, http://research.cyber.ee/∼lipmaa/research/aes/rijndael.html (Oct. 9, 2008).

[173] Liu, S., Bowen, F., King, B., and Wang, W., “Elliptic curves cryptosystemimplementation based on a look-up table sharing scheme,” in Proceedingsof the 2006 IEEE International Symposium on Circuits and Systems, IS-CAS 2006, Island of Kos, Greece, May 21–24, 2006, pp. 2921–2924.

[174] Lopez, J. and Dahab, R., “Improved algorithms for elliptic curve arith-metic in GF (2n),” in Proceedings of the 5th Annual International Work-shop on Selected Areas in Cryptography, SAC ’98, Kingston, Ontario,Canada, Aug. 17–18, 1998, Lecture Notes in Computer Science, vol. 1556,Springer, pp. 201–212.

[175] ———, “Fast multiplication on elliptic curves over GF (2m) without pre-computation,” in Proceedings of the Workshop on Cryptographic Hardwareand Embedded Systems, CHES 1999, Worcester, Massachusetts, USA,Aug. 12–13, 1999, Lecture Notes in Computer Science, vol. 1717, Springer,pp. 316–317.

[176] Lutz, A.K., Treichler, J., Gurkaynak, F.K., Kaeslin, H., Basler, G., Erni,A., Reichmuth, S., Rommens, P., Oetiker, S., and Fichtner, W., “2Gbit/shardware realizations of RIJNDAEL and SERPENT: A comparative anal-ysis,” in Proceedings of the Workshop on Cryptographic Hardware andEmbedded Systems, CHES 2002, Redwood City, California, USA, Aug.13–15, 2002, Lecture Notes in Computer Science, vol. 2523, Springer, pp.144–158.

[177] Lutz, J., High Performance Elliptic Curve Cryptographic Co-processor,Master’s thesis, University of Waterloo, Canada, 2003.

[178] Lutz, J. and Hasan, A., “High performance FPGA based elliptic curvecryptographic co-processor,” in Proceedings of the International Confer-ence on Information Technology: Coding and Computing, ITCC 2004, LasVegas, Nevada, USA, Apr. 5–7, 2004, vol. 2, pp. 486–492.

— 108 —

[179] Mangard, S., Aigner, M., and Dominikus, S., “A highly regular andscalable AES hardware architecture,” IEEE Transactions on Computers,vol. 52, no. 4, Apr. 2003, pp. 483–491.

[180] Mangard, S., Oswald, E., and Popp, T., Power Analysis Attacks: Reveal-ing the Secrets of Smart Cards, Springer, 2007.

[181] Mastrovito, E.D., “VLSI designs for multiplication over finite fieldsGF (2m),” in Proceedings of the 6th International Conference on AppliedAlgebra, Algebraic Algorithms and Error-Correcting Codes, AAECC-6,Rome, Italy, Jul. 4–8, 1988, Lecture Notes in Computer Science, vol. 357,Springer, pp. 297–309.

[182] ———, VLSI architectures for computations in Galois fields, Ph.D. thesis,Linkoping University, 1991.

[183] Masuda, A.M., Moura, L., Panario, D., and Thomson, D., “Low com-plexity normal elements over finite fields of characteristic two,” IEEETransactions on Computers, vol. 57, no. 7, Jul. 2008, pp. 990–1001.

[184] McIvor, C.J., McLoone, M., and McCanny, J.V., “Hardware elliptic curvecryptographic processor over GF (p),” IEEE Transactions on Circuits andSystems—Part I: Regular Papers, vol. 53, no. 9, Sep. 2006, pp. 1946–1957.

[185] McLoone, M. and McCanny, J.V., “High performance single-chip FPGARijndael algorithm implementations,” in Proceedings of the Workshopon Cryptographic Hardware and Embedded Systems, CHES 2001, Paris,France, May 14–16, 2001, Lecture Notes in Computer Science, vol. 2162,Springer, pp. 65–76.

[186] McLoone, M. and McCanny, J., “Rijndael FPGA implementation utilizinglook-up tables,” in Proceedings of the 2001 IEEE Workshop on SignalProcessing Systems, SIPS 2001, Antwerp, Belgium, Sep. 26–28, 2001, pp.349–360.

[187] Meier, W. and Staffelbach, O., “Efficient multiplication on certain non-supersingular elliptic curves,” in Proceedings of the 12th Annual Interna-tional Cryptology Conference, Advances in Cryptology — CRYPTO ’92,Santa Barbara, California, USA, Aug. 16–20, 1992, Lecture Notes in Com-puter Science, vol. 740, Springer, pp. 333–344.

[188] Menezes, A.J., Okamoto, T., and Vanstone, S.A., “Reducing elliptic curvelogarithms to logarithms in a finite field,” IEEE Transactions on Infor-mation Theory, vol. 39, no. 5, Sep. 1993, pp. 1639–1646.

[189] Menezes, A.J., van Oorschot, P.C., and Vanstone, S.A., Handbook of Ap-plied Cryptography, CRC Press, 1997.

[190] Mentens, N., Batina, L., Preneel, B., and Verbauwhede, I., “A system-atic evaluation of compact hardware implementations for the RijndaelS-box,” in Proceedings of the Cryptographers’ Track at the RSA Confer-ence, Topics in Cryptology — CT-RSA 2005, San Francisco, California,USA, Feb. 14–18, 2005, Lecture Notes in Computer Science, vol. 3376,Springer, pp. 323–333.

— 109 —

[191] Mentens, N., Ors, S.B., and Preneel, B., “An FPGA implementation ofan elliptic curve processor over GF (2m),” in Proceedings of the 14th ACMGreat Lakes symposium on VLSI, GLVLSI 2004, Boston, Massachusetts,USA, Apr. 26-28, 2004, pp. 454–457.

[192] Meurice de Dormale, G., Bulens, P., and Quisquater, J.J., “Collisionsearch for elliptic curve discrete logarithm over GF (2m) with FPGA,” inProceedings of the Workshop on Cryptographic Hardware and EmbeddedSystems, CHES 2007, Vienna, Austria, Sep. 10–13, 2007, Lecture Notesin Computer Science, vol. 4727, Springer, pp. 378–393.

[193] Meurice de Dormale, G. and Quisquater, J.J., “High-speed hardware im-plementations of elliptic curve cryptography: A survey,” Journal of Sys-tems Architecture, vol. 53, no. 2-3, Feb.–Mar. 2007, pp. 72–84.

[194] Miller, V., “Use of elliptic curves in cryptography,” in Advances in Cryp-tology — CRYPTO ’85, Santa Barbara, California, USA, Aug. 18–22,1985, Lecture Notes in Computer Science, vol. 218, Springer, pp. 417–426.

[195] Mishra, P.K. and Sarkar, P., “Application of Montgomery’s trick toscalar multiplication for elliptic and hyperelliptic curves using a fixed basepoint,” in Proceedings of the 7th International Workshop on Theory andPractice in Public Key Cryptography, PKC 2004, Singapore, Mar. 1–4,2004, Lecture Notes in Computer Science, vol. 2947, Springer, pp. 41–54.

[196] Montgomery, P.L., “Modular multiplication without trial division,” Math-ematics of Computation, vol. 44, no. 170, Apr. 1985, pp. 519–521.

[197] ———, “Speeding the Pollard and elliptic curve methods of factorization,”Mathematics of Computation, vol. 48, no. 177, Jan. 1987, pp. 243–264.

[198] Morain, F. and Olivos, J., “Speeding up the computations of an ellipticcurve using addition-subtraction chains,” RAIRO Theoretical Informaticsand Applications, vol. 24, no. 6, 1990, pp. 531–543.

[199] Morales-Sandoval, M. and Feregrino-Uribe, C., “GF (2m) arithmetic mod-ules for elliptic curve cryptography,” in Proceedings of the IEEE Interna-tional Conference on Reconfigurable Computing and FPGA’s, ReConFig2006, San Luis Potosi, Mexico, Sep. 20–22, 2006.

[200] Morioka, S. and Satoh, A., “An optimized S-box circuit architecture forlow power AES design,” in Proceedings of the Workshop on CryptographicHardware and Embedded Systems, CHES 2002, Redwood City, California,USA, Aug. 13–15, 2002, Lecture Notes in Computer Science, vol. 2523,Springer, pp. 172–186.

[201] ———, “A 10-Gbps full-AES crypto design with a twisted BDD S-boxarchitecture,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 12, no. 7, Jul. 2004, pp. 686–691.

[202] Mullin, R.C., Onyszchuk, I.M., Vanstone, S.A., and Wilson, R.M., “Op-timal normal bases in GF (pn),” Discrete Applied Mathematics, vol. 22,no. 2, 1989, pp. 149–161.

— 110 —

[203] National Institute of Standards and Technology (NIST), “Data encryptionstandard (DES),” Federal Information Processing Standard, FIPS PUB46, Jan. 15, 1977.

[204] ———, “Data encryption standard (DES),” Federal Information Process-ing Standard, FIPS PUB 46-3, Oct. 25, 1999.

[205] ———, “Digital signature standard (DSS),” Federal Information Process-ing Standard, FIPS PUB 186-2, Jan. 27, 2000.

[206] ———, “Advanced encryption standard (AES),” Federal InformationProcessing Standard, FIPS PUB 197, Nov. 26, 2001.

[207] ———, “Secure hash standard (SHS),” Federal Information ProcessingStandard, FIPS PUB 180-2, Aug. 1, 2002.

[208] ———, “Announcing request for candidate algorithm nominations fora new cryptographic hash algorithm (SHA–3) family,” Federal Register,vol. 72, no. 212, Nov. 2, 2007, pp. 62212–62220.

[209] ———, “Recommendation for block cipher modes of operation: Ga-lois/counter mode (GCM) and GMAC,” Special Publication 800-38D,Nov. 2007.

[210] National Security Agency (NSA), “The case of elliptic curve cryptogra-phy,” Web page, http://www.nsa.gov/ia/industry/crypto elliptic curve.cfm (Oct. 9, 2008).

[211] Nechvatal, J., Barker, E., Bassham, L., Burr, W., Dworkin, M., Foti, J.,and Roback, E., “Report on the development of the Advanced EncryptionStandard (AES),” Techical report, National Institute of Standards andTechnology (NIST), Oct. 2, 2000.

[212] Nguyen, N., Gaj, K., Caliga, D., and El-Ghazawi, T., “Implementationof elliptic curve cryptosystems on a reconfigurable computer,” in Proceed-ings of the 2003 IEEE International Conference on Field-ProgrammableTechnology, FPT 2003, Tokyo, Japan, Dec. 15–17, 2003, pp. 60–67.

[213] Ning, P. and Yin, Y.L., “Efficient software implementation for finite fieldmultiplication in normal basis,” in Proceedings of the 3rd InternationalConference on Information and Communications Security, ICICS 2001,Xian, China, Nov. 13—16, 2001, Lecture Notes in Computer Science, vol.2229, Springer, pp. 177–188.

[214] Okada, S., Torii, N., Itoh, K., and Takenaka, M., “Implementation ofelliptic curve cryptographic coprocessor over GF (2m) on an FPGA,” inProceedings of the Workshop on Cryptographic Hardware and EmbeddedSystems, CHES 2000, Worcester, Massachusetts, USA, Aug. 17–18, 2000,Lecture Notes in Computer Science, vol. 1965, Springer, pp. 25–40.

[215] Okeya, K. and Sakurai, K., “Fast multi-scalar multiplication methods onelliptic curves with precomputation strategy using Montgomery trick,” inProceedings of the Workshop on Cryptographic Hardware and EmbeddedSystems, CHES 2002, Redwood City, California, USA, Aug. 13–15, 2002,Lecture Notes in Computer Science, vol. 2523, Springer, pp. 564–578.

— 111 —

[216] Okeya, K., Takagi, T., and Vuillaume, C., “Short memory scalar multi-plication on Koblitz curves,” in Proceedings of the Workshop on Crypto-graphic Hardware and Embedded Systems, CHES 2005, Edinburgh, Scot-land, UK, Aug. 29–Sep. 1, 2005, Lecture Notes in Computer Science, vol.3659, Springer, pp. 91–105.

[217] Omura, J.K. and Massey, J.L., “Computational method and apparatusfor finite field arithmetic,” United States Patent 4,587,627, Oct. 6, 1986.

[218] OpenSSL Project, “OpenSSL: The open source toolkit for SSL/TLS,”Web page, http://www.openssl.org (Oct. 9, 2008).

[219] Orlando, G. and Paar, C., “A super-serial Galois fields multiplier forFPGAs and its application to public-key algorithms,” in Proceedings ofthe 7th Annual IEEE Symposium on Field-Programmable Custom Com-puting Machines, FCCM 1999, Napa Valley, California, USA, Apr. 21–23,1999, pp. 232–239.

[220] ———, “A high-performance reconfigurable elliptic curve processor forGF (2m),” in Proceedings of the Workshop on Cryptographic Hardware andEmbedded Systems, CHES 2000, Worcester, Massachusetts, USA, Aug.17–18, 2000, Lecture Notes in Computer Science, vol. 1965, Springer, pp.41–56.

[221] ———, “A scalable GF (p) elliptic curve processor architecture for pro-grammable hardware,” in Proceedings of the Workshop on CryptographicHardware and Embedded Systems, CHES 2001, Paris, France, May 14–16, 2001, Lecture Notes in Computer Science, vol. 2162, Springer, pp.348–363.

[222] Ors, S.B., Batina, L., Preneel, B., and Vandewalle, J., “Hardware imple-mentation of an elliptic curve processor over GF (p),” in Proceedings ofthe IEEE International Conference on Application-Specific Systems, Ar-chitectures, and Processors, ASAP 2003, The Hague, The Netherlands,Jun. 24–26, 2003, pp. 433–443.

[223] Ors, S.B., Oswald, E., and Preneel, B., “Power-analysis attacks on anFPGA — first experimental results,” in Proceedings of the Workshop onCryptographic Hardware and Embedded Systems, CHES 2003, Cologne,Germany, Sep. 8–10, 2003, Lecture Notes in Computer Science, vol. 2779,Springer, pp. 35–50.

[224] Ozturk, E., Sunar, B., and Savas, E., “Low-power elliptic curve cryptogra-phy using scaled modular arithmetic,” in Proceedings of the Workshop onCryptographic Hardware and Embedded Systems, CHES 2004, Cambridge,Massachusetts, USA, Aug. 10–13, 2004, Lecture Notes in Computer Sci-ence, vol. 3156, Springer, pp. 92–106.

[225] Peter, S., Langendorfer, P., and Piotrowski, K., “Flexible hardware re-duction for elliptic curve cryptography in GF (2m),” in Proceedings of theConference and Exhibition on Design, Automation and Test in Europe,DATE 2007, Nice, France, Apr. 16–20, 2007, pp. 1259–1264.

— 112 —

[226] Popp, T., Mangard, S., and Oswald, E., “Power analysis attacks andcountermeasures,” IEEE Design and Test of Computers, vol. 24, no. 6,Nov.–Dec. 2007, pp. 535–543.

[227] Potgieter, M.J. and van Dyk, B.J., “Two hardware implementations ofthe group operations necessary for implementing an elliptic curve cryp-tosystem over a characteristic two finite field,” in Proceedings of the 6thIEEE Africon Conference in Africa, Africon 2002, George, South Africa,Oct. 2-4, 2002, vol. 1, pp. 187–192.

[228] Pramstaller, N. and Wolkerstorfer, J., “A universal and efficient AES co-processor for field programmable logic arrays,” in Proceedings of the 14thInternational Conference on Field Programmable Logic and Application,FPL 2004, Antwerp, Belgium, Aug. 30–Sep. 1, 2004, Lecture Notes inComputer Science, vol. 3203, Springer, pp. 565–574.

[229] Preneel, B., “A survey of recent developments in cryptographic algorithmsfor smart cards,” Computer Networks, vol. 51, no. 9, Jun. 2007, pp. 2223–2233.

[230] Proos, J., “Joint sparse forms and generating zero columns when comb-ing,” Techical Report CORR 2003-23, University of Waterloo, Canada,Jul. 7, 2003.

[231] Ravi, S., Raghunathan, A., Kocher, P., and Hattangady, S., “Security inembedded systems: Design challenges,” ACM Transactions on EmbeddedComputing Systems, vol. 3, no. 3, Aug. 2004, pp. 461–491.

[232] Reyhani-Masoleh, A., “Efficient algorithms and architectures for field mul-tiplication using Gaussian normal bases,” IEEE Transactions on Comput-ers, vol. 55, no. 1, Jan. 2006, pp. 34–47.

[233] Reyhani-Masoleh, A. and Hasan, M.A., “On low complexity bit parallelpolynomial basis multipliers,” in Proceedings of the Workshop on Crypto-graphic Hardware and Embedded Systems, CHES 2003, Cologne, Germany,Sep. 8–10, 2003, Lecture Notes in Computer Science, vol. 2779, Springer,pp. 189–202.

[234] ———, “Efficient digit-serial normal basis multipliers over binary exten-sion fields,” ACM Transactions on Embedded Computing Systems, vol. 3,no. 3, Aug. 2004, pp. 575–592.

[235] ———, “Low complexity bit parallel architectures for polynomial basismultiplication over GF (2m),” IEEE Transactions on Computers, vol. 53,no. 8, Aug. 2004, pp. 945–959.

[236] ———, “Low complexity word-level sequential normal basis multipliers,”IEEE Transactions on Computers, vol. 54, no. 2, Feb. 2005, pp. 98–110.

[237] Rijmen, V., “Efficient implementation of the Rijndael S-box,” Online doc-ument, http://citeseer.ist.psu.edu/293912.html (Oct. 9, 2008).

[238] Rivest, R., “The MD5 message-digest algorithm,” Request for Comments,RFC 1321, Apr. 1992, http://www.ietf.org/rfc/rfc1321.txt (Oct. 9, 2008).

— 113 —

[239] Rivest, R.L., Shamir, A., and Adleman, L., “A method for obtainingdigital signatures and public-key cryptosystems,” Communications of theACM, vol. 21, no. 2, Feb. 1978, pp. 120–126.

[240] Rodrıguez-Henrıquez, F., Saqib, N.A., and Dıaz-Perez, A., “4.2 Gbit/ssingle-chip FPGA implementation of AES algorithm,” Electronics Letters,vol. 39, no. 15, Jul. 24, 2003, pp. 1115–1116.

[241] ———, “A fast parallel implementation of elliptic curve point multiplica-tion over GF (2m),” Microprocessors and Microsystems, vol. 28, no. 5–6,Aug. 2004, pp. 329–339.

[242] Rosing, M., Implementing Elliptic Curve Cryptography, Manning Publi-cations Co., 1999.

[243] Rosner, M.C., Elliptic Curve Cryptosystems on Reconfigurable Hardware,Master’s thesis, Worcester Polytechnic Institute, 1998.

[244] Rouvroy, G., Standaert, F.X., Quisquater, J.J., and Legat, J.D., “Efficientuses of FPGAs for implementations of DES and its experimental linearcryptanalysis,” IEEE Transactions on Computers, vol. 52, no. 4, Apr.2003, pp. 473–482.

[245] ———, “Compact and efficient encryption/decryption module for FPGAimplementation of the AES Rijndael very well suited for small embeddedapplications,” in Proceedings of the International Conference on Informa-tion Technology: Coding and Computing, ITCC 2004, Las Vegas, Nevada,USA, Apr. 5–7, 2004, vol. 2, pp. 583–587.

[246] RSA Security, Inc., “The RSA factoring challenge,” Web page, http://www.rsa.com/rsalabs/node.asp?id=2092 (Oct. 9, 2008).

[247] Rudra, A., Dubey, P.K., Jutla, C.S., Kumar, V., Rao, J.R., and Ro-hatgi, P., “Efficient rijndael encryption implementation with compositefield arithmetic,” in Proceedings of the Workshop on Cryptographic Hard-ware and Embedded Systems, CHES 2001, Paris, France, May 14–16, 2001,Lecture Notes in Computer Science, vol. 2162, Springer, pp. 171–184.

[248] Saggese, G.P., Mazzeo, A., Mazzocca, N., and Strollo, A.G.M., “AnFPGA-based performance analysis of the unrolling, tiling, and pipeliningof the AES algorithm,” in Proceedings of the 13th International Confer-ence on Field-Programmable Logic and Applications, FPL 2003, Lisbon,Portugal, Sep. 1–3, 2003, Lecture Notes in Computer Science, vol. 2778,Springer, pp. 292–302.

[249] Sakiyama, K., Batina, L., Preneel, B., and Verbauwhede, I., “Superscalarcoprocessor for high-speed curve-based cryptography,” in Proceedings ofthe Workshop on Cryptographic Hardware and Embedded Systems, CHES2006, Yokohama, Japan, Oct. 10–13, 2006, Lecture Notes in ComputerScience, vol. 4249, Springer, pp. 415–429.

[250] ———, “Multicore curve-based cryptoprocessor with reconfigurable mod-ular arithmetic logic units over GF (2n),” IEEE Transactions on Comput-ers, vol. 56, no. 9, Sep. 2007, pp. 1269–1282.

— 114 —

[251] Sakiyama, K., Mentens, N., Batina, L., Preneel, B., and Verbauwhede,I., “Reconfigurable modular arithmetic logic unit supporting high-performance RSA and ECC over GF (p),” International Journal of Elec-tronics, vol. 94, no. 5, May 2007, pp. 501–514.

[252] Saqib, N.A., Rodrıguez-Henrıquez, F., and Dıaz-Perez, A., “A paral-lel architecture for fast computation of elliptic curve scalar multiplica-tion over GF (2m),” in Proceedings of the 18th International Parallel andDistributed Processing Symposium, IPDPS 2004, Santa Fe, New Mexico,USA, Apr. 26–30, 2004.

[253] Satoh, A., “High-speed hardware architectures for authenticated encryp-tion mode GCM,” in Proceedings of the 2006 IEEE International Sympo-sium on Circuits and Systems, ISCAS 2006, Island of Kos, Greece, May21–24, 2006, pp. 4831–4834.

[254] Satoh, A., Morioka, S., Takano, K., and Munetoh, S., “A compact Ri-jndael hardware architecture with S-box optimization,” in Proceedings ofthe 7th International Conference on the Theory and Application of Cryp-tology and Information Security, Advances in Cryptology — ASIACRYPT2001, Gold Coast, Australia, Dec. 9–13, 2001, Lecture Notes in ComputerScience, vol. 2248, Springer, pp. 239–254.

[255] Satoh, A. and Takano, K., “A scalable dual-field elliptic curve crypto-graphic processor,” IEEE Transactions on Computers, vol. 52, no. 4, Apr.2003, pp. 449–460.

[256] Savas, E. and Koc, C.K., “The Montgomery modular inverse—revisited,”IEEE Transactions on Computers, vol. 49, no. 7, Jul. 2000, pp. 763–766.

[257] Savas, E., Tenca, A.F., and Koc, C.K., “A scalable and unified multi-plier architecture for finite fields GF (p) and GF (2m),” in Proceedings ofthe Workshop on Cryptographic Hardware and Embedded Systems, CHES2000, Worcester, Massachusetts, USA, Aug. 17–18, 2000, Lecture Notesin Computer Science, vol. 1965, Springer, pp. 277–292.

[258] Schneier, B., Applied Cryptography, John Wiley & Sons, Inc., 2nd ed.,1996.

[259] Schroeppel, R., Beaver, C., Gonzales, R., Miller, R., and Draelos, T., “Alow-power design for an elliptic curve digital signature chip,” in Proceed-ings of the Workshop on Cryptographic Hardware and Embedded Systems,CHES 2002, Redwood City, California, USA, Aug. 13–15, 2002, LectureNotes in Computer Science, vol. 2523, Springer, pp. 366–381.

[260] Schroeppel, R., Orman, H., O’Malley, S., and Spatscheck, O., “Fastkey exchange with elliptic curve systems,” in Proceedings of the 15thAnnual International Cryptology Conference, Advances in Cryptology —CRYPTO ’95, Santa Barbara, California, USA, Aug. 27–31, 1995, LectureNotes in Computer Science, vol. 963, Springer, pp. 43–56.

[261] Shacham, H. and Boneh, D., “Improving SSL handshake performancevia batching,” in Proceedings of the Cryptographers’ Track at the RSA

— 115 —

Conference 2001, Topics in Cryptology — CT-RSA 2001, San Francisco,California, USA, Apr. 8–12, 2001, Lecture Notes in Computer Science,vol. 2020, Springer, pp. 28–43.

[262] Shu, C., Gaj, K., and El-Ghazawi, T., “Low latency elliptic curve cryp-tography accelerators for NIST curves over binary fields,” in Proceedingsof the 2005 IEEE International Conference on Field Programmable Tech-nology, FPT 2005, Singapore, Dec. 11–14, 2005, pp. 309–310.

[263] Smart, N.P., “How secure are elliptic curves over composite extensionfields?” in Proceedings of the International Conference on the Theoryand Application of Cryptographic Techniques, Advances in Cryptology —EUROCRYPT 2001, Aarhus, Denmark, May 21–23, 2001, Lecture Notesin Computer Science, vol. 2045, Springer, pp. 30–39.

[264] Solinas, J.A., “An improved algorithm for arithmetic on a family of el-liptic curves,” in Proceedings of the 17th Annual International CryptologyConference, Advances in Cryptology — CRYPTO ’97, Santa Barbara,California, USA, Aug. 17–21, 1997, Lecture Notes in Computer Science,vol. 1294, Springer, pp. 357–371.

[265] ———, “Generalized Mersenne numbers,” Techical Report CORR 99-39,University of Waterloo, Canada, 1999.

[266] ———, “Improved algorithms for arithmetic on anomalous binarycurves,” Techical Report CORR 99-46, University of Waterloo, Canada,1999, corrected and updated version of [264].

[267] ———, “Efficient arithmetic on Koblitz curves,” Designs, Codes andCryptography, vol. 19, no. 2–3, 2000, pp. 195–249.

[268] ———, “Low-weight binary representations for pairs of integers,” TechicalReport CORR 2001-41, University of Waterloo, Canada, 2001.

[269] Song, L. and Parhi, K.K., “Low-energy digit-serial/parallel finite fieldmultipliers,” Journal of VLSI Signal Processing, vol. 19, no. 2, Jul. 1998,pp. 149–166.

[270] Sozzani, F., Bertoni, G., Turcato, S., and Breveglieri, L., “A parallelizeddesign for an elliptic curve cryptosystem coprocessor,” in Proceedingsof the International Conference on Information Technology: Coding andComputing, ITCC 2005, Las Vegas, Nevada, USA, Apr. 4–6, 2005, vol. 1,pp. 626–630.

[271] Standaert, F.X., Peeters, E., Rouvroy, G., and Quisquater, J.J., “Anoverview of power analysis attacks against field programmable gate ar-rays,” Proceedings of the IEEE, vol. 94, no. 2, Feb. 2006, pp. 383–394.

[272] Standaert, F.X., Rouvroy, G., Quisquater, J.J., and Legat, J.D., “Effi-cient implementation of Rijndael encryption in reconfigurable hardware:Improvements and design tradeoffs,” in Proceedings of the Workshop onCryptographic Hardware and Embedded Systems, CHES 2003, Cologne,Germany, Sep. 8–10, 2003, Lecture Notes in Computer Science, vol. 2779,Springer, pp. 334–350.

— 116 —

[273] Stein, J., “Computational problems associated with Racah algebra,” Jour-nal of Computational Physics, vol. 1, no. 3, Feb. 1967, pp. 397–405.

[274] Stinson, D.R., Cryptography Theory and Practice, Chapman & Hall/CRC,2nd ed., 2002.

[275] Su, C.P., Lin, T.F., Huang, C.T., and Wu, C.W., “A high-throughput low-cost AES processor,” IEEE Communications Magazine, vol. 41, no. 12,Dec. 2003, pp. 86–91.

[276] Sun Microsystems, Inc., “UltraSPARC T2 processor datasheet,”Data sheet, 2007, http://www.sun.com/processors/UltraSPARC-T2/datasheet.pdf (Oct. 9, 2008).

[277] Sunar, B. and C. K. Koc, “Mastrovito multiplier for all trinomials,” IEEETransactions on Computers, vol. 48, no. 5, May 1999, pp. 522–527.

[278] ———, “An efficient optimal normal basis type II multiplier,” IEEETransactions on Computers, vol. 50, no. 1, Jan. 2001, pp. 83–87.

[279] Tillich, S. and Großschadl, J., “Instruction set extensions for efficient AESimplementation on 32-bit processors,” in Proceedings of the Workshop onCryptographic Hardware and Embedded Systems, CHES 2006, Yokohama,Japan, Oct. 10–13, 2006, Lecture Notes in Computer Science, vol. 4249,Springer, pp. 270–284.

[280] Todman, T.J., Constantinides, G.A., Wilton, S.J.E., Mencer, O., Luk,W., and Cheung, P.Y.K., “Reconfigurable computing: architectures anddesign methods,” IEE Proceedings: Computers and Digital Techniques,vol. 152, no. 2, Mar. 2005, pp. 193–207.

[281] Tommiska, M., Applications of Reprogrammability in Algorithm Accelera-tion, Ph.D. thesis, Helsinki University of Technology, 2005.

[282] Verbauwhede, I., Schaumont, P., and Kuo, H., “Design and performancetesting of a 2.29-GB/s Rijndael processor,” IEEE Journal of Solid-StateCircuits, vol. 38, no. 3, Mar. 2003, pp. 569–572.

[283] von zur Gathen, J. and Shokrollahi, J., “Efficient FPGA-based Karatsubamultipliers for polynomials over F2,” in Revised Selected Papers of the 12thInternational Workshop on Selected Areas in Cryptography, SAC 2005,Kingston, Ontario, Canada, Aug. 10–12, 2005, Lecture Notes in ComputerScience, vol. 3897, Springer, pp. 359–369.

[284] Vuillaume, C., Okeya, K., and Takagi, T., “Short memory scalar multi-plication for koblitz curves,” IEEE Transactions on Computers, vol. 57,no. 4, 2008, pp. 481–489.

[285] Wang, C.C., Troung, T.K., Shao, H.M., Deutsch, L.J., Omura, J.K., andReed, I.S., “VLSI architectures for computing multiplications and inversesin GF (2m),” IEEE Transactions on Computers, vol. 34, no. 8, Aug. 1985,pp. 709–717.

— 117 —

[286] Wang, X., Yin, Y.L., and Yu, H., “Finding collisions in the full SHA-1,”in Proceedings of the 25th Annual International Cryptology Conference,Advances in Cryptology — CRYPTO 2005, Santa Barbara, California,USA, Aug. 14–18, 2005, Lecture Notes in Computes Science, vol. 3621,pp. 17–36.

[287] Wang, X. and Yu, H., “How to break MD5 and other hash functions,” inProceedings of the 24th Annual International Conference on the Theoryand Applications of Cryptographic Techniques, Advances in Cryptology —EUROCRYPT 2005, Aarhus, Denmark, May 22–26, 2005, Lecture Notesin Computer Science, vol. 3494, Springer, pp. 19–35.

[288] Weeks, B., Bean, M., Rozylowicz, T., and Ficke, C., “Hardware perfor-mance simulations of round 2 advanced encryption standard algorithms,”in Proceedings of the 3rd Advanced Encryption Standard Candidate Con-ference, New York, New York, USA, Apr. 13–14, 2000, pp. 286–304.

[289] Wiles, A., “Modular elliptic curves and Fermat’s last theorem,” The An-nals of Mathematics, vol. 142, no. 3, May 1995, pp. 443–551.

[290] Wolf, W., FPGA-Based System Design, Prectice Hall, 2004.

[291] Wolkerstorfer, J., “Dual-field arithmetic unit for GF (p) and GF (2m),” inProceedings of the Workshop on Cryptographic Hardware and EmbeddedSystems, CHES 2002, Redwood City, California, USA, Aug. 13–15, 2002,Lecture Notes in Computer Science, vol. 2523, Springer, pp. 500–514.

[292] Wolkerstorfer, J., Oswald, E., and Lamberger, M., “An ASIC implemen-tation of the AES Sboxes,” in Proceedings of the Cryptographers’ Track atthe RSA Conference, Topics in Cryptology — CT-RSA 2002, San Jose,California, USA, Feb. 18–22, 2002, Lecture Notes in Computer Science,vol. 2271, Springer, pp. 67–78.

[293] Wollinger, T., Guajardo, J., and Paar, C., “Security on FPGAs: State-of-the-art implementations and attacks,” ACM Transactions on EmbeddedComputing Systems, vol. 3, no. 3, Aug. 2004, pp. 534–574.

[294] Wollinger, T. and Paar, C., “How secure are FPGAs in cryptographicapplications?” in Proceedings of the 13th International Conference onField-Programmable Logic and Applications, FPL 2003, Lisbon, Portugal,Sep. 1–3, 2003, Lecture Notes in Computer Science, vol. 2778, Springer,pp. 91–100.

[295] Wu, H., “Low complexity bit-parallel finite field arithmetic using polyno-mial basis,” in Proceedings of the Workshop on Cryptographic Hardwareand Embedded Systems, CHES 1999, Worcester, Massachusetts, USA,Aug. 12–13, 1999, Lecture Notes in Computer Science, vol. 1717, Springer,pp. 280–291.

[296] ———, “On complexity of polynomial basis squaring in F2m ,” in Pro-ceedings of the 7th Annual International Workshop on Selected Areas inCryptography, SAC 2000, Waterloo, Ontario, Canada, Aug. 14–15, 2000,Lecture Notes in Computer Science, vol. 2012, Springer, pp. 118–129.

— 118 —

[297] ———, “Bit-parallel finite field multiplier and squarer using polynomialbasis,” IEEE Transactions on Computers, vol. 51, no. 7, Jul. 2002, pp.750–758.

[298] ———, “Montgomery multiplier and squarer for a class of finite fields,”IEEE Transactions on Computers, vol. 51, no. 5, May 2002, pp. 521–529.

[299] Xilinx, Inc., “Virtex-E 1.8 V field programmable gate arrays,” Datasheet, Jul. 17, 2002, http://www.xilinx.com/support/documentation/data sheets/ds022.pdf (Oct. 9, 2008).

[300] ———, “MicroBlaze processor reference guide,” Data sheet, Jun.25, 2007, http://www.xilinx.com/support/documentation/sw manuals/edk92i mb ref guide.pdf (Oct. 9, 2008).

[301] ———, “Virtex-4 family overview,” Data sheet, Sep. 28, 2007, http://www.xilinx.com/support/documentation/data sheets/ds112.pdf (Oct. 9,2008).

[302] ———, “Virtex-II platform FPGAs: Complete data sheet,” Datasheet, Nov. 5, 2007, http://www.xilinx.com/support/documentation/data sheets/ds031.pdf (Oct. 9, 2008).

[303] ———, “Virtex-5 family overview,” Data sheet, Sep. 23, 2008, http://www.xilinx.com/support/documentation/data sheets/ds100.pdf (Oct. 9,2008).

[304] Zambreno, J., Nguyen, D., and Choudhary, A., “Exploring area/delaytradeoffs in an AES FPGA implementation,” in Proceedings of the 14thInternational Conference on Field Programmable Logic and Application,FPL 2004, Antwerp, Belgium, Aug. 30–Sep. 1, 2004, Lecture Notes inComputer Science, vol. 3203, Springer, pp. 575–585.

[305] Zhang, T. and Parhi, K.K., “Systematic design of original and modifiedMastrovito multipliers for general irreducible polynomials,” IEEE Trans-actions on Computers, vol. 50, no. 7, Jul. 2001, pp. 734–749.

[306] Zhang, X. and Parhi, K.K., “Implementation approaches for the advancedencryption standard algorithm,” IEEE Circuits and Systems Magazine,vol. 2, no. 4, 2002, pp. 24–46.

[307] ———, “High-speed VLSI architectures for the AES algorithm,” IEEETransactions on Very Large Scale Integration (VLSI) Systems, vol. 12,no. 9, Sep. 2004, pp. 957–967.

[308] ———, “On the optimum constructions of composite field for the AES al-gorithm,” IEEE Transactions on Circuits and Systems—Part II: ExpressBriefs, vol. 53, no. 10, Oct. 2006, pp. 1153–1157.

[309] Zhou, G., Michalik, H., and Hinsenkamp, L., “Efficient and high-throughput implementations of AES-GCM on FPGAs,” in Proceedingsof the 2007 International Conference on Field Programmable Technology,FPT 2007, Kitakyushu, Japan, Dec. 12–14, 2007, pp. 185–192.

— 119 —

[310] Zhou, J.Y., Liu, X.W., and Jiang, X.G., “Implementing elliptic curvecryptography on Nios II processor,” in Proceedings of the 7th InternationalConference on ASIC, ASICON 2007, Guilin, China, Oct. 26–29, 2007, pp.205–208.

— 120 —

Documents

STUDIES ON HIGH-SPEED HARDWARE ...lib.tkk.fi/Diss/2008/isbn9789512295906/isbn9789512295906.pdfHelsinki University of Technology Department of Signal Processing and Acoustics Espoo