92
Stefan Heyse May 31, 2009 Diploma Thesis Ruhr-University Bochum Faculty of Electrical Engineering and Information Technology Chair for Embedded Security Prof. Dr.-Ing. Christof Paar Dr.-Ing. Tim Güneysu

Stefan Heyse May 31, 2009 - Ruhr-Universität Bochum · Stefan Heyse May 31, 2009 Diploma Thesis Ruhr-University Bochum Faculty of Electrical Engineering and Information Technology

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

  • Code-basedCryptography:Implementing theMcEliece Scheme on Recon�gurableHardwareStefan Heyse

    May 31, 2009

    Diploma Thesis

    Ruhr-University Bochum

    Faculty of Electrical Engineering and Information

    Technology

    Chair for Embedded Security

    Prof. Dr.-Ing. Christof Paar

    Dr.-Ing. Tim Güneysu

  • Gesetzt am October 14, 2009 um 12:09 Uhr.

  • IIIAbstractMost advanced security systems rely on public-key schemes based either on the

    factorization or discrete logarithm problem. Since both problems are known to beclosely related, a major breakthrough in cryptanalysis tackling one of those prob-

    lems could render a large set of cryptosystems completely useless. The McEliecepublic-key scheme is based on the alternative security assumption that decodingunknown linear, binary codes is NP-complete. In this work, we investigate the effi-

    cient implementation of the McEliece scheme on reconfigurable hardware what was– up to date – considered a challenge due to the required storage of its large keys. To

    the best of our knowledge, this is the first time that the McEliece encryption schemeis implemented on a Xilinx Spartan-3 FPGA.ZusammenfassungDie meisten modernen Sicherheitssysteme basieren auf Public-Key-Verfahren dieentweder auf dem Problem der Faktorisierung oder dem diskreten Logarithmuspro-

    blem beruhen. Da die beiden Probleme dafür bekannt sind, eng miteinander verbun-den zu sein, könnte ein wichtiger Durchbruch in der Kryptoanalyse, der eines dieser

    Probleme löst, eine große Anzahl von Kryptosystemen völlig nutzlos machen. DasMcEliece Public-Key-System basiert auf der alternativen Annahme, dass die Deco-dierung unbekannte linearer Codes NP-vollständig ist. In dieser Arbeit untersuchen

    wir die effiziente Umsetzung des McEliece Verfahren auf rekonfigurierbarer Hard-ware, was durch die notwendige Speicherung der grossen Schlüssel - bis heute - eine

    Herausforderung war. Nach besten Wissen ist dies das erste Mal, dass das McElieceVerfahren auf einem Xilinx Spartan-3 FPGA implementiert wird.

  • V

    FOR MANDY AND EMELIE

  • VI

    Erklärung/Statement

    Hiermit versichere ich, dass ich meine Diplomarbeit selbst verfaßt und keine an-

    deren als die angegebenen Quellen und Hilfsmittel benutzt sowie Zitate kenntlichgemacht habe.

    I hereby declare that the work presented in this thesis is my own work and thatto the best of my knowledge it is original except where indicated by references to

    other authors.

    Bochum, den October 14, 2009

    Stefan Heyse

  • Contents1. Introduction 31.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2. Existing Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52. The McEliece Crypto System 72.1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2. Key Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3. Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.4. Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.5. Reducing Memory Requirements . . . . . . . . . . . . . . . . . . . . . . 9

    2.6. Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6.1. Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.6.2. Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.7. Side Channel Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123. Introduction to Coding Theory 153.1. Codes over Finite Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3.2. Goppa Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3. Parity Check Matrix of Goppa Codes . . . . . . . . . . . . . . . . . . . 163.4. Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.5. Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.6. Solving the Key Equation . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3.7. Extracting Roots of the Error Locator Polynomial . . . . . . . . . . . . 204. Recon�gurable Hardware 234.1. Introducing FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    4.1.1. Interfacing the FPGA . . . . . . . . . . . . . . . . . . . . . . . . . 244.1.2. Buffers in Dedicated Hardware . . . . . . . . . . . . . . . . . . . 244.1.3. Secure Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.2. Field Arithmetic in Hardware . . . . . . . . . . . . . . . . . . . . . . . . 265. Designing for Area-Time-E�ciency 295.1. Extended Euclidean Algorithm . . . . . . . . . . . . . . . . . . . . . . . 29

    5.1.1. Multiplication Component . . . . . . . . . . . . . . . . . . . . . 31

  • Contents 1

    5.1.2. Component for Degree Extraction . . . . . . . . . . . . . . . . . 315.1.3. Component for Coefficient Division . . . . . . . . . . . . . . . . 32

    5.1.4. Component for Polynomial Multiplication and Shifting . . . . 335.2. Computation of Polynomial Square Roots . . . . . . . . . . . . . . . . . 345.3. Computation of Polynomial Squares . . . . . . . . . . . . . . . . . . . . 356. Implementation 376.1. Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    6.1.1. Reading the Public Key . . . . . . . . . . . . . . . . . . . . . . . 38

    6.1.2. Encrypting a message . . . . . . . . . . . . . . . . . . . . . . . . 396.2. Decryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    6.2.1. Inverting the Permutation P . . . . . . . . . . . . . . . . . . . . . 426.3. Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    6.3.1. Computing the Square Root of T(z)+z . . . . . . . . . . . . . . . 45

    6.3.2. Solving the Key Equation . . . . . . . . . . . . . . . . . . . . . . 466.3.3. Computing the Error Locator Polynomial Sigma . . . . . . . . . 46

    6.3.4. Searching Roots . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.4. Reverting the Substitution S . . . . . . . . . . . . . . . . . . . . . . . . . 487. Results 518. Discussion 558.1. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    8.2. Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 568.3. Outlook for Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . 56A. Tables 57B. Magma Functions 59C. VHDL Code Snippets 65D. Bibliography 69E. List of Figures 73F. List of Tables 75G. Listings 77H. List of Algorithms 79

  • 1. Introduction1.1. MotivationThe advanced properties of public-key cryptosystems are required for many crypto-graphic issues, such as key establishment between parties and digital signatures. In

    this context, RSA, ElGamal, and later ECC have evolved as most popular choices andbuild the foundation for virtually all practical security protocols and implementa-tions with requirements for public-key cryptography. However, these cryptosystems

    rely on two primitive security assumptions, namely the factoring problem (FP) andthe discrete logarithm problem (DLP), which are also known to be closely related.

    With a significant breakthrough in cryptanalysis or a major improvement of thebest known attacks on these problems (i.e., the Number Field Sieve or Index Calculus),a large number of recently employed cryptosystems may turn out to be insecure

    overnight. Already the existence of a quantum computer that can provide compu-tations on a few thousand qubits would render FP and DLP-based cryptography

    useless. Though quantum computers of that dimension have not been reported tobe built yet, we already want to encourage a larger diversification of cryptographicprimitives in future public-key systems. However, to be accepted as real alternatives

    to conventional systems like RSA and ECC, such security primitives need to supportefficient implementations with a comparable level of security on recent computing

    platforms. For example, one promising alternative are public-key schemes basedon Multivariate Quadratic (MQ) polynomials for which hardware implementationswere proposed on CHES 2008 [9]. In this work, we demonstrate the efficient im-

    plementation of another public-key cryptosystem proposed by Robert J. McEliece in1978 that is based on coding theory [35]. The McEliece cryptosystem incorporates

    a linear error-correcting code (namely a Goppa code) which is hidden as a gen-eral linear code. For Goppa codes, fast decoding algorithms exist when the code isknown, but decoding codewords without knowledge of the coding scheme is proven

    NP-complete [3]. Contrary to DLP and FP-based systems, this makes this schemealso suitable for post-quantum era since it will remain unbroken when appropriately

    chosen security parameters are used [6].

    The vast majority1 of today’s computing platforms are embedded systems. Onlya few years ago, most of these devices could only provide a few hundred bytes

    1Already in 2001, 98% of the microprocessors in world-wide production were assembled in embed-ded platforms.

  • 4 Chapter 1 - Introduction

    of RAM and ROM which was a tight restriction for application (and security) de-signers. Thus, the McEliece scheme was regarded impracticable on such small and

    embedded systems due to the large size of the private and public keys.For examplefor a 80 bit security level, public key is 437.75 Kbyte and secret key is 377 Kbyte large.But nowadays, recent families of microcontrollers provide several hundreds of bytes

    of Flash-ROM. Moreover, recent off-the-shelf hardware such as FPGAs also containdedicated memory blocks and Flash memories that support on-chip storage of up to

    a few megabits of data. In particular, these memories could be used, e.g., to storethe keys of the McEliece cryptosystem.

    While a microcontroller implementation already exist [16] this work present thefirst implementation of the McEliece cryptosystem on a Xilinx Spartan-3AN 1400

    FPGA which is a suitable for many embedded system applications. To the bestof our knowledge, no other implementations for the McEliece scheme have been

    proposed targeting embedded platforms. Fundamental operations for McEliece arebased on encoding and decoding binary linear codes in binary extension fields that,in particular, can be implemented very efficiently in dedicated hardware. Unlike FP

    and DLP-based cryptosystems, operations on binary codes do not require compu-tationally expensive multi-precision integer arithmetic what is beneficial for small

    computing platforms.

    On quantum computers algorithms with a polynomial complexity exist whichbreak the security assumption in both cases [41]. McEliece is based on a proven

    NP-complete problem and is therefore not effected by the computational power ofquantum computers. This is the primary reason why McEliece is an interesting

    candidate for post quantum cryptography.

    Finally, the rising market of pervasive computing, in which mobile phones, cars

    and even white goods communicate with each other create a new kind of cryptog-raphy: lightweight cryptography. In the context of lightweight cryptography it is

    important for crypto systems to perform fast on systems with limited memory, CPUresources and battery capacity. Therefore many symmetric and asymmetric algo-rithms like AES, DES, RSA and ECC were ported to microcontrollers and even new

    algorithms like Present [10] were developed especially for embedded systems.

    For these reasons it is particularly interesting how McEliece performs on embed-ded systems in comparison to commonly used public key schemes like RSA and

    ECC.

  • 1.2 Existing Implementations 51.2. Existing ImplementationsThere exist only a few McEliece software implementations [39, 40] for 32 bit archi-

    tectures. Implementation [39] is written in pure i386 assembler. It encrypts with 6KBits/s and decrypts with 1.7 Kbits/s on a 16 MHz i386-CPU. A C-implementation

    for 32-bit architectures exists in [40]. The code is neither formatted nor does it con-tain a single line of source code comment. Nevertheless, it was used for the opensource p2p software freenet and entropy [19, 18]. Due to the poor documentation

    we are unable to give performance results for the specific code. An FPGA imple-mentation of the original McEliece crypto system appears not to exist, only in [8] a

    signature scheme derived from McEliece is implemented on a FPGA.

    To the best of our knowledge, this work presents the first implementation of theMcEliece public key scheme for embedded systems.1.3. GoalsGoal of this thesis is a proof-of-concept implementation of McEliece for reconfig-

    urable hardware. The primary target is to solve the memory problem for the largepublic and private key data and fit all components into a low cost FPGA of the

    Spartan-3 class. Simultaneously we want to achieve a high performance to allowthe McEliece scheme to become a competitor in post quantum crypto systems onembedded systems. For encryption the large public key has to be stored in a fast

    accessible area and for decryption a way has to be found to reduce the private keysize, because the target FPGA has only limited memory available.1.4. OutlineFirst, the growing interest in post quantum cryptography is motivated and whyespecially embedded implementations are necessary. Next, an introduction to the

    McEliece scheme is given. Subsequently the basic concepts of error correcting codes,especially Goppa codes are presented. Afterwards, an introduction to FPGAs is

    given and it is shown how to perform finite field arithmetics with them. Then theencryption implementation is explained with respect to the properties of the targetplatform. Next, the implementation of decryption is described on the second target

    platform. Finally, the results are presented and options for further improvement areoutlined.

  • 2. The McEliece Crypto SystemIn this chapter an introduction to McEliece is given and the algorithms to generatethe key pairs and how to encrypt and decrypt a message are presented. After-

    wards, currently known weaknesses and actual attacks on the McEliece scheme arepresented.2.1. OverviewThe McEliece crypto system consists of three algorithms which will be explained in

    the following sections: A key generation algorithm which produces a public/privatekey pair, an encryption algorithm and a decryption algorithm. The public key is

    a hidden generator matrix G of a binary linear code of length n and dimension kcapable of correcting up to t errors. R. McEliece suggested to use classical binarygoppa codes. The generator matrix of this code is hidden using a random k × kbinary non-singular substitution matrix S and a random n× n permutation matrixP. The matrix Ĝ = S ⋅ G ⋅ P and the error- correcting capability t forms the publickey. G itself together with matrices S and P form the secret key.

    To encrypt a message, it is transformed into a codeword of the underlying code

    and an error e with Hamming weight wH(e) ≤ t is added. Without knowledgeof the specific code used, errors can not be corrected and therefore the originalmessage cannot be recovered. The owner of the secret informations can reverse the

    transformations of G and use the decoding algorithm of the code to correct the errorsand recover the original message.

    c‘ = m ⋅ Ĝ = m ⋅ S ⋅ G ⋅ P→ c = c‘ + e = m ⋅ S ⋅ G ⋅ P + e→ c ⋅ P−1 = m ⋅ S ⋅ G ⋅ P ⋅ P−1 + e ⋅ P−1 = m ⋅ S ⋅ G + e ⋅ P−1

    (2.1)

    The decoding algorithm can still be used to correct the t errors and get m‘ =m ⋅ S because the equation wH(e) = wH(e ⋅ P−1) ≤ t still holds. To finally receivethe original message m a multiplication with S−1 is performed which results inm‘ ⋅ S−1 = m ⋅ S ⋅ S−1 = m.

  • 8 Chapter 2 - The McEliece Crypto System2.2. Key GenerationBasically the parameters of the McEliece crypto system are the parameters of theGoppa code used. After choosing the underlying Galois field GF(2m)and the errorcorrecting capability of the code t, all other parameters depend on these two values.

    The original parameters suggested by McEliece in [35] are m = 10, t = 50, but incitehac, the authors noted that t = 38 is a better choice with regard to the compu-tational complexity of the algorithm while not reducing the security level. Theseparameters nowadays only give 260 bit security [6] compared to a symmetric ciher.To achieve an appropriate level of 280-ḃit security at least parameters n = 211 = 2048,t = 38 and k = n−m ⋅ t = 1751 must be chosen. Then the key generation is:

    Algorithm 1 Key Generation Algorithm for McEliece Scheme

    Input: Security Parameters m, tOutput: Kpub, Ksec

    1: n← 2m, k← n−m ⋅ t2: C← random binary (n, k)-linear code C capable of correcting t errors3: G← k× n generator matrix for the code C4: S← random k× k binary non-singular matrix5: P← random n× n permutation matrix6: Ĝ← k× n matrix S× G× P7: return Public Key(Ĝ, t); Private Key(S, G, P).

    Note that in Section 2.4 only matrices P−1 and S−1 are used. Here exist the pos-sibility to precompute these inverse matrices and then the private (decryption) keyis redefined to (S−1, G, P−1). The permutation matrix P is very sparse. In every rowand columns exactly a single one occurs. This fact is used in the implementation tosave space when storing P.2.3. EncryptionSuppose Bob wishes to send a message m to Alice whose public key is (Ĝ, t):

    McEliece encrypts a k bit message into a n bit ciphertext. With the actual parame-ters, this results in 2048 bit ciphertext for a 1751 bit message. This is an overhead of

    about nk = 1.17.

  • 2.4 Decryption 9

    Algorithm 2 McEliece Message Encryption

    Input: m, Kpub = (Ĝ, t)Output: Ciphertext c

    1: Encode the message m as a binary string of length k

    2: c‘← m ⋅ Ĝ3: Generate a random n-bit error vector z containing at most t ones

    4: c = c‘ + z5: return c2.4. Decryption

    To decrypt the ciphertext and receive the message Alice has to perform the following

    steps:

    Algorithm 3 McEliece Message Decryption

    Input: c, Ksec = (P−1, G, S−1)Output: Plaintext m

    1: ĉ← c ⋅ P−12: Use a decoding algorithm for the code C to decode ĉ to m̂ = m ⋅ S3: m← m̂ ⋅ S−14: return m

    2.5. Reducing Memory RequirementsTo make McEliece-based cryptosystems more practical (i.e., reduce the key sizes),there is an ongoing research to replace the code with one that can be represented in

    a more compact way. Examples for such alternative representations are quasi-cycliccodes [20], or low density parity check codes [37], but have also been broken again []

    Using a naive approach, in which the code is constructed from the set of all ele-ments in Fm2 in lexicographical order and both matrices S, P are totally random, the

    public key Ĝ = S× G× P becomes a random n × k matrix. However, since P is asparse permutation matrix with only a single 1 in each row and column, it is more

    efficient to store only the positions of the 1’s, resulting in an array with n ⋅m bits.Another trick to reduce the public key size is to convert Ĝ to systematic form{Ik ∣ Q}, where Ik is the k× k identity matrix. Then, only the (k × (n− k)) matrixQ is published [17]. But to achieve the same security level the message has to be

  • 10 Chapter 2 - The McEliece Crypto System

    converted and additional operations [17] has to be computed to avoid weaknessesdue to the know structure of Q [17].

    In the last step of code decoding (Algorithm 4), the k message bits out of the n(corrected) ciphertext bits need to be extracted. Usually, this is done by a mapping

    matrix iG with G × iG = Ik. But if G is in systematic form, then this step can beomitted, since the first k bits of the corrected ciphertext corresponds to the messagebits. Unfortunately, G and Ĝ cannot both be systematic at the same time, since then

    Ĝ = {Ik ∣ Q̂} = S × {Ik ∣ Q} × P and S would be the identity matrix which isinappropriate for use as the secret key.

    For reduction of the secret key size, we chose to generate the large scrambling ma-trix S−1 on-the-fly using a cryptographic pseudo random number generator (CPRNG)and a seed. During key generation, it must be ensured that the seed does not gen-

    erate a singular matrix S−1. A random binary matrix is with a probability of about33% invertible (value is empiric, not proven). Depending on the target platform and

    available cryptographic accelerators, there are different options to implement such aCPRNG (e.g. AES in counter mode or a hash-based PRNG) on embedded platforms.However, to prevent the security margin to be reduced, the complexity to attack the

    sequence of the CPRNG should not be significantly lower than to break the consid-ered McEliece system with a static scrambling matrix S. However, the secrecy of

    S−1 is not required for hiding the secret polynomial G(z) [17]. The secret matrix Sindeed has no cryptographic function in hiding the secret goppa polynomial g(z).Today, there is no way to recover H with the knowledge of only G ⋅ P.2.6. SecurityIn the past, many researchers attempted to break the McEliece scheme [33, 32, 2] butnone of them was successful for the general case.2.6.1. WeaknessesAt Crypto’97, Berson [7] shows that McEliece has two weaknesses. It fails when

    encrypting the same message twice and when encrypting a message that has aknown relation to another message. In the first case assume that c1 = m ⋅ Ĝ + e1and c2 = m ⋅ Ĝ + e2 with e1 ∕= e2 are send which leads to c1 + c2 = e1 + e2.

    Now compute two sets L0, L1 where L0 contains all positions l where c1 + c2 iszero and L1 contains the positions l where c1 + c2 is one. Due to that two errors e1, e2are chosen independently and the transmitted messages are identical, it follows that

  • 2.6 Security 11

    for a position l ∈ L0 most probably neither c1(l) nor c2(l) is modified by an error,while for a position l ∈ L1 certainly one out of c1(l) or c2(l) is modified by an error.For the parameters suggested by McEliece n = 1024, k = 524, t = 50 the probabilityfor a l ∈ L0 of e1(l) = e2(l) = 1 is about 0.0024 (or the other way round) for mostl ∈ L0 mainly e1(l) = e2(l) = 0. The probability pi that exactly i positions are for thesame time changed by e1 and e2 is:

    pi = Pr(∣{l : e1(l) = 1} ∩ {l : e2(l) = 1}∣ = i) =( 501025)(

    97450−i)

    (102450 )(2.2)

    Therefore, the cardinality of L1 is:

    E(∣L1 ∣) =50

    ∑i=0

    (100− 2 ⋅ i) ⋅ pi ≈ 95.1 (2.3)

    For example, with the cardinality ∣L1∣ = 94 it follows that ∣L0∣ = 930 and only3 entries of L0 are affected by an error. The probability to select 524 unmodified

    positions from L0 is

    (927525)

    (930524)≈ 0.0828 (2.4)

    Therefore after about 12 trials an unmodified codeword is selected which can bedecoded with the help of the public generator matrix.

    In the second case where the two messages have a known linear relation the sumof the two ciphertexts becomes:

    c1 + c2 = m1 ⋅ Ĝ + m2 ⋅ Ĝ + e1 + e2 (2.5)

    Due to the known relation (m1 + m2) ⋅ Ĝ can be computed and subsequently c1 +c2 + (m1 +m2) ⋅ Ĝ = e1 + e2. Now, proceed like in the first case, using c1 + c2 + (m1 +m2) ⋅ Ĝ = e1 + e2 instead of c1 + c2. Remark that this attack does not reveal the secretkey.

    There are several ways to make McEliece resistant against those weaknesses. Mostof them scramble or randomize the messages to destroy any relationship between

    two dependent messages[17].

  • 12 Chapter 2 - The McEliece Crypto System2.6.2. AttacksAccording to [17], there is no simple rule for choosing t with respect to n. Oneshould try to make an attack as difficult as possible using the best known attacks.

    A recent paper [5] from Bernstein, Lange and Peters introduces an improved at-tack of McEliece with support of Bernsteins list decoding algorithm [4] for binary

    Goppa codes. List decoding can approximately correct n −√

    n ⋅ (n− 2t− 2 errorsin a length-n classical irreducible degree-t binary Goppa code, while the so far thebest known algorithm from Patterson [38] which can correct up to t errors. This

    attack reduces the binary work factor to break the original McEliece scheme with a(1024, 524) Goppa code and t = 50 to 260.55 bit operations. Table 2.1 summarizes theparameters suggested by [5] for specific security levels:

    Table 2.1.: Security of McEliece Depending on Parameters

    Security Level Parameters Size Kpub Size Ksec

    (n, k, t), errors added in KBits (G(z), P, S) in KBits

    Short-term (60 bit) (1024, 644, 38), 38 644 (0.38, 10, 405)

    Mid-term (80 bit) (2048, 1751, 27), 27 3, 502 (0.30, 22, 2994)

    Long-term (256 bit) (6624, 5129, 115), 117 33, 178 (1.47, 104, 25690)

    For keys limited to 216, 217, 218, 219, 220 bytes, the authors propose Goppa codes

    of lengths 1744, 2480, 3408, 4624, 6960 and degrees 35, 45, 67, 95, 119 respectively, with36, 46, 68, 97, 121 errors added by the sender. These codes achieve security levels84.88, 107.41, 147.94191.18, 266.94 against the attack described by the researchers.2.7. Side Channel AttacksThe susceptibility of the McEliece cryptosystem to side channel attacks has not exten-

    sively been studied, yet. This is probably due to the low number of practical systemsemploying the McEliece cryptosystem. However, embedded systems can always be

    subject to passive attacks such as timing analysis [30] and power/EM analysis [34].In [42], a successful timing attack on the Patterson algorithm was demonstrated. Theattack does not recover the key, but reveals the error vector z and hence allows for

    efficient decryption of the message c. This implementations is not be susceptible tothis attack due to unconditional instruction execution, e.g., the implementation will

    not terminate after a certain number of errors have been corrected.

  • 2.7 Side Channel Attacks 13

    For a key recovery attack on McEliece, the adversary needs to recover the secretsubstitution and permutation matrices S and P, and the Goppa code itself, either

    represented by G or by the Goppa polynomial g(z) and the support L. A feasibleSPA is possible to recover ĉ. For each bit set in ĉ, our implementations perform oneiteration of the euclidean algorithm (EEA). Assuming the execution of EEA to be

    visible in the power trace, ĉ can easily be recovered. The attacker is able to recoverthe whole whole permutation matrix P with less than 100 chosen ciphertexts. This

    powerful attack can easily be prevented by processing the bits of ĉ in a random order.Without recovering P, power attacks on inner operations of McEliece are aggravated

    due to an unknown input. Classical timing analysis, as described in [30] seemsmore realistic, as many parts of the algorithm such as EEA,Permutation and moduloreduction exhibit a data-dependent runtime. Yet, no effective timing attack has been

    reported so far, and simple countermeasures are available [31].

    An attacker knowing P could recover the generator polynomial G(z) and therebybreak the implementation [17]. As a countermeasure, we will randomize the execu-

    tion of step 1 of Algorithm 4 due to a possible SPA recovering ĉ, and consequentlyP−1. Differential EM/power attacks and timing attacks are impeded by the permuta-tion and scrambling operations (P and S) obfuscating all internal states, and finally,the large key size. Yet template-like attacks [12] might be feasible if no further pro-tection is applied.

  • 3. Introduction to Coding TheoryFor this section it is assumed that the reader is familiar with the theory of finite

    fields and algebraic coding theory. A good introduction to finite fields for engineerscan be found in [36]. The following definitions are from [45] and define the crucial

    building blocks for Goppa codes.3.1. Codes over Finite FieldsBecause the McEliece scheme is based on the topic of some problems in codingtheorie, a short introduction into this field based on [17] is given.

    Definition 3.1.1 An (n, k)-code C over a finite Field F is a k-dimensional subvectorspace ofthe vector space Fn. We call C an (n, k, d)-code if the minimum distance is d = minx,y∈Cdist(x, y),where dist denotes a distance function, e.g. Hamming distance. The distance of x ∈ Fn tothe null-vector wt(x) := dist(0, x) is called weight of x.

    Definition 3.1.2 The matrix G ∈ Fk×n is a generator matrix for the (n, k)-code C overF, if the rows of G span C over F. The matrix H ∈ F(n−k)×n is called parity check matrixfor the code C if HT is the right kernel of C. The code generated by H is called dual code ofC and denoted by CT.

    By multiplying a message with the generator matrix a codeword is formed, which

    contains redundant information. This information can be used to recover the orig-inal message, even if some errors occurred, for example during transmission of thecodeword over an radio channel.3.2. Goppa CodesIntroduced by V.D. Goppa in [21] Goppa codes are a generalization of BCH- and

    RS-codes. There exist good decoding algorithms e.g. [38] and [44].

    Definition 3.2.1 (Goppa polynomial, Syndrome, binary Goppa Codes). Let m and t be

    positive integers and let

    g(z) =t

    ∑i=0

    gizi ∈ F2m [z] (3.1)

  • 16 Chapter 3 - Introduction to Coding Theory

    be a monic polynomial of degree t called Goppa polynomial and

    ℒ = {α0, ⋅ ⋅ ⋅ , αn−1} αi ∈ F2m (3.2)

    a tuple of n distinct elements, called support, such that

    g(αj) ∕= 0, ∀0 ≤ j ≤ n. (3.3)For any vector c = (c0, ⋅ ⋅ ⋅ , cn−1) ∈ Fn, define the syndrome of c by

    Sc(z) = −n−1∑i=0

    cig(αi)

    g(z)− g(αi)z− αi

    mod g(z) (3.4)

    The binary Goppa code Γ(ℒ, g(z)) over F2 is the set of all c = (c0, ⋅ ⋅ ⋅ , cn−1) ∈ Fn2such that the indentity

    Sc(z) = 0 (3.5)holds in the polynomial ring F2m(z) or equivalent

    Sc(z) =n−1∑i=0

    ciz− αi

    ≡ 0 mod g(z) (3.6)

    If g(z) is irreducible over F2m then Γ(ℒ, g(z)) is called an irreducible binary Goppa code.3.3. Parity Check Matrix of Goppa CodesRecall equation (3.4). From there it follows that every bit of the received codewordis multiplied with

    g(z)− g(αi)g(αi) ⋅ (z− αi)

    (3.7)

    Describing the Goppa polynomial as gs ⋅ zs + gs−1 ⋅ zs−1 ⋅ . . . ⋅+g0 one can con-struct the parity check matrix H as

    H =

    gsg(α0)

    gsg(α1)

    ⋅ ⋅ ⋅ gsg(αn−1)

    gs−1+gs⋅α0g(α0)

    gs−1+gs⋅α0g(α1)

    ⋅ ⋅ ⋅ gs−1+gs⋅α0g(αn−1)

    .... . .

    ...g1+g2⋅α0+⋅⋅⋅+gs⋅αs−10

    g(α0)g1+g2⋅α0+⋅⋅⋅+gs⋅αs−10

    g(α1)⋅ ⋅ ⋅ g1+g2⋅α0+⋅⋅⋅+gs⋅α

    s−10

    g(αn−1)

    (3.8)

  • 3.4 Encoding 17

    This can be simplified to

    H =

    gs 0 ⋅ ⋅ ⋅ 0gs−1 gs ⋅ ⋅ ⋅ 0

    .... . .

    ...

    g1 g2 ⋅ ⋅ ⋅ gs

    1g(α0)

    1g(α1)

    ⋅ ⋅ ⋅ 1g(αn−1)

    α0g(α0)

    α1g(α1)

    ⋅ ⋅ ⋅ αn−1g(αn−1)

    .... . .

    ...α

    s−10

    g(α0)α

    s−11

    g(α1)⋅ ⋅ ⋅ α

    s−1n−1

    g(αn−1)

    (3.9)

    where the first part has a determinate unequal Zero and following the second partĤ is a equivalent parity check matrix, which has a simpler structure. By applyingthe Gaussian algorithm to the second matrix Ĥ one can bring it to systematic form

    (Ik ∣ H), where Ik is the k× k identity matrix. Note that whenever a column swap isperformed, actual also a swap of the elements in the support ℒ is performed.

    From the systematic parity check matrix (Ik ∣ H), now the systematic generatormatrix G can be derived as (In−k ∣ HT).3.4. EncodingTo encode a message m into a codeword c represent the message m as a binary

    string of length k and multiply it with the n× k matrix G. If G is in systematic form(Ik ∣ P), where Ik is the k × k identity matrix, one only have to multiply m withthe (n− k) × k matrix P and append the result to m. This trick can not be used inthe context of McEliece without special actions (see section 2.5) because the publicgenerator matrix Ĝ is generally not in systematic form, due to multiplication with

    two random matrices.3.5. DecodingHowever, decoding such a codeword r on the receiver’s side with a (possibly) ad-ditive error vector e is far more complex. For decoding, we use Patterson’s algo-rithm [38] with improvements from [43].

    Since r = c+ e ≡ e mod G(z) holds, the syndrome Syn(z) of a received codewordcan be obtained from Equation (3.6) by

    Syn(z) = ∑α∈GF(2m)

    rαz− α ≡ ∑

    α∈GF(2m)

    eαz− α mod G(z) (3.10)

  • 18 Chapter 3 - Introduction to Coding Theory

    To finally recover e, we need to solve the key equation σ(z) ⋅ Syn(z) ≡ ω(z)mod G(z), where σ(z) denotes a corresponding error-locator polynomial and ω(z)denotes an error-weight polynomial. Note that it can be shown that ω(z) = σ(z)′ isthe formal derivative of the error locator. By splitting σ(z) into even and odd poly-nomial parts σ(z) = a(z)2 + z ⋅ b(z)2, we finally determine the following equationwhich needs to be solved to determine error positions:

    Syn(z)(a(z)2 + z ⋅ b(z)2) ≡ b(z)2 mod G(z) (3.11)

    To solve Equation (3.11) for a given codeword r, the following steps have to beperformed:

    1. From the received codeword r compute the syndrome Syn(z) according toEquation (3.10). This can also be done using simple table-lookups.

    2. Compute an inverse polynomial T(z) with T(z) ⋅ Syn(z) ≡ 1 mod G(z) (orprovide a corresponding table). It follows that (T(z)+ z)b(z)2 ≡ a(z)2 mod G(z).

    3. There is a simple case if T(z) = z ⇒ a(z) = 0 s.t. b(z)2 ≡ z ⋅ b(z)2 ⋅ Syn(z)mod G(z) ⇒ 1 ≡ z ⋅ Syn(z) mod G(z) what directly leads to σ(z) = z.Contrary, if T(z) ∕= z, compute a square root R(z) for the given polynomialR(z)2 ≡ T(z) + z mod G(z). Based on a observation by Huber [24] we cancompute the square root R(z) by:

    R(z) = T0(z) + w(z) ⋅ T1(z) (3.12)

    where T0(z), T1(z) are odd and even part of T(z) + z satisfying T(z) + z =T0(z)

    2 + z ⋅ T1(z)2 and w(z)2 = z mod g(z) which can be precomputed forevery given code. We can then determine solutions a(z), b(z) satisfying

    a(z) = b(z) ⋅ R(z) mod G(z). (3.13)

    with a modified euclidean algorithm(see section 3.6). Finally, we use the identi-

    fied a(z), b(z) to construct the error-locator polynomial σ(z) = a(z)2 + z ⋅ b(z)2.4. The roots of σ(z) denote the positions of error bits(see section 3.7). If σ(αi) ≡ 0

    mod G(z) with αi being the corresponding bit of a generator in GF(211), there

    was an error in the position i in the received codeword that can be corrected

    by bit-flipping.

    This decoding process, as required in Step 2 of Algorithm 3 for message decryp-

    tion, is finally summarized in Algorithm 4:

  • 3.6 Solving the Key Equation 19

    Algorithm 4 Decoding Goppa Codes

    Input: Received codeword r with up to t errors, inverse generator matrix iG

    Output: Recovered message m̂1: Compute syndrome Syn(z) for codeword r2: T(z) ← Syn(z)−13: if T(z) = z then4: σ(z) ← z5: else

    6: R(z) ←√

    T(z) + z7: Compute a(z) and b(z) with a(z) ≡ b(z) ⋅ R(z) mod G(z)8: σ(z) ← a(z)2 + z ⋅ b(z)29: end if

    10: Determine roots of σ(z) and correct errors in r which results in r̂11: m̂← r̂ ⋅ iG {Map rcor to m̂}12: return m̂3.6. Solving the Key EquationTo solve this equation a(z) = b(z) ⋅ R(z) mod G(z) with the extended euclideanalgorithm observe the following: From σ(z) = a(z)2 + z ⋅ b(z)2 and deg(σ(z)) ≤deg(G(z)) it follows that deg(a(z)) ≤ deg(G(z))2 and deg(b(z)) ≤

    deg(G(z))−12 . During

    the iterations of the extended euclidean algorithm:

    ri−2(z) = 1 ⋅ R(z) + 0 ⋅ G(z)ri−1(z) = 0 ⋅ R(z) + 1 ⋅ G(z)

    ...

    rk = uk(z) ⋅ R(z) + vk(z) ⋅ G(z)...

    ggT(R(z), G(z)) = uL ⋅ R(z) + vL(z) ⋅ G(z)0 = uL+1(z) ⋅ R(z) + vL+1(z) ⋅ G(z)

    the following holds:

    ri = ri−2− ai ⋅ ri−1 where ai =ri−2ri−1

    (3.14)

  • 20 Chapter 3 - Introduction to Coding Theory

    and

    deg(ri) < deg(ri−1) (3.15)

    In addition:

    deg(ui(z)) + deg(ri−1(z)) = deg(G(z)) (3.16)

    It follows that deg(r(z)) is constantly decreasing after starting with deg(G(z)) anddeg(u(z)) increases after starting with zero. Using this, one can see that there is aunique point in the computation of EEA where both polynomials are just below their

    bounds. The EEA can be stopped when for the first time deg(r(z)) <deg(G(z))

    2 and

    deg(u(z)) <deg(G(z)−1)

    2 . These results are the needed polynomials a(z), b(z). Froma(z) and b(z) construct σ(z) = a(z)2 + z ⋅ b(z)2.3.7. Extracting Roots of the Error LocatorPolynomialThe roots of σ(z) indicate the error positions. If σ(αi) ≡ 0 mod G(z)1 there was anerror ei = 1 at ci in the received bit string, which can be corrected by just flippingthe bit. The roots can be found by evaluating σ(ai) for every i. A more sophisticatedmethod is the Chien search [13]. To get all roots αi of σ(z) the following holds for allelements except the zero element:

    σ(αi) = σs(αi)

    s+σs−1(α

    i)s−1

    + . . . +σ1(αi) + σ0

    ∼= λs,i +λs−1,i + . . . +λ1,i + λ0,iσ(αi+1) = σs(α

    i+1)s

    +σs−1(αi+1)

    s−1+ . . . +σ1(α

    i+1) + σ0

    = σs(αi)

    s +σs−1(αi)

    s−1α

    s−1 + . . . +σ1(αi)α + σ0

    = λs,iαs +λs−1,iα

    s−1 + . . . +λ1,iα + λ0,i∼= λs,i+1 +λs−1,i+1 + . . . +λ1,i+1 + λ0,i+1

    In other words, one may define σ(αi) as the sum of a set {λj,i∣0 ≤ j ≤ s}, fromwhich the next set of coefficients may be derived thus:

    λj,i+1 = λj,iαj (3.17)

    1Note that α was a generator of GF(211)

  • 3.7 Extracting Roots of the Error Locator Polynomial 21

    Start at i = 0 with λj,0 = σj and iterate through every value of i up to s. If at anyiteration the sum:

    t

    ∑j=0

    λj,i = 0 (3.18)

    then σ(αi) = 0 and αi is a root. This method is more efficient than the brute forcelike method mentioned before.

    If the generator matrix G is in standard form, just take the first k bits out of the

    n-bit codeword c and retrieve the original message m. To get a generator matrixin this form, reorder the elements in the support of the code until G is in standardform. Note that in the computation of the syndrome and the error correction, the

    i-th element in C and σ(z), respectively, corresponds with the i-th element in thesupport and not with ai. If a matrix not in standard form is used, a mapping matrix

    which selects the right k bits from c, has to be found. This matrix iG must be of theform G ⋅ iG = IDk, where IDk is the k× k identity matrix. To get iG do the following:

    Algorithm 5 Getting a Mapping Matrix from Codewords to Messages

    Input: G

    1: Select randomly k columns of G

    2: Test if the resulting k× k matrix sub is invertible3: If not goto Step 1

    4: Else insert the rows of sub−1 into a k × n matrix at the positions where the krandom columns come from and fill the remaining rows with zeros.

    5: return iG

    For the Magma algorithm computing iG see Appendix B. A multiplication ofa valid codeword c with this matrix iG computes the corresponding message mbecause:

    c ⋅ iG = (m ⋅ G) ⋅ iG = m ⋅ IDk = m (3.19)

    Now all prerequisites to start with implementing the McEliece public key schemeare given.

  • 4. Recon�gurable HardwareThis chapter introduces FPGAs and presents the necessary additional hardware.

    Furthermore, field arithmetic in hardware is introduced.4.1. Introducing FPGAsFPGA stands for Field Programmable Gate Arrays. It consists of a large amount ofLook Up Tables that can generate any logic combination with four inputs and one

    output and basic storage elements based on Flip Flops.

    Figure 4.1.: 4-Input LUT with FF

    Two LUT’s and two FF’s are packed together into an slice and four slices into anConfigurable Logic Block. Between the CLBs is a programmable switch matrix thatcan connect the input and outputs of the CLB’s. How LUTs and CLBs are configured

    is defined in a vendor specific binary file, the bitstream. Most modern FPGAs alsocontain dedicated hardware like multiplier, clock manager, and configurable block

    RAM.

    CLB

    Blo

    ck R

    AM

    Multip

    lier

    DCM

    IOBs

    IOBs

    IOB

    s

    IOB

    s

    DCM

    Blo

    ck R

    AM

    / M

    ultip

    lier

    DCM

    CLBs

    IOBs

    OBs

    DCM

    Figure 4.2.: Simplified Overview over an FPGA [27]

    The process of building a design for a FPGA consist of several steps which are

    depicted in Figure 4.3. After writing the VHDL code in an editor, it is translated

  • 24 Chapter 4 - Reconfigurable Hardware

    Figure 4.3.: VHDL Designflow[28]

    to a net list. This process is called synthesis and for this implementation the toolsXST and Synplify are chosen from the hundred available. Based on the net list the

    correct behavior of the design can be verified by using a simulation tool, which isin our case Modelsim. This both steps are completely hardware independent. The

    next step is mapping and translating the net list into logic resources and specialresources offered by the target platform. Due to this hardware dependency, thoseand the following steps need to know the exact target hardware. The final step

    place-and-route (PAR) then tries to find an optimum placement for the single logicblocks and connects them over the switching matrix. The output of PAR can now be

    converted into a bitstream file and loaded into a flash memory on the FPGA board.The FPGA contains a logic block that can read this flash memory and can configurethe FPGA accordingly. On most FPGA boards this memory is located outside the

    FPGA chip and can therefore be accessed by anyone. To protect the content of thebitstream, which may include intellectual property (IP) cores or, like in our case,

    secret key material, the bitstream can be stored encrypted. The FPGA boot-up logic

  • 4.1 Introducing FPGAs 25

    then has to decrypt the bitstream before configuring the FPGA. Some special FPGAs,for example the Spartan3-AN series, contain large on-die flash memory, which can

    only be accessed by opening the chip physically. For the decryption algorithm thebitstream file has to be protected by one of the two methods mentioned above.4.1.1. Interfacing the FPGAAside from the algorithmic part of the design, a way has to be found to get data into

    the FPGA, and after computation, to read the data back. For this implementation weremain at a standard UART-interface even though interfaces with higher bandwidth

    (Ethernet,USB,PCI-Express) are possible. The used UART is derived from an existingsource for the PicoBlaze published in Xilinx application note 223 [11]. A wrappercomponent was developed, that provide all necessary ports: Input en_16_x_baud

    uart

    C_uart

    rst

    rx_in

    clk

    tx_out

    uart_tx

    C_uart_tx

    uart_rx

    C_uart_rx

    write_buffer

    reset_buffer

    en_16_x_baud

    clk

    [7:0]data_in[7:0]

    serial_out

    buffer_full

    buffer_half_full

    serial_in

    read_buffer

    reset_buffer

    en_16_x_baud

    clk

    buffer_data_present

    buffer_full

    buffer_half_full

    [7:0]data_out[7:0]

    clk

    reset_rx_buffer

    rx_in

    tx_out

    Figure 4.4.: The UART component

    should be pulsed HIGH for one clock cycle duration only and at a rate 16 times (orapproximately 16 times due to oversampling) faster the rate at which the serial datatransmission takes place. The receive and transmit component each contain a 16

    byte buffer. Transmission is started as soon as a byte is written to the buffer. Writingto the buffer if it is full has no effect. When the receive buffer is full, all subsequent

    write attempts from the host system are ignored, which will cause data loss.

  • 26 Chapter 4 - Reconfigurable Hardware4.1.2. Bu�ers in Dedicated HardwareFor some data a structure is required, that hold some values for later processing.For example ciphertext, plaintext, bytes send and received from UART and alsoprecomputed values have to be stored inside the FPGA. Instead of building these

    storage from a huge amount of registers, both target FPGA for this implementationprovide dedicated RAM. This block RAM (BRAM) is organized in blocks of 18Kbit

    and can be accessed with the maximum frequency of the FPGA. Each of the BRAMsis true dual ported, allowing independent read and write access at the same time.With a Xilinx vendor tool CoreGen each block can be configured as RAM or ROM

    in single or dual port configuration. Also one can select the required width anddepth of the BRAM. CoreGen then generates a VHDL instantiation template that can

    be used in the VHDL code. Every type of BRAM can be initialized with predefinedcontent via the CoreGen wizard. It uses a .coe file which can contain the memorycontent in binary, decimal or hex. This is used transfer the precomputed tables and

    constant polynomial w into the FPGA. Goppa polynomial G is not stored in a ROM,because we need to access it in its whole width at once.This polynomial is stored as

    a large register. Because it is constant and only needed in the extended euclideanalgorithm it will be resolved to fixed connections to VCC and GND by the synthesistool. The Spartan3-200 incorporates 12 BRAMs and the Spartan3-2000 incorporates

    40 BRAMs allowing up to 216 or 720 Kbits of data to be stored, respectively.4.1.3. Secure StorageAll data that is required to configure the FPGA and also constant values from VHDLcode, like ROM and RAM initialization values, has to be stored on the FPGA board.

    In normal circumstances the target platform contains a flash memory chip whichholds the bitstream file. During boot-up, the FPGA reads this file and configure

    himself and initialize the BRAM cells accordingly. But this flash has a standardinterface and can therefore be read by anyone. To protect the private key data, thereexist two ways. The first way, which is only possible at newer FPGAs [47], is to store

    the bitstream file encrypted. The FPGA contains a hardware decryption moduleand a user defined secret key. During boot-up, the bitstream file is decrypted inside

    the FPGA and then the normal configuration takes place. Our target FPGAs arenot capable of decrypting bitstream files. But the Spartan3-AN family of FPGAscontain a large on-die flash memory. Assuming that physical opening the chip is

    hard, this memory can be accepted as secure on chip storage. If the designer doesnot connect the internal flash to the outside world, then the flash cannot be read

    from the I/O pins. Additionally, there exist some security features which can be

  • 4.2 Field Arithmetic in Hardware 27

    Table 4.1.: Bitstream Generator Security Level Settings

    Security Level Description

    None Default. Unrestricted access to all configuration andReadback functions.

    Level 1 Disable all Readback functions from both the SelectMAPor JTAG ports (external pins). Readback via the ICAPallowed.

    Level 2 Disable all Readback operations on all ports.

    Level 3 Disable all configuration and Readback functions from allconfiguration and JTAG ports. The only command (in

    terms of Readback and configuration) that can be issuedand executed in Level3 is REBOOT. This erases the con-

    figuration of the device. This has the same function asenabling the PROG_B pin on the device, except it is donefrom within the device.

    configured during bitstream generation. Table 4.1 summarizes the different securitylevels provides by Spartan3 FPGAs [26].4.2. Field Arithmetic in HardwareAnalyzing McEliece encryption and decryption algorithms (cf. Section 3.2), the fol-

    lowing arithmetic components are required supporting computations in GF(2m): amultiplier, a squaring unit, calculation of square roots, and an inverter. Furthermore,a binary matrix multiplier for encryption and a permutation element for step 2 in

    Algorithm 2 are required. Many arithmetic operations in McEliece can be replacedby table lookups to significantly accelerate computations at the cost of additional

    memory. Our primary goal is area and memory efficiency to fit the large keys andrequired lookup-tables into the limited on-chip memories of our embedded targetplatform.

    Arithmetic operations in the underlying field GF(211) can be performed efficientlywith a combination of polynomial and exponential representation. In registers, we

    store the coefficients of a value a ∈ GF(211) using a polynomial basis with naturalorder. Given an a = a10α

    10 + a9α9 + a8α

    8 + ⋅ ⋅ ⋅+ a0α0, the coefficient ai ∈ GF(2) isdetermined by bit i of an 11 bit standard logic vector where bit 0 denotes the least sig-

  • 28 Chapter 4 - Reconfigurable Hardware

    nificant bit. In this representation, addition is fast just by performing an exclusive-oroperation on two 11 bit standard logic vectors. For more complex operations, such

    as multiplication, squaring, inversion and root extraction, an exponential represen-tation is more suitable. Since every element in GF(211) can be written as a power ofsome primitive element α, all elements in the finite field can also be represented by

    αi with i ∈ Z2m−1. Multiplication and squaring can then be performed by adding

    the exponents of the factors over Z2m−1 such as

    c = a ⋅ b = αi ⋅ αj = αi+j ∣ a, b ∈ GF(211), 0 ≤ i, j ≤ 2m − 2. (4.1)

    The inverse of a value d ∈ GF(211) in exponential representation d = αi can beobtained from a single subtraction in the exponent d−1 = α2

    11−1−i with a subsequenttable-lookup. Root extraction, i.e., given a value a = αi to determine r = ai/2 issimple, when i is even and can be performed by a simple right shift on index i.

    For odd values of i, perform m − 1 = 10 left shifts followed each by a reductionwith 211 − 1. To allow for efficient conversion between the two representations, weemploy two precomputed tables (so called log and antilog tables) that allow fastconversion between polynomial and exponential representation. Each table consistsof 2048 11-bit values that are stored in two of the Block Rams Cells (BRAM) in the

    FPGA. For multiplication, squaring, inversion, and root extraction the operands aretransformed on-the-fly to exponential representation and reverted to the polynomial

    basis after finishing the operation. To reduce routing delay every arithmetic unit canaccess their own LUT that is placed closed to the logic block.

  • 5. Designing forArea-Time-E�ciencyMost components of an algorithm can be implemented in a way to finish as fast aspossible but then they need a lot of logic resources. On the other hand a componentcan be implemented resource efficient, but most likely it will then consume more

    time to complete the task. Table 5.1 gives an estimate about how often specific partsof the McEliece algorithms will be executed on average during one encryption or

    decryption with the parameters m = 11, t = 27.

    From this table one can see easily that poly_EEA is the most important, time critical

    and largest component in the whole design and thus is worth a few more details.5.1. Extended Euclidean AlgorithmIn algorithm 6 the extended euclidean algorithm adopted to polynomials overGF(2m)is summarized again.

    Algorithm 6 Extended Euclid over GF(211) with Stop Value

    Input: G(z) irreducible ∈ F211 [z], x ∈ F211 [z] with degree x < GOutput: x−1 mod G(z)

    1: A← G(z), B ← x(z), v ← 1, u← 02: while degree(A) ≥ stop do3: q← lc(A)

    lc(B){ lc(X) is leading coefficient of X}

    4: A← A− q ⋅ zk ⋅ B5: u← u + q ⋅ zk ⋅ v6: if degree(A) ≤ degree(B) then7: A← B B← A { swap A and B}8: u← v v← u { swap u and v}9: end if

    10: end while

    11: return (u(z)A0 )

    We identified the following components:

  • 30 Chapter 5 - Designing for Area-Time-Efficiency

    Table 5.1.: Execution Count for Crucial Parts.

    Part CountE

    ncr

    ypti

    on

    UART receive 219

    UART send 256

    (1751× 2048)Matrix MUL 1(8× 8)Submatrix MUL 56064

    Error Distribution 2

    PRNG* 7

    Dec

    rypt

    ion

    UART receive 256

    UART send 219

    Permutation 1

    Polynomial MUL 1

    Polynomial SQRT 1

    Polynomial SQR 2

    EEA+ 1024 + 2

    -EEA.getDegree 2 ⋅ 27 ⋅ (1024 + 2) ≈ 55, 500-EEA.shiftPoly 2 ⋅ 27 ⋅ (1024 + 2) ≈ 55, 500-EEA.MulCoeff 2 ⋅ 27 ⋅ 27 ⋅ (1024 + 2) ≈ 1, 500, 000-EEA.DivCoeff 2 ⋅ 27 ⋅ (1024 + 2) ≈ 55, 500

    (2048× 2048)Matrix MUL 1(8× 8)Submatrix MUL 65, 536PRNG- 65, 536

    * One PRNG run generates 4 error positions.+ Assuming that half of the ciphertext bits are one.- One PRNG run generates one (8× 8)Submatrix.

    1. get_degree: a component that determine the degree and the leading coefficient(lc) of a polynomial

    2. gf_div: a component that divide two field elements

    3. gf_mul: a component that multiply two field elements

  • 5.1 Extended Euclidean Algorithm 31

    4. poly_mul_shift_add: a component that computes X = X + q ⋅ zk ⋅ YWe decide to implement the extended euclidean algorithm in a full parallel way. Inother words, we are trying to compute nearly every intermediate result in one clock

    cycle, which means that the design is working on a complete polynomial at once.Only in places that occur rarely during a decoding run, we compute the results in aserially, coefficient-wise way to save resources.

    If all computations would be done in a serially, coefficient-wise way, this would

    result in about 30 times more clock cycles. Although a serial design is smaller andshould achieve a higher clock frequency, it can not reach 30 times the frequency

    of the parallel design. A Spartan3 FPGA can run at only 300 MHz and thus theadvantage of parallelism can not be outperformed when the parallel design operatesat least with 10 MHz.5.1.1. Multiplication ComponentDue to the large number of required multiplications we choose the fastest possi-ble design for this operation. Instead of performing the multiplication with the table

    look-up method mentioned in Section 4.2 or implementing school-book or Karazuba-like methods, we completely unroll the multiplication to a hardwired tree of XORs

    and ANDs. This tree is derived from the formal multiplication of two field elementsand includes modulo reduction mod p(α) = α11 + α2 + 1, which is a defining poly-nomial for GF(211). Using this a complete multiplication is finished in one clockcycles. See Appendix C for the complete multiplication code.

    After optimization by the synthesis tool the multiplier consists of an 11 bit registerfor the result and a number of XORs as shown in Table A.2 in Appendix A. Overall

    one multiplier consumes 89 four-input LUT’s, but computes the product in one clockcycle.5.1.2. Component for Degree ExtractionTo allow get_degree to complete as fast as possible, first every coefficient of the input

    polynomial is checked if it is Zero as shown in Listing 5.1.2.

    e n t i t y gf_compare i s PORT(c o e f f _ i n : in STD_LOGIC_VECTOR(MCE_M−1 downto 0 ) ;equal : out s t d _ l o g i c

    ) ;end gf_compare ;

  • 32 Chapter 5 - Designing for Area-Time-Efficiency

    a r c h i t e c t u r e s t r u c t u r a l of gf_compare i sbegin

    equal poly_in ( I ) ,equal=> i n t e r n a l _ d e g r e e ( I ) ) ;

    end generate gen_comp ;

    Listing 5.2: Instantiation of gf_compare

    These components are unclocked and consist only of combinatorial logic. When thestart signal is driven high from this vector, degree and lc are derived in one clock

    cycle.

    i f i n t e r n a l _ d e g r e e ( 2 6 ) = ’1 ’ thendegree

  • 5.1 Extended Euclidean Algorithm 33

    The exponents are then subtracted exp_diff. This difference may become negative.To avoid an extra clock cycle for adding the modulo to bring the difference back

    to the positive numbers, we extend the exp2poly table to the negative numbers asexp2polymod. It turns out that negative numbers are represented in twos complementform in hardware. So addresses 0 to 2047 contain the standard exponent-to-polynom

    mapping. Addresses 2048 and 2049 contain due to twos complement form the ex-ponent −2048,−2047, which can’t occur because the maximum exponent range is0 ⋅ ⋅ ⋅ 2046. This addresses are filled with a dummy value. The rest of the addressspace now contains the same values in the same order as at the beginning of the

    table. By this, the modulo reduction is avoided resulting in about 55, 600 overallsaved clock cycles. Remark that the TLU method is chosen because these designhas finished in 3 clock cycles, whereas a dedicated arithmetic unit computing the

    division need at least 11 clock cycles when assuming that one clock cycle is neededper bit operation. Figure 5.1 shows the complete division component. The look-up

    tables and the substractor are highlighted in red. The surrounding signals form thecontrolling state machine.

    statemachine

    state[0:3]

    exp_out_b_1[10:0]

    valid_1_sqmuxa

    un1_state[0]

    e

    d

    e

    d

    e

    d

    e

    d

    un1_state_1[0]un1_state_2 un1_state_3 valid

    exp_diff[11:0]

    start

    valid

    coeff_c[10:0][10:0]

    coeff_b[10:0][10:0]

    coeff_a[10:0][10:0]

    rst

    clk =1

    [10:0]

    poly2exp_dualport

    TLU1

    exp2polymod

    TLU2invert\.exp_diff_2[11:0]

    +

    I[1:0]

    [0:3]Q[3:0]C

    0R

    clka

    clkb

    [10:0]addra[10:0]

    [10:0]addrb[10:0]

    [10:0]douta[10:0]

    [10:0]doutb[10:0]

    clka

    [11:0]addra[11:0]

    [10:0]douta[10:0]

    [10:0]

    [10:0]

    [0]

    [3]

    0

    [2]

    0

    [1]

    1

    [0]

    1

    Q[0][1]

    D[0]

    R

    E

    [10:0]

    [11:0]

    1

    [11:0]Q[11:0]

    [11:0]D[11:0]

    R[2]

    E

    Figure 5.1.: The Complete Coefficient Divider

    5.1.4. Component for Polynomial Multiplication and ShiftingThe poly_mul_shift_add component also is designed with priority to high perfor-mance. Therefore, the multiplication of the input polynomial with the coefficient

    is completely unrolled. We use 27 multiplier in parallel of which each consists of atree of ANDs and XORs. Each of these multiplier blocks is connected to the fixed

    coefficient q and one of the polynomial coefficients as shows in Figure 5.2.

  • 34 Chapter 5 - Designing for Area-Time-Efficiency

    Mul26 Mul22Mul23Mul24Mul25 Mul0Mul1Mul2

    297

    11

    297

    Figure 5.2.: Overview of the Polynomial Multiplier

    After determining the shift value k = deg(A) − deg(B) as the difference in thedegree of both polynomials A(z), B(z) the shift step by k coefficients is accomplishedby large 297b̃it width 27 to 1 multiplexer to allow minimum runtime.

    with k s e l e c tpoly_out " 00000000000 " ) when others ;

    Listing 5.4: k-width Shifter

    To allow maximum throughput, we decide to instantiate the poly_mul_shift_add com-ponent twice, allowing line4 and 5 of Algorithm 6 to run in parallel. For the same

    reason component get_degree is instantiated twice as required in line 3 of Algo-rithm 6.

    The final normalization is also handled by the poly_mul_shift_add component. Thisworks since in normal mode poly_mul_shift_add computes

    u(z) = u(z) + q ⋅ zk ⋅ v(z) (5.1)

    Remember that q is the result of the division of the leading coefficients of two poly-nomials. By setting u(z) = 0, q = 1A0 , k = 0 and v(z) = u(z), this equation results in

    u(z) = 0 +1

    A0⋅ z0 ⋅ u(z) = u(z)

    A0(5.2)

  • 5.2 Computation of Polynomial Square Roots 35

    which is the required normalization. This method avoid implementation of an extranormalization component and save available slices without increasing required clock

    cycles because this step can anyway be computed after the last computation whichinvolves poly_mul_shift_add.5.2. Computation of Polynomial Square RootsThe next required computation, according to Algorithm 4 line 6, is taking the squareroot of T(z) + z).Like mentioned in Section 4.2, instead of using a matrix to computethe square root of the coefficients, we choose the method proposed by K. Huber [24].For this and a latter purpose, we need a component, which can split polynomials into

    its odd and even part so that T(z) + z = T0(z)2 + z ⋅ T1(z)2. Splitting a polynomial

    means to take the square root of all coefficients and assigning square roots from oddpositions to one result polynomials and even one to the other result polynomial.

    All coefficients that origin from even positions form the even part and vise versa.Taking square roots of coefficients only occurs in this component and nowhere else

    in the whole algorithm. Therefore, we choose a method that does not consume manyslices, but use the already introduced TLU method.5.3. Computation of Polynomial SquaresAnd the last required field operation is the computation of a polynomial squarein Algorithm 4 at line eight. Like the square root computation, this operation is

    required only once during one decryption. Due to this low utilization we decide toimplement it in a space saving way. The square is computed coefficient by coefficient,

    thus requiring only a single squaring unit. This squaring unit is simply the unrolledmultiplier from Section 5.1.1, where both input are wired together.

  • 6. ImplementationBased on the decisions made in Chapter 4, this chapter discusses the implementationof encryption and decryption.6.1. EncryptionIn this section the implementation of the matrix multiplication and generation of the

    random error plus its distribution among the ciphertext is presented. Rememberthat encryption is only c = m ⋅ Ĝ + e. But first the target platform is presented.

    Because of the lower logic requirements for the encryption and since the encryp-

    tion key is public and does not require confidential storage, the encryption routine isimplemented on a low cost Spartan3 FPGA, namely a XC3S200. This device is part ofa Spartan-3 development board manufactured by Digilent[14]. Aside from the FPGA

    this board additional provides:

    1. On-board 2Mbit Platform Flash (XCF02S)

    2. 8 slide switches, 4 pushbuttons, 9 LEDs, and 4-digit seven-segment display

    3. Serial port, VGA port, and PS/2 mouse/keyboard port

    4. Three 40-pin expansion connectors

    5. Three high-current voltage regulators (3.3V, 2.5V, and 1.2V)

    6. 1Mbyte on-board 10ns SRAM (256Kb x 32)

    Figure 6.2 shows an summary of the design. The single blocks will be explained

    in the following.

    The implementation consist of two parts which can be selected via two of thesliding switches. The first part ( selected with sw1 = 1) reads the public key fromthe UART and store it in external SRAM. The second part(selected with sw0 = 1)reads the message from the UART and encrypt it with the public key.6.1.1. Reading the Public KeyThe top level component , called toplevel_encrypt controls the UART interface, reads

    the external switches and writes data to the SRAM. Also this component controls

  • 38 Chapter 6 - Implementation

    (a) Board (b) Block Overview

    Figure 6.1.: Spartan3-200 Development Board

    PRNG

    for Errors

    UART

    Buffer

    mce_encryptI/O

    SRAM

    FSM

    Matrix

    Multiplier

    Access switch

    BRAM

    Buffer

    FS

    M

    8x8

    Matrix

    Multiplier

    Access s

    witch

    Buffer BRAM

    Row Counter

    Column Counter

    DE

    BU

    G

    Mo

    de

    Se

    lect

    toplevel_encrypt

    Figure 6.2.: Block Overview for Encryption

    the real encryption component mce_encrypt. When a reset occurs, all buffers are

    cleared, SRAM is disabled by driving SRAMce1, SRAMce2, SRAMoe high and thestate machine is set to idle state. In the moment when the UART indicates a receivedbyte, the FSM switches to read_in state, the SRAM is enabled and its address bus set

    to all zero. Now, a byte is read into the lower 8 bits of an 32 bit register and theregister is shifted up by 8 bits. After reading 4 byte now, a 32 bit word is complete

    which is written to SRAM. Afterwards, the address counter is incremented. This

  • 6.1 Encryption 39

    procedure is repeated until all 448, 512 bytes of Ĝ are written to SRAM. Now, theFSM returns to idle state which is indicated by driving an LED on the board. Then,

    another public key could be read or, more useful, the mode can be switched toencryption by setting sw1 to 0 and sw0 to 1.

    To send data from the PC to the FPGA the open source terminal program Hterm [23]

    is used. Hterm can send bytes to the FPGA either from keyboard or from a file. Thestructure of the file containing the data is simply one byte in hexadecimal form (w/oleading 0x header) on a single line.

    During development, debug methods where integrated, that are also in the finaldesign to allow verification of the current state and the behavior of some importantsignals. The development board contains four seven-segment displays which share

    a common data bus with signals (a, b, c, d, e, f, g, DP) and are individual selectedby four separated anode control lines. The function of the signals can be seen from

    Figure 6.3.

    Figure 6.3.: 7 Segment Display

    The common data bus is connected to a component that converts a four bit inputcharacter to the bit pattern that is necessary to display the character in hexadecimal

    form. The conversion procedure is taken from the reference manual of the board and

    SevenSeg

    C_SevenSeg0

    1dot

    [3:0]char[3:0]

    [0:7]data[0:7]

    Figure 6.4.: Seven Segment Driver Component

    is given in Appendix A.1. The decimal dot (data bus pin DB) in the display can be

  • 40 Chapter 6 - Implementation

    misused to display data larger than 16 bit. For example, the SRAM address is 18 bitwide and therefore the two leading bits are displayed as dots in two of the segments.

    Four of this components are in parallel connected to a signal called disp, which byitself can be connected to various internal signals. Which signal is displayed can beselected using the sliding switches sw2 to sw4. Table 6.1 shows the available debug

    modes and how they are selected. Switch sw5 controls whether the status bit of the

    Table 6.1.: Function of the Debug Switches

    Signal on Seven Segment Display SW2 SW3 SW4

    SRAMaddress in encrypt state 1 0 x

    SRAMaddress in public key mode 0 0 x

    SRAMdata low word x 0 0

    SRAMdata high word x 1 0

    UART or the control bits for the SRAM are displayed on the seven LEDs led0 to led6.Signal led7 is reserved and hardwired to indicate completion of transferring Ĝ to the

    SRAM.6.1.2. Encrypting a messageWhen the encryption mode is entered by setting sw0 = 1, all buffers are cleared,SRAM is disabled and the state machine is set to idle state. Now every received

    byte from the UART is directly feed through to the mce_encrypt component whichis depicted in Figure 6.5. This component contains the matrix multiplier and the

    SRAMdata[31:0][31:0]

    rst

    mce_encrypt

    C_mce_encrypt

    clk

    prng_clk

    rst

    start

    byte_ready

    [7:0]M_byte_in[0:7]

    [31:0]SRAMdata[31:0]

    valid

    need_byte

    SRAMoe

    [15:0]status[15:0]

    [7:0]Cipher_byte_out[0:7]

    [17:0]SRAMaddress[17:0]

    Figure 6.5.: Interface of the mce_encrypt Component

    PRNG plus the controlling state machine.

  • 6.1 Encryption 41

    First, every byte is read into an buffer in BRAM. Because this buffer is imple-mented as BRAM of 18Kbit of which only 1752 bits are needed, we decide to addi-

    tional put the cipher buffer of 2048 bit into the same BRAM block. The interface ofthis combined buffer is shown in Figure 6.6.

    clk

    =1

    [0:7]

    =0

    [0:7]

    YandM_buff

    C_YandM_buff

    clka

    wea[0]

    clkb

    web[0]

    [0:7]dina[7:0]

    addra[8:0]

    [0:7]dinb[7:0]

    addrb[8:0]

    [0:7]douta[7:0]

    [0:7]doutb[7:0]

    Figure 6.6.: Buffer for Plaintext and Ciphertext

    After all input bytes are buffered, the FSM changes to multiplication state. Firstrow and column counter (RowCnt,ColCnt) and the SRAM address are set to zero.Note that RowCnt and ColCnt actual count blocks of (8× 8) submatrices insteadsingle rows and columns. Due to the (1751× 2048) dimension of the matrix thisleads to a range of 0 . . . 218 for RowCnt and 0 . . . 255 for ColCnt. The partial product

    from the submatrix multiplications are summed up in a register c_work, which willcontain the resulting cipher text when the multiplication is finished. First, the lasttemporary ciphertext byte c_workis read from BRAM. In a next step, the plaintext

    byte for this row is read out of BRAM and stored in a register m_work and, addition-ally, the 64 bit submatrix of Ĝ is read from SRAM into G_work prior to incrementing

    the SRAM address. The partial product of this block is computed with:

    gen_part_prod : for i in 0 to 7 looppart_prod ( i )

  • 42 Chapter 6 - Implementation

    finished. ColCnt also addresses the current c_work which saves an extra address reg-ister. When a new row is started the next m_work is read in. When the multiplication

    is complete, the FSM changes to error distribution state.Generation and Distribution of the ErrorsTo distribute random errors among the ciphertext, the bit address of the error posi-tions is generated on-the-fly by a fast and small PRNG based on the PRESENT block

    chiffre [10]. Figure 6.7 shows how Present is embedded into the PRNG. For real

    prng_clk

    rst

    PRNG

    C_PRNG

    rst

    clk

    start

    seed

    valid

    [63:0]data_out[63:0]

    (a) Interface of the PRNG

    PRNG

    C_PRNG

    prng_clk

    rst

    Present

    C_present

    rst

    clk

    start

    [79:0]key[79:0]

    [63:0]plain[63:0]

    valid

    [63:0]cipher[63:0]

    rst

    clk

    (b) Interface of PRESENT

    Figure 6.7.: Parts of the PRNG

    security this should be replaced by a TRNG or at least a CSPRNG, but a hardware

    design of these type of RNGs is far beyond the scope of this thesis. The PRNG isseeded by an fixed 80 bit IV. Bit 0 to 63 are used as plaintext and the whole IV isused as key. The produced ciphertext is the random output and also the plaintext

    for the next iteration. The key is scheduled by rotating it one to the left and XORinga five bit counter to the lowest bits after the rotation. Figure 6.8 gives an overview

    over the round function and key derivation. Due to a codeword length of 2048 bit,eleven bits are required to index each bit position. From one 64bit random wordfour errors positions can be derived by splitting the word into four 16 bit words and

    using the lower eleven as index. Therefore, seven runs of PRESENT are required togenerate the 27 errors. Because the ciphertext is stored byte wise, the upper eight

    bits of the error position select the byte address of c_work in the buffer and the lowerthree a bit of this byte. The bit error is induced by toggling (0 ← 1, 1 ← 0) thebit and afterwards the whole byte is written back to the buffer. After 27 errors are

    distributed, mce_encrypt signals completion to the top level component by drivingsignal valid high. Now, the content of the ciphertext buffer is fed to the UART and

    mce_encrypt and top_level return to idle state. Finally, the encryption is completeand either a new message can be encrypted or the FPGA can be switched over toinitialization mode and a new matrix Ĝ can be read in.

  • 6.2 Decryption 43

    PRESENT KEY

  • 44 Chapter 6 - Implementation

    FSM

    I/O

    FS

    M

    P-1 PRNG

    for S-1

    (PRESENT)

    Buffer

    goppa_

    decode

    mce_decrypt

    Polynomial EEA

    over GF(211

    )

    Polynomial MUL/

    SQ over GF(211

    )

    Access s

    witch

    Log Table

    Anti-log Table

    Syn(z) Table

    Syn(z)-1

    Table

    Buffer BRAM

    BR

    AM

    BR

    AM

    BRAM

    toplevel_decryptUART

    Matrix

    Multiplier

    Figure 6.9.: McEliece implementation on Spartan-3 2000 FPGA

    Figure 6.10.: The XC3S2000 Development Board6.2.1. Inverting the Permutation PThe first step in decryption is to revert the permutation P. Like mentioned in Sec-tion 2.5, the inverse permutation matrix P−1 is not stored as matrix but as arrayof 11 bit indexes into the ciphertext. P−1 is 2048 ⋅ 11 = 22.528 bit large and storedin a ROM. Because one BRAM can only hold 18 KBits, this ROM is built of two

    BRAMs with a simple 11 bit address bus. The input and output buses are depictedin Figure 6.13.

    From this ROM now consecutive the indexes of the permutated ciphertext areread as 11bit value perm_address. The value at every address i indicates from which

    received cipher text bit j the permutated bit i is taken. Because the received cipher-text is stored byte wise, the upper 8 bits select the byte address of YandMbuff andthe lower three the corresponding bit of the byte. Listing C in the appendix should

    make the process of permutation clear.

    After eight iterations, one byte of the permutated ciphertext has become avail-

    able,and is input to the goppa_decode component. This component can immediatelystart computing a partial syndrome, while concurrently the next permutated byte is

    generated. Both components communicate with a simple hand-shake protocol with

  • 6.2D

    ecryption45

    \ c ece_ _Spa ta 3\top_ e e s s 5 p 6 50 0 009

    mce_decrypt

    C_mce_decrypt

    prng_clk

    rst

    clk

    =0

    [0:7]

    =1

    [0:7]

    [0:6]

    [1:7]

    [8:2]

    [0:7]

    [2:0]

    [7:0]

    [7:0]

    invP

    C_invP

    YandM_buff

    C_YandM_buffcipher_byte[0:7]

    ColCnt[7:0]

    m_work[0:7]

    part_prod[0:7]

    perm_byte[0:7]plain_byte_out[0:7]

    random[63:0]

    RowCnt[7:0]

    seed_prng

    start_goppa

    start_prng

    statemachine_9

    state[0:18]

    t_work[0:7]

    valid

    PRNG

    C_PRNG

    goppa_decode

    C_goppa_decodeY_byte_w[0:7]P_decrypt\.Y_byte_w_8[0:7]

    0

    1

    perm_addr[10:0]

    clka

    [10:0]addra[10:0]

    [10:0]douta[10:0]

    clka

    wea[0]

    clkb

    web[0]

    [0:7]dina[7:0]

    addra[8:0]

    [0:7]dinb[7:0]

    addrb[8:0]

    [0:7]douta[7:0]

    [0:7]doutb[7:0]

    [0:7]Q[7:0]

    [0:7]D[7:0]

    R

    E

    [7:0]Q[7:0]

    [8:1]D[7:0]

    R

    [0:7]Q[7:0]D[7:0]

    E

    [0:7]Q[7:0]

    [0:7]D[7:0]

    E

    [0:7]Q[7:0]D[7:0]

    E

    [0:7]Q[7:0]

    [0:7]D[7:0]

    E

    [63:0]Q[63:0]

    [63:0]D[63:0]

    E

    [7:0]Q[7:0]D[7:0]

    R

    Q[0]D[0]

    E

    Q[0]D[0]

    R

    E

    Q[0]D[0]

    E

    I[33:0]

    [0:18]Q[18:0]C

    0R

    [0:7]Q[7:0]

    [0:7]D[7:0]

    E

    Q[0]D[0]

    E

    rst

    clk

    start

    seed

    valid

    [63:0]data_out[63:0]

    clk

    rst

    start

    byte_ready

    [0:7]cipher_byte_in[0:7]

    need_byte

    valid

    [0:7]plain_byte_out[0:7]

    [0:7]Q[7:0]

    [0:7]D[7:0]

    E

    [0:7]

    [0:7][0:7]

    [10:0]Q[10:0]

    [10:0]D[10:0]

    R

    clk

    prng_clk

    rst

    start

    ext_byte_ready

    [7:0] Y_byte_in[0:7]

    valid

    ext_need_byte

    [15:0]

    status[15:0]

    [7:0]

    plain_byte_out[0:7]

    Fig

    ure

    6.11.:O

    verv

    iewo

    fm

    ce_decry

    pt

    Co

    mp

    on

    ent

  • 46 Chapter 6 - Implementation

    clk

    =1

    [0:7]

    =0

    [0:7]

    YandM_buff

    C_YandM_buff

    clka

    wea[0]

    clkb

    web[0]

    [0:7]dina[7:0]

    addra[8:0]

    [0:7]dinb[7:0]

    addrb[8:0]

    [0:7]douta[7:0]

    [0:7]doutb[7:0]

    Figure 6.12.: Buffer for Ciphertext and Plaintext

    clk

    invP

    C_invP

    clka

    [10:0]addra[10:0]

    [10:0]douta[10:0]

    Figure 6.13.: ROM containing the Inverse Permutation Matrix

    two signals byte_ready and need_byte to indicate if mce_decrypt has a byte completed

    and if goppa_decode has finished the previous byte. After the permutation is com-plete, the FSM of mce_decrypt waits until goppa_decode has finished with decoding

    the permutated ciphertext. The implementation of decoding will now be explainedbased on Algorithm 4 and the decisions made in Section 5.1.6.3. DecodingThe goppa_decode component is the crucial part of the McEliece implementation in

    hardware. It operates on large polynomials with up to 28 ⋅ 11 = 308 bit length.Operations on these polynomials involve assigning specific values, or shifting and

    manipulating individual coefficients. This require large multiplexer, shift registers ormassive parallel mathematical operations. For details on how often which operationis performed, refer to Chapter 5.

    Figure 6.14 shows the interface of this component. Like mentioned in Section 6.2.1

    the ciphertext is read in byte per byte and in the same way the decoded messageis written out in a byte-wise manner. Figure 6.15 shows the different part of the

    decoding component.

  • 6.3 Decoding 47

    rst

    clk

    goppa_decode

    C_goppa_decode

    clk

    rst

    start

    byte_ready

    [0:7]cipher_byte_in[0:7]

    need_byte

    valid

    [0:7]plain_byte_out[0:7]

    Figure 6.14.: Interface of goppa-decode Component

    The different steps during decoding a codeword are controlled by an FSM which

    enables the involved components with a start signal and awaits completion indi-cated by signal valid going high. These two signals are used in every componentexcept those who require a single clock cycle. Figure 6.16 shows the FSM for the

    goppa_decode component. When goppa_decode has taken a byte it is written to a buffercipher_BRAM_buff, because after decoding the ciphertext is needed for the error cor-

    rection step. Then the actual byte is scanned bit wise. If a 1 is found, the correspond-ing trailing coefficient for the polynomial that has to be inverted(see Equation (3.6))is looked up in SList-ROM. This polynomial is now fed into the poly_EEA compo-

    nent (see section 5.1 for details). We recognized that there is only one time duringeach run of the EEA, when a polynomial with degree 27 occurs. This is in the first

    iteration of the EEA, when polynomial A(z) is set to Goppa polynomial. To save slicesoccupied by this components, the finite state machine takes care of this special case.By resolving this case with signal first_run and manually set the degree of A(z) to 27and lc(A(z)) = 1 without computing it, all components can be reduced in size andonly operate on polynomials of maximum degree 26. This saves 11 FFs, a complete

    multiplier and one multiplexer stage in each (sub)component and additional oneclock cycle per EEA run. The resulting FSM is depicted in Figure 6.17.

    In the control state called checkab state the degree and the leading coefficient ofboth polynomials A, B is computed and also the break condition deg(A) = stopdegree)is checked. For the normal EEA stopdegree is zero, but this component is also

    used to solve the key equation (see Equation3.13). In this case, the EEA must be

    stopped when for the first time deg(A) reaches ⌊ deg(g(z))2 ⌋ = ⌊272 ⌋ = 13. Assign-ing the correct stopdegree is handled by the finite state machine of the goppa_decodecomponent (see Figure 6.16). In reduce1 the difference k of the degrees of A and B

    and also lc(A)lc(B)

    is computed. State reduce2 then waits for completion of the the two

    poly_mul_shift_add components and then returns to checkab. When the break condi-tion deg(A) = stopdegree is true (depending on the actual value of stopdegree), thestate switches to normalize if stopdegree = 0 or to ready otherwise. After normaliza-

  • 48C

    hapter6

    -Im

    plemen

    tation

    goppa_decode

    C_goppa_decode

    mce_decrypt

    C_mce_decrypt

    [10:0]

    [0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    =0

    [2]

    [2]

    =0

    [0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    [10:0]

    plain_byte_out[0:7]

    cipher_byte[0:7]

    start_split

    start_mulT1w

    start_EEA

    start_comp_sigma

    start_chien

    SList

    C_TLU_SLIST

    InvSList

    C_TLU_InvSList

    cipher_BRAM_buff

    C_cipher_buff

    statemachine_7

    state[0:18]

    poly_split

    C_split

    poly_mulW_T1_clocked

    C_mulT1W

    poly_EEA

    C_EEA

    comp_sigma

    C_comp_sig

    chien_search

    C_chien_search

    clk

    rst

    start

    byte_ready

    [0:7]Q[7:0]

    [0:7]D[7:0]

    R

    E

    [0:7]Q[7:0]

    [0:7]D[7:0]

    E

    Q[0][11]

    D[0]

    R

    E

    Q[0]D[0]

    R

    E

    Q[0]D[0]

    R

    E

    Q[0]D[0]

    R

    E

    Q[0]D[0]

    R

    E

    clka

    [10:0]addra[10:0]

    [10:0]douta[10:0]

    clka

    [10:0]addra[10:0]

    [10:0]douta[10:0]

    clka

    wea[0]

    [0:7]dina[7:0]

    [0:7]addra[7:0]

    [0:7]douta[7:0]

    I[22:0]

    [0:18]Q[18:0]C

    0R

    clk

    rst

    start

    poly2split[296:0]

    valid

    [10:0]poly_odd[153:0]

    poly_even[153:0]

    clk

    rst

    start

    [10:0]poly_T1[153:0]

    valid

    poly_WxT1[296:0]

    clk

    rst

    start

    stopdegree[4:0]

    poly_x[296:0]

    valid

    poly_r[153:0]

    poly_s[296:0]

    clk

    rst

    start

    poly_a_in[153:0]

    poly_b_in[153:0]

    valid

    [10:0]poly_sigma[307:0]

    clk

    rst

    start

    [10:0]poly_sigma_in[307:0]

    valid

    ready

    [10:0]root[10:0]

    Fig

    ure

    6.15.:O

    verv

    iewo

    fgoppa-decode

    Co

    mp

    on

    ent

  • 6.3 Decoding 49

    Figure 6.16.: FSM of the goppa_decode Component

    tion the ready state is entered, in which the FSM waits until signal start drops to zero.Then, the EEA is finished and returns to idle state.

    When the complete syndrome is computed, Syn(z) will be fed again into poly_EEAto compute T(z) and z is added. In the first version of the implementation allpolynomials are kept in std_logic_vector(297 downto 0) but it turns out that synthesistool is unable to handle such large vectors. For example, in a simple XOR, both input

    coefficients are placed in one corner of the FPGA and the output far away in anothercorner, leading to a large routing delay (nearly 90% of the overall delay). So thedescription of polynomials was modified to type PolyArray_t is array(26 downto 0) of

    std_logic_vector(10 downto 0). Now, the synthesis tool seems to be able to identify theregular structure and place corresponding coefficents closer. This leads to a higher

    achievable frequency compared to the original version.

  • 50 Chapter 6 - Implementation

    :valid_degreea&:stop_degree!=0checkab ready

    idle

    reduce2reduce1

    normalize :rst

    :rst

    :rst

    :rst

    :rst

    :rst

    :valid

    _re

    duce

    u&

    :valid

    _re

    duce

    A&

    !:rst

    !:valid_reduceA&!:rst!:valid_div&!:rst

    :start&!:rst

    !:valid

    _re

    duceu&

    !:rs

    t

    :!valid_degreea&!:rst

    :degre

    eA

    &!s

    top_

    degre

    e&

    valid

    _degre

    ea

    :valid_degreea&:stop_degree="0000":start&!:rst

    !:valid_div&:valid_reduceu&!

    Figure 6.17.: FSM of the EEA Component6.3.1. Computing the Square Root of T(z)+zThe next component handles line 6 of Algorithm 4, which is a polynomial squareroot. To compute the square root of T(z) + z, first T(z) + z is