Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS
, STOCKHOLM SWEDEN 2017
Evaluation of Cryptographic CRC in 65nm CMOS
YANG YU
KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF INFORMATION AND COMMUNICATION TECHNOLOGY
Evaluation of Cryptographic CRCin 65nm CMOS
YANG YU
Stockholm 2017
Master ThesisSchool of Information and Communication Technology
KTH Royal Institute of Technology
Abstract
With the rapid growth of Internet-of-Things (IoT), billions of devices are expected tobe interconnected to provide various services appealing to users. Many devices willget an access to valuable information which is likely to increase the number ofmalicious attacks on these devices in the future. Therefore, security is considered asone of the most critical challenges in the development of IoT. In order to secureresource-constrained devices such as sensors or radio frequency identification (RFID)tags which form the backbone of IoT, lightweight cryptographic algorithms arerequired. This thesis focuses on the problem of message authentication.
To authenticate a message means to verify that the message: (1) comes from theright sender (i.e. its authenticity), and (2) has not been modified (i.e. its integrity). It ischallenging to use traditional message authentication methods in resource-constraineddevices because typically they can allocate only a few hundred gates for implementingsecurity due to their limited computing, storage and energy resources.
To address these needs, a new message authentication algorithm based on aCryptographic Cyclic Redundancy Check (C-CRC) was developed by KTH incollaboration with Ericsson. In this thesis, we implemented C-CRC and compared itwith KECCAK Message Authentication Code (KMAC) standardized by the NationalInstitute of Standards and Technology (NIST) in 2016.
First, MATLAB and Verilog versions were developed for both algorithms. Thecomparison of these two versions allowed us to verify the correctness of algorithmsfunctionality. After that, the Verilog descriptions were simulated in ModelSim andsynthesized using Synopsys design compiler. Finally, placement and routing wasperformed using Cadence SoC Encounter. The evaluation results show that C-CRCoutperforms KMAC in terms of area, power, throughput per area, and energy per bit.However, C-CRC is worse than KMAC in terms of latency. We have also investigatedseveral different options of implementing C-CRC, including producing more than onebit of output per clock cycle. We found that such a technique improves throughput ofC-CRC with the minimal penalty in area and power consumption.
Keywords— MAC, KMAC, cryptographic CRC, Simulation, Verilog HDL, MATLAB
Acknowledgment
First of all, I would like to thank my examiner, Professor Elena Dubrova, for givingme the chance to join this interesting topic. She also inspired me to keep questioningand studying since I took her course in 2015.
Secondly, I want to express my sincere gratitude to my supervisor Ms. Sha Tao forhelp and kindness at all times. She gave me considerable advices at almost every stepof this thesis.
Thirdly, I am much obliged to Professor Gerald Q Maguire Jr, the course examinerof Research Methodology and Scientific Writing. He not only taught me what to dobut also how to do.
Fourthly, I would like to thank Mr. John Mattsson, from Ericsson, for attendingthe mid-term presentation and giving me many pivotal suggestions.
A big thanks goes to my programme coordinator and friends at the school of ICT,Ms. May-Britt Eklund Larsson.
Last but not the least, I would like to thank my family and my friends in Sweden.
iii
Contents
1 Introduction 11.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Overview of KMAC and C-CRC 52.1 KMAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 KECCAK[256] Sponge Function . . . . . . . . . . . . . . . . . 62.1.2 KECCAK - f [1600] Permutation Function . . . . . . . . . . . . 7
2.2 C-CRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Implementation and Evaluation 143.1 MATLAB Implementation . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 C-CRC Implementation in MATLAB . . . . . . . . . . . . . 153.1.2 KMAC128 Implementation in MATLAB . . . . . . . . . . . 15
3.2 ASIC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Variable Specification . . . . . . . . . . . . . . . . . . . . . 173.2.2 KMAC128 Implementation in HDL . . . . . . . . . . . . . . 183.2.3 C-CRC Implementation in HDL . . . . . . . . . . . . . . . . 203.2.4 Simulation Results and Analysis . . . . . . . . . . . . . . . . 223.2.5 Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Discussion and Conclusion 254.1 Implementation of Alternative Specification . . . . . . . . . . . . . . 254.2 Implementation of Special Constraints . . . . . . . . . . . . . . . . . 26
4.2.1 Latency Constraint . . . . . . . . . . . . . . . . . . . . . . . 264.2.2 Power Constraint . . . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A Verilog HDL Transcript 32A.1 KMAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32A.2 C-CRC with parallel output . . . . . . . . . . . . . . . . . . . . . . . 44
v
List of Figures
1.1 Increasing connectivity across people and devices. . . . . . . . . . . . 1
2.1 The data flow diagram of KMAC128 . . . . . . . . . . . . . . . . . . 52.2 The sponge construction of KECCAK[256] [1] . . . . . . . . . . . . . . 72.3 Illustration of θ applied to a single bit [1] . . . . . . . . . . . . . . . 82.4 Illustration of ρ for a 200 bit string [1] . . . . . . . . . . . . . . . . . 92.5 Illustration of π applied to a 5-by-5 slice [1] . . . . . . . . . . . . . . 102.6 Illustration of χ applied to a 1-by-5 row [1] . . . . . . . . . . . . . . 102.7 The data flow diagram of C-CRC . . . . . . . . . . . . . . . . . . . . 13
3.1 The project design methodology flow chart . . . . . . . . . . . . . . 143.2 C-CRC structure in MATLAB implementation . . . . . . . . . . . . . 153.3 KMAC structure in MATLAB implementation . . . . . . . . . . . . . 163.4 The adapted KMAC128 algorithm data flow diagram . . . . . . . . . 183.5 Two types of LFSR . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.6 The optimized Galois LFSR with less latency . . . . . . . . . . . . . 213.7 The programmable LFSR with any generator polynomial . . . . . . . 213.8 C-CRC with two designs of the output section . . . . . . . . . . . . . 223.9 Layout of the implementations of KMAC128 and C-CRC . . . . . . . 24
4.1 The sequence diagram of KMAC128 with a longer input message . . 264.2 The data flow diagram of KMAC with round instance of two . . . . . 274.3 Power gating in LFSR of C-CRC . . . . . . . . . . . . . . . . . . . . 28
vi
List of Tables
2.1 The computation of 3-bit CRC . . . . . . . . . . . . . . . . . . . . . 12
3.1 Specifications of the identical variables . . . . . . . . . . . . . . . . . 173.2 Round constant value in little-endian and hexadecimal format . . . . . 193.3 Synthesis Results for The Proposed MAC Functions . . . . . . . . . . 233.4 Comparison of total area of core . . . . . . . . . . . . . . . . . . . . 24
4.1 Comparison in 1024 bits message length and 10MHz Clock Frequency 27
viii
Acronyms
ASIC Application-Specific Integrated CircuitCRC Cyclic Redundancy CheckGE Gate EquivalentGF Galois FieldHMAC Keyed-Hash Message Authentication CodeKAT Known Answer TestKMAC KECCAK Message Authentication CodeLFSR Linear-Feedback Shift RegisterMAC Message Authentication CodeMD Message DigestNIST National Institute of Standards and TechnologyPRF Pseudo-Random FunctionRFID Radio Frequency IDentificationSHA Secure Hash Algorithm
x
Chapter 1
Introduction
Today, we are at a turning point as embedded systems are increasingly found to beriddled with security vulnerabilities. With the trend of the Internet-of-Things (IoT), anincreasing amount of embedded devices that used to have no communicationcapabilities are integrated into large systems and thus connected to the Internet (shownin Figure 1.1). The embedded devices, especially the low-end ones, are initiallydesigned without security concerns and not able to fix the problems through themethods like updating software. The risk is raised that the attackers can gain physicalaccess to these devices and thus perform certain attacks such as side-channel attack.
Cloud
Figure 1.1: Increasing connectivity across people and devices.
Message authentication code (MAC) is a piece of code to authenticate a messagefrom the sender [2]. Compared to the hash functions, the MAC value protects not onlyintegrity but also authenticity of the message. In other words, the receiver verifies thatthe message comes from the right sender and has not been modified. In most cases,the embedded devices like sensors or RFID tags are constrained in energy and areaconsumption, computation power, memory and cost. Therefore, the solutions need tobe secure as well as efficient to implement on these devices. In consideration of theabove, we tried to find a solution to achieve error detection and data integrityprotection simultaneously with the minimal penalty in area and power consumption.
2 Introduction
1.1 Previous Work
In 1996, a new type of MAC, called keyed-hash message authentication code(HMAC) was defined and published [3]. HMAC involves a user-selected hashfunction and a preset secret key. Combined with any hash function, such as MessageDigest 5 (MD5) or Secure Hash Algorithm 1 (SHA-1), HMAC is able to verify dataintegrity and message authentication. In terms of security strength, HMAC relies onthe security strength of three main factors, including the underlying hash function, thehash output length, and the secret key length. In 2008, National Institute of Standardsand Technology (NIST) published the standard for HMAC [4].
Unfortunately, MD5 and SHA-1 are no longer considered secure in today’scomputer security environment. In 2013, a collision attack was announced which canbreak MD5 collision resistance with a regular computer in less than one second [5].On February 23rd, 2017 the first collision attack against SHA-1 was announced byGoogle and CWI Amsterdam [6]. Accordingly, new hash functions as well as newMAC functions should be taken into consideration.
In 2006, NIST started to organize a competition for a new hash standard. InDecember 2010, KECCAK [7] was selected to be the winner against other 50candidates. In 2015, NIST published the standard for the Secure Hash Algorithm-3(SHA-3) function [1], which is the official alias of KECCAK. In 2016, a new MACstandard, called KECCAK Message Authentication Code (KMAC), was published byNIST [8]. The only thing that is missing in the application of low-cost embeddedsystems is the huge area and power consumption with regard to the heavycomputation. Therefore, a light-weight with enough security-satisfaction MACfunction is what we need.
In 1961, Cyclic Redundancy Check (CRC) was published by Wesley Peterson [9].CRC is an error-detecting code which encodes messages by padding a checksum ofcertain length, commonly used within communication networks. CRC are not onlyeasy to implement in hardware but also beneficial in being particularly suitable fordetecting burst errors. In practice, an n-bit CRC applied on an arbitrary size ofmessage block will detect any single to n bits burst error. Because the checksum has afixed length, the CRC computation function can be used as a hash functionoccasionally. Nowadays, various standards of CRC have been published and appliedon billions of devices.
In 2017, a new MAC scheme was proposed by the collaboration of KTH andEricsson on the basis of the cryptographic Cyclic Redundancy Check (CRC) [10](referred as C-CRC in the following). Different from other CRC based MAC, it usesrandom rather than irreducible generator polynomials. In the security analysis of thisnew scheme, the result showed that it is particularly suitable for short messages (up toa few tens of bytes). That paper also showed interest in combining the build-in CRCsection of devices with C-CRC.
1.2 Contribution 3
1.2 ContributionThis thesis presents the implementation of KMAC and C-CRC in 65nm CMOS, andcompares them using performance metrics. MATLAB and Verilog versions areimplemented for both of the algorithms. With the comparison between these twoversions’ simulation, the algorithms’ functionality are validated together with the testvectors provided by NIST. After that, the two functions were synthesized, and thesimulation results are evaluated and analyzed. Finally, to provide a morecomprehensive conclusion, the influence of specification and special constraints ontwo functions were under discussion. With the discussion, the following study couldeasily compare the two functions without actually implementing them.
1.3 Performance MetricsGenerally, area and power is the most significant indexes for embedded devices,especially the low end ones. With concern for the efficiency of area and power,throughput per area and energy per bit are listed besides the common performancemetrics . All the comparison should be performed within the same manufacturingtechnology and environment, as well as the input specification and simulationprocedure.
Area
The metric ”area” stands for the total area of implemented circuit, including the area ofcells and interconnections. Throughout this thesis, micrometer (µm) rather than GateEquivalent (GE) is used as the primary measure for circuit area. In application-specificintegrated circuit (ASIC) design, GE is define as the area of a two-input NAND gateindependently of the manufacturing technology. In our project, 1 GE is equal to 1.44µm2 according to the applied technology databook [11]. The values of area given bythe synthesis tool are presented in GE. To give an intuitive comparison, the values aretransfered into micrometers instead.
Power
”Power” is another important metric especially in embedded devices. The constrainton power consumption results from the limitation from the applied technology and therequest for extra functionality or increased working time. Both the overall static powerconsumption (same in both measurements), and the dynamic power consumption ofthe I/O interface have been subtracted. The ”total power” in this thesis corresponds tothe sum of dynamic power and cell leakage power. To be noticed, the power valuesmay differ under different operating voltages for the same design.
Latency
”Latency” is the duration from the beginning of an input message injected to the endof the corresponding output string obtained. This metric represents the requiredprocessing time of one input message regardless of whether messages can beprocessed at the same time. In this section, each implementation is considered in their
4 Introduction
best performance. Only the previous rounds of the very long messages are taken intoconsideration. The effects of the finalization and communication stages are neglected.
Throughput
The metric ”throughput” is defined as the maximum output bits for eachimplementation can be produced per unit time. This value only counts when theoutput is active. With this value together with latency, we can decide which one is theperformance bottleneck of the implementation. The throughput values are given inmega (106) bits per second (Mbit/s).
Energy per Bit
”Energy per Bit” represents efficiency in power consumption per output string bit.
energy per bit =power
throughput(1.1)
Throughput per Area
”Throughput per Area” represents efficiency in throughput per area consumption.
throughput per area =throughput
area(1.2)
1.4 Thesis OrganizationThe main object of this thesis project is to implement and evaluate C-CRC in 65nmand compare its performance with KMAC. Throughout the thesis, the two MACfunctions were implemented first in MATLAB then in HDL. Next, the simulationresults of two MAC functions were analyzed and compared. Finally, a discussion ofthe implementations of various input parameter specification and special constraintsare presented at the end of this thesis report.
Chapter 2 gives a brief introduction of KMAC and C-CRC algorithm. KMAC isintroduced step by step due to its multi-layer structure. The introduction of C-CRCfocuses on CRC and its computation.
Chapter 3 shows the detailed implementation of KMAC128 and C-CRC, includingMATLAB and HDL versions. After that, their simulation results are presented andcompared with each other. Finally, the layout and the reported total core area areshown.
Chapter 4 presents the discussion of the KMAC and C-CRC alternativeimplementations of various input parameter specification and special constraints. Atlast, the final conclusion are drawn with several suggestions for future work.
Chapter 2
Overview of KMAC and C-CRC
By definition, the output of MAC functions is produced by two independent inputparameters, which are the message and the secret key. Normally, the length of the keyhas certain influence on the security level of MAC. In this chapter, the algorithm-levelintroductions of KMAC and C-CRC are presented in their original definitions. On thebasis of this, the proposed functions are reorganized and implemented in Chapter 3.
2.1 KMACIn FIPS PUB 202, KECCAK was introduced. The KECCAK algorithm is a family of allsponge functions with a KECCAK− f permutation as the underlying function andmulti-rate padding as the padding rule. As the name suggests, KMAC is also builtfrom the KECCAK algorithm. In order to illustrate the definition of KMAC, thefollowing subsections are structured in the top-down construction.
The KMAC function is a pseudo-random function (PRF) and keyed hash functionon the basis of the KECCAK. For different security strength requirements, there are twovariants of KMAC, KMAC128 and KMAC256. Take into account both theapplication area and balanced security strength with C-CRC, KMAC128 is adopted inthis thesis. To give an intuitive review of the algorithm, KMAC128 is transformed intothe data flow diagram in Figure 2.1.
EncryptionKey
Message
T || newM || 00
Customized String
KECCAK[256]
Figure 2.1: The data flow diagram of KMAC128
The following parameters are used in KMAC128 definition:
• K : a key bit string of at least the required security length
6 Overview of KMAC and C-CRC
• M : the input message bit string
• L : an integer representing the required output length in bits
• S : the customization bit encryption string of any length, including null
Algorithm 2.1 KMAC128(K,M,L,S)Input: K,M,L,SOutput: Z, len(Z) = len(L)
newM = bytepad(encode string(K),168)‖M ‖ right encode(L)T = bytepad(encode string(”KMAC”)‖encode string(S),168)Z = KECCAK[256](T ‖newM ‖00,L)
Three internal functions, right encode, encode string and bytepad, are used toencode the intermediate values. right encode(x) encodes the input integer x as a bytestring and inserts the length of the string after (on the right of) the representing stringof x. encode string(S) utilizes a similar function as right encode and inserts theencoding before (on the left of) S. bytepad(X ,w) adds an le f t encode(w) to the stringX , then pads the result with zeros until it is a byte string whose length in bytes is amultiple of w.
2.1.1 KECCAK[256] Sponge FunctionIn the KMAC definition, KECCAK[256] is called to perform the last step. The numberbetween the square brackets is called the capacity, which indicates the fixed length ofthe outputs of the underlying function, KECCAK− f permutation, minus the number ofinput bits processed. In consideration of security strength, the capacity is proved tobe the twice of the collision, which is suitable for KMAC128.The parameters used inKECCAK[256] definition are similar to KMAC128 if they have the same name.
Algorithm 2.2 KECCAK[256](M,L)Input: M,LOutput: Z, len(Z) = L
P = M ‖pad10*1(1344, len(M))n = len(P)/1344for i = 0 to n−1 do
Pi = P(1344∗ i,1344∗ i+1343)end forS = 01600
for i = 0 to n−1 doS = KECCAK - f [1600](S⊕(Pi ‖0256))
end forZ = S(0,1344)if d ≤ |Z| then
Z = Z(0,256)else
S = KECCAK - f [1600](S)Z = Z ‖S(0,1344)
end if
2.1 KMAC 7
KECCAK[256] is constructed by three components, the underlying function,KECCAK - f , the capacity and the padding rule, pad10*1. KECCAK[256] is called asponge function in FIPS PUB 202. The analogy to a sponge is that an arbitrary lengthof input message (len = 1600−256) is ”absorbed” into the underlying function, afterwhich the same length of output string is ”squeezed” out of the underlying function.This denotation is illustrated in Figure 2.2.
Figure 2.2: The sponge construction of KECCAK[256] [1]
2.1.2 KECCAK - f [1600] Permutation Function
The parameter specifies KECCAK - f permutation function, called the width, whichsuggests the maximum length of the strings that can be processed each time. Anomitted parameter, called the round, is defined to be 24 in default. Put differently,every KECCAK - f [1600] function call consists of 24 rounds of permutation. In eachround, a similar routine is performed with a different value, called the round constant.
A conversion from the input string to a 5-by-5-by-64 array, called the state array,acts as the first step of permutation. Next, five functions, called the step mapping, areperformed in order on the state array. At last, a reverse of the first step are executed toachieve the final output.
The first step mapping is called θ . The illustration of applying function θ to asingle bit in the state array shown in Figure 2.3.
8 Overview of KMAC and C-CRC
Figure 2.3: Illustration of θ applied to a single bit [1]
Algorithm 2.3 θ(A)Input: state array AOutput: state array A′
for x = 0 to 4 dofor z = 0 to 63 do
C(x,z) = A(x,0,z)⊕A(x,1,z)⊕A(x,2,z)⊕A(x,3,z)⊕A(x,4,z)end for
end forfor x = 0 to 4 do
for z = 0 to 63 doD(x,z) =C((x−1) mod 5,z)⊕C((x+1) mod 5,(z−1) mod 64)
end forend forfor x = 0 to 4 do
for y = 0 to 4 dofor z = 0 to 63 do
A′(x,y,z) = A(x,y,z)⊕D(x,z)end for
end forend for
2.1 KMAC 9
The second step mapping is called ρ . The illustration of applying function ρ to a200 bit string shown in Figure 2.4.
Figure 2.4: Illustration of ρ for a 200 bit string [1]
Algorithm 2.4 ρ(A)Input: state array AOutput: state array A′
for z = 0 to 63 doA′(0,0,z) = A(0,0,z)
end for(x,y) = (1,0)for t = 0 to 23 do
for z = 0 to 63 doA′(x,y,z) = A(x,y,(z− (t +1)(t +2)/2) mod 64)(x,y) = (y,(2x+3y) mod 5)
end forend for
The third step mapping is called π . The illustration of applying function π to a5-by-5 bit slice of the state array shown in Figure 2.5.
Algorithm 2.5 π(A)Input: state array AOutput: state array A′
for x = 0 to 4 dofor y = 0 to 4 do
for z = 0 to 63 doA′(x,y,z) = A((x+3y) mod 5,x,z)
end forend for
end for
10 Overview of KMAC and C-CRC
Figure 2.5: Illustration of π applied to a 5-by-5 slice [1]
The forth step mapping is called χ . The illustration of applying function π to a1-by-5 bit row of the state array shown in Figure 2.6.
Figure 2.6: Illustration of χ applied to a 1-by-5 row [1]
Algorithm 2.6 χ(A)Input: state array AOutput: state array A′
for x = 0 to 4 dofor y = 0 to 4 do
for z = 0 to 63 doA′(x,y,z) = A(x,y,z)⊕((A((x+1) mod 5,x,z)⊕1) ·A((x+2) mod 5,y,z))
end forend for
end for
2.2 C-CRC 11
The last step mapping is called ι .
Algorithm 2.7 ι(A)Input: state array AOutput: state array A′
for x = 0 to 4 dofor y = 0 to 4 do
for z = 0 to 63 doA′(x,y,z) = A(x,y,z)
end forend for
end forRC = 064for j = 0 to 6 do
RC(2 j−1) = rc( j+7ir)end forfor z = 0 to 63 do
A′(0,0,z) = A′(0,0,z)⊕RC(z)end for
The rc in the Algorithm 2.7 is a function that determines the round constant,denoted by RC.
Algorithm 2.8 rc(t)Input: integer tOutput: bit RC
if t mod 255 = 0 thenRC = 1
end ifR = 10000000for i = 1 to t mod 255 do
R = 0‖RR(0) = R(0)⊕R(8)R(4) = R(4)⊕R(8)R(5) = R(5)⊕R(8)R(6) = R(6)⊕R(8)R = R(0 : 7)
end forRC = R(0)
2.2 C-CRCFor CRC, there is a so-called generator polynomial in the specification. In practice,each bit string is associated with a polynomial over the Galois Field of two elements(GF(2)), where the coefficients stand for the bits. In GF(2), addition andmultiplication operation corresponds to logical XOR and AND respectively. In adivision of polynomials, the message becomes the dividend and the generatorpolynomial becomes the divisor. The quotient of this division is neglected while the
12 Overview of KMAC and C-CRC
remainder is the CRC checksum. The highest index of a polynomial is called thedegree. The degree of the checksum can be therefore determined by the generatorpolynomial degree.
M(x) · xn = Q(x) ·g(x)−R(x) (2.1)
To compute an n-bit binary CRC in the mathematical way, first place the messagebits with n-bit 0 in a row, position the generator polynomial start from the left-hand ofthe message. To divide the message with the generator polynomial, bitwise XOR ofthe upper with the lower bits. The message bits not above the generator polynomialbits remain the same and are duplicated for the next step. Then, the generatorpolynomial is shifted to the right one bit, and bitwise XOR the upper bits. Thisoperation is repeated until the generator polynomial reaches the end of the message.Take message ”11010011101101” and generator polynomial ”1011” as an example,the computation is shown in Table 2.1.
11010011101101 000101101100011101101 0001011
00111011101101 0001011
00010111101101 0001011
00000001101101 0001011
00000000110101 0001011
00000000011001 0001011
00000000001111 0001011
00000000000100 000101 1
00000000000001 1001 011
00000000000000 111
Table 2.1: The computation of 3-bit CRC
In software, the polynomial division can be realized by bitwise XOR the non-zerobit in the message string. The follwing pseudo code gives the details when implement-ing CRC computation in software.
2.2 C-CRC 13
Algorithm 2.9 CRCInput: M(1 : L), g(1 : n+1)Output: R
R = 0for i = 1 to L do
R = R⊕[M(i)‖zeros(n−1)]if R(n−1) = 1 then
R = (R‖0)⊕gelse
R = R‖0end if
end for
The following parameters are used in C-CRC definition:
• M(x) : the input message polynomial
• g(x) : the generator polynomial
• S(x) : the customized polynomial
• L : the required generator polynomial degree
Algorithm 2.10 C-CRCInput: M(x), g(x), S(x), LOutput: Z(x), deg(Z) = deg(g)
Hg(x) = M(x) · xL mod g(x)Z(x) = Hg(x) + S(x)
To be pointed out, g(x) in the definition represents any generator polynomial withnon-zero constant term. To give an intuitive review of the algorithm, C-CRC is trans-formed into the data flow diagram in Figure 2.7.
CRC
ComputationGenerator Polynomial
Message
Encryption String
Encryption
Figure 2.7: The data flow diagram of C-CRC
Chapter 3
Implementation and Evaluation
The implementation of two algorithms consists of two main parts, includingMATLAB implementation and ASIC design. First, the proposed messageauthentication functions are implemented with their original definitions usingMATLAB. Next, a typical ASIC design flow is performed to obtain the circuits andthen validated by the MATLAB version. The overall implementation methodology isillustrated as flow chart in Figure 3.1.
ASIC Design
START
HDL Design
Logic Synthesis
Rou�ng
Placement
END
Func�onal
Simula�on
Matlab
Implementa�on
Func�onal
Valida�on
Figure 3.1: The project design methodology flow chart
3.1 MATLAB Implementation 15
3.1 MATLAB Implementation
There is no available implementations or test vectors of KMAC when this projectbegan. Therefore, a validated implementation of the original algorithms are requiredto prove the correctness of the ASIC design. The reasons why MATLAB was chosenare that intermediate values can be viewed easily when testing, and there are variousbuild-in functions which can be used directly. With less self-written functions, thecodes are tested faster.
3.1.1 C-CRC Implementation in MATLAB
There are numerous available MATLAB codes of CRC implementation on the Internet.For the ordinary CRC calculation, a build-in function, called deconv, is employed toobtain the coefficients of the remainder. The first task of implementing C-CRC is to addthe generator polynomial as an additional input in order to realize the programmablefeature. Then, the coefficients are converted to GF(2) for the final result. The flowchart illustrates the C-CRC implementation in MATLAB shown in Figure 3.2.
START
[M 0length(g)-1] Mx
deconv(Mx,g) rx
r(length(M)+1:end)
mod(|rx|,2) r
END
Figure 3.2: C-CRC structure in MATLAB implementation
3.1.2 KMAC128 Implementation in MATLAB
In the definition, KMAC128 function follows a top-down structure. Correspondingto this, a bottom-up approach was adopted in the MATLAB implementation. Eachfunction from the bottom layer must be tested before building the up-layer functions.In addition, the KECCAK[256] function is validated with the test vectors provided by theKECCAK designers [12]. The flow chart illustrates the KMAC128 implementation inMATLAB shown in Figure 3.3.
16 Implementation and Evaluation
KMAC128
KECCAK[256]
KECCAK-f[1600]
START
bytepad(encode(K)) || M || encode(L) newM
bytepad(encode("KMAC") || encode(S)) T
i < n
n = length(P) / 1344;
i = 0;
S = 01600;
True
pad(NewM || T || 00) P
True
False
S = S ⊕ P(i*1344+1 : (i+1)*1344)
Concert S into state array A
Concert A into string S
j < 24
j = 0
A = �(�(�(�(�(A)))),j)
j = j + 1
i = i + 1
j = 0
False
Z = S(1 : L)
END
Figure 3.3: KMAC structure in MATLAB implementation
3.2 ASIC Design 17
There are two noteworthy things when it comes to KMAC128 implementation inMATLAB. The first one is the endianness of the processing bit strings. Not only theinput and the output string, but also the intermediate string could cause such problem.Take KECCAK[256] as an example, the order of the provided test vector inputs is aspecial format. There are two specific bit-reordering functions for the provided testinput, which can be found in the appendix of [1]. It would be mistaken if the inputsare used directly. Second, different build-in functions have restriction on the matrixdimension and the variable type in MATLAB. The description of functions should bechecked carefully before used.
3.2 ASIC Design
As mentioned before, a typical cell-based ASIC design flow is adopted in this project.This denotes that pre-designed logic cells, called standard cells, which are selecteddirectly from the standard-cell library, construct the outcome circuit. For this project,the software tools that are used in HDL design, functional simulation and logicsynthesis are Verilog HDL, ModelSim and Synopsys respectively. The synthesisresults are based on the 65-nanometer CMOS process technology provided by UMC65LL (Low Leakage).
3.2.1 Variable Specification
Unlike the MATLAB implementation, hardware description language (HDL) treatsvariables in fixed-point rather than floating-point. As a result, it is obligatory tospecify the length of input and output in ASIC design, likewise the intermediatevalues. The application area was taken into account when the lengths were specified.
In this section, only the specifications of input and output are revealed. The lengthof each input and output variable should be equivalent for both KMAC and C-CRC.Otherwise, it would be unfair to compare their performance under this condition.Moreover, the application area is also taken into account to produce a practical design.The specifications of the identical variables in both KMAC and C-CRC are shown inTable 3.1.
Length (bit)Data Transfer Width 128Input Message String 256, 512 or 1024
Secret Key (KMAC128) /Generator Polynomial Degree (C-CRC) 128
Customized String (KMAC128) /Encryption String (C-CRC) 128
Output String 128
Table 3.1: Specifications of the identical variables
18 Implementation and Evaluation
3.2.2 KMAC128 Implementation in HDL
To achieve an accurate and efficient implementation, a similar bottom-up approach isadopted in HDL as the one in MATLAB. On the basis of the algorithm structure inFigure 2.1, the KMAC128 implementation is divided into two main sections,including one section performs KECCAK - f [1600], and the other section that encryptsand controls the input and the round number of the first section.
According to the specification in Table 3.1, the maximum of input message lengthis 1024 bits. Consider the fact that the function bytepad(S,168) keep padding the inputstring until it can be divisible by 168 times 8, the output length should be an integermultiple of 1344. In our case, it should be 1344. Coincidentally, the first few steps ofKECCAK[256] achieve such a function of splitting the input after padding it to an integermultiple of 1344. Therefore, the input to the KECCAK[256] algorithm should be dividedinto three parts of the same length, newK, newS and newM. The adapted algorithm ispresented in Algorithm 3.1. The adapted KMAC128 algorithm is shown in Figure 3.4.
Algorithm 3.1 Adapted KMAC128(K,M,L,S)Input: K,M,L,SOutput: Z, len(Z) = len(L)
newK = bytepad(encode string(K),168)newS = bytepad(encode string(”KMAC”)‖encode string(S),168)newM = M ‖ right encode(L)‖00Z = KECCAK[256](newS‖newK ‖newM,L)
EncryptionKey
Message
newKey
newString
newMessage
Customized String
KECCAK-
f[1600] 24
KECCAK-
f[1600] 24KECCAK-
f[1600] 24
Figure 3.4: The adapted KMAC128 algorithm data flow diagram
On the basis of the adapted version of KMAC128, KECCAK[256] algorithm is alteredas Algorithm 3.2. Instead of spliting and pading, the inputs are used directly from theprevious algorithm. In this way, parallelism is able to take place while reading andprocessing. According to the test vector provided by the designer of KECCAK family, theproposed implementation of KECCAK[256] algorithm is validated through the KnownAnswer Test (KAT) provided by NIST.
3.2 ASIC Design 19
Algorithm 3.2 Optimized KECCAK[256](newS‖newK ‖newM,L)Input: newS‖newK ‖newM,LOutput: Z, len(Z) = L
P0 = newSP1 = newKP2 = newM ‖pad10*1(1344, len(M))S = 01600
for i = 0 to 2 doS = KECCAK - f [1600](S⊕(Pi ‖0256))
end forZ = S(0,256)
In the section of KECCAK - f [1600], one implementation technique is to process thestate array by a combination of bits at a time. For example, in the first step of the stepmapping θ , instead of XOR the bits one by one, the bits of the same lane can beXORed to produce the same result.
Another technique is to do some computations ahead. In Algorithm 2.8, the roundconstant RC is determined by the round index, ir. This should be converted to fixedvalue when implemented in HDL. The value for RC is precomputed using MATLAB.The values in hexadecimal and little-endian format are shown in Table 3.2.
RC[0] 0000000000000001 RC[12] 000000008000808BRC[1] 0000000000008082 RC[13] 800000000000008BRC[2] 800000000000808A RC[14] 8000000000008089RC[3] 8000000080008000 RC[15] 8000000000008003RC[4] 000000000000808B RC[16] 8000000000008002RC[5] 0000000080000001 RC[17] 8000000000000080RC[6] 8000000080008081 RC[18] 000000000000800ARC[7] 8000000000008009 RC[19] 800000008000000ARC[8] 000000000000008A RC[20] 8000000080008081RC[9] 0000000000000088 RC[21] 8000000000008080RC[10] 0000000080008009 RC[22] 0000000080000001RC[11] 000000008000000A RC[23] 8000000080008008
Table 3.2: Round constant value in little-endian and hexadecimal format
The other implementation techniques include to treat the (x− n)mody case asevery bit loop n positions in a string of length y, and to ignore the unchanged value inthe algorithm such as the last step of ι . Bit ordering is noteworthy when processingthe multidimensional array in HDL. For example,wire [array length− 1 : 0] array name is the only legal way to declare the firstdimension of a two-dimension array.
To decrease the redundancy and increase the readability of the transcripts,modules of reusable functions are created. In addition, a synchronization signal isrequired to match the states of the different modules. Furthermore, a counter isrequired to control the number of a certain module operations.
20 Implementation and Evaluation
In Table 3.1, the data transfer width is set to be 128. As a result, the input messageshould be separated and transmitted piece by piece. Accordingly, two input signals areadded to indicate the completeness of the input message, in valid and is last. Thein valid signal remains active until the end of the input. The is last signal turns toactive only when transferring the last piece of the input.
While reading the input message, the KECCAK - f [1600] computation for newKeyand newString can be processed at the same time. Therefore, the separation of readingdoes not delay the overall computation time. The complete HDL transcripts can befound in Appendix A.1.
3.2.3 C-CRC Implementation in HDL
In practice, the computation of CRC in hardware is realized by a linear-feedback shiftregister (LFSR). There are two types of LFSR, Fibonacci and Galois, shown inFigure 3.5. The reason why Galois, known as internal XORs LFSR, is chosen is that ithas shorter propagation delay and is more efficient to implement than the other one.
Q
QSET
CLR
S
R
1
Q
QSET
CLR
S
R
2
Feedback
Q
QSET
CLR
S
R
3
Q
QSET
CLR
S
R
4
Input Output
(a) Galois type LFSR
Q
QSET
CLR
S
R
1
Q
QSET
CLR
S
R
2
Feedback
Q
QSET
CLR
S
R
3
Q
QSET
CLR
S
R
4
Input
Output
(b) Fibonacci type LFSR
Figure 3.5: Two types of LFSR
Furthermore, the feedback part, which the value is connected to every XOR gate inLFSR, is selected to be XORed value from the input of the entire LFSR and the outputof the last shift register (shown in Figure 3.6). In this way, CRC computation is able tofinish within the shifts of the message bits number. As a result, the latency reduces thenumber of generator polynomial degree clock cycle.
3.2 ASIC Design 21
Q
QSET
CLR
S
R
1
Q
QSET
CLR
S
R
2
Feedback
Q
QSET
CLR
S
R
3
Q
QSET
CLR
S
R
4
Input
Output
Figure 3.6: The optimized Galois LFSR with less latency
In order to achieve the compatibility with any generator polynomial, the LFSRhere should be reconfigurable, or so-called programmable. A multiplexer (MUX) iscontroled by a bit of the generator polynomial to choose the input to the next shiftregister between the output of the previous register or the XORed value of the outputand the feedback (shown in Figure 3.7).
Generator Polynomial[1]
Q
QSET
CLR
S
R
1
S1
S2
D
C ENB
MUX1
Feedback
Figure 3.7: The programmable LFSR with any generator polynomial
Due to the property of LFSR, the input message is transmitted bit by bit.Therefore, a similar signal as in KMAC128, m valid, is required to indicate thevalidity of the input message. Instead of the signal is last, a new signal, pad valid, isrequired to indicate the completeness of the message and the beginning of theencryption string. These two signals are complementary to each other, which isrealized in the test-bench transcript.
When it comes to the end of input, there are two ways to implement the output of theLFSR. The illustration of C-CRC with two designs of the output shown in Figure 3.8.The first one is to read the checksum one bit per clock cycle. In this way, the outputof the last register is designed as the output of the LFSR. The CRC checksum will beread bit by bit as the input. The second implementation is to read all the output at thesame time, as soon as the computation is finished. The first design has the advantagein area, while the second design has the advantage in latency. The simulation results oftwo designs will be shown in Chapter 3.2.4.
22 Implementation and Evaluation
Message
Generator Polynomial[1]
Generator Polynomial[2]
Generator Polynomial[127]
Input Valid
Encryption String
OutputQ
QSET
CLR
S
R
1
S1
S2
D
C ENB
MUX1
Q
QSET
CLR
S
R
2
S1
S2
D
C ENB
MUX2
S1
S2
D
C ENB
MUX127
Q
QSET
CLR
S
R
128
CRC Output
Feedback
(a) LFSR produces output one bit per clock cycle
Message
Generator Polynomial[1]
Generator Polynomial[2]
Generator Polynomial[127]
Input Valid
Encryption String[1]
Output[128]
Q
QSET
CLR
S
R
1
S1
S2
D
C ENB
MUX1
Q
QSET
CLR
S
R
2
S1
S2
D
C ENB
MUX2
S1
S2
D
C ENB
MUX127
Q
QSET
CLR
S
R
128
CRC Output
Feedback
Encryption String[2]
Encryption String[128]
Output[2]
Output[1]
(b) LFSR produces all output in one clock cycle
Figure 3.8: C-CRC with two designs of the output section
3.2.4 Simulation Results and Analysis
After implementing the two algorithms in ModelSim, the same test vector is used as theinput to the MATLAB implementation. The HDL implementation is validated since theoutput of the MATLAB version matches the HDL version. Next, the verified HDL codeis synthesized in Synopsys with UMC 65LL library and the specified working condition(25◦C temperature and 1.2 volt operating voltage). The results are shown in Table 3.3.
3.2 ASIC Design 23
(a) Area And Power under Room Temperature and 1.2 Volt Operating Voltage
PerformanceMetrics
ClockFrequency
(MHz)
MAC Function
KMAC128 C-CRC (a) C-CRC (b)
Total Area(µm2)
1 100725.01 2643.61 3176.6410 102209.99 2643.61 3176.64100 102227.85 2643.61 3176.64
1000 120565.73 2643.61 3176.64
Total Power(µW )
1 43.64 1.51 2.0310 384.17 13.76 18.28100 3762.74 136.28 180.75
1000 38723.51 1361.45 1805.52
(b) Computed Results under 10 MHz Clock Frequency
PerformanceMetrics
MessageLength
(bit)
MAC Function
KMAC128 C-CRC (a) C-CRC (b)
Latency(Clock Cycle)
256 75 382 256512 75 638 512
1024 75 1150 1024Throughput
(Mbit/s) - 1280 10 1280
Throughputper Area
(bit/(s∗µm2))- 12523.24 3782.71 402941.48
Energy per Bit(pJ/bit) - 0.296 1.361 0.014
Table 3.3: Synthesis Results for The Proposed MAC Functions
From the table, we can conclude the following: regarding area, power, throughputper area, and energy per bit: C-CRC with parallel output has better performance thanKMAC128; while in terms of latency, KMAC128 is better. KMAC128 and C-CRCwith parallel output have the equal amount of throughput within the samespecification.
3.2.5 LayoutWhen synthesis finished, Cadence SoC Encounter was used to place and route. TheHDL gate file was generated by Synopsys and imported into Encounter afterwards.After specifying the Library Exchange Files and timing libraries, the following stepswere made to create the final layout, including floorplan, power planning, global netsconnection, standard cells placement, route nets and create report files. The final layoutof the implementations of the two algorithms is presented in Figure 3.9.
24 Implementation and Evaluation
MAC function Total area of core (µm2)KMAC128 97113.204
C-CRC 2982.969
Table 3.4: Comparison of total area of core
(a) KMAC128
(b) C-CRC with parallel output
Figure 3.9: Layout of the implementations of KMAC128 and C-CRC
Chapter 4
Discussion and Conclusion
KMAC and C-CRC were designed and implemented within the same specification inChapter 3. Moreover, latency or power constraints were not taken into consideration.As a result, the comparison analysis was conditional and might vary for differentrequirements. In order to compare the two MAC functions in general, morespecifications and implementation with constraints are discussed in this chapter.
4.1 Implementation of Alternative Specification
In this thesis’s application area, i.e. embedded devices, the length of secret key andencryption string will not affect the performance without considering transmissiontime. As mentioned above, pre-computation and transmission is possible to conduct atthe same time. Therefore, the length of input message is the most important factor inthe specification.
For C-CRC, the HDL code for algorithm will not change since it is independent ofthe input message length. The latency depends on both the output and input messagelength (len(M)+L).
For KMAC, the algorithm was adapted in Figure 3.4 in the specification ofTable 3.1. If the length of input message increased to more than1343− 16− 2− 2 = 1323 bits and other input length stays the same, KECCAK[256]will be called more than 3 times depending on r = 2+(len(M)+ 16+ 2+ 2)/1343.The latency will increase to 25∗ r. On the other hand, the total area will not increase alot due to the unchanged core function. The KMAC128 implementation of more than1323 input message bits is illustrated as the sequence diagram in Figure 4.1.
26 Discussion and Conclusion
KMAC128 KECCAK -f [1600]
newKey
newString
f-acknowledge
f-acknowledge
24 clock cycle
1 clock cycle
24 clock cycle
1 clock cyclenewMessage[1]
f-acknowledge
24 clock cycle
1 clock cycle
newMessage[2]
f-acknowledge
24 clock cycle
1 clock cycle
Figure 4.1: The sequence diagram of KMAC128 with a longer input message
4.2 Implementation of Special Constraints
Essentially, ”latency or power constraints” is another way of performing designtrade-off. For example, when the implementation places a higher priority on latency, itmeans that area and other performance metrics can be sacrificed in order to decreaselatency. In this chapter, only constraints that cannot be satisfied by the implementationof Chapter 3 is discussed.
4.2.1 Latency Constraint
Parallelism is one technique to decrease latency for both of the MAC functions. Loopunrolling and data parallel processing are two common way to achieve parallelism.For KMAC, we can increase the round instance in one clock cycle (illustrated inFigure 4.2). For C-CRC, we can process two message at the same time. With otherspecification unchanged, comparison between the results of parallelism of two andserial are shown in Table 4.1.
4.2 Implementation of Special Constraints 27
EncryptionKey
Message
newKey
newString
newMessage
Customized String
KECCAK-
f[1600]
12
KECCAK-
f[1600]
KECCAK-
f[1600]
12
KECCAK-
f[1600]
KECCAK-
f[1600]
12
KECCAK-
f[1600]
Figure 4.2: The data flow diagram of KMAC with round instance of two
PerformanceMetrics
KMAC128(round
instance:1)
KMAC(round
instance:2)
C-CRC (b)(parallelism:1)
C-CRC (b)(parallelism:2)
Total Area (µm2) 102209.99 129448.28 3176.64 6353.28
Total Power(µW ) 384.17 447.17 18.28 36.55
Latency 75 39 1024 1024Throughput
(Mbit/s) 1280 1280 1280 2560
Throughputper Area
(bit/(s∗µm2))12523.24 9888.12 402941.48 402941.48
Energy per Bit(pJ/bit) 0.296 0.343 0.014 0.014
Table 4.1: Comparison in 1024 bits message length and 10MHz Clock Frequency
28 Discussion and Conclusion
From the results we can see that, KMAC128 is influenced by parallelism less thanC-CRC in area, power and throughput. In latency, throughput per area and energy perbit, C-CRC stays the same. For parallelism of more than two, similar result can beconcluded. For comparison between KMAC128 and C-CRC, similar conclusions canbe drawn as in Chapter 3.2.4.
4.2.2 Power ConstraintPower gating is a common technique to decrease power consumption in ASIC design,for instance in LFSR of CRC computation. The concept is to shut off the ”useless”gate when not needed, and leakage power is reduced consequently. This technique canbe illustrated in Figure 4.3. Unfortunately, in the applied technology library, there is nosingle CMOS gate that can be directly used in HDL code. Therefore, a multiplexer willbe replaced to act as the switch of controlled gates, which consumes more power thansingle CMOS gate. Moreover, the implementations with this design fails the criticalpath timing analysis when the clock frequency is more than 10MHz. For C-CRC, theleakage power can be ignored compared to dynamic power. As a result, power gatingis not suitable for both KMAC and C-CRC.
Generator Polynomial[1]
Input Valid
Q
QSET
CLR
S
R
1
S1
S2
D
C ENB
MUX1
Feedback
Figure 4.3: Power gating in LFSR of C-CRC
4.3 ConclusionFor KMAC128, latency is the outstanding merit as well as its security strengths. In theaspect of embedded devices, security strength is not the primary goal. AlthoughKMAC128’s latency increases when input message length increases, it still has lowergrowth rate compared to C-CRC. KMAC128’s latency performance can be improvedby increasing the number of round instances computed in one clock cycle. However,this improvement has a limit where no more permutation can be performed in oneclock cycle.
For C-CRC, it achieves excellent performance in terms of area and power. Withinthe domain of C-CRC, the implementation with parallel output offers much higherthroughput at the cost of a trivial increase in area and power consumption. Thus, it hasthe similar result in throughput per area and energy per bit. Latency is the only bottle
4.4 Future Work 29
neck for C-CRC due to the feature of LFSR. Little improvement in this thesis hasbeen made to decrease latency, including parallel output.
4.4 Future WorkFor both of the implementations, there is still work left to do after floor planning androute. Time and power analysis should be performed to verify the circuit.Additionally, functional verification should be executed on the gate level net-list.When finishing physical synthesis, more accurate power consumption result should bepresented. And the throughput and latency might be affected by the propagation delay.
Since the latency is the only bottleneck for C-CRC, more effort could be put intooptimizing the algorithm or the CRC computation. Also, how to extract entropy froma longer length output would be an interesting topic to investigate.
Bibliography
[1] National Institute of Standards and Technology, FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions.Gaithersburg, MD 20899-8900: National Institute of Standards and Technology,Aug. 2015. [Online]. Available: http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf
[2] Alfred J. Menezes, Handbook of applied cryptography, ser. The CRC Press serieson discrete mathematics and its applications. Boca Raton: CRC, 1997.
[3] M. Bellare, R. Canetti, and H. Krawczyk, “Keying hash functions for messageauthentication.” Springer-Verlag, 1996, pp. 1–15.
[4] National Institute of Standards and Technology, FIPS PUB 198-1: The Keyed-Hash Message Authentication Code (HMAC). Gaithersburg, MD 20899-8900:National Institute of Standards and Technology, Jul. 2008. [Online]. Available:http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.198-1.pdf
[5] M. Stevens, “Fast collision attack on md5.” IACR Cryptology ePrint Archive,vol. 2006, p. 104, 2006. [Online]. Available: http://dblp.uni-trier.de/db/journals/iacr/iacr2006.html#Stevens06
[6] M. Stevens, E. Bursztein, P. Karpman, A. Albertini, and Y. Markov, “The firstcollision for full sha-1,” Cryptology ePrint Archive, Report 2017/190, 2017, http://eprint.iacr.org/2017/190.
[7] G. Bertoni, J. Daemen, M. Peeters, and G. Van Assche, “The Making ofKECCAK,” Cryptologia, vol. 38, no. 1, p. 26, 2014.
[8] John Kelsey, Shu-jen Chang, and Ray Perlner, NIST Special Publication800-185: SHA-3 Derived Functions: cSHAKE, KMAC, TupleHash andParallelHash. Gaithersburg, MD 20899-8930: National Institute of Standardsand Technology, Dec. 2016. [Online]. Available: http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-185.pdf
[9] W. Peterson and D. Brown, “Cyclic codes for error detection,” Proceedings of theIRE, vol. 49, no. 1, pp. 228–235, January 1961.
[10] E. Dubrova, M. Naslund, G. Selander, and F. Lindqvist, “Message authenticationbased on cryptographically secure crc without polynomial irreducibility test,”Cryptography and Communications, pp. 1–17, 2017. [Online]. Available:http://dx.doi.org/10.1007/s12095-017-0227-8
BIBLIOGRAPHY 31
[11] UMC, “UMK65lscllmvbbr b UMC 65nm Low-K Multi-Voltage Low LeakageRVT Tapless Standard Cell Library Databook,” Nov. 2011.
[12] “KeccakCodePackage.” [Online]. Available: https://github.com/gvanas/KeccakCodePackage
Appendix A
Verilog HDL Transcript
A.1 KMACFile name: kmac128.v
module kmac128 #
(
parameter Key_Length = 128,
parameter Output_Length = Key_Length
)
(
input wire clk,
input wire reset,
input wire [0:127] in_X,
input wire in_ready,
input wire is_last,
input wire [0:Key_Length-1] in_K,
input wire [0:127] in_S,
output wire [0:Output_Length-1] out,
output wire out_ready
);
//bytepad Key = left_encode(168) || left_encode(128) || K || 0
wire [0:1343] new_K;
assign new_K = {16’b1000000000010101,
16’b1000000000000001, in_K, 1184’b0};
genvar gv_B, gv_b;
//generate in_S in big endian order
wire [0:127] S_bigen;
generate
for(gv_B=0; gv_B<128/8; gv_B=gv_B+1)
begin : S_1
for(gv_b=0; gv_b<8; gv_b=gv_b+1)
begin : S_2
assign S_bigen[gv_B*8+gv_b] = in_S[gv_B*8+(7-gv_b)];
A.1 KMAC 33
end
end
endgenerate
//bytepad customized String and "KMAC" = left_encode(168)
//|| left_encode(128) || S || left_encode(32) || "KMAC" || 0
wire [0:1343] new_S;
assign new_S = {16’b1000000000010101, 16’b1000000000000100,
32’b11010010101100101000001011000010, 16’b1000000000000001,
S_bigen, 1136’b0};
//read message and pad after finish reading
reg [0:7] read_count;
reg [0:1343] X_array;
always @(posedge clk)
if (reset)
begin
X_array = 0;
read_count = 0;
end
else
begin
if (in_ready)
begin
X_array = {X_array[128:1343], in_X};
read_count = {read_count[1:7], 1’b1};
end
if (is_last)
begin
case (read_count)
//new_X = X || right_encode(Output_Length) || 00
8’b00000011: X_array = {X_array[1088:1343],
16’b0000000110000000, 2’b0, 1’b1,1068’b0, 1’b1};
8’b00001111: X_array = {X_array[832:1343],
16’b0000000110000000, 2’b0, 1’b1,812’b0, 1’b1};
8’b11111111: X_array = {X_array[320:1343],
16’b0000000110000000, 2’b0, 1’b1,300’b0, 1’b1};
endcase
end
end
wire f_ack;
wire f_out_ready;
//only the first part of message or the previous output
//of permutation ready, f_in_ready = 1
reg f_in_ready;
always @(posedge clk)
if (reset)
f_in_ready = 0;
34 Verilog HDL Transcript
else
begin
if ((in_ready && !read_count[6]) || f_out_ready)
f_in_ready = 1;
if (f_in_ready)
f_in_ready = !f_ack;
end
reg [0:2] f_count;
always @(posedge clk)
if (reset)
f_count = 0;
else
if (f_out_ready && f_in_ready)
f_count = {1’b1, f_count[0:1]};
wire [0:1343] f_in;
wire [0:1599] f_out;
assign f_in = (!f_count[0]) ? new_S : (!f_count[1]) ?
new_K : (!f_count[2]) ? X_array : 0;
f_permutation f_permutation_ (clk, reset, f_in, f_in_ready,
f_ack, f_out, f_out_ready);
assign out_ready = f_count[2];
assign out = out_ready ? f_out[0:Output_Length-1] : 0;
endmodule
A.1 KMAC 35
File name: f permutation.v
module f_permutation
(
input clk, reset,
input [0:1343] in,
input in_ready,
output ack,
output reg [0:1599] out,
output reg out_ready
);
reg [0:22] i; /* select round constant */
wire [0:1599] round_in, round_out;
wire [0:63] rc;
wire update;
wire accept;
reg calc; /* == 1: calculating rounds */
assign accept = in_ready && !calc; // in_ready & (i == 0)
always @ (posedge clk)
if (reset) i <= 0;
else i <= {accept, i[0:21]};
always @ (posedge clk)
if (reset) calc <= 0;
else calc <= (calc & (~ i[22])) | accept;
assign update = calc | accept;
assign ack = accept;
always @ (posedge clk)
if (reset)
out_ready <= 0;
else if (accept)
out_ready <= 0;
else if (i[22]) // only change at the last round
out_ready <= 1;
assign round_in = accept ? {in ^ out[0:1343],
out[1344:1599]} : out;
rconst rconst_ ({accept, i}, rc);
round round_ (round_in, rc, round_out);
always @ (posedge clk)
36 Verilog HDL Transcript
if (reset)
out <= 0;
else if (update)
out <= round_out;
endmodule
File name: rconst.v
module rconst(i, rc);
input wire [0:23] i;
output reg [0:63] rc;
always @ (i)
begin
rc = 0;
rc[0] = i[0] | i[4] | i[5] | i[6] | i[7] |
i[10] | i[12] | i[13] | i[14] | i[15] | i[20] | i[22];
rc[1] = i[1] | i[2] | i[4] | i[8] | i[11] |
i[12] | i[13] | i[15] | i[16] | i[18] | i[19];
rc[3] = i[2] | i[4] | i[7] | i[8] | i[9] | i[10] |
i[11] | i[12] | i[13] | i[14] | i[18] | i[19] | i[23];
rc[7] = i[1] | i[2] | i[4] | i[6] | i[8] | i[9] |
i[12] | i[13] | i[14] | i[17] | i[20] | i[21];
rc[15] = i[1] | i[2] | i[3] | i[4] | i[6] | i[7] | i[10] |
i[12] | i[14] | i[15] | i[16] | i[18] | i[20] | i[21] | i[23];
rc[31] = i[3] | i[5] | i[6] | i[10] | i[11] |
i[12] | i[19] | i[20] | i[22] | i[23];
rc[63] = i[2] | i[3] | i[6] | i[7] | i[13] |
i[14] | i[15] | i[16] | i[17] | i[19] | i[20] | i[21] | i[23];
end
endmodule
A.1 KMAC 37
File name: round.v
‘define low_pos(x,y) 64*(5*y+x)
‘define high_pos(x,y) ‘low_pos(x,y) + 63
‘define add_1(x) (x == 4 ? 0 : x + 1)
‘define add_2(x) (x == 3 ? 0 : x == 4 ? 1 : x + 2)
‘define sub_1(x) (x == 0 ? 4 : x - 1)
‘define rot_up(in, n) {in[63-n:0], in[63:63-n+1]}
‘define rot_up_1(in) {in[62:0], in[63]}
module round(in, round_const, out);
input [0:1599] in;
input [0:63] round_const;
output [0:1599] out;
/* "a ~ g" for round 1 */
wire [63:0] a[4:0][4:0];
wire [63:0] b[4:0];
wire [63:0] c[4:0][4:0], d[4:0][4:0],
e[4:0][4:0], f[4:0][4:0], g[4:0][4:0];
genvar x, y, z;
/* assign "a[x][y][z] == in[w(5y+x)+z]" */
generate
for(y=0; y<5; y=y+1)
begin : pre0
for(x=0; x<5; x=x+1)
begin : pre1
for(z=0; z<64; z=z+1)
begin : pre2
assign a[x][y][z] = in[‘low_pos(x,y)+z];
end
end
end
endgenerate
/* calc "b[x] == a[x][0] ^ a[x][1] ^ ... ^ a[x][4]" */
generate
for(x=0; x<5; x=x+1)
begin : theta0
assign b[x] = a[x][0] ^ a[x][1] ^ a[x][2] ^ a[x][3] ^ a[x][4];
end
endgenerate
/* calc "c == theta(a)" */
generate
for(y=0; y<5; y=y+1)
begin : theta1
38 Verilog HDL Transcript
for(x=0; x<5; x=x+1)
begin : theta2
assign c[x][y] = a[x][y] ^ (b[‘sub_1(x)] ^ ‘rot_up_1(b[‘add_1(x)]));
end
end
endgenerate
/* calc "d == rho(c)" */
assign d[0][0] = c[0][0];
assign d[1][0] = ‘rot_up_1(c[1][0]);
assign d[2][0] = ‘rot_up(c[2][0], 62);
assign d[3][0] = ‘rot_up(c[3][0], 28);
assign d[4][0] = ‘rot_up(c[4][0], 27);
assign d[0][1] = ‘rot_up(c[0][1], 36);
assign d[1][1] = ‘rot_up(c[1][1], 44);
assign d[2][1] = ‘rot_up(c[2][1], 6);
assign d[3][1] = ‘rot_up(c[3][1], 55);
assign d[4][1] = ‘rot_up(c[4][1], 20);
assign d[0][2] = ‘rot_up(c[0][2], 3);
assign d[1][2] = ‘rot_up(c[1][2], 10);
assign d[2][2] = ‘rot_up(c[2][2], 43);
assign d[3][2] = ‘rot_up(c[3][2], 25);
assign d[4][2] = ‘rot_up(c[4][2], 39);
assign d[0][3] = ‘rot_up(c[0][3], 41);
assign d[1][3] = ‘rot_up(c[1][3], 45);
assign d[2][3] = ‘rot_up(c[2][3], 15);
assign d[3][3] = ‘rot_up(c[3][3], 21);
assign d[4][3] = ‘rot_up(c[4][3], 8);
assign d[0][4] = ‘rot_up(c[0][4], 18);
assign d[1][4] = ‘rot_up(c[1][4], 2);
assign d[2][4] = ‘rot_up(c[2][4], 61);
assign d[3][4] = ‘rot_up(c[3][4], 56);
assign d[4][4] = ‘rot_up(c[4][4], 14);
/* calc "e == pi(d)" */
assign e[0][0] = d[0][0];
assign e[0][2] = d[1][0];
assign e[0][4] = d[2][0];
assign e[0][1] = d[3][0];
assign e[0][3] = d[4][0];
assign e[1][3] = d[0][1];
assign e[1][0] = d[1][1];
assign e[1][2] = d[2][1];
assign e[1][4] = d[3][1];
assign e[1][1] = d[4][1];
assign e[2][1] = d[0][2];
assign e[2][3] = d[1][2];
assign e[2][0] = d[2][2];
assign e[2][2] = d[3][2];
assign e[2][4] = d[4][2];
A.1 KMAC 39
assign e[3][4] = d[0][3];
assign e[3][1] = d[1][3];
assign e[3][3] = d[2][3];
assign e[3][0] = d[3][3];
assign e[3][2] = d[4][3];
assign e[4][2] = d[0][4];
assign e[4][4] = d[1][4];
assign e[4][1] = d[2][4];
assign e[4][3] = d[3][4];
assign e[4][0] = d[4][4];
/* calc "f = chi(e)" */
generate
for(y=0; y<5; y=y+1)
begin : chi0
for(x=0; x<5; x=x+1)
begin : chi1
assign f[x][y] = e[x][y] ^ ((~ e[‘add_1(x)][y]) & e[‘add_2(x)][y]);
end
end
endgenerate
/* calc "g = iota(f)" */
generate
for(z=0; z<64; z=z+1)
begin : iota0
if(z==0 || z==1 || z==3 || z==7 || z==15 || z==31 || z==63)
assign g[0][0][z] = f[0][0][z] ^ round_const[z];
else
assign g[0][0][z] = f[0][0][z];
end
endgenerate
generate
for(y=0; y<5; y=y+1)
begin : iota1
for(x=0; x<5; x=x+1)
begin : iota2
if(x!=0 || y!=0)
assign g[x][y] = f[x][y];
end
end
endgenerate
/* assign "out[w(5y+x)+z] == out_var[x][y][z]" */
generate
for(y=0; y<5; y=y+1)
begin : post0
for(x=0; x<5; x=x+1)
begin : post1
40 Verilog HDL Transcript
for(z=0; z<64; z=z+1)
begin : post2
assign out[‘low_pos(x,y)+z] = g[x][y][z];
end
end
end
endgenerate
endmodule
‘undef low_pos
‘undef high_pos
‘undef add_1
‘undef add_2
‘undef sub_1
‘undef rot_up
‘undef rot_up_1
A.1 KMAC 41
File name: testbench kmac128.v
‘timescale 1ns / 1ps
‘define P 20
module test_kmac128;
// Inputs
reg clk;
reg reset;
reg [0:127] in_X;
reg in_ready;
reg is_last;
reg [0:127] in_K;
reg [0:127] in_S;
// Outputs
wire [0:127] out;
wire out_ready;
// Var
integer i;
// Instantiate the Unit Under Test (UUT)
kmac128 uut (
.clk(clk),
.reset(reset),
.in_X(in_X),
.in_ready(in_ready),
.is_last(is_last),
.in_K(in_K),
.in_S(in_S),
.out(out),
.out_ready(out_ready)
);
initial begin
// Initialize Inputs
clk = 0;
reset = 0;
in_X = 0;
in_ready = 0;
is_last = 0;
in_K = 0;
in_S = 0;
// Wait 100 ns for global reset to finish
#50;
in_K = 128’h52A608AB21CCDD8A4457A57EDE782176;
42 Verilog HDL Transcript
in_S = "The test message";
#50;
// Add stimulus here
@ (negedge clk);
reset = 1; #(‘P*4); reset = 0;
in_ready = 1;
in_X = 128’h9F2FCC7C90DE090D6B87CD7E9718C1EA; #(‘P);
in_X = 128’h6CB21118FC2D5DE9F97E5DB6AC1E9C10;
is_last = 1; #(‘P);
in_ready = 0; is_last = 0;
while (out_ready !== 1)
#(‘P);
check(128’h0661EBA1FE4D68A099EA222AB2854C20);
reset = 1; #(‘P*4); reset = 0;
in_ready = 1; is_last = 0;
in_X = 128’hE926AE8B0AF6E53176DBFFCC2A6B88C6; #(‘P);
in_X = 128’hBD765F939D3D178A9BDE9EF3AA131C61; #(‘P);
in_X = 128’hE31C1E42CDFAF4B4DCDE579A37E150EF; #(‘P);
in_X = 128’hBEF5555B4C1CB40439D835A724E2FAE7;
is_last = 1; #(‘P);
in_ready = 0; is_last = 0;
while (out_ready !== 1)
#(‘P);
check(128’h00459CE25CA61F87020D1AF65CEBBB8E);
reset = 1; #(‘P*4); reset = 0;
in_ready = 1; is_last = 0;
in_X = 128’h2B6DB7CED8665EBE9DEB080295218426; #(‘P);
in_X = 128’hBDAA7C6DA9ADD2088932CDFFBAA1C141; #(‘P);
in_X = 128’h29BCCDD70F369EFB149285858D2B1D15; #(‘P);
in_X = 128’h5D14DE2FDB680A8B027284055182A0CA; #(‘P);
in_X = 128’hE275234CC9C92863C1B4AB66F304CF06; #(‘P);
in_X = 128’h21CD54565F5BFF461D3B461BD40DF281; #(‘P);
in_X = 128’h98E3732501B4860EADD503D26D6E6933; #(‘P);
in_X = 128’h8F4E0456E9E9BAF3D827AE685FB1D817;
is_last = 1; #(‘P);
in_ready = 0; is_last = 0;
while (out_ready !== 1)
#(‘P);
check(128’h5E9C40703AEF6D7019A947CF92DF320E);
$display("Good!");
$finish;
end
always #(‘P/2) clk = ~ clk;
A.1 KMAC 43
task error;
begin
$display("E");
$finish;
end
endtask
task check;
input [0:127] wish;
begin
if (out !== wish)
begin
$display("%h %h", out, wish); error;
end
end
endtask
endmodule
‘undef P
44 Verilog HDL Transcript
A.2 C-CRC with parallel output
File name: s crc.v
module s_crc #
(
// degree of generator polynomial
parameter Gen_Degree = 128,
// length of CRC/LFSR output
parameter Output_Length = Gen_Degree
)
(
input wire clk,
//clock signal
input wire in_valid,
//input valid signal for valid input
input wire m_in,
//plain text input message bit by bit in big endian order
input wire [Gen_Degree-1:1] gen_polynomial,
//generator polynomial without LS and MS bits
input wire out_valid,
//output valid sinal for valid output
input wire [Gen_Degree-1:0] pad,
//pad s produced by stream cipher bit by bit
output wire [Gen_Degree-1:0] e_out
//encrypted output
);
wire [Gen_Degree-1:0] ff_state;
//current state(output) of each flipflop
assign e_out = ff_state ^ pad;
wire rst;
//reset signal to each flipflop
//initialize flipflop to all 0’s when no input or no output
assign rst = in_valid || out_valid;
wire feedback;
//feedback wire
assign feedback = m_in ^ ff_state[Output_Length-1];
wire [Gen_Degree-1:1] ori_input;
assign ori_input = {ff_state[Gen_Degree-2:0]};
//original input of each multiplexer is the output of the front flipflop
wire [Gen_Degree-1:1] xor_input;
assign xor_input = ori_input ^ {Gen_Degree-1{feedback}};
wire [Gen_Degree-1:1] xor_enable;
//multiplexer select signal
A.2 C-CRC with parallel output 45
assign xor_enable = gen_polynomial[Gen_Degree-1:1]
& {Gen_Degree-1{in_valid}};
wire [Gen_Degree-1:1] mux_out;
mux M1[Gen_Degree-1:1] (mux_out, xor_enable, xor_input, ori_input);
/*
,----------+------------+-------------------------+------------.
| | | | |
| .----. V .----. V .----. .----. V .----. |
‘->| 0 |-(+)->| 1 |->(+)->| 2 |->...->|n-2 |->(+)->| n-1|->(+)<-DIN (MSB first)
’----’ ’----’ ’----’ ’----’ ’----’
*/
flipflop F1[Gen_Degree-1:0] (ff_state, clk, rst,
{mux_out[Gen_Degree-1:1], feedback});
endmodule
File name: mux.v
/*
* multiplexer to enable xor between each FF
*/
module mux(mux_out, mux_select, a_input, b_input);
input wire mux_select, a_input, b_input;
output wire mux_out;
assign mux_out = (mux_select) ? a_input : b_input;
endmodule
File name: flipflop.v
module flipflop(q, clk, rst, d);
input clk, rst, d;
output reg q;
always @(posedge clk)
begin
if (!rst)
q <= 1’b0;
else
q <= d;
end
endmodule
46 Verilog HDL Transcript
File name: testbench scrc.v
‘timescale 1ns / 1ps
‘define P 20
module s_crc_tb();
// Inputs
reg clk;
reg m_in;
reg [127:1] gen_polynomial;
reg in_valid;
reg out_valid;
reg [0:255] message_1;
reg [0:511] message_2;
reg [0:1023] message_3;
reg [127:0] pad;
// Outputs
wire [127:0] e_out;
// Var
integer i;
// Instantiate the Unit Under Test (UUT)
s_crc utt (
.clk(clk),
.in_valid(in_valid),
.m_in(m_in),
.gen_polynomial(gen_polynomial),
.out_valid(out_valid),
.pad(pad),
.e_out(e_out)
);
initial begin
// Initialize Inputs
clk = 0;
in_valid = 0;
out_valid = 0;
gen_polynomial = 0;
m_in = 0;
pad = 0;
// Wait 100 ns for global reset to finish
#50;
gen_polynomial = 127’h9F2FCC7C90DE090D6B87CD7E9718C1E;
message_1 = 256’h9F2FCC7C90DE090D6B87CD7E9718C1EA6CB2
1118FC2D5DE9F97E5DB6AC1E9C10;
A.2 C-CRC with parallel output 47
message_2 = 512’hE926AE8B0AF6E53176DBFFCC2A6B88C6BD76
5F939D3D178A9BDE9EF3AA131C61E31C1E42CDFA
F4B4DCDE579A37E150EFBEF5555B4C1CB40439D835A724E2FAE7;
message_3 = 1024’h2B6DB7CED8665EBE9DEB080295218426BDA
A7C6DA9ADD2088932CDFFBAA1C14129BCCDD70F3
69EFB149285858D2B1D155D14DE2FDB680A8B027284055182A0C
AE275234CC9C92863C1B4AB66F304CF0621CD54565F5BFF461D3B461
BD40DF28198E3732501B4860EADD503D26D6E69338F4E0456E9E9BAF3D827AE685FB1D817;
pad = 128’h9F2FCC7C90DE090D6B87CD7E9718C1EA;
#50;
// Add stimulus here
@ (negedge clk);
in_valid = 1;
for (i = 0; i < 255; i = i + 1)
begin
m_in = message_1[i]; #(‘P);
end
m_in = message_1[255]; #(‘P);
in_valid = 0;
out_valid = 1; #(‘P);
out_valid = 0; #(‘P);
//check
@ (negedge clk);
in_valid = 1;
for (i = 0; i < 511; i = i + 1)
begin
m_in = message_2[i]; #(‘P);
end
m_in = message_2[511]; #(‘P);
in_valid = 0;
out_valid = 1; #(‘P);
out_valid = 0; #(‘P);
//check
@ (negedge clk);
in_valid = 1;
for (i = 0; i < 1023; i = i + 1)
begin
m_in = message_3[i]; #(‘P);
end
m_in = message_3[1023]; #(‘P);
in_valid = 0;
48 Verilog HDL Transcript
out_valid = 1; #(‘P);
out_valid = 0; #(‘P);
//check
$display("Good!");
$finish;
end
always #(‘P/2) clk = ~ clk;
task error;
begin
$display("E");
$finish;
end
endtask
task check;
input [127:0] wish;
begin
if (e_out !== wish)
begin
$display("%h %h", e_out, wish); error;
end
end
endtask
endmodule
‘undef P