[IEEE 2006 13th IEEE International Conference on Electronics, Circuits and Systems - Nice, France (2006.12.10-2006.12.13)] 2006 13th IEEE International Conference on Electronics, Circuits

Applying Low Power Techniques in AES MixColumn/InvMixColumn Transformations

George N. Selimis, Apostolos P. Fournaris, and Odysseas Koufopavlou VLSI Lab, Department of Electrical & Computers Engineering,

University of Patras, Patras, Greece {gselimis, apofour, odysseas}@ee.upatras.gr

Abstract. In low power resources environments with increased security needs, like smart cards or RFIDs tags, power consumption plays a crucial role in system efficiency. Since AES algorithm is widely used in the above applications, power efficient design of this algorithm is essential. However few researchers have extensively studied this issue but rather focus on high throughput designs. In this paper the low power techniques of Resource Sharing and Power Management are applied in a 32-bit architecture for the MixColumn/InvMixColumn transformation of the Advanced Encryption Standard. The proposed architecture performs multiplication in )2( 8GF field of a byte jiS , with

specific constants, using a common data path. Low power consumption is also achieved by deactivating the unused parts of the data path when MixColumn Transformation is performed. The proposed architecture achieves low power consumption and low area resources compared to other designs.

I. INTRODUCTION Cryptography plays an important role in the security of data

transmission. NIST selected the Rijndael as the AES algorithm [1] in October 2000. The AES algorithm has broad applications, including smart cards and cellular phones, WWW servers and automated teller machines (ATMs). Compared to software implementations, hardware implementations of the AES algorithm provide more physical security as well as higher speed [6].

The existing implementations of AES do not focus on the problem of power consumption but present high throughput architectures. [6]. In addition, most of the existing implementations of AES, approach Mix Column and InvMixColumn independently. However, in low resources environments, power and area resources are the main efficiency factor. In this paper, we analyze the basic operations used in MixColumn and InvMixColumn transformations of AES and by applying power management and resource sharing techniques, we propose a low power architecture for the above transformation. The proposed architecture gives interesting results in terms of power consumption and area resources when compared with other known designs.

In this paper, the mathematical background of Galois Fields is presented in Section II. The basic structure of the standard AES is given in Section III. In Section IV, we analyze the MixColumn/InvMixColumn transformations. In Section V, the proposed system is presented in detail. Comparisons with other works are given in Section VI and the paper is concluded in Section VII.

II. MATHEMATICAL BACKGROUND This article uses the same notations and conventions as in the

AES specifications [1].

Bytes. The basic data unit of AES are bytes: a = {a7, a6, a5, a4, a3, a2, a1, a0}. A byte can represent an element of the Galois Field )2( 8GF in polynomial representation:

015

56

67

7

7

0

...)( axaxaxaxaxaxa i

ii +++++==∑

=

defined over

the irreducible polynomial 1)( 348 ++++= xxxxxp . For example, the binary value {01100011} is {63} in hex-decimal notation and represents the polynomial 156 +++ xxx . Addition: The addition of two bytes representing polynomials

)2()(),( 8GFxbxa ∈ is achieved by adding their corresponding coefficients modulo 2 which is a XOR operation usually denoted

by ⊕ . ∑=

⊕=⊕7

0

)()()(i

iii xbaxbxa .

The additive inverse of a byte is the byte itself: )()( xbxb =− . Due to this, subtraction is identical with addition:

)()()()( xbxaxbxa +=−

Multiplication: The multiplication of )2()(),( 8GFxbxa ∈ ,

denoted as )()( xbxa • , uses the irreducible polynomial p(x) of degree 8 defining the Galois Field. The multiplication

)()()( xbxaxc •= in )2( 8GF is done by multiplying the

polynomials )(),( xbxa which yields a polynomial )(xt with degree less than 15. This step is followed by a modular reduction step )(mod)()( xpxtxc = to ensure that the result is an

element of )2( 8GF .

III. ADVANCED ENCRYPTION STANDARD The AES algorithm is a symmetric-key cipher, in which both

the sender and the receiver use a single key for encryption and decryption. The data block length is fixed to be 128 bits, while the key length can be 128, 192, or 256 bits, respectively. In addition, the AES algorithm is an iterative algorithm. Each iteration can be called a round, and the total number of rounds, rN is 10, 12, or 14, when the key length is 128, 192, or 256 bits, respectively. The 128-bit data block is divided into 16 bytes. These bytes are mapped to a 44 × array called the State, and all the internal operations of the AES algorithm are performed on the State. Each byte in the State is denoted by )4,0(, <≤ jiS ji , and is considered as an

element of )2( 8GF . The irreducible polynomial used in the AES

1-4244-0395-2/06/$20.00 ©2006 IEEE. 1089

algorithm to construct, )2( 8GF field is

1)( 348 ++++= xxxxxp . In the encryption of the AES algorithm, each round except the

final round consists of four transformations: the SubBytes, the ShiftRows, the MixColumns, and the AddRoundKey, while the final round does not have the MixColumns transformation. The previous Cipher transformations can be inverted and then implemented in reverse order to produce a straightforward Inverse Cipher for the AES algorithm. The individual transformations used in the Inverse Cipher are InvShiftRows, InvSubBytes, InvMixColumns, and AddRoundKey

IV. MIXCOLUMN /INVMIXCOLUMN TRANSFORMATIONS

The MixColumn transformation operates on the State column-by-column, treating each column as a four-term polynomial. (Fig. 1).

Fig. 1. The MixColumn Transformation.

The columns are considered as polynomials over )2( 8GF multiplied modulo (x4 + 1) with a fixed polynomial

}02{}01{}01{}03{)( 23 +++= xxxxa .

Suppose )()()(' xsxaxs ⊗= , as a result of this multiplication, the four bytes in a column are replaced by the following:

ccccc SSSSS ,3,2,1,0',0 )}03({)}02({ ⊕⊕•⊕•=

ccccc SSSSS ,3,0,2,1',1 )}03({)}02({ ⊕⊕•⊕•=

ccccc SSSSS ,1,0,3,2',2 )}03({)}02({ ⊕⊕•⊕•=

ccccc SSSSS ,2,1,0,3',3 )}03({)}02({ ⊕⊕•⊕•=

In the InvMixColumn the columns are considered as polynomials over )2( 8GF multiplied modulo (x4 + 1) with a fixed

polynomial }0{}09{}0{}0{)( 231 exxdxbxa +++=− .

V. PROPOSED ARCHITECTURE

A. Top Level Architecture The proposed system has a 32-bit input. Each input stream is a

column of the AES State. The 1-bit signal en/dec determines the encryption/decryption mode. The system includes four Multiplier blocks. The multiplier block multiplies in )2( 8GF field the 32-bit stream with the constants {01}, {01}, {02} and {03} in encryption mode and with the constants {09}, {0B}, {0D} and {0E} in decryption mode. Therefore, the output of the Multiplier is four

8bit streams (32 bit value). With the wiring block, the 8-bit products follow the appropriate XOR tree determined by the matrix multiplication for encryption and decryption of Section IV. In the end of the process, the 32-bit output stream is the i column of the State after the MixColumn/InvMixColumn operation, where

40 <≤ i . In Figure 2, the top level architecture of the proposed system is presented.

Fig. 2. The top level architecture of the proposed system

B. )2( mGF Multiplication preliminaries In this subsection, the design of a hardware circuit for multiplication in a )2( mGF field is discussed. In Algorithm 1 the multiplication of )2(, mGFba ∈ is presented. In this Algorithm the bits of b are processed from left (most significant) to right (least significant). The resulting multiplier, is called most significant bit first (MSB) multiplier. MixColumn Transformation is a Multiplication of )2( 8GF field elements defined over the irreducible polynomial

1)( 348 ++++= xxxxxp . Taking into consideration the MixColumn/InvMixColumn Transformation specification, Algorithm 1 is reformed as presented in Algorithm 2. The input is an 8 bit signal S that is multiplied by one of several Constant values (signal Con). The Multiplication process is concluded after 8 rounds.

1090

Algorithm 1. Most significant bit first (MSB) multiplier for )2( mGF

INPUT: ),2(),,...,( 011m

m GFaaaa ∈= −

)2(),,...,( 011m

m GFbbbb ∈= − and reduction polynomial

).()( xrxxp m +=

OUTPUT: bac •= 1. Set 0←c 2. For i from m – 1 downto 0 do

2.1 rccleftshiftc m 1)( −+←

2.2 abcc i+←

3. Return )(c

Algorithm2. MixColumn/InvMixColumn MSB multiplication in )2( mGF

INPUT: )2(),,,...,( 017mGFsssS ∈=

),2(),,...,( 017mGFconconconCon ∈= and reduction

polynomial )(1)( 8348 xrxxxxxxp +=++++=

OUTPUT: ConSc •= 1. Set 0←c 2. For i from 7 downto 0 do

2.1 rccleftshiftc 7)( +←

2.2 Sconcc i+←

3. Return )(c

C. Resource Sharing Hardware Technique Resource sharing can be employed in order to speed up the

calculations and reduce hardware area and power consumption. Observing Algorithm 2, it can be noted that till the first non zero coni is used, in round i, no change of the intermediate value c occurs. This value is set to zero. Therefore, if number i of the first non zero coni is known then (7 - i) rounds in Algorithm 2 can be omitted. As shown in Table 1, for all the Constant values (Con value) used in MixColumn/InvMixColumn, con7 to con4 bits are zero. It can be concluded that (7 - 3) = 4 rounds can be omitted in each multiplication. Calculating the multiplication product requires at most 4 rounds. The required number of rounds for multiplying the S input to each constant value Con is also shown in Table I.

TABLE I. BINARY PRESENTATION OF CONSTANT VALUES AND REQUIRED ROUNDS FOR EACH MULTIPLICATIONTABLE TYPE STYLES

Constant Values Req. Rounds Constant Values Req. R. {01}: 00000001 1 {0B}: 00001011 4 {02}: 00000010 2 {0D}: 00001101 4 {03}: 00000011 2 {0E}: 00001111 4

{0D}: 00001101 4 Each round of Algorithm 2 can be represented as a recursive

equation, varying according to coni and r inputs. However, the Con and r values are known and a recursive equation can be specified for every possible input. There are three types of such equations.

Type I equation correspond to the finding of the first non zero coni value and has the following form Sconcc i+← and since c=0

and coni=1 becomes Sc ← . Type II equation correspond to the case of a coni=0 occurrence after finding the first non zero bit of Con. Type II equation has the form rccleftshiftc 7)( +← . Type III equation correspond to the case of a coni=1 occurrence after finding the first non zero bit of Con and has the form

Srccleftshiftc ++← 7)( .

Fig. 3 . The tree structure of the multiplication process.

The Constant values along with the number of Rounds for the multiplication of a column S of the State with each Constant value are known. Using the above remark, the whole multiplication process can be represented by the tree structure of Fig. 3. Every level of this tree corresponds to one round of Algorithm 2. Each node of the tree corresponds to a Type I, II, or III equation depending on the current value of coni. The root of the tree (Level 1) represents con3 and one multiplication round. Level 2 represents con2 and two multiplication rounds. Level 3 represents con1 and three multiplication rounds. Level 4 represents con0 and four multiplication rounds.

Fig. 4. Proposed Type II multiplier slices

Fig. 5. Proposed Type III multiplier slices

1091

Each product is taken at the appropriate level according to the required number of rounds for each multiplication, shown in Table I. Using the above tree, all the required results can be computed. For example, to obtain the result jiS , •{09} we follow the path 1-

0-0-1 while for result jiS , • {0D} we follow path 1-1-0-1. Each equation can be modelled by an 8bit hardware slice. Such

slices are presented in Figures 4, 5 for Type II, III equations respectively. Type I slice is a rearrangement of wires not requiring any gate. Type II slice, implementing equation

rccleftshiftc 7)( +← , uses only three XOR gate since the value r is a known constant (r={11011}) and the Least Significant bit of c after left shifting is always set to zero, as shown in Fig. 4. Type III slice, implementing Srccleftshiftc ++← 7)( , utilizes 11 XOR gates as shown in Fig 5.

D. Power Management Technique Switching activity [5] is the major cause of energy dissipation

in most CMOS digital systems. Switching activity of area resources that do not contribute in a specific operation at a given time, can be reduced. The basic principle is to identify logical conditions at some inputs of a logic circuit that are invariant to the output. When the system operates in encryption mode some parts of the multiplier block do not contribute in the result. We can shutdown, Level 3 and 4 of the proposed tree (Fig. 3) by introducing AND gates to stop the propagation of S(i), C(i) signals (Fig. 4, 5). Applying the above proposed methodology, only the appropriate parts (Level 1 and Level 2) are operational in encryption mode. In Figure 5 the active and inactive parts of a Multiplier Block during MixColumn operation are presented, controlled by En/Dec signal.

Fig. 6. The proposed Power Management technique

VI. COMPARISONS WITH OTHER WORKS Applying power management technique in System level, data

paths that do not contribute in the required results of the system are

deactivated. In this case, a part of the system is inactive and does not contribute to the total power consumption. Due to that fact, a big number of gates is inactive and the power savings are significant.

In general, common implementations of AES algorithm with low power characteristics are not proposed, with the exception of [2], where low power implementations of Subbytes Transformation are presented. We compare the proposed system with two detailed architectures in MixColumn Transformation. In [4], no resource sharing is used and a different multiplication architecture is proposed for each constant. This technique can achieve about the same power consumption with our proposed design but covers about 35% more area resources. In [3], the work has similar area resources compared to our proposed design. However, during MixColumn Transformation, power consumption, in [3], is increased by a factor of 170% more active gates than the power consumption in our proposed design. Comparison results are shown in Table 2. In order to achieve fair comparisons, the 8 bit architectures of [3], [4] are normalized to 32 bits in Table II.

TABLE II. COMPARISONS WITH OTHER WORKS

Implementations [4] [3] proposed Bit Length 32-bit 32-bit 32-bit Area Resources 592 XOR 424XOR+128

AND 432 XOR+ 104

AND Throughput 4byte/clock

cycle 4byte/clock

cycle. 4byte/clock

cycle.

Resource Sharing no yes yes Active Gates (Power Consumption)

MixColumn oper. 152 XOR 424 XOR 152 InvMixColumn oper. 440 XOR 424 XOR 432

VII. CONCLUSIONS In this paper, applying the resource sharing technique, we find

common data paths between the desired operations in order to limit the area resources of the system. Also, applying power management technique, data paths that do not contribute in the required results of the system are deactivated. The resulting proposed architecture, combining the above techniques, achieved efficient results in terms of power consumption and area when compared with other known designs.

REFERENCES [1] FIPS 197: Advanced Encryption Standard, 2001 [2] Stefan Tillich, Martin Feldhofer, and Johann Großschädl. Area,

Delay, and Power Characteristics of Standard-Cell Implementations of the AES S-Box. In Embedded Computer Systems: Architectures, Modeling, and Simulation, vol. 4017 of Lecture Notes in Computer Science, pp. 457–466. Springer Verlag, 2006.

[3] J. Wolkerstorfer, “An ASIC implementation of the AES MixColumn operation”, in Proc. Austrochip 2001, Vienna, Austria, Oct. 12, 2001, pp. 129-132.

[4] P. Noo-Intara, S. Chantarawong, and S. Choomchuay, “Architectures for MixColumn Transform for the AES”, Proc. of (ICEP2004), University (Phuket Campus), January 2004.

[5] P.J.M. Havinga, “Mobile Multimedia Systems”, Ph.D. thesis University of Twente, February 2000.

[6] X. Zhang and K. Parhi, “High-speed VLSI architectures for the AES algorithm”, IEEE Transactions on Very Large Scale Integration (VLSI) Systems ,Volume 12 , Issue 9 (September 2004).

1092

Documents

[IEEE 2006 13th IEEE International Conference on Electronics, Circuits and Systems - Nice, France (2006.12.10-2006.12.13)] 2006 13th IEEE International Conference on Electronics, Circuits