Modular Multiplication: C = A * B mod M where A, B < M

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Improving Cryptographic Improving Cryptographic Architectures by Adopting Architectures by Adopting Efficient Adders in their Efficient Adders in their Modular Multiplication Modular Multiplication HardwareHardware

Adnan Gutub, Hassan TahhanComputer Engineering Department,King Fahd University of Petroleum & Minerals

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Modular Multiplication: C = A * B mod M where A, B < M

Secure System very large operand size too expensive.

Straightforward Method: Multiplication then modulus division.

M. M.

Modular Multiplication Operation

In many public-key encryption schemes (e.g., RSA, ElGamal & ECC),

Modular Multiplication is a basic arithmetic operations heavily used.

M. M.ABM

C

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Interleaving

Interleaving Multipl. and reduction

In 1983, Blakley:

Pi = 2 Pi-1 + bi A + q M

In the literature, proposals to solve the magnitude

comparison problem.

Koc’s implementation based on carry-save adders. Partial

products are represented as sum-carry pairs. The 5 MSBs

of the pair is tested for sign estimation.

P = 0for i = n-1 to 0

{ P = 2 * P

if ( P M ) P = P – M if ( bi = 1 )

{ P = P + A

if ( P M ) P = P – M}

}

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Montgomery

Montgomery’s Method

In 1985, Montgomery:

Pi = Pi-1 + bi A + q M / 2

No full magnitude comparison is required.

The correction step can be easily removed.

However, pre and post calculations are needed in order to

have the required result.

As in the interleaving method, implementations based on

carry-save adders are the most effective solutions.

P’ = 0

for i= 0 to n-1

{

P’ = P’ + a’i * B’

if ( p’0 = 1 ) P’ = P’ + M

P’ = P’ / 2

}

if ( P’ M ) P’ = P’ - M

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

High-Radix

High-Radix Method

Speedups the modular multiplier by requiring less number

of cycles. Area and time will increase.

The reduction step will be the crucial operation. As the

radix increases, it becomes more complex.

Walter shows that there is a direct trade-off between the

required space and the overall computation time. The AT

factor is independent of the choice of the radix. The factor

is expected to improve for radices that are not much larger

than radix-2.

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Comparison

Comparison Between [6] and [18]

DescriptionKoc [6]Montgomery [18]

Equation(S,C = )2S + 2C + aiB + qM

q Є {1, 0 ,-1}

(S,C = )S + C + aiB

(S,C( = )S + C + s0M / )2

Hardware)n+4(-bit CSA

)n+4(-bit CSA

Register 1 Register 2

carry sum

sumcarry

X

ai

00

MSBs

B

00

0LSBs

M

00

0LSBs

MC

Sign-Estimate Logic

5 MSBs

5 MSBs

indicates one left shift

M

MC

B

MSB MSBs

00

0 1n

n

n

0LSBs

A

00

n+1 n+1

P = S + Cif P < 0 P = P + M

n

P

An

clk

indicates one right shift

n-bit CSA

n-bit CSA

Register 1

carry sum

sumcarry

indicates one left shift

M

B

A

n

n

X

ai

B

LSB

indicates one right shift

MRegister 2

X

A

n n

P = S + C

n

P

clk

n

MSBs00

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion


Algorithmic Analysis

Koc [6]] 18[Montgomery

calculationsPre-The two’s complement of the modulus needs to be computed

Transformation of operands into Montgomery’s domain

calculationsInter-n + 3 iterations n + 2 iterations

calculationsPost-

There is a correction step in addition to the final summation of the sum-carry pair

Summation of the sum-carry pair needs to be transformed back to the ordinary domain

RestrictionsIf M is represented using n bits, then |M| 2n-1

GCD )M, 2( = 1

Comparison

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion


Hardware Analysis


Logic Two )n+4(-bit carry save adders plus 5-bit carry lookahead logic

Two n-bit carry save adders

Registers 6 5

Synthesis Analysis


Clock period 6.468 ns6.342 ns

Comparison

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Improvement

Improvements on [6]

Pipelining:

Due to data dependency, the

pipelining will not improve the

throughput. However, the

pipeline can be used to compute

two separate operations

simultaneously.

)n+4(-bit CSA

)n+4(-bit CSA

Register 1 Register 2

carry sum

sumcarry

a1i B1

Sign-Estimate Logic

M1 M1C

M2 M2C

Register 3 Register 4 Register 5

a2i B2

Mux

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Improvement

Improvements on [6]

Parallelism :

The correction step at the end of

the algorithm increases the

algorithm complexity. At the

hardware level, the correction

step can be implemented using

two options.

By computing the two possible

results in parallel, time will be

saved.

Fast-Speed Adder

C M

sel

S

P

sel

MUXMUX

Register

Fast-Speed Adder

S C M

P

Fast-Speed Adder

Carry-Save AdderS C

MUXmsb

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Adders

The last stage in both algorithms does full-length addition on

the carry-sum pair which can be performed in hardware through

binary adders.

Statistics showed that 72% of the instructions perform additions

in the data path of a prototypical RISC machine.

The carry-lookahead adder and the carry-skip adder were

compared in terms of time, area and power.

Binary Adders

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

CLA

Carry-Lookahead Adder

bn-1a0b0

s0

c0

p0g0

Carry-Lookahead Logic

a1b1

s1

c1

p1g1

an-1

cn-1

pn-1gn-1sn-1

cn

The total delay of the carry-lookahead adder is (log n). There is a

penalty paid for this gain: the area increases. The carry-lookahead

adders require (n log n) area.

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

The carry-skip adder has a simple and regular structure that

requires an area in the order of (n) which is hardly larger then

the area required by the ripple-carry adder. The time complexity

of the carry-skip adder is bounded between (n1\2) and (log_n).

An equal-block-size one-level carry-skip adder will have a time

complexity of (n1\2). However, a more optimized multi-level

carry-skip adder will have a time complexity of O (log n) .

CSK

Carry-Skip Adder

FAFA

aibi

sisi+1

ai+1bi+1

cici+1

pipi+1

skip signal

ci+1

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Comparison

CLA versus CSK

Using 32-bit operands, a multi-level carry-skip adder was 14 %

faster and its power dissipation was 58 % of that of the carry-

lookahead adder.

Using 64-bit operands, a one-level carry-skip adder was 38%

slower and its power consumption is 68 % of the the carry-

lookahead adder.

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

ConclusionConclusion

This work studied the modular multiplication problem over large

operand sizes. Based on a survey, two implementations for modular

multiplication algorithms were modeled using VHDL and

synthesized. A time-area analysis of both implementations showed

that Koc’s implementation has the potential to be an effective

solution in terms of time and hardware requirements. This

implementation was improved further.

Carry-save adders give the maximum speedup in computing the

partial products since. However, full-length addition on the sum-

carry pair needs to be carried out at the last iteration through

dedicated binary adder. Two binary adders were studied: the CLA

and the CSK. Although the two adders can be of a comparable

speed, the CSK requires smaller area and consumes much less

power than the CLA .

Conclusion

M. M.

Interleaving

Montgomery

High-Radix

Comparison

Improvement

Adders

CLA

CSK

Comparison

Conclusion

Improving Cryptographic Architectures by Adopting Efficient Adders in their Modular Multiplication Hardware

Adnan Gutub, Hassan TahhanComputer Engineering Department,King Fahd University of Petroleum and Minerals

Thank you

The End

Documents

Modular Multiplication: C = A * B mod M where A, B < M