View
23
Download
0
Category
Preview:
Citation preview
Efficient FPGA Modular Multiplication and Exponentiation 1
Efficient FPGA Modular
Multiplication and
Exponentiation Architectures
using Digit Serial Computation
Gustavo Sutter, Jean-Pierre Deschamps, José Luis Imañagustavo.sutter@uam.es, jeanpierre.deschamps@urv.cat , jluimana@dacya.ucm.es
Efficient FPGA Modular Multiplication and Exponentiation 2
Agenda
• Introduction
– Modular exponentiation
• Background
– Montgomery multiplication and exponentiation
• The proposed architecture
– Precomputing q, digit serial and carry save adder
• FPGA Results
– multiplication and exponentiation
• Result comparison
– For multiplication and exponentiation
• Conclusions
Efficient FPGA Modular Multiplication and Exponentiation 3
Agenda
• Introduction
– Modular exponentiation
• Background
– Montgomery multiplication and exponentiation
• The proposed architecture
– Precomputing q, digit serial and carry save adder
• FPGA Results
– multiplication and exponentiation
• Result comparison
– For multiplication and exponentiation
• Conclusions
Introduction
• Modular exponentiation => public key cryptosystems.
• Montgomery´s modular multiplication algorithm is
normally used since no trial division is necessary and
the critical path is reduced by using carry-save
addition (CSA).
• In this paper, the Montgomery multiplication is
optimized and architectures are proposed to perform
the Least-Significant-Bit (LSB) first and the Most-
Significant-Bit (MSB) first algorithms.
Efficient FPGA Modular Multiplication and Exponentiation 4
Introduction (II)
• The architecture here presented has the
following distinctive characteristics:
– Use of digit-serial approach for Montgomery
multiplication.
– Conversion of the CSA representation of
intermediate multiplication using carry-skip
addition which reduces the critical path with a
small area-speed penalty.
– Precompute quotient value in Montgomery
iteration in order to speed up operation frequency.
Efficient FPGA Modular Multiplication and Exponentiation 5
Efficient FPGA Modular Multiplication and Exponentiation 6
Agenda
• Introduction
– Modular exponentiation
• Background
– Montgomery multiplication and exponentiation
• The proposed architecture
– Precomputing q, digit serial and carry save adder
• FPGA Results
– multiplication and exponentiation
• Result comparison
– For multiplication and exponentiation
• Conclusions
Background: Montgomery’s
algorithm
• The Montgomery product computes Z=X.Y.R-1 mod
M instead of Z=X.Y mod M . The drawback is the
need to convert operands into and out of
Montgomery’s domain, which is almost negligible in
some particular applications such as exponentiation.
Efficient FPGA Modular Multiplication and Exponentiation 7
Algorithm 1 – modified Montgomery product p := 0;
for i in 0 .. k-1 loop
q(i):= (p(0) + x(i)*y(0)) mod 2;
p := (p + x(i)*y + q(i)*m)/2;
end loop;
if p >= m then z := p-m; else z := p; end if;
Background: Montgomery’s
algorithm (II)
• In the previous algorithm the main contributing factor
to the delay is the carry propagation resulting from
the very large operand additions. This can be
avoided by using Carry Save Adders (CSA)
Efficient FPGA Modular Multiplication and Exponentiation 8
Algorithm 2 – Montgomery product, carry-save addition pc := 0; ps := 0;
for i in 0 .. k-1 loop
q:= (pc(0) + ps(0) + x(i)*y(0)) mod 2;
(pc, ps) := (pc + ps + x(i)*y + q(i)*m)/2;
end loop;
p = pc + ps;
if p >= m then z:=p-m; else z:=p; end if;
Background: The Exponentiation
• Modular exponentiation (YX mod M) is usually done with
repeated modular multiplications (MSB or LSB first).
• If the operands in Montgomery’s domain, then additional
pre- and post-processing steps are needed.
Efficient FPGA Modular Multiplication and Exponentiation 9
Algorithm 4 - base 2 mod m exponentiation,
LSB-first using Montgomery product e := exp_k;
ty := mp(y, exp_2k);
for i in 0 .. ke-1 loop
if x(i) = 1 then e := mp(e, ty);
end if;
ty := mp(ty, ty);
end loop;
z := mp(ty, 1);
Algorithm 3 – base 2 mod m exponentiation,
MSB-first using Montgomery product e := exp_k;
ty := mp(y, exp_2k);
for i in 1 .. ke loop
e := mp(e, e);
if x(k-i) = 1 then e := mp(e, ty);
end if;
end loop;
z := mp(e, 1);
Efficient FPGA Modular Multiplication and Exponentiation 10
Agenda
• Introduction
– Modular exponentiation
• Background
– Montgomery multiplication and exponentiation
• The proposed architecture
– Precomputing q, digit serial and carry save adder
• FPGA Results
– multiplication and exponentiation
• Result comparison
– For multiplication and exponentiation
• Conclusions
THE PROPOSED ARCHITECTURE:
Modular Multiplication
• To speed up algortihm 3, precomputes q(i+1) and
use Carry Save Adders.
Efficient FPGA Modular Multiplication and Exponentiation 11
Algorithm 6 – modified Montgomery product, carry-save
addition, q precomputed.pc := 0; ps := 0;
q := x(0)*y(0);
for i in 0 .. k-1 loop
qn:= ((pc(1:0) + ps(1:0) + x(i)*y(1:0)
+ q*m(1:0))/2 + x(i+1)*y(0)) mod 2;
(pc, ps) := (pc + ps + x(i)*y + q(i)*m)/2;
q := qn;
end loop;
p = pc + ps;
if p >= m then z:=p-m; else z:=p; end if;
HA FAFA ...
yk-1 y1
pc,k ps,k-1 pc,k-1 ps,1 pc,1ps,k
HA FAFA ...
bs,k bc,k bs,k-1 bc,k-1 bs,1
mk-1
...
FA
y0
ps,0 pc,0
HA
0bc,0
m0=1m1
bs,0bc,1
bc,k+1
bs,k+1 bc,(k+1..1) bs,(k+1..1)
new_pc,(k..0)
...
xor
y1
ps,1 pc,1
xor
FA
y0
ps,0 pc,0
xi+1
m1
xi
xor
xi qiqi
qi+1
next q computation
new_ps,(k..0)
THE PROPOSED ARCHITECTURE:
Modular Multiplication (II)
• To further optimize
– Use digit serial computation.
– Use carry-skip adder for final addition
Efficient FPGA Modular Multiplication and Exponentiation 12
clear
ce
two (k+1)-bit
and a one bit registers
load
ce_p
load
shift
k-bit shift-d-register
new_pc,(k..0)
new_ps,(k..0)
d-digits Montgomery Cell
qi+1
qipc m x
x(d.(i+1)+1.. d.i)
qipc ps
ps y
final additions
p
m
d+1
THE PROPOSED ARCHITECTURE:
Modular Multiplication (III)
• The carry-skip is much faster than a carry-propagate
adder but can be slower than the period of the
datapath of divider. The used solution is wait w=
T/ad cycles to finish this final step.
Efficient FPGA Modular Multiplication and Exponentiation 13
TABLE I. DELAY IN NS AND AREA IN LUTS FOR CARRY SKIP COMPARED AGAINST
RIPPLE CARRY ADDERS IN VIRTEX 5
ripple-carry S=32 S=64 SpeedUp
Area
Overhead Bits Delay Area Delay Area Delay Area
512 11.8 512 4.4 716 5.5 644 267% 40%
1024 26.8 1024 5.3 1452 5.9 1332 505% 42%
2048 56.5 2048 6.6 2924 6.3 2708 896% 32%
THE PROPOSED ARCHITECTURE:
Modular Exponentiation
• We have used the traditional MSB and LSB first
algorithm.
– In MSB first the average Montgomery products (MP)
performed is around of 1.5 and worst case is 2.
– In LSB first in turn includes at most two Montgomery
products. In this case both products can be executed in
parallel and the total computation time is 1.
– The computation of exp_k and exp_2k necessary for are
computed using an SRT reducer
Efficient FPGA Modular Multiplication and Exponentiation 14
Efficient FPGA Modular Multiplication and Exponentiation 15
Agenda
• Introduction
– Modular exponentiation
• Background
– Montgomery multiplication and exponentiation
• The proposed architecture
– Precomputing q, digit serial and carry save adder
• FPGA Results
– multiplication and exponentiation
• Result comparison
– For multiplication and exponentiation
• Conclusions
FPGA Implemenation Results
• The design entry is behavioral VHDL except for
FPGA carry-skip adder.
Efficient FPGA Modular Multiplication and Exponentiation 16
TABLE II. VIRTEX 5 IMPLEMENTATION RESULTS OF PROPOSED DIGIT SERIAL MONTGOMERY’S
MULTIPLIERS
k d FF 6-Luts cycles
main
w
cycles
Period
(ns)
Total
Time (ns)
512 1 2581 4130 512 4 1.7 920.5
512 2 2583 6178 256 3 2.6 663.8
512 4 2584 10276 128 2 4.5 585.0
512 8 2584 18494 64 1 8.4 549.3
1024 1 5142 8227 1024 4 1.8 1936.8
1024 2 5144 12323 512 3 2.6 1319.9
1024 4 5145 20527 256 2 4.5 1161.0
1024 8 5145 36937 128 1 8.5 1090.1
2048 1 10263 16417 2048 5 1.8 3867.9
2048 2 10265 24613 1024 4 2.5 2634.8
2048 4 10266 41007 512 2 4.5 2313.0
FPGA Implemenation Results
Efficient FPGA Modular Multiplication and Exponentiation 17
TABLE III. VIRTEX 5 IMPLEMENTATION OF EXPONENTIATIONS
k = ke Meth d FF LUTs Period
(ns)
avg T
(ms)
Thrg
(Mb/s)
512 MSB 1 4144 5696 1.8 0.72 713.6
512 MSB 2 4145 7745 2.5 0.50 1023.6
512 MSB 4 4145 11845 4.5 0.45 1133.0
512 MSB 8 4145 20041 8.5 0.43 1199.6
512 LSB 2 6728 13923 2.5 0.33 1535.4
1024 MSB 1 8242 11330 1.9 2.98 343.2
1024 MSB 2 8243 15427 2.6 2.03 503.6
1024 MSB 4 8243 23623 4.5 1.79 572.5
1024 MSB 8 8243 40011 8.4 1.68 608.7
1024 LSB 2 13387 27750 2.6 1.38 744.6
2048 MSB 1 16436 22595 1.9 12.00 170.7
2048 MSB 2 16437 30790 2.5 7.91 259.0
2048 MSB 4 16437 47176 4.5 7.12 287.8
2048 LSB 1 26699 39012 2.5 10.53 194.6
Efficient FPGA Modular Multiplication and Exponentiation 18
Agenda
• Introduction
– Modular exponentiation
• Background
– Montgomery multiplication and exponentiation
• The proposed architecture
– Precomputing q, digit serial and carry save adder
• FPGA Results
– multiplication and exponentiation
• Result comparison
– For multiplication and exponentiation
• Conclusions
Performance Comparison:
Modular Multipliers• Circuits reimplemented the multipliers in Virtex 2 devices using
Xilinx ISE 10.1.03.
Efficient FPGA Modular Multiplication and Exponentiation 19
TABLE V. COMPARISON OF MODULAR MULTIPLIERS IN FPGAS
k Circuit Device slice T
(ns)
Time
(µs)
Thrg
(Mb/s) AxD
512 [9] Virtex E 2972 10.5 16.17 31.7 48.1
512 [3] (5 to 2) Virtex 2 5170 7.9 4.06 126.2 21.0
512 [3] (4 to 2) Virtex 2 5782 8.2 4.21 121.6 24.4
512 [6] Virtex 2 2902 8.2 4.26 120.3 12.3
512 [4] Virtex 2 4029 4.5 2.33 220.2 9.4
512 Prop D=1 Virtex 2 2469 3.6 1.89 270.5 4.7
512 Prop D=2 Virtex 2 3497 4.8 1.25 409.3 4.4
512 Prop D=4 Virtex 2 5538 8.6 1.13 452.2 6.3
512 Prop D=8 Virtex 2 9446 15.6 1.03 497.4 9.7
512 Prop D=4 Virtex 5 2936 4.5 0.59 862.0 -
Performance Comparison:
Modular Multipliers (II)
Efficient FPGA Modular Multiplication and Exponentiation 20
TABLE V. COMPARISON OF MODULAR MULTIPLIERS IN FPGAS
k Circuit Device slice T
(ns)
Time
(µs)
Thrg
(Mb/s) AxD
1024 [9] Virtex E 5706 10.5 32.17 31.8 183.6
1024 [3] (5 to 2) Virtex 2 10332 9.8 10.09 101.5 104.2
1024 [3] (4 to 2) Virtex 2 11520 9.0 9.22 111.1 106.2
1024 [6] Virtex 2 4512 8.8 9.03 113.4 40.7
1024 [4] Virtex 2 8000 4.5 4.63 221.1 37.1
1024 Prop D=1 Virtex 2 4923 3.7 3.88 262.7 19.2
1024 Prop D=2 Virtex 2 6982 4.8 2.48 410.8 17.4
1024 Prop D=4 Virtex 2 11079 8.4 2.19 471.7 24.1
1024 Prop D=8 Virtex 2 19247 15.5 2.02 508.2 38.8
1024 Prop D=4 Virtex 5 5702 4.5 1.18 868.5 -
2048 [3] (5 to 2) Virtex 2 20986 11.1 22.76 90.0 477.5
2048 [3] (4 to 2) Virtex 2 23108 11.0 22.59 90.6 522.1
2048 Prop D=1 Virtex 2 9831 3.8 7.79 263.0 76.6
2048 Prop D=2 Virtex 2 13954 4.8 4.94 414.8 68.9
2048 Prop D=4 Virtex 2 22201 8.4 4.34 471.3 95.7
2048 Prop D=2 Virtex 5 6837 2.56 2.63 777.3 -
Performance Comparison:
Modular Exponentiators
Efficient FPGA Modular Multiplication and Exponentiation 21
TABLE VII. COMPARISON FOR 1024 BITS EXPONENTIATORS.
Ref Meth FPGA Area
(slices)
Period
(ns) w
avg C
(x1000)
avg T
(ms)
Thrg
(Mb/s)
[10] (r2) LSB XC4K 4865 19.2 - 2122 40.74 25.1
[10] (r16) LSB XC4K 6683 21.9 - 546 11.95 85.7
[3] (4 to 2) MSB Virtex 2 26136 10.3 - 1054 10.85 94.3
[4] LSB Virtex 2 12537 6.6 - 1579 10.35 98.9
Prop D=2 LSB Virtex 2 9298 4.8 6 798 3.83 267.3
Prop D=4 LSB Virtex 2 13346 8.4 3 399 3.35 305.5
Prop D=2 MSB Virtex 2 16280 4.8 6 532 2.55 401.0
Prop D=4 LSB Virtex 5 6217 4.5 2 397 1.79 572.5
Prop D=2 MSB Virtex 5 7303 2.6 3 529 1.38 744.6
Efficient FPGA Modular Multiplication and Exponentiation 22
Agenda
• Introduction
– Modular exponentiation
• Background
– Montgomery multiplication and exponentiation
• The proposed architecture
– Precomputing q, digit serial and carry save adder
• FPGA Results
– multiplication and exponentiation
• Result comparison
– For multiplication and exponentiation
• Conclusions
Conclusions
• The key point for exponentiation is an efficient
multiplication. The Montgomery`s multiplication is
widely used since it avoids the trial division.
• The distinctive characteristics of present work are:
– Precomputation of quotient value (q) in Montgomery iteration
in order to speed up operation frequency.
– Use of digit serial computation approach for Montgomery´s
multiplication.
– Maintain intermediate exponentiation values in binary format
instead of carry-save.
– Final conversion of the carry-save representation of
intermediate MP using carry-skip addition.
Efficient FPGA Modular Multiplication and Exponentiation 23
Conclusions
• Results comparisons show that the proposed
architecture outperforms all the previous published
results to the author’s knowledge in terms of
throughput and also in area-delay.
• The comparison for 512, 1024 and 2048 bits
multipliers doubles the fastest reported result.
Comparison in 1024 bits exponentiation in FPGA
shows also a factor two improvement for similar or
less area.
Efficient FPGA Modular Multiplication and Exponentiation 24
Efficient FPGA Modular Multiplication and Exponentiation 25
Questions…
Recommended