Designing stream ciphers with scalable data-widths: a case study with HC-128

J Cryptogr EngDOI 10.1007/s13389-014-0071-0

REGULAR PAPER

Designing stream ciphers with scalable data-widths: a case studywith HC-128

Goutam Paul · Anupam Chattopadhyay

Received: 12 May 2013 / Accepted: 1 January 2014© Springer-Verlag Berlin Heidelberg 2014

Abstract For optimal utilization of the resources, a streamcipher running on a w-bit machine should ideally produce w

bits in every round of keystream generation. With increasingdata-widths of processors, scalability of stream ciphers is apertinent issue which is often addressed with ad hoc cipher-specific designs. In this paper, we propose a generic frame-work for designing stream ciphers with scalable data-widthsby combining multiple instances of a single stream cipherfollowing certain principles. We demonstrate using a casestudy on the fastest software stream cipher HC-128 in theeSTREAM final portfolio to show that the proposed designincreases the performance without decreasing the security ofthe cipher.

Keywords Data-width · eSTREAM · Keystream ·Random number generation · Scalable · Stream cipher

1 Introduction

Stream cipher is an important cryptographic primitive, thatis essentially a pseudo-random number generator (PRNG)seeded by a secret key shared between the sender and thereceiver. The sequence of pseudo-random bits available asthe output of a stream cipher is called keystream. Thissequence is bitwise XOR-ed with the plaintext at the sender

G. Paul (B)R. C. Bose Centre for Cryptology and Security, Indian StatisticalInstitute, Kolkata 700 108, Indiae-mail: [email protected]

A. ChattopadhyayInstitute for Communication Technologies and Embedded Systems,RWTH Aachen University, 52074 Aachen, Germanye-mail: [email protected]

end to produce the ciphertext, which is again bitwise XOR-ed with the same keystream (generated from the commonshared secret key) at the receiver end to reproduce the plain-text.

In 2004, a 4-year European research initiative waslaunched under the name European Network of Excellencefor Cryptology (ECRYPT) [4]. Under ECRYPT, a projectcalled eSTREAM [5] started with an aim of identifying “newstream ciphers that might become suitable for widespreadadoption”. Two profiles, one software and another hardware,were created for the project. The software profile called for“stream ciphers for software applications with high through-put requirements” and the hardware profile sought “hard-ware applications with restricted resources such as limitedstorage, gate count, or power consumption”. This multi-yeareffort running from 2004 to 2008 has identified a portfolioof promising new stream ciphers. Out of 34 initial submis-sions, only seven survived the rigorous analysis by world-experts, giving a final portfolio having a software profile offour candidates and a hardware category of three candidates.The stream cipher HC-128 [23] is the fastest candidate in thesoftware profile.

HC-128 is a lighter version of HC-256 [24] stream cipher.HC-256 had a 256-bit secret key and HC-128 was developedto meet the 128-bit key requirement imposed in the com-petition. At every round of keystream generation, HC-128outputs a keystream word of 32-bit. In recent years, sev-eral cryptanalytic attempts have been performed on HC-128[10,12,13,18,20]. However, none of them breaks the ciphercompletely and the cipher can be safely used in practice.

1.1 Motivation and contribution

With increasing demand of parallelism, the data-widthsof processors are also increasing monotonically. The era

123

J Cryptogr Eng

of 32-bit processors is already paving the way for 64-bit machines and beyond [1,2]. To satisfy the shrinkingenergy constraints, dedicated accelerators and customizedinstruction-sets are also commonly found in modern proces-sors [8] and heterogeneous multiprocessor System-on-Chips(SoCs).

Along the same direction, recent years have witnessedseveral attempts in hardware accelerator designs of softwareciphers [3,11,14,19,22]. We look into the problem froma completely different perspective. For optimal utilizationof the available hardware resources, a stream cipher run-ning on a w-bit processor should ideally produce w bitsin every round of keystream generation. Thus, it is impor-tant to investigate the design issues of 64-bit variants of theexisting 32-bit ciphers. There exist a few individual designattempts along the same line of consideration. For example,GGHN [7] gave a scheme for extending the byte-orientedRC4 to RC4(n, m), where n is the word-size of each inter-nal state elements and m is the output word-size. How-ever, all such designs are ad-hoc and cipher specific andprovide no guaranty on the security margin. In the cur-rent work, we study a general approach of securely scalingstream ciphers to higher data-widths, using HC-128 as a casestudy.

It is worth noting that the same idea can be pursuedfor hardware stream ciphers as well. However, the goal inthe hardware stream cipher design is to achieve minimumarea and therefore, scalability is not of prime importance.Also, the problem is not so interesting for block ciphersfor two reasons: (i) they do not usually operate as pseudo-random bit-stream generators (except in counter mode), and(ii) scalability may be achieved simply by extending the datatypes.

Below, we summarize the main contributions of our work.

– We propose a generic framework for combining minstances of an n-bit stream cipher to produce mn-bit out-put word at every round.

– The combining method ensures that the keystream bits ofthe new design have good randomness properties.

– The combining method ensures that the security is notdecreased by the combination; it is either increased or atleast remains the same as the base stream cipher.

– The proposed approach can scale to future processors ofincreasing data-widths.

– We demonstrate our scheme by combining two HC-128instances to produce a 64-bit cipher called BiHC-128. Wealso report the speed, throughout and area of the designand compare with relevant state-of-the-art schemes.

– Our study reveals that not all stream ciphers have the prop-erty of data-width extendibility and emphasizes that newstream ciphers must be designed with this feature to meetthe demand of increasing data-width of processors.

1.2 Paper outline

In Sect. 2, we present our generic framework for combiningmultiple stream cipher instances towards a scalable design.In Sect. 3, we give a design case study with the stream cipherHC-128 including the security analysis in Sect. 3.3. Next,in Sect. 4, we provide detailed performance evaluation andcomparison with state-of-the-art implementations. We con-clude with a brief sketch on possible future extensions inSect. 5.

2 A framework for building stream ciphers withscalable data-widths

A stream cipher has two components. The first componentis called a key scheduling algorithm (KSA) that spreads therandomness of the secret key inside the internal state of thecipher. The second component is called the pseudo-randomgeneration algorithm (PRGA) that updates the internal stateat every step and after each update produces a keystreamword.

We take m instances of an n-bit stream cipher, calledthe base cipher having secret keys k1, . . . km . Let the inter-nal state be denoted by s1, . . . sm and the correspondingkeystreams by z1, . . . zm . After the KSA is over, we can writesi = KSA (ki ), i = 1, . . . , m.

Let u be the state update function, and v be the keystreamgeneration function. For each instance i , 1 ≤ i ≤ m, we canwrite

st+1i = u(st

i ) and zt+1i = v(st+1

i ).

The superscript t denotes the PRGA round number. After theKSA is over, PRGA begins with t = 0.

We combine the internal states s1, . . . , sm into a singlegiant state S and use a combined state update function U onS. The new keystream generation function V works with Sand produces a mn-bit word at every stage.

St+1 = U (St ), Zt+1 = V (St+1).

In particular, when each si is an array of length l, each entrybeing an n-bit element, S can be constructed as an array oflength l, with each entry of S being the concatenation of thecorresponding elements of s1, . . . , sm , i.e., for 0 ≤ j ≤ l−1,

S[ j] = s1[ j]||s2[ j]|| . . . ||sm[ j]. (1)

The functions U and V would now work with the mn-bitelements of S to produce the next state and the output respec-tively.

The exact designs of U and V would depend on the partic-ular stream cipher. However we propose the following gen-

123

J Cryptogr Eng

eral criteria for their designs. First, the combining methodshould keep the

– element access pattern,– the structure of the state update function, and– the structure of the keystream generation function

as much similar to those of the individual instances as pos-sible. By structure of a function, we mean (a) the number ofarguments, (b) the number of steps, (c) the number and (d)nature of mathematical operators used. Secondly, if one sub-stitutes m = 1 in the combined design, then it should reduceto the base cipher.

If the above guidelines are followed, then the extendedcipher designed using the proposed framework is likely to beno less secure than the original base cipher. Suppose thereis a generic attack on the combined cipher. Take m = 1 andapply the attack. With m = 1, the combined state reducesto the state of the base cipher and the state update and thekeystream update functions also coincide with those of thebase cipher. Hence, the base cipher is also attacked. On theother hand, if there is an attack on the base cipher, that neednot necessarily imply that it would hold for the combinedcipher.

Note that not all ciphers would be extendible in the aboveframework. It depends on the internal structure and opera-tions of the cipher. We call the ciphers that are extendiblein the above framework as data-width extendible streamciphers. For example, RC4 [17], that has been the most pop-ular and the most widely deployed software stream ciphersince more than 20 years, is not data-width extendible. Thereason is as follows. RC4 is byte-oriented and its state is apermutation over Z28 . In spite of several cryptanalytic resultson RC4, it is quite secure if the IVs are used carefully andfew hundred initial keystream bytes are thrown away. If wecombine m RC4 instances using Eq. (1), then the new statewould no longer be a permutation over Z28m . Hence, theremay be some new weakness in the combined cipher, that wasnot originally present in normal RC4. In Sect. 3, We demon-strate that HC-128 is such a stream cipher by extending it toa 64-bit cipher BiHC-128.

2.1 Key scheduling issues

In the present work, we consider how to combine the m cipherinstances after the KSA of the individual ciphers are over.One could, in principle, consider how to design a combinedKSA for the combined internal state evolution. However,KSA is only a one-time effort and is performed for cipherinitialization. Therefore, whether the KSA is run individu-ally or run in a combined manner does not have an impact onthe cipher performance in the long run. One may also ques-tion about the key length issue for the combined scheme.

We would like to point out that we do not mandate that mdifferent secret keys have to be used. One can use a singlesecret key, as long as that of the base cipher, and then generatem different sub-keys out of it. There are several choices forthe sub-key generation algorithms. One may use round-keygeneration method as in block ciphers like AES [6]. Anotheroption is to use m different initialization vectors (IV’s) andcombine with the secret key to produce the m sub-keys. Athird and stronger method would be to run a single instanceof the base cipher with the single secret key and then takem non-overlapping segments (each being as long as the key)from its keystream.

3 Design case study with HC-128

We illustrate our framework discussed in Sect. 2 by applyingthe design guidelines on HC-128 and thereby combine twoHC-128 instances to produce a 64-bit variant of the cipher,called BiHC-128.

3.1 Description of HC-128

We begin with the operators first and then explain the datastructure and the functions used. All the operands are 32-bitnumbers. The following four binary operators are used inHC-128.

+: addition (mod 232), �: subtraction (mod 512),⊕: bit-wise exclusive OR, ‖: bit-string concatenation.

There are four unary operators, namely

�: right shift operator, �: left shift operator,≫: right rotation operator, ≪: left rotation operator.

Several arrays are used as data structures. All of them have32-bit words as elements. The internal state consists of two512 size arrays P and Q. The 128-bit secret key is storedin an array of length 4, denoted by K [0, . . . , 3]. For keyreuse across multiple sessions, a 128-bit initialization vectorI V [0, . . . , 3] is used. We denote the keystream word gener-ated at the t-th step by st , t = 0, 1, 2, . . ..

All the functions have 32-bit elements as arguments andoutputs. For key and IV loading, the following two functionsare used.

f1(x) = (x ≫ 7) ⊕ (x ≫ 18) ⊕ (x � 3).

f2(x) = (x ≫ 17) ⊕ (x ≫ 19) ⊕ (x � 10).

There are two state-update functions, described below.

g1(x, y, z) = ((x ≫ 10) ⊕ (z ≫ 23)) + (y ≫ 8).

g2(x, y, z) = ((x ≪ 10) ⊕ (z ≪ 23)) + (y ≪ 8).

Two additional functions are used for keystream generation.

123

J Cryptogr Eng

h1(x) = Q[x (0)] + Q[256 + x (2)].h2(x) = P[x (0)] + P[256 + x (2)].

Here, x (0), x (1), x (2) and x (3) denote the four bytes from rightto left of the 32-bit word x .

The KSA of HC-128 recursively loads the P and Q arrayto form the expanded key and IV. It then runs the cipher for1,024 steps to use the outputs to replace the table elements.There are a total of four main steps in the KSA as follows:

1. Key and IV are repeated twice. Let K [i + 4] = K [i] andI V [i + 4] = I V [i] for 0 ≤ i ≤ 3.

2. The key and IV are expanded into an array of 1,280 words,denoted by W [0, . . . , 1279]. W [i] is assigned the value ofK [i], for 0 ≤ i ≤ 7,I V [i − 8], for 8 ≤ i ≤ 15, andf2(W [i −2])+W [i −7]+ f1(W [i −15])+W [i −16]+i ,for 16 ≤ i ≤ 1279.

3. Update the tables P and Q with the array W as follows.P[i] = W [i + 256], for 0 ≤ i ≤ 511,Q[i] = W [i + 768], for 0 ≤ i ≤ 511.

4. Run the cipher 1,024 steps and use the outputs to replacethe table elements as follows. For i = 0 to 511,P[i] = (P[i] + g1(P[i � 3], P[i � 10], P[i � 511])) ⊕h1(P[i � 12])and for i = 0 to 511,Q[i] = (Q[i] + g2(Q[i � 3], Q[i � 10], Q[i � 511])) ⊕h2(Q[i � 12]).

The keystream is generated using the following algorithm.

3.2 Proposed design for 64-bit HC-128

Since HC-128 is an array-based cipher, the combined state isgiven by Eq. (1). Given two pairs of 32-bit state arrays fromtwo different initializations of HC-128, i.e., P1[0 . . . 511],Q1[0 . . . 511] from one instance and P2[0 . . . 511], Q2[0. . . 511] from another instance, the combined state (P, Q)is formed as follows. For i = 0, . . . , 511,

P[i] = (P1[i] � 32) | (P2[i]),

Q[i] = (Q1[i] � 32) | (Q2[i]).We modify the functions g1, g2, h1, h2 that operate on 32-bitwords to form respectively the new functions G1, G2, H1, H2

that operate on 64-bit words. The new g functions are givenby

G1(x, y, z) = ((x ≫ 10) ⊕ (z ≫ 23)) + (y ≫ 8),

G2(x, y, z) = ((x ≪ 10) ⊕ (z ≪ 23)) + (y ≪ 8).

We propose the following generic method of modifying theh functions for a combination of m HC-128 instances:

H1(x) = Q[x (0)] + Q[256 + x (4(m−1)+2)],

H2(x) = P[x (0)] + P[256 + x (4(m−1)+2].Note that the each entry of the new state-array is a 4m-byteword and 4(m − 1) + 2 < 4m. In our case study, m = 2and hence in H1 (or H2), the second index into the Q (or P)array is given by 256 + x (6).

We present the method of keystream generation for thenew cipher in Algorithm 1.

Note that the original paper of HC-128 [23] referred tosome primitives in SHA-256 for the design of the functionsf1, f2. These functions are used in the KSA and we keep thesame parameters for these functions in the combined design.We change only the other two functions (used in the PRGA ofHC-128), whose parameters were not justified theoretically.

The data-width extendibility of HC-128 can be explored toconstruct QuadHC-128 by merging four HC-128 instancesin a similar fashion. However, how far a given data-widthextendible cipher can be extended is a deep theoretical ques-tion and is beyond the scope of the current work. In this paper,we introduce the novel concept of data-width extendibilityand highlight this feature from the architectural view-point.Designers of stream ciphers often tend to create an ad-hoccipher which leads to under-utilization when the processorword-width is expanded. Though HC-128 possess the fea-ture of data-width extendibility, no documented evidenceexists of such extendibility-aware design being targeted dur-ing the genesis of HC-128. We emphasize that the data-widthextendibility feature should be considered systematically fordesigning stream ciphers with better longevity.

123

J Cryptogr Eng

3.3 Security analysis

In spite of a few cryptanalytic attempts [10,12,13,18,20] onHC-128, the cipher is practically secure. We have followedthe design principles of Sect. 2, and hence we claim that the64-bit BiHC-128 is no less secure than HC-128 and none ofcryptanalysis on HC-128 can pose a threat to BiHC-128.

Additionally, we perform an independent empirical eval-uation of the randomness of the BiHC-128 keystream. For astream cipher, if there is a keystream-based event that occurswith a probability that is away from that of the same event in auniformly random stream, the event is called a distinguisherand finding such an event is called a distinguishing attack. ForHC-128, the best known distinguishing attack [20] requiresa data complexity of 2152.537 keystream words. Since onecan form infinite number of events based on the keystream,resistance against distinguishing attack cannot be proved; theonly method of establishing good randomness properties fora new stream cipher is to run NIST-recommended standardstatistical tests for randomness [16].

To evaluate the randomness of BiHC-128 keystream, weran the NIST test suite on 256 different keystreams, eachgenerating 4,096 rounds of 64-bit output. The individual HC-128 states were generated from randomly chosen keys andIVs. This gives a total of 262,144 bits per keystream. Each testthat we performed satisfied the minimum criteria for pass-rate.

4 Performance analysis

Though HC-128 is primarily presented as a software streamcipher, its actual deployment may vary depending on theapplication scenario. Researchers have presented HC-128implementation results on general purpose processors,embedded processors as well as dedicated accelerators. Thisis natural given the blurring of the distinction between generalpurpose application and embedded application domains. Onthe one hand, there is a drive in general purpose processordesign community to make application-specific or domainspecific instruction support, e.g. Intel AES-NI [8]. On theother hand, embedded processors are bulking up to supportincreasing application parallelism requirements. For exam-ple ARM, a leading embedded processor designer, haveannounced 64-bit architecture [1], with support for crypto-graphic instructions. Recently, a detailed evaluation of theperformance of HC–128 on GPGPU architectures have beenreported at [9].

To make a fair comparison across the complete imple-mentation landscape, the performance analysis is done fordesktop processors, a customizable embedded processor anda dedicated hardware accelerator.

4.1 Experiment with desktop processors

We benchmark the algorithm presented in this paper on two64-bit desktop processors-an AMD Phenom™II X6 1100T

123

J Cryptogr Eng

Table 1 Performance benchmarking on general purpose processors

Algorithm Keystream generation Message encryption Machine Frequency(GHz)

Cycles perbyte

Throughput(Gbps)

Cycles perbyte

Throughput(Gbps)

HC-128 3.9 6.77 4.5 5.87 AMD PhenomTM 3.3

BiHC-128 2.2 12.0 2.6 10.15 II X6 1100T

HC-128 3.5 7.54 4.1 6.44 Intel(R) Xeon(R) 3.3

BiHC-128 2.7 9.77 3.0 8.8 CPU X5680

HC-128 [23] − − 3.1 4.2 Intel Pentium M 1.6

HC-128 [5] − − 4.43 6.07 Intel Pentium M 1.7

HC-128 [5] − − 3.76 5.96 Intel Pentium 4 2.8

HC-128 [5] − − 2.86 6.15 AMD AthlonTM64X2 4200+ 2.2

HC-128 [5] − − 3.9 1.03 Alpha EV6 0.5

HC-128 [5] − − 2.75 2.55 HP 9000/785 0.875

and an Intel(R) Xeon(R) CPU X5680-both running at 3.3GHz clock and having Ubuntu 12.04 OS. The time is mea-sured by running the clock() function. The code is compiledwith standard gcc compiler version 4.4.6, with -O3 opti-mization mode. The default HC-128 [23] implementation isdownloaded from [25] and modified to return the keystreamgeneration throughput only. The results are presented in theTable 1.

It can be observed from Table 1 that for comparableprocessing power, BiHC-128 delivers approximately dou-ble the throughput of HC-128. Note that this throughputimprovement comes for free as 64-bit support is common-place for all modern general purpose processors.

4.2 Experiment with embedded processors

The performance of HC-128 is relatively less studied onembedded processors [15]. One of the key reason is the largememory space requirement of HC-128, which is uncom-mon in low-end microcontrollers. On the other hand, for thiscurrent work, we require both 32-bit and 64-bit variant ofan embedded processor. To solve this issue, a customizableprocessor design framework [21] is selected for our experi-ments.

Customizable application-specific processors are increas-ingly used in modern SoCs for balancing performanceand flexibility constraints. Synopsys Processor Designer(PD) [21] (http://www.synopsys.com/systems/blockdesign/processordev/pages/default.aspx) provides several starterdesigns for modeling such processors. We selected a sim-plescalar RISC processor, termed PD_RISC. It contains sixpipeline stages, sixteen 32-bit registers, fully bypassed arith-metic logic unit. The instruction-set supports a wide vari-ety of logical, arithmetic, control and load-store instructions.Both the program memory and data memory are accessedsynchronously. The tool-suite, which can be generated from

the processor description, includes a C compiler. Synthe-sizable RTL code can be generated from the description,too. Importantly, the C compiler supports 64-bit operations.This is enabled by a compiler flag -flong-long. This directsthe compiler to emulate 64-bit operations via 32-bit oper-ations. Additionally, user may customize the datapath toadd 64-bit operations. For quick design exploration, thekeystream generation latency is measured via the cycle-accurate instruction-set simulator, which also models syn-chronous RAM. For the final performance estimation of areaand clock frequency, automatic hardware generation flowfrom PD is used.

The compilation of HC-128 on PD_RISC resulted in akeystream generation throughput of 37.17 cycles/byte. Thiscould be improved further by adding multiple, partitionedmemory structures for P , Q array. As this does not affect ourrelative benchmarking, this experiment is skipped.The initial compilation of BiHC-128 on PD_RISC resultedin a keystream generation throughput of 96 cycles/byte. Thisis expected given the 64-bit operations are emulated by 32-bitones. To improve the throughput few customizations are per-formed. First, 64-bit addition, XOR and shift operations areadded to the datapath. The operations are performed over aset of special-purpose 64-bit registers, which are also addedto the processor resources. The amount of memory accessoperations for 64-bit P , Q memory are emulated via a setof memory movement instructions, which resulted in poorperformance. To improve that, the memories are split intotwo partitions containing the upper and lower 32-bit words.This potentially allows parallel memory load-stores, thoughit is not exploited by the C compiler in PD_RISC. The afore-mentioned improvements resulted in a keystream generationthroughput of 29.36 cycles/byte.

To compare the accurate performance improvement, boththe basic PD_RISC and the customized model are processedthrough PD to generate synthesizable Verilog model. The

123

http://www.synopsys.com/systems/blockdesign/processordev/pages/default.aspx

http://www.synopsys.com/systems/blockdesign/processordev/pages/default.aspx

J Cryptogr Eng

Table 2 HC-128, BiHC-128 on customizable embedded processor

Algorithm Core area Throughput Throughput/area(KGates) (Gbps) (Mbps/KGates)

HC-128 27.64 0.22 7.96

BiHC-128 29.57 0.27 9.13

Fig. 1 Pipeline structure for HC-128 keystream generation

hardware description is synthesized with Synopsys DesignCompiler, (dc_shell version G-2012.06), topographical modewith compile_ultra) for 65nm CMOS technology. The syn-thesis results are presented in Table 2. For the additional P ,Q arrays extra 4 KB of SRAM is needed. Both the designsreported a highest achievable throughput of 1.0 GHz.

As can be observed from the Table 2, both throughput andarea is increased while migrating from HC-128 to BiHC-128.A fine comparison of the benefit can be done by consider-ing the metric of Throughput/Area, showing minor overallimprovement. It is very likely that with the support of paral-lel multiple memory load-store and 64-bit instruction width,this improvement will be much stronger.

4.3 Hardware accelerator implementation of BiHC-128

An efficient hardware accelerator for HC-128 is recently pre-sented in [3]. We exploited the same basic pipeline structureas proposed there and extended the architecture to supportBiHC-128. The pipeline architecture is shown in Fig. 1. Themodulo minus (�) operations are indicated with italics.

The key challenge to efficient accelerator implementa-tion is the efficient distribution of computation and storageaccesses across the pipeline to improve the throughput andsimultaneous sharing of resources to minimize the area. Tokeep the area overhead low, block RAMs are used for Pand Q array. The RAMs are dual-port and synchronous innature, with 1 cycle latency. This requires the addressing ofRAM and data access to be split by one pipeline stage. Thisis achieved in the 4-stage pipeline distribution.

To minimize the critical path, the computation is evenlydistributed in tandem with the accessed data. The abovepipeline distribution allows one iteration of keystream gen-eration in the pipeline, resulting in a throughput of 1 wordper 4 cycles or 1 byte/cycle. This architecture is extendedtowards BiHC-128 with the following modifications:

– The storage is partitioned in P_high, P_low and Q_high,Q_low arrays since the array words are always simultane-ously accessed.

– For the 64-bit computation, the operands are merged and64-bit addition, XOR and shift operations are invoked.This turned out to generate better performance than doing32-bit operations and accounting for additional shifts/carrypropagation.

– The initialization of BiHC-128 is similar to HC-128 exceptfor the duplicate operations. This is integrated in the samepipeline structure.

Both the designs are synthesized with Synopsys Design Com-piler, (dc_shell version G-2012.06), topographical mode withcompile_ultra) for 65nm CMOS technology. The results arepresented in the Table 3.

4.4 Summary of performance benchmarking

The performance of the proposed BiHC-128 algorithm canbe best analyzed by comparing the area efficiency (through-put/area) results for different platforms, presented in the Fig.2. Since the area for general purpose processors remainsthe same for HC-128 and BiHC-128, an unit area forarea efficiency computation is assumed. For general pur-pose processors, the datapath remains underutilized for HC-

Table 3 HC-128, BiHC-128 hardware implementation

Algorithm Core area Frequency(GHz)

Keystream generation Throughput/area(Gbps/KGates)

Comb.(KGates)

Seq.(KGates)

RAM(KByte)

(Cycles/byte) (Gbps)

HC-128 6.43 1.76 9 1.67 1 13.36 1.6

BiHC-128 10.19 4.46 13 1.2 0.5 19.2 1.3

123

J Cryptogr Eng

Sca

led

Th

rou

gh

pu

t/A

rea

Fig. 2 Comparing area efficiency for different platforms

128 computation. Therefore, the area efficiency improve-ment is maximum for BiHC-128. For customizable proces-sors, adding specialized datapaths can be highly benefi-cial for BiHC-128. For those, there is still an area effi-ciency improvement though, much less compared to thegeneral purpose processors. Finally, for dedicated hard-ware accelerators, the area efficiency decreases slightly.This is accounting for the additional logic required for thekeystream generation, which is different from the initializa-tion phase. In this case, from the perspective of pure perfor-mance, multiple identical HC-128 accelerators will be ben-eficial.

5 Conclusion and future work

Increasing data-width of modern processors provides anopportunity for increased throughput of cryptographic prim-itives. This paper presents an early work in this directionby proposing a novel framework for combining multipleinstances of a single stream cipher to produce the keystreamexploiting the complete available data-width.

The concept is validated with an experiment on HC-128, the fastest software stream cipher in the repertoire ofeSTREAM finalists. We show that significant performanceimprovement is possible for general purpose processors. Itis also observed that the approach gives diminishing returnsas one moves to more and more customized processing plat-form. Our method can be easily scaled to architectures offer-ing higher bit-width instruction-sets such as [2].

Finally, our work emphasizes the significance of design-ing new stream ciphers with the HC-128 like data-widthextendible property. As part of our future work, we would liketo identify and analyze other data-width extendible streamciphers amongst the existing ones. We also plan to study thecriteria for designing a new stream cipher with scalable data-widths from scratch. An interesting open question in thiscontext is how to characterize the randomness properties of

the resulting combined architecture from a theoretical pointof view.

Acknowledgments We sincerely thank the anonymous reviewers fortheir feedback and suggestions that helped in improving the technical aswell as the editorial quality our paper. We also express our gratitude tothe Centre of Excellence in Cryptology (CoEC), Indian Statistical Insti-tute, Kolkata, funded by the Government of India, for partial supporttowards this project.

References

1. ARM 64-bit Processor. http://www.computerworld.com/s/article/9223894

2. Advanced vector extensions (AVX). http://software.intel.com/en-us/avx

3. Chattopadhyay, A., Khalid, A., Maitra, S., Raizada, S.: Designinghigh-throughput hardware accelerator for stream cipher HC-128.In: IEEE ISCAS, pp. 1448–1451 (2012)

4. ECRYPT—network of excellence in cryptology. IST-2002-507932. http://www.ecrypt.eu.org/ecrypt1

5. eSTREAM: the ECRYPT stream cipher project. http://www.ecrypt.eu.org/stream. Accessed on Nov 2013

6. Federal information processing standards (FIPS) publication197. Advanced encryption standard (AES). http://csrc.nist.gov/publications/PubsFIPS.html. Accessed 26 Nov 2001

7. Gong, G., Gupta, K.C., Hell, M., Nawaz, Y.: Towards a generalRC4-like keystream generator. In: CISC. Lecture Notes in Com-puter Science, vol. 3822, pp. 162–174. Springer (2005)

8. Intel AES instruction set. http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-aes-instructions-set/

9. Khalid, A., Bagchi, D., Paul, G., Chattopadhyay, A.: OptimizedGPU implementation and performance analysis of HC series ofstream ciphers. In: Proceedings of the 15th International Confer-ence on Information Security and Cryptology (ICISC), Nov 28–30,Seoul, Korea. Lecture Notes in Computer Science (LNCS), vol.7839, pp. 293–308. Springer (2012)

10. Kircanski, A., Youssef, A.M.: Differential fault analysis of HC-128.In AFRICACRYPT 2010. Lecture Notes in Computer Science, vol.6055, pp. 360–377. Springer

11. Kitsos, P., Kostopoulos, G., Sklavos, N., Koufopavlou, O.: Hard-ware implementation of the RC4 stream cipher. In: Proceedingsof 46th IEEE Midwest Symposium on Circuits & Systems, pp.1363–1366. Cairo, Egypt (2003)

12. Liu, Y., Qin, T.: The key and IV setup of the stream ciphers HC-256and HC-128. In: International Conference on Networks Security,Wireless Communications and Trusted Computing, pp. 430–433.(2009)

13. Maitra, S., Paul, G., Raizada,S., Sen, S., Sengupta, R.: Some obser-vations on HC-128. In: Designs, Codes and Cryptography. vol. 59,no. 1–3 pp. 231–245. (2011)

14. Matthews Jr. D.P.: Methods and apparatus for accelerating ARC4processing. US Patent Number 7403615, Morgan Hill, CA. http://www.freepatentsonline.com/7403615.html. Accessed July 2008

15. Meiser, G., Eisenbarth, T., Lemke-Rust, K., Christof Paar.: Effi-cient implementation of eSTREAM ciphers on 8-bit AVR micro-controllers. In: SIES. pp. 58–66. (2008)

16. NIST special publication 800–22rev1a. A statistical testsuite for the validation of random number generators andpseudo random number generators for cryptographic applica-tions. http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html. Accessed April 2010

17. Paul, G., Maitra, S.: RC4 Stream Cipher and Its Variants. CRCPress (2011)

123

http://www.computerworld.com/s/article/9223894

http://www.computerworld.com/s/article/9223894

http://software.intel.com/en-us/avx

http://software.intel.com/en-us/avx

http://www.ecrypt.eu.org/ecrypt1

http://www.ecrypt.eu.org/stream

http://www.ecrypt.eu.org/stream

http://csrc.nist.gov/publications/PubsFIPS.html

http://csrc.nist.gov/publications/PubsFIPS.html

http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-aes-instructions-set/

http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-aes-instructions-set/

http://www.freepatentsonline.com/7403615.html

http://www.freepatentsonline.com/7403615.html

http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html

http://csrc.nist.gov/groups/ST/toolkit/rng/documentation_software.html

J Cryptogr Eng

18. Paul, G., Maitra, S., Raizada, S.: A theoretical analysis of the struc-ture of HC-128. In: IWSEC 2011. Lecture Notes in Computer Sci-ence, vol. 7038, pp. 161–177. Springer

19. Sen Gupta, S., Chattopadhyay, A., Sinha, K., Maitra, S., Sinha,B.P.: High performance hardware implementation for RC4 streamcipher. In: IEEE Transactions on Computers. doi:10.1109/TC.2012.19 (2012)

20. Stankovski, P., Ruj, S., Hell, M., Johansson, T.: Improved distin-guishers for HC-128. In: Designs, Codes and Cryptography. vol.63, no. 2. pp. 225–240 (2012)

21. Chattopadhyay, A., Meyr, H., Leupers, R.: LISA: a uniform ADLfor embedded processor modelling, implementation and softwaretoolsuite generation. In: Mishra, P., Dutt, N. (eds.) ProcessorDescription Languages, vol. 1, (Systems on Silicon), pp. 95–130,Morgan Kaufmann (2008)

22. Tran, T.H., Lanante, L., Nagao, Y., Kurosaki, M., Ochi, H.: Hard-ware implementation of high throughput RC4 algorithm. In: IEEEISCAS. pp. 77–80 (2012)

23. Wu, H.: The stream cipher HC-128. http://www.ecrypt.eu.org/stream/hcp3.html

24. Wu, H.: A new stream cipher HC-256. In: FSE. Lecture Notes inComputer Science, vol. 3017, pp. 226–244. Springer (2004)

25. Wu, H.: The stream cipher HC-128. http://www3.ntu.edu.sg/home/wuhj/research/hc/index.html

123

http://dx.doi.org/10.1109/TC.2012.19

http://dx.doi.org/10.1109/TC.2012.19

http://www.ecrypt.eu.org/stream/hcp3.html

http://www.ecrypt.eu.org/stream/hcp3.html

http://www3.ntu.edu.sg/home/wuhj/research/hc/index.html

http://www3.ntu.edu.sg/home/wuhj/research/hc/index.html

Documents

Designing stream ciphers with scalable data-widths: a case study with HC-128