Deep-Learning Side-Channel Attacks on AESkth.diva-portal.org/smash/get/diva2:1322924/FULLTEXT01.pdf · togra ska algoritmer. Konceptuellt ar tanken att overvaka ett chip medan det

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2019

Deep-Learning Side-Channel Attacks on AES

MARTIN BRISFORS

SEBASTIAN FORSMARK

KTHSKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

Sammanfattning

Nyligen har stora framsteg gjorts i att tillampa djupinlarning pa sidokanalat-tacker. Detta medfor ett hot mot sakerheten for implementationer av kryp-tografiska algoritmer. Konceptuellt ar tanken att overvaka ett chip medan detkor kryptering for informationslackage av ett visst slag, t.ex. Energiforbrukn-ing. Man anvander da kunskap om den underliggande krypteringsalgoritmenfor att trana en modell for att kanna igen nyckeln som anvands for kryptering.Modellen appliceras sedan pa matningar som samlats in fran ett chip underattack for att aterskapa krypteringsnyckeln.

Vi forsokte forbattra modeller fran ett tidigare arbete som kan finna en byteav en 16-bytes krypteringsnyckel for Advanced Advanced Standard (AES)-128fran over 250 matningar. Var modell kan finna en byte av nyckeln fran enenda matning. Vi har aven tranat ytterligare modeller som kan finna inte baraen enda nyckelbyte, men hela nyckeln. Vi uppnadde detta genom att stallain vissa parametrar for battre modellprecision. Vi samlade var egen tranings-data genom att fanga en stor mangd strommatningar fran ett Xmega 128D4mikrokontrollerchip. Vi samlade ocksa matningar fran ett annat chip - som viinte tranade pa - for att fungera som en opartisk referens for testning. Nar viuppnadde forbattrad precision markte vi ocksa ett intressant fenomen: vissalabels var mycket enklare att identifiera an andra. Vi fann ocksa en stor variansi modellprecision och undersokte dess orsak.

Nyckelord

AES, maskinlarning, sidokanalattack, kryptering

1

Abstract

Recently, substantial progress has been made in applying deep learning to sidechannel attacks. This imposes a threat to the security of implementations ofcryptographic algorithms. Conceptually, the idea is to monitor a chip whileit’s running encryption for information leakage of a certain kind, e.g. powerconsumption. One then uses knowledge of the underlying encryption algorithmto train a model to recognize the key used for encryption. The model is thenapplied to traces gathered from a victim chip in order to recover the encryptionkey.

We sought to improve upon models from previous work that can recover onebyte of the 16-byte encryption key of Advanced Encryption Standard (AES)-128from over 250 traces. Our model can recover one byte of the key from a singletrace. We also trained additional models that can recover not only a singlekeybyte, but the entire key. We accomplished this by tuning certain parametersfor better model accuracy. We gathered our own training data by capturing alarge amount of power traces from an Xmega 128D4 microcontroller chip. Wealso gathered traces from a second chip - that we did not train on - to serve asan unbiased set for testing. Upon achieving improved accuracy we also noticedan interesting phenomenon: certain labels were much easier to identify thanothers. We also found large variance in model accuracy and investigated itscause.

Keywords

AES, Machine Learning, Side Channel Attack, Encryption

2

Contents

1 Introduction 21.1 Existing research . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Purpose and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 32.1 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Side channel attacks . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Methodology 63.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.2 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.4 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Results 114.1 Xmega 1 first key byte recovery . . . . . . . . . . . . . . . . . . . 114.2 Xmega 1 full key recovery . . . . . . . . . . . . . . . . . . . . . . 114.3 Xmega 2 first key byte recovery . . . . . . . . . . . . . . . . . . . 114.4 Xmega 2 full key recovery . . . . . . . . . . . . . . . . . . . . . . 124.5 Sbox variability impact . . . . . . . . . . . . . . . . . . . . . . . 134.6 Model variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.7 Chip variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5 Conclusion 18

6 Future Work 19

Appendices 22

CONTENTS 1

1 Introduction

The cryptographic strength of Advanced Encryption Standard (AES) is welldocumented. Based on its perceived dependability a lot of contemporary secu-rity applications rely on AES. While the algorithm itself can be shown to bevery resilient, the computational steps need to be implemented in hardware.Implementations of the AES algorithm have been shown to leak informationabout the encryption process when the algorithm runs.

Attacking the implementation rather than the algorithm itself is called SideChannel Attack (SCA), and it has been used to successfully recover the en-cryption key from a device running AES [19]. With Machine Learning (ML)becoming more prevalent recently it has been suggested that machine learningcan be used as a tool for SCA. Results showing successful partial recovery ofthe encryption key have been presented [2]. However, commonly the results pre-sented used the same chip for attacking as the one used for training, or attackeda circuit board of identical design as the one trained on. The similarity of thedevice used for training versus the device attacked is a big factor in how easyit is to train an accurate model. Furthermore, using machine learning to attackthe same device as trained on is sometimes unnecessary. Correlation PowerAnalysis (CPA) can recover the entire key of unprotected implementations ofAES on microcontroller in 50 traces. However, the training data for the machinelearned model usually has to be in the thousands.

1.1 Existing research

Applying ML to SCA is a developing field. Supervised learning techniqueshave been applied to compromise different implementations of cryptographicalgorithms. Studies have successfully used support vector machines [7, 1, 8, 13,14], random forests [13, 14], self organizing maps [13] and neural nets [4].

Deep learning is a subcategory of neural nets in which several layers areused to find more complex correlations. Maghrebi et al. were the first tostudy the application of deep learning on implementations of cryptographicalgorithms [16]. They laid groundwork for much of the theoretical understandingin the field. They published experimental results in which they applied deeplearning models such as Multi Layer Perceptrons (MLP) and ConvolutionalNeural Networks (CNN) to SCA. The results show that these more complexmodels are more suitable than simple ML and template attacks.

Benadjila et al. [2] published Study of ”Deep Learning Techniques for Side-Channel Analysis and Introduction to ASCAD Database”. It is a more concreteevaluation of hyper-parametrization and model performance. Up until thatpoint researchers had often kept their parameter data secret. The fact that thispaper published their data as well as their code is what enabled our work, sincethey made their models reproducible and their parameters known. These arethe two papers upon which the foundation for our work is based.

1. INTRODUCTION 2

1.2 Purpose and goal

The possibility of successfully attacking an AES implementation with ease iscause for concern due to reliance on AES for encryption and authentication.The goal of this thesis is to use machine learning to train a model that cansuccessfully recover the AES encryption key. We then investigate the possibilityof successfully attacking the same kind of chip mounted on a different PrintedCircuit Board (PCB). Specifically, we will focus on the possibility of recoveringthe key with very few attack traces. We present our findings in order to helpshed light on potential security risks and provide a starting point for designingcountermeasures.

2 Background

2.1 AES

AES, also known by its original name Rijndael after its creators Vincent Rijmenand Joan Daemen, is a symmetric blockwise encryption algorithm. It replacedthe Data Encryption Standard (DES) when it was standardized by NIST inNovember 2001 [17]. They chose a 128 bit block version of Rijndael with threedifferent key sizes: 128, 192, or 256 bits. For these experiments we are usingAES-128 which uses a 128 bit key divided into 16 8-bit keybytes [3]. Using amicrocontroller chip, each keybyte is operated upon in sequence which is whythere will be 16 repeating patterns in our traces (presented later in Fig. 5).

The AES-128 algorithm repeats a pattern of transformations for ten rounds.For each of the ten rounds except the last one the process is: byte substitution,row shifting, column mixing and finally the key addition for that round. Thedetails of these steps are not relevant to this paper, but in essence the stepsprovide diffusion and ensure that the data is pseudo randomly arranged whilebeing encrypted.

The byte substitution transformation ’ByteSub’, implementing AES Sbox, ishowever relevant to our work. We are measuring information leakage from thisoperation during the first round. The Sbox is a non-linear byte substitution.You input an 8-bit value and receive a new one, and as it is a bijective mappingit is deterministic. The keybyte and the plaintext byte are XORed with eachother, creating a new 8-bit value which is then passed through the Sbox. Theresult of this operation is referred in this thesis as the Sbox ouput.

2.2 Side channel attacks

Side channel attacks were pioneered by Paul Kocher. He first thought of itin the context of timing how long it took to perform private key operations aspart of the Diffie Hellman key exchange [12]. By using time measurements as aninformation leakage point, obfuscated pieces of information could be inferred. Asthe field developed there have been several new avenues of information leakagesuch as electromagnetic radiation and the power consumption of the hardware.

2. BACKGROUND 3

CPA measures the correlation between the processed data and the power con-sumption. It has been proven successful as a tool for breaking implementationsof AES [18] since 2004. Currently in our CPA analyses of our microcontrollerimplementations of AES, you usually need no more than 50 traces in order torecover the key. This number of traces is necessary in order to filter out irregularmeasurements and determine the correct correlation.

2.3 Machine learning

Machine learning and the idea of making a computer that can learn new thingsdates back to the 1950s. Computer scientists like Alan Turing theorized aboutartificial intelligence and Arthur Samuel coined the term Machine Learning [22].It wasn’t until the 1990s, however, that machine learning as we understandit today was being successfully implemented. These early implementations,however, tended to be small in scope. Ever since then, with the production ofcheaper and more powerful computational hardware, there has been an increasein the complexity of what machine learning can accomplish [23, 5].

Machine learning can be used for data classification wherein an input datadi ∈ Rk with k input features is mapped to a set of all possible classes C, wheredi is an instance of the full data-set D. A way of representing the members cjof class C is to use a one-hot encoding truth vector such that

ci = cj ∗ tj ; 0 ≤ j ≤ |C|;

{tj = 1 ⇐⇒ cj is the true class

tj = 0 otherwise

By doing so we can let our model make predictions on new data so that eachprediction is a percentage probability of cj being the class for the data accordingto the model.

A lot of different ways of achieving machine learning have been conceived,and broadly speaking we can recognize 4 categories of different approaches:Supervised, Unsupervised, Reinforced and Evolutionary learning. This paperfocuses on supervised learning, but the other categories are mentioned for com-pleteness and further reading.

Even within the broader category of supervised learning we can identifyseveral distinct implementations e.g. support vector machines, decision treesand artificial neural networks. Training a computer with such algorithms lets itidentify a decision boundary between data points in the input space to delimita perimeter for distinct classes of data (See Fig. 1).

In the case of a neural net, the first step of determining this boundary isachieved by defining a loss function. We then minimize the loss function, bycalculating its gradient as a function of the internal parameters of the network,using an optimization function.

A basic way of structuring a neural net is by using the MLP algorithm. Theprincipal idea of MLP is that neurons form a lattice-like graph divided into

2. BACKGROUND 4

4 2 0 2 44

3

2

1

0

1

2

3

4Example of a decision boundary drawn using a support vector machine

Figure 1: Decision boundary

discrete layers of neurons such that the value of each neuron is defined by

ni,j = f(∑j

ni−1,j ∗ wj + bi)

where w represents weight variables for the connection between nodes of thegraph, b is a bias variable for that layer and f is an activation function. Thuswe can see that each neuron (enumerated by j) of layer i is a function of the valueof each connected neuron from the previous layer. Popular activation functionsinclude ReLU, Sigmoid, Tanh and Softmax.

The optimizer function’s job is to find a minimum for the output of theloss function. It is common to make use of the algorithm Stochastic GradientDescent, or any of its more advanced adaptations such as RMSProp. The mech-anism through which gradient descent works is to compute the gradient of theloss function with respect to all of the internal variables and perform what iscalled back propagation. The internal variables, usually the weight variables, areadjusted in the direction of the steepest descent of the surface spanned by theloss function [21].

The loss function has the purpose of defining the amount of error of a vectorcompared to a target vector. Ostensibly, this can be defined in whatever way isdesirable for the optimization problem, but a number of commonly adopted loss

2. BACKGROUND 5

Figure 2: Example MLP neural net

functions exist. These include Mean Squared Error (MSE) or similar methodssuch as Mean Squared Logarithmic Error, as well as Mean Absolute Error, andCross Entropy. We are using Categorical Cross Entropy, which is defined as

CE = −log

esp

C∑j

esj

where s denotes the classifier score (prediction) for each instance of class C. spdenotes the score for a positive class, indicating we are using a one-hot encodingwhere only one class will be positive for each input datum [6].

Machine learning is a very broad topic, and this section barely scratches thesurface. For further reading the reader is directed to [5].

3 Methodology

3.1 Scope

Our work focuses on the application of a deep learning attack, using MLP, onAES encryption implemented on Xmega microcontrollers. There are other deeplearning models such as Convolutional Neural Networks (CNNs) or RecurrentNeural Networks (RNNs) which can be used. There are also other encryptionschemes and other hardware types such as FPGA, however these are outsidethe scope of this thesis.

Our focus stems from the fact that based on the paper by Benadjila et al. [2]MLP performed approximately as well as CNN for Sbox classification when therewas no trace desynchronization. MLPs can also be trained significantly faster

3. METHODOLOGY 6

due to having a fraction of the variables. We believe that trace synchronizationis a trivial task for any dedicated attacker. As such, CNNs’ advantage in thisaspect was deemed irrelevant for our purposes.

3.2 Hardware

For training and testing we required a large quantity of traces so that the net-work could filter out noise. These traces were gathered using ChipWhisperer, atoolchain for embedded hardware security research [10].

We refer to one ATxmega128D [9] as Xmega 1 and another as Xmega 2 tosignify which PCB the microcontroller was mounted on, see Fig. 3 an 4. TheXmega mounted on a PCB that is included with ChipWhisperer is referred toas Xmega 1 and will be used for training models in an attempt to attack Xmega2.

Figure 3: ChipWhisperer connectedto Xmega 1

Figure 4: ChipWhisperer connectedto Xmega 2

3.3 Training

The power traces were contained within a discrete measurement of 3000 timesteps (Fig. 5). Using knowledge of AES it was relatively easy to pinpoint the ap-proximate positions of the different Sbox rounds. The portion of Fig. 6 markedbetween the arrows contains the entire first Sbox calculation and its leakagepoint. Since we assumed that the power leakage measured during these roundswas correlated with the Sbox output we could then classify these traces usingSbox outputs as labels. Each trace would then have a value between 0 and 255(8 bits) to which it should map in order to evaluate as correct.

The processing center for our training was the KTH PDC (Paralelldator-centrum) [20]. The code was written as python scripts using the Keras li-brary [11] which is as an upper layer controller for Tensorflow.

Training was conducted in an iterative way: each day we discussed potentialimprovements and created a series of models which evaluated a variable, or

3. METHODOLOGY 7

0 500 1000 1500 2000 2500 3000

0.5

0.4

0.3

0.2

0.1

0.0

0.1

0.2

A full 3000 time step power trace from ChipWhisperer

Figure 5: A full AES power trace inECB mode

0 25 50 75 100 125 150 175 2000.30

0.25

0.20

0.15

0.10

0.05

0.00

0.05

0.10

The first 200 time steps including the first Sbox leakage

Leakage

Figure 6: Close up of the relevant partof the trace

implementation variation, in regular intervals. Our primary variables to adjustwere

• Learning rate

• Number of epochs

• Trace size

• Number of layers

We also adjusted several other variables in order to exclude them from con-sideration. The above listed variables were the ones we found to have a positiveeffect on model accuracy when tuned.

Before we began using our own traces we attempted to improve upon Be-nadjila et al.’s MLP model which trained and tested on the ASCAD database.The two most significant improvements we found were learning rate and thenumber of layers. The model’s defined learning rate was quite low (1/100 ofthe Keras default value), and we also wanted to investigate if a more complexmodel with more layers would perform better. As seen in Fig. 7 we alreadyachieved significant performance improvements from tuning these parameters.These particular improvements likely indicate the same underlying problem.Increasing the complexity (number of layers) and increasing the rate at whichthe network learns both counteract underfitting, so it would appear that thebenchmark model was slightly underfit to the data.

Since these variables showed promise we decided to try them on our own dataset collected using ChipWhisperer. We also decided to use validation accuracyas our monitor so that we select the trained models which perform best inour validation step. Our increased amount of traces compared to the originalallowed us to use 30% of our data for validation.X8. Our parameter adjustments gave different results when applied to our owndata than it had on the ASCAD data but learning rate still held up as one of

3. METHODOLOGY 8

0 50 100 150 200 250traces

0

20

40

60

80

100

rank

Improvements due to increased layers and learning rateBenchmark MLP model2 additional layersHigher learning rate

Figure 7: Improvements compared to benchmark model

the prime avenues for improvement. Our tests with Learning rate lead to ourfirst model X8 (named for its 8 times higher learning rate). It was later adaptedinto 2 additional trained models referred to in this paper as T57 and AllBytes.T57. Our first model was trained on a rather large span of 150 time points.The length of the Sbox portion divided by 16 is 96. Therefore, T57 was insteadtrained using an indexing of [57:152] in order to avoid training on incorrectdata. However, we later discovered that it is not an inclusive span so the modelwas in fact trained on 95 points. The model performed the best out of ourtrained models either way and we don’t believe the extra time step would alterthe results dramatically. As mentioned earlier we iteratively probe variables;in the case of T57 we trained models for each time interval from the point weconsidered the latest possible to the earliest. We found that from time step 57to 152 we had optimal results for our attack target Xmega 2.AllBytes. AllBytes was based on the fact that a model trained on the firstkey byte position gave poor results for the other 15, most likely due to slightinherent differences between each sequential Sbox calculation. In order to pre-dict the entire key a single model was trained on all key byte positions. We didthis by splitting the traces into subtraces, each containing one keybyte’s Sboxcalculations, and combined them into one data set. This way each trace gaveus 16 sections of 95 time points to train upon.

[57 + 96 ∗ i : 152 + 96 ∗ i] for 0 ≤ i ≤ 15

It should be noted, however, that the model which trained on 200k of thesesections (12.5k traces originally) performed best.

3. METHODOLOGY 9

3.4 Testing

We adapted the rank test presented by Benadjila et al. except we changed thecalculation to be on a per trace basis rather than per 10 traces. We believe thisgives much more indicative graphs of the result, even if they are less smooth.The test works by having the model predict a set of input data. It is thenprocessed sequentially by sorting the logarithm of the model predictions by size.The placement of the actual Sbox output is denoted by its ”rank”; its indexedplacement in the predictions. For each sequential test we keep a cumulative sumof previous log-probabilities that we add to the new prediction.

When our models started performing so well that we could no longer discernany difference between results, variance started being the determining factorfor how the graph looked. Therefore we developed new tests for more granularresults. We had tried randomizing the order of the testing data and the amountof variance we observed was larger than expected. Thus, one of the simplemodifications we made was to randomize the input for the rank test and runningit many times to average the results. The downside of randomizing the inputfor a test is that it is less reproducible, but the mean result should be the samegiven a large number of tests.

The first trace success test measures the rate of successfully recovering akeybyte upon seeing only a single trace. We let the model predict the Sboxoutput value for 300k traces. For each trace we compare top prediction withthe expected, correct, output calculated from Sbox(pi⊕ki) for the correspondingplaintext (pi) and keybyte (ki) to each trace. We note the number of occurrencesas well as number of successes for each keybyte value. This was later changedto instead show the spread over Sbox output values, see Fig. 16 or 17. Giventhis data we can plot the total success rate for each keybyte or Sbox output onthe first try. Using this test we can also calculate the total first try success rateby summation of each correctly classified Sbox output divided by the sum ofeach Sbox occurrence.

Another test calculates the amount of traces needed to recover the full 16byte key by finding which keybyte took the longest to determine. This was doneby calculating the rank of the correct Sbox output (calculated from keybyte andplaintext) after each trace and determining how long it takes until the rank was0 for 3 traces in a row. The number 3 was chosen arbitrarily, though it isbased on us never having seen our successful models diverge after convergingto a correct answer for 3 traces in a row for Xmega 1. After calculating theamount of traces needed for convergence to rank 0 for each keybyte position,we conclude that the worst performer amongst them will be the limiting factor.This determines the number of traces needed for full recovery. We plotted acumulative distribution function for the results, see Fig.10.

Other tests were created for evaluation during iterative training whose resultsare not presented here since they don’t directly correspond to metrics relatedto breaking AES.

3. METHODOLOGY 10

4 Results

4.1 Xmega 1 first key byte recovery

The first byte key recovery on Xmega 1 using model X8 was in most casesimmediate. Fig. 8 shows the recovery rate per key value for a single trace. Theaverage accuracy for the model was 96.1%, signifying that as long as the datatrained on and the data tested on were from the same PCB, our X8 model hadan easy time detecting and classifying the Sbox values.

0 50 100 150 200 250keybyte value

0.0

0.2

0.4

0.6

0.8

1.0

succ

ess

rate

Single trace frst byte recovery success rate on Xmega 1 (X8)

Figure 8: Success rate for model X8

0 50 100 150 200 250keybyte value

0.0

0.2

0.4

0.6

0.8

1.0

succ

ess

rate

Single trace frst byte recovery success rate on Xmega 1 (T57)

Figure 9: Success rate for model T57

For model T57 the results on Xmega 1 were slightly worse with an accuracyof 88.5%. This is still a very high level of accuracy from a single trace, andwe would once again expect to recover the keybyte from a single trace in mostcases, see Fig. 9.

4.2 Xmega 1 full key recovery

We tested models X8 and T57 for their ability to recover keybytes in otherpositions than what they were trained for. While model X8 could not recoverthe entire key using Xmega 1 traces, model T57 could do so. T57 could notreliably recover the entire key using fewer traces than CPA, however, requiringmore than 50 traces on average.

AllBytes was tested for full key recovery using the above defined test. Ascan be seen in the graph in Fig. 10 the model can recover the whole key fromXmega 1 traces within 4 traces in over 95% of cases. On average it takes only2 traces to recover the entire key.

4.3 Xmega 2 first key byte recovery

For model X8 we can see in Fig. 11 the first test result we found running only asingle test. As described in section 3.4 we found that there was a large variancein test results if test data was shuffled and randomly sampled before the testwas run, and Fig.12 shows a more indicative test result from running the test

4. RESULTS 11

1 2 3 4 5 6 7 8traces

0.0

0.2

0.4

0.6

0.8

1.0cu

mul

ativ

e su

cces

s rat

eCDF for full key recovery of Xmega1

Figure 10: A cumulative distribution function for the number of traces neededto recover the full key

1000 times and averaging the results of the tests. For model T57 the averageresults are presented in Fig. 13. The first trace success rates were 2.25% for X8and 14.14% for T57.

4.4 Xmega 2 full key recovery

We tested the ability to recover the entire key for both model X8 and T57, andneither of them could successfully recover the entire Xmega 2 key. While therewas some promise in the T57 results they will not be presented here since itstill failed to reliably recover the entire key faster than CPA. As for the modelAllBytes, however, we did successfully recover the entire key from Xmega 2traces. As can be seen from Fig. 14 and 15 we did so on average faster thanCPA. From 1000 tests we recovered the entire key in 51% of cases using 28traces. Given 80 traces we can recover the entire key in 95% of cases. Thetest cutoff point was 200 traces, and the rate of failure to recover the entire keywithin 200 traces was 0.3%.

4. RESULTS 12

0 200 400 600 800 1000traces

0.0

0.2

0.4

0.6

0.8

1.0ra

nk

First byte performance of X8 on Xmega 2

Figure 11: Our first test results,showing misleadingly good resultsfor X8

0 10 20 30 40 50traces

0

10

20

30

40

50

rank

Average frst byte performance of X8 on Xmega 2

Figure 12: Running X8 1000 times withshuffled input yields a more accurate re-sult

0 10 20 30 40 50traces

0

2

4

6

8

rank

Average frst byte performance of T57 on Xmega 2

Figure 13: Average results of 1000 tests for model T57

4.5 Sbox variability impact

During our fine tuning of variables, Hyanyu Wang, a fellow student working inparallel to extend the work of Benadjila et al., came to us with an interestingrealization. As previously mentioned, our testing code verifies the correct keyagainst the model’s deduced key after we apply the AES functions backwards.However upon automating that process we (and Benadjila et al. before us)obfuscate part of the process for ease of testing and missed a crucial piece of

4. RESULTS 13

0 5 10 15 20traces

1

5

10

15

20ra

nkRanks of each part of the key

Figure 14: The mean ranking of thecorrect key value for each keybyteposition for AllBytes

25 50 75 100 125 150 175 200traces

0.0

0.2

0.4

0.6

0.8

1.0

cum

ulat

ive

succ

ess r

ate

CDF for full key recovery of Xmega2

Figure 15: A cumulative distributionfunction for the number of traces neededto recover the full key using AllBytes,the 50% line is showed in dashed andthe 95% line is showed in dotted

information.When testing our model against traces collected on the second PCB Xmega

2, although our model was able to retrieve the key value with enough traces,there existed only a very small subset of Sbox values that made recovery possible.When the model was asked to classify the traces whose correct Sbox classificationwas one of these values it would easily, with up to 100% accuracy, identify theSbox value and thus; since we know the plaintext, the key. When the true Sboxvalue was one of the many others however, as the comparison between Fig. 16and 17 shows, it would in a majority of cases be unable to classify entirely.

0 50 100 150 200 250Sbox output value

0.0

0.2

0.4

0.6

0.8

1.0

succ

ess

rate

Single trace recovery success rate on Xmega 1 per Sbox value

Figure 16: The success rates fordifferent Sbox values, model: X8


0.0

0.2

0.4

0.6

0.8

1.0

succ

ess

rate

Single trace recovery success rate on Xmega 2 per Sbox value

Figure 17: The success rates on adifferent chip, model: X8

4. RESULTS 14


0.0

0.2

0.4

0.6

0.8

1.0su

ccess

rate

Single trace recovery success rate on Xmega 1 per Sbox value (T57)

Figure 18: Success rate with T57on Xmega 1


0.0

0.2

0.4

0.6

0.8

1.0

succ

ess

rate

Single trace recovery success rate on Xmega 2 per Sbox value (T57)

Figure 19: Success rate with T57on Xmega 2

For Xmega 1 many Sbox values have a classification accuracy of 1 or 100%,and the biggest variation is instead among certain outlying underperformers. Inthe case of Xmega 2 the results show that a single Sbox value (219) is classifiedin 100% of cases and certain others are quite successful. The reason for this isthat whenever the model does not recognize the trace as having an Sbox outputof one of the few it can recognize it makes a default guess, e.g. 219. Betweenmodels we trained there is an overlap in their sets of Sbox values with highclassification success rates as well as in their sets of Sbox values which are neverclassified correctly.

Our results of the same test when optimizing trace time interval also showan interesting development. Upon testing the T57 model the Sbox classifica-tions were much more evenly distributed and more values show non-zero re-sults (Fig. 19). There is also a clear reduction in accuracy for Xmega 1 traces(Fig. 18). For Xmega 2 however this increased spread of correct classificationmeans it can identify a wider range of traces correctly, however it still seems todefault to certain values when uncertain about correct classification.

We had the idea to try to capitalize on the ability to learn certain Sboxvalues more easily by training models specialized in recognizing certain subsetsof traces. For example: a model that recognizes Sbox value 255 traces and forany other trace it answers ”256”; a stand in for ”not my trace”. This showedpromise in the rate of classification, however it is easy to see how this can bemisleading. A model that always answers its value, 255 in this case, would cor-rectly classify a 255 trace in 100% of cases, but it would also always misclassifyall other traces. After some testing we found that the rate of misclassificationwas very high for all specialized models that had a high classification successrate. We concluded that, while this may be a possible avenue for making abetter model, it takes an immense amount of time to train 256 separate models.Furthermore, what we found was that the same training parameters did notyield similar results across all specialized models. Thus each of these modelswould also need to be optimized separately.

4. RESULTS 15

4.6 Model variance

During training one of our most significant results was the indirect discoveryof high variance among our models. T57 for example, when trained on therandom subsets of the same data and with identical hyperparameters has a firstbyte classification accuracy spanning between 4.06% to 14.14% (table: 1). Thisvariance makes model evaluation harder as it’s quite easy to mistake a bad resultfor poor parameterization rather than a result of the variance.

We thought that a much larger set of training data combined with a reducedamount of epochs to avoid overfitting could resolve this. However, after increas-ing our training set from 255k traces to 1.4M traces, we saw approximately thesame amount of variance after training 13 models with the same new set oftraces and hyperparameters. These new models were marginally more accurateon average, however.

Model number Success rate Absolute varianceModel % %− Units

1 14.14 7.372 6.45 0.323 4.06 2.714 5.79 0.985 4.32 2.456 8.64 1.877 6.85 0.088 7.00 0.239 6.19 0.5810 4.22 2.5511 6.86 0.0912 9.14 2.3713 4.36 2.41

Table 1: Total first try success rate for 13 T57 models

This variance was found training on Xmega 1 traces and attacking Xmega2. We do not have data for the variance in results when attacking Xmega 1since our successful models could easily classify Xmega 1 traces often within asingle trace, and also since the Xmega 1 performance tended to be proportionalto validation accuracy of the training data.

4.7 Chip variability

We did some preliminary testing on inter-chip variability to verify that the bigdifference in performance between Xmega 1 and Xmega 2 was indeed due toPCB differences and not due to manufacturing process variability [15]. Wetested model T57 against traces captured from two additional chips mounted

4. RESULTS 16

on an Xmega 1 and an Xmega 2 PCB respectively. We call these Xmega 1Band Xmega 2B. We ran an average of 1000 rank tests on both to compare theresults to the original Xmega 1 and Xmega 2, see Fig. 20, 21, 22, 23. Whilethere is some difference between chips for Xmega 2, it does not seem to be thecause of the large discrepancy in performance between the original Xmega 1and Xmega 2. Thus we believe that the difference in PCB is a more importantfactor for the ease of generalization for our MLP model. We can, however, seefrom Fig. 20 and 21 that some amount of difference between chips mounted ona PCB of the same design does exist. This is to be expected.

0 10 20 30 40 50traces

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

ran

k


Figure 20: Average rank for T57against Xmega 1 traces

0 10 20 30 40 50traces

0.0

0.5

1.0

1.5

2.0

rank

Average frst byte performance of T57 on Xmega 1B

Figure 21: Average rank for T57against Xmega 1b traces

0 10 20 30 40 50traces

0

2

4

6

8

rank


Figure 22: Average rank for T57against Xmega 2 traces

0 10 20 30 40 50traces

0

2

4

6

8

10

rank

Average frst byte performance of T57 on Xmega 2B

Figure 23: Average rank for T57against Xmega 2b traces

4. RESULTS 17

5 Conclusion

Though AES is a deterministic algorithm with a known set of calculations, it hasa great variability in the appearance of power traces depending on a multitudeof factors, including byte position and hardware. Our results, and probablymore importantly, our 290 poorly performing models show that there is a fineline between excellent and mediocre results. Because of the limited time for thisthesis work there remains room for improvement in our existing models. Givenenough computer processing time and patience we believe that an MLP modelcan be trained to very effectively compromise microcontroller implementationsof AES. Due to the Sbox variability problem, however, there appears to bea potential limiting factor when using our training methodology. Any set oftraining parameters may lead to certain labels being overfit or underfit. Thus,a real challenge is finding a way to train a model that can recover the keyregardless of the Sbox output.

Our results on Sbox variability indicate that a poorly performing model hasa much higher incidence of Sbox inequality. While any model we train hascertain Sbox outputs that are classified much easier than others they are farmore evenly distributed for accurate models. These easily classified Sbox valuesalso don’t appear to be random. They favour a certain subset of numbers whichare often repeated between models even when trained with slight variations.The potential conclusion is that there is an underlying correlation. The modelis able to learn from some commonality for these values. We have been unableto discern any simple relation between them if one exists. More research isneeded to determine if this phenomenon manifests for other PCBs.

Since an attacker would not have our luxury of identifying whether a modelperforms well before performing the attack the variability of the results is aserious challenge. From our models we have seen performance jumps of 350%when running identical training code but on a different subset of data. Onesolution to this problem is a majority polling system, in which several modelsare trained. When the models are all asked to classify a trace the attacker canadd the probabilities together for each Sbox label. This cumulative predictionis proportional to the average prediction.

The difference in our results between Xmega 1 and Xmega 2 show thatthe hardware variability is a substantial factor in performance. Training on adifferent PCB from the attacked target reduces prediction accuracy enormously.The attempts at recovering the full key also show inherent differences betweenpositions. We have shown that training on all byte positions is a viable way toovercome the inherent keybyte position variability. In addition, we hypothesizethat training on traces captured from different hardware will help generalize themodel. For example training on different chips or different PCBs.

However, an attacker could also train models uniquely suited for each at-tacked subset of data. By using a series of 16 models in sequence one caneliminate byte position variability. Our Xmega 1 results show excellent resultsattacking traces captured from the same device, eliminating that variability aswell.

5. CONCLUSION 18

6 Future Work

Our extensive work within a narrow scope has led us to conclude that thereare unanswered questions worth answering. Here we will present the ones webelieve are most important.Sbox performance discrepancy. Our results show that the difference inability to classify different Sbox output correctly when attacking Xmega 2 issignificant. We did not test if similar results would be seen when attacking athird PCB with ATxmega128D (Xmega 3). Would the same set of Sbox labelsbe more accurately predicted when attacking Xmega 3? We do not know ifthe manifestation of this behaviour is intrinsic to the design of the PCB. Webelieve more testing is necessary to determine how a difference in PCBs affectsthe accuracy of labels.Model variance. Since the variance in models we trained was high we didsome preliminary testing for counteracting model variance. We did not succeedin reducing the variance in our models. The large variance was surprising to us,and we believe that investigating its cause is worthwhile.Model threat assessment. It is noteworthy that taking an average of therank graphs for a model that converges to rank 0 has a distinct shape. It seemsto be bounded by a decaying exponential. If more research into this topic is tobe done, it may be worth considering a standardized notation. Exploring thepossibility of assessing a model’s threat to a system in terms of the function it isbounded by might make it easy to parameterize risk assessment. For instance,perhaps Ce−kt + ε can represent the level of risk one is willing to take, and allmodels bounded by the function are deemed compromising. Other ideas includeplotting it on a logarithmic scale, or perhaps even determining model threat interms of a differential equation with an initial state.Neural net topology. Our research focused on using MLP, but throughoutthe process the idea of different network topologies was discussed. We decided tofocus on optimizing MLP networks, but it is plausible that a different topologyis more suitable.

6. FUTURE WORK 19

References

[1] Timo Bartkewitz and Kerstin Lemke-Rust. 2013. Efficient Template At-tacks Based on Probabilistic Multi-class Support Vector Machines. In SmartCard Research and Advanced Applications, Stefan Mangard (Ed.). SpringerBerlin Heidelberg, Berlin, Heidelberg, 263–276.

[2] Ryad Benadjila, Emmanuel Prouff, Remi Strullu, Eleonora Cagli, andCecile Dumas. 2018. Study of Deep Learning Techniques for Side-ChannelAnalysis and Introduction to ASCAD Database. (2018), 45.

[3] Joan Daemen and Vincent Rijmen. 1999. The Rijndael Block Cipher. (09March 1999), 45.

[4] R. Gilmore, N. Hanley, and M. O’Neill. 2015. Neural network based attackon a masked implementation of AES. In 2015 IEEE International Sympo-sium on Hardware Oriented Security and Trust (HOST). 106–111.

[5] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning.MIT Press. http://www.deeplearningbook.org.

[6] Raul Gomez. 2019. Understanding Categorical Cross-Entropy Loss, BinaryCross-Entropy Loss, Softmax Loss, Logistic Loss, Focal Loss and all thoseconfusing names. (23 April 2019). https://gombru.github.io/2018/05/23/cross_entropy_loss/

[7] Annelie Heuser and Michael Zohner. 2012. Intelligent Machine Homicide. InConstructive Side-Channel Analysis and Secure Design, Werner Schindlerand Sorin A. Huss (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg,249–264.

[8] Gabriel Hospodar, Benedikt Gierlichs, Elke De Mulder, Ingrid Ver-bauwhede, and Joos Vandewalle. 2011. Machine learning in side-channelanalysis: a first study. Journal of Cryptographic Engineering 1, 4 (27 Oct2011), 293.

[9] Microchip Technology Inc. 2019. ATxmega128D4 - 8-bit AVR Microcon-trollers. (20 April 2019). https://www.microchip.com/wwwproducts/en/ATxmega128D4

[10] NewAE Technology Inc. 2019. ChipWhisperer R©. (29 April 2019). https://newae.com/tools/chipwhisperer/

[11] Keras. 2019. Home - Keras Documentation. (02 May 2019). https://

keras.io/

[12] Paul C Kocher. 1996. Timing Attacks on Implementations of Di e-Hellman,RSA, DSS, and Other Systems. (1996), 10.

REFERENCES 20

[13] Liran Lerman, Gianluca Bontempi, and Olivier Markowitch. 2014. Poweranalysis attack: An approach based on machine learning. 3 (01 2014), iedCryptography.

[14] Liran Lerman, Gianluca Bontempi, and Olivier Markowitch. 2015. A ma-chine learning approach against a masked AES. Journal of CryptographicEngineering 5, 2 (01 Jun 2015), 123–139.

[15] Roel Maes. 2013. Physically unclonable functions: constructions, propertiesand applications (1st edition ed.). Springer. https://link.springer.

com/content/pdf/10.1007%2F978-3-642-41395-7.pdf

[16] Houssem Maghrebi, Thibault Portigliatti, and Emmanuel Prouff. 2016.Breaking Cryptographic Implementations Using Deep Learning Tech-niques. In Security, Privacy, and Applied Cryptography Engineering,Claude Carlet, M. Anwar Hasan, and Vishal Saraswat (Eds.). Vol. 10076.Springer International Publishing, 3–26. https://doi.org/10.1007/

978-3-319-49445-6_1

[17] National Institute of Standards and Technology. [n. d.]. FIPS 197, Ad-vanced Encryption Standard (AES). ([n. d.]), 51.

[18] S.B. Ors, F. Gurkaynak, E. Oswald, and B. Preneel. 2004. Power-analysis attack on an ASIC AES implementation. In International Con-ference on Information Technology: Coding and Computing, 2004. Pro-ceedings. ITCC 2004. IEEE, 546–552 Vol.2. https://doi.org/10.1109/

ITCC.2004.1286711

[19] S. B. Ors, F. Gurkaynak, E. Oswald, and B. Preneel. 2004-04. Power-analysis attack on an ASIC AES implementation. In International Con-ference on Information Technology: Coding and Computing, 2004. Pro-ceedings. ITCC 2004., Vol. 2. 546–552 Vol.2. https://doi.org/10.1109/ITCC.2004.1286711

[20] PDC. 2019. (02 May 2019). https://www.pdc.kth.se/

[21] Sebastian Ruder. 2016. An overview of gradient descent op-timization algorithms. (19 January 2016). http://ruder.io/

optimizing-gradient-descent/

[22] A. L. Samuel. 1959-07. Some Studies in Machine Learning Using the Gameof Checkers. 3, 3 (July 1959-07), 210–229. https://doi.org/10.1147/

rd.33.0210

[23] Juergen Schmidhuber. 2015. Deep Learning in Neural Networks: AnOverview. 61 (January 2015), 85–117. https://doi.org/10.1016/j.

neunet.2014.09.003 arXiv:1404.7828

REFERENCES 21

Appendices

A MLP architecture summaries

The 3 models share some fundamental parameters since T57 and AllBytes arebased off of X8. They each have six layers, using the activation function ReLUfor the first five layers and Softmax for the sixth. The optimizer is RMSpropwith a learning rate of 0.0008. The model’s learning metric is accuracy andthe monitor is validation accuracy with a validation split of 0.3. Remainingparameters are noted in table: A1.

22

Hyperparameter ValueSample interval (X8) [50:200]Sample interval (T57) [57:152]

# of traces (X8 & T57) 255kSample interval (AllBytes) T57, repeating1

# of traces (AllBytes) 12.5k2

Loss function Categorical Crossentropy# of nodes 200# of epochs 200Batch size 100

Table A1: Model specifics

Layer Type Output Shape Parameter #

Input (Dense) (None, 200) 30200Dense 1 (None, 200) 40200Dense 2 (None, 200) 40200Dense 3 (None, 200) 40200Dense 4 (None, 200) 40200Output (Dense) (None, 256) 51456

Total Parameters: 242,456

Table A2: X8 architecture summary

Layer Type Output Shape Parameter #

Input (Dense) (None, 200) 19200Dense 1 (None, 200) 40200Dense 2 (None, 200) 40200Dense 3 (None, 200) 40200Dense 4 (None, 200) 40200Output (Dense) (None, 256) 51456

Total Parameters: 231,456

Table A3: T57 and AllBytes architecture summary

2[57 + 96*i : 152 + 96*i] for i between 0 and 152See section 3.3

A. MLP ARCHITECTURE SUMMARIES 23

B Our code

Most of our code was adapted from code published by Benadjila et al. Theircode can be found on their github3. While we made a lot of changes to it, mostof those changes were related to code readability. We also introduced the abilityto do batch testing of models from the command line. These changes do notin essence alter the functionality of the training or testing, so reproducing ourexperiments can be done without those alterations.

We wrote some test code ourselves, but most of those test results are notpresented in our paper. This was detailed in section 3.4. The first trace successrate test mentioned there was written by us and the code is presented below:

#c r e a t e a (256 , 2) shaped matrix with ”number o f checks#f o r each Sbox” as [ : , 0 ] and ”number o f s u c c e s s e s ”#f o r [ : , 1 ]de f k ey t e s t ( model ) :#1 f o l d e r = ’ t r a c e s / X2 attack / ’#2 r e s u l t s = np . z e ro s ( (256 , 2 ) )#3#4 t r a c e s = np . load ( f o l d e r + ’ Traces . npy ’ ) [ : , 5 7 : 1 5 2 ]#5 p l a i n t e x t = np . load ( f o l d e r + ’ P la in t ex t . npy ’ ) [ : , 0 ]#6 keys = np . load ( f o l d e r + ’ Keys . npy ’ ) [ : , 0 ]#7#8 p r e d i c t i o n s = model . p r e d i c t ( t r a c e s )#9 maxindices = np . argmax ( p r e d i c t i o n s , a x i s = 1)#10 f o r i in range ( t r a c e s . shape [ 0 ] ) :#11 i f Sbox [ p l a i n t e x t [ i ] ˆ keys [ i ] ] == maxindices [ i ] :#12 r e s u l t s [ maxindices [ i ] , 1 ] += 1#13 r e s u l t s [ Sbox [ p l a i n t e x t [ i ] ˆ keys [ i ] ] , 0 ] += 1#14#15 return r e s u l t s

This is the code for the first attempt success test. Some file names werechanged to make the code fit the page. Lines 1-7 all initialize variables. On line4 we load the time span [57:152] for all traces. On line 5 and 6 we load keybyte0 and plaintext byte 0.

On line 8 we have the model make predictions on the data. Line 9 finds thetop prediction for each trace. Lines 10-14 compare each top prediction with theactual correct answer. On line 12 we store correct predictions in one column.On line 13 we store Sbox occurrences. This lets us calculate a success:occurrenceratio in another part of the code which gives us success rate.

3https://github.com/ANSSI-FR/ASCAD.git

B. OUR CODE 24

TRITA TRITA-EECS-EX-2019:110

www.kth.se

Documents

Deep-Learning Side-Channel Attacks on AESkth.diva-portal.org/smash/get/diva2:1322924/FULLTEXT01.pdf · togra ska algoritmer. Konceptuellt ar tanken att overvaka ett chip medan det