Stochastic Gradient Descent in Machine Learningkth.diva-portal.org/smash/get/diva2:1335380/FULLTEXT01.pdf · 1.A literature study about Numerical stochastic gradient descent and other

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2019

Stochastic Gradient Descent in Machine Learning

CHRISTIAN L. THUNBERG

NIKLAS MANNERSKOG

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

Abstract

Some tasks, like recognizing digits and spoken words, are simple for humans to complete yet hardto solve for computer programs. For instance the human intuition behind recognizing the numbereight, ”8 ”, is to identify two loops on top of each other and it turns out this is not easy to representas an algorithm. With machine learning one can tackle the problem in a new, easier, way where thecomputer program learns to recognize patterns and make conclusions from them. In this bachelorthesis a digit recognizing program is implemented and the parameters of the stochastic gradientdescent optimizing algorithm are analyzed based on how their effect on the computation speed andaccuracy. These parameters being the learning rate ∆t and batch size N . The implemented digitrecognizing program yielded an accuracy of around 95 % when tested and the time per iterationstayed constant during the training session and increased linearly with batch size. Low learningrates yielded a slower rate of convergence while larger ones yielded faster but more unstable con-vergence. Larger batch sizes also improved the convergence but at the cost of more computationalpower.

Keywords: Stochastic Gradient Descent, Machinelearning, Neural Networks, Learning Rate,Batch Size, MNIST

i

IN DEGREE PROJECT TEKNIK,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2019

Stochastic Gradient Descent inom Maskininlärning

CHRISTIAN L. THUNBERG

NIKLAS MANNERSKOG

KTH ROYAL INSTITUTE OF TECHNOLOGYSKOLAN FÖR TEKNIKVETENSKAP

Sammanfattning

Vissa problem som for manniskor ar enkla att losa, till exempel: att kanna igen siffror och sagdaord, ar svart att implementera i datorprogram. Till exempel, den manskliga intuitionen att kannaigen siffran atta ”8 ” ar att notera tva slingor ovanpa varandra, detta visar sig vara svart attrepresentera som en algoritm. Med maskininlarning ar det mojligt att angripa problemet pa ettnytt, enklare, satt dar datorprogrammet lars att kanna igen utformningar som datorprogrammetdrar slutsatser fran. I denna kandidatuppsats implementeras ett sifferigenkanningsprogram ochparametrarna i ”stochastic gradient descent” analyseras i deras paverkan av programmets berakn-ingshastighet och traffsakerhet. Dessa parametrar ar ”learning rate” ∆t och ”batch size” N . Detimplementerade programmet for sifferigenkanning hade en traffsakerhet pa omkring 95 % nar dettestades och tiden per iteration var konstant under traningen av programmet, samtidigt som denokade linjart med okad batch size. Laga learning rates resulterade i lag men stadig konvergensmedans storre resulterade i snabbare men mer instabil konvergens. Storre batch sizes forbattradekonvergensen men pa bekostnad av langre berakningstid.

Nyckelord: Stokastiska Gradientmetoden, Maskininlarning, Neurala Natverk, Inlarningshastighet,Batchstorlek, MNIST

ii

Acknowledgements

This bachelor thesis was written as a part of course SA114X, Degree Project in Engineering PhysicsFirst cycle, at the Department of Numerical Analysis at the Royal Institute of Technology. Wewould like to thank our supervisors Anna Persson and Patrick Henning for their support and feed-back throughout the entire project.

Christian L. Thunberg, [email protected] Mannerskog, [email protected]

Stockholm, 2019

iii

Contents

1 Introduction 11.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Background and target group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Theory 12.1 Artificial intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Artificial neural network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

3 Training Neural Networks 43.1 Standard Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.2 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Empirical and Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 Implementation 54.1 MNIST Problem and Model, bild pa MNIST . . . . . . . . . . . . . . . . . . . . . 54.2 Code and measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Parameter study 65.1 Time dependency on batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5.1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65.1.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.2 Accuracy and learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.2.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.2.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

5.3 Accuracy and batch size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115.3.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.4.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135.4.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

6 Analysis and Conclusions 15

7 Appendix I7.1 Digit recognizing program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

iv

1 Introduction

1.1 Purpose

This bachelor thesis will cover two separated topics:

1. A literature study about Numerical stochastic gradient descent and other iterativemethods used in machine learning to find a local minimum of a function.

2. How to make a digit recognizing computer program by using a neural network which willrecognize hand written digits, then analyze some parameters’ impact on the result and com-putation speed.

1.2 Background and target group

The interest of machine learning has increased during the past years [1]. It has potential to form thefuture more than any other technical concept year 2019 [2] and interesting applications availabletoday justifies the increased interest, for example:

1. IBM’s Deep Blue chess-playing system which defeated the world champion in chess in 1997[3].

2. An AI developed by Google’s DeepMind which in February 2019 defeated the best StarCraftII players [4].

3. Alexa, Amazon’s virtual assistant, became good at speech recognition and natural languageunderstanding with help of machine learning [5].

This bachelor thesis will first cover explanations about the important concepts within machinelearning. After the concepts are explained the machine learning’s ”hello world” program will beimplemented and explained. After the implementation some parameters will be analyzed to seehow the parameter effects the accuracy of the program.

Everyone with basic knowledge in programming and mathematics, who is new to machine learningcould see this thesis as an introduction to the subject.

1.3 Delimitation

The thesis will only cover supervised learning which is one of many ways a computer programcan be trained and only a handful variables will be analyzed. The variables that will be analyzedare: how the batch size, N , affects the computation time and accuracy of the digit recognizingprogram and how the learning rate, ∆t, affects the accuracy. With regards to the learning rate,there are algorithms with adaptive learning rates to train neural networks, in this thesis howeverwe only consider constant learning rates and constant batch sizes. Also note designing a neuralnetwork that achieves maximum possible accuracy is not in the scope of this thesis.

2 Theory

2.1 Artificial intelligence

Artificial intelligence has no unambiguous definition but can be summarized as the science and en-gineering of making intelligent entities which includes the most important topic - making intelligentcomputer programs [6].

An intelligent entity could be a mechanical arm which sorts papers with hand written digits or aself driven car who reacts to people in front of the car on the road.

1

2.2 Machine learning

Machine learning was defined as the ”field of study that gives computers the ability to learn withoutbeing explicitly programmed” by a pioneer in the field of machine learning, Arthur Samuel in 1959.

The field, machine learning, can be understood as a scientific study of computational methods thatuse experience to raise performance or to make better predictions [7].

The entity’s software could learn to recognize handwritten digits by analyzing a large set of handwritten digits or being run in a simulator where feedback is given to the software.

2.3 Artificial neural network

An artificial neural network or from now on ”neural network”, is a framework for a set of machinelearning algorithms inspired by the human brain. The neural network, used in this thesis, consistsof different layers, two which are visible for humans and one or more hidden layers. The layers,visible to humans, is the first layer which is where the data is fed and the last layer which is theneural network’s output. Each layer consists of an arbitrary amount of neurons, connected to allneurons in the layer in front and behind, no connections between neurons in a layer exists. Infigure 1 the above described neural network is visually illustrated.

Figure 1 – Illustration of a neural network with: I input neurons, H hidden layers, with A,B or Cneurons and O output neurons.

Each neuron contains a value, the value of a neuron is calculated by creating a weighted sum. Theweighted sum is calculated by multiplying each neuron value in the previous layer with a weightand sum all of the created products, then add a bias to the sum. The weighted sum is then mappedto an interval I, chosen by the author by a function σ : R→ I, the mapped value is the neuron’svalue. The function σ is called the ”activation function”, the activation function used in this thesisis the sigmoid function, which is a frequently used activation function in machine learning [8]

sigmoid function: σ(x) =1

1 + e−x. (1)

Let Σbh be the weighted sum of Nbh and nbh be the value of Nbh in figure 1. The weighted sumwith the bias is then calculated through

Σbh =

A∑a=1

ωab · na1 + bh (2)

where ωab is the weight, bh is the bias and na1 is a neuron’s value in the previous layer. The indexab means from neuron a in the previous layer to neuron b in the current layer. The weights andbiases are real values and can be thought of as a description of how a neuron depends of a neuronin the previous layer. The value of the neuron, nbh, is then calculated by inserting the weightedsum, Σbh, into the activation function. If so, the value of the neuron is

2

nbh = σ(Σbh). (3)

In figure 2 the above described neuron is visually illustrated.

Figure 2 – Illustration of a how the j’th neuron in layer h in a neural network is calculated. Thevalue of a neuron in the previous layer is labeled na,h−1 where a ∈ {1, 2, ..., A} and h − 1 is theprevious layer’s index. From now on a will be used instead of a, h − 1. Each na is associatedwith a weight ωa ∈ (−∞,∞) and bias bh ∈ (−∞,∞) which describes how this neuron dependson na. The value of the neuron of interest, nj,h, is then calculated by inserting the weighted sum

Σ =∑Aa=1 ωa · na + bh into the activation function σ, hence nj,h = σ(Σ). The value of a neuron is

then transmitted, to all neurons in the next layer.

In this thesis the inputs and outputs of the model can be interpreted as vectors, X = (X1, ..., XI)and α = (α1, ..., αO) where the components Xi and αo are real numbers. The output layer ismapped to a vector T(α) where the components, To, describes the probability of an event. Thelargest component in T, Tmax = max(T), is the neural network’s conclusion. E.g. if the neuralnetwork can distinguish pictures of cats or dogs, the mapped output layer could be T = [T1, T2] =[P (Dog), P (Cat)]. If the output array for a picture is T = [0.52, 0.89], the largest componentTmax = 0.89 hence, the neural network will conclude ”this is a picture of a cat”. Before the neuralnetwork can make a good conclusion it needs to be trained or taught by ”looking” at input datawith labels that contains the correct output layer vector

y = (y1, ..., yo, ..., yO) where yo =

{1 if the o’th component is the wanted output

0 else. (4)

The correct output layer vector (y) or the one-hot vector when y is defined as above, is then com-pared with the neural network’s mapped output vector, T, for an error estimation, let’s denote thecomparison as TCompared. The problem is now to minimize f(All variables in the neural network) =TCompared in order to minimize the error. In general this is not easily done and certain methods areused to complete the minimization problem, see section 3 for further reading. E.g. a picture of a catis given to a neural network in the training phase, the correct output array is y = [0, 1] = [y1, y2]and the the network’s output is T = [T1, T2]. The comparison could be the sum of the components’squared difference, hence

TCompared =

O∑o=1

(To − yo)2. (5)

The goal is now to minimize [9]

E[f(All variables in the neural network)]. (6)

To simplify the notation from now on: θ will be used to represent the model’s parameters, x isthe input, y is the one-hot vector and α will be the neural network function (taking x and θ asinputs). Hence the function to minimize (with respect to θ) can be written as:

E[f(all variables)] = E[f(α(x, θ),y)]. (7)

3

3 Training Neural Networks

3.1 Standard Gradient Descent

As previously stated the machine learning problem of training a neural network is reduced tofinding θ that solves

minθ

E[f(α(x, θ),y)] (8)

where we specify a cost function f(α,y) and feed datapoints (xi,yi) from the training data tocompute the optimal neural network parameters θ. α is the neural network function taking inputvectors x and parameters θ. We can solve the problem approximately using the following algorithm

• Make an initial guess θ0 for θ

• Create a set of labeled training data {(xi,yi)}Ntraining

i=1

• Choose suitable learning rate ∆t > 0

• for m ∈ {1, 2, ...,M} do

• Compute θm+1 = θm −∆t 1Ntraining

∇θ∑Ntraining

j=1 f(α(xj , θm),yj)

which is typically referred to as ”standard” gradient descent or GD for short. Naturally GD

actually tries to minimize 1Ntraining

∑Ntraining

j=1 f(α(xj , θ),yj) [10].

3.2 Stochastic Gradient Descent

The most computationally taxing part of the algorithm described above is calculating the gradientwhich is done in a exact matter using the entire training data set. However, the gradient caninstead be approximated using a smaller sample (mini batch) of the data set and thus reduce thecomputation needed. The algorithm then becomes

• Make an initial guess θ0 for θ

• Create a set of labeled training data {(xi,yi)}Ntraining

i=1

• Choose suitable batch size N ≤ Ntraining

• Choose suitable learning rate ∆t > 0

• for m ∈ {1, 2, ...,M} do

• Choose random {jk}Nk=1 from {1, 2, ..., Ntraining}

• Compute θm+1 = θm −∆t 1N∇θ

∑Nk=1 f(α(xjk , θm),yjk)

which we call stochastic gradient descent (SGD) [10]. If N = Ntraining we get standard GD instead.Naturally this method will be less exact due to its random nature but will hopefully be faster insolving (8) by using less computation.

In conclusion we must specify at least two parameters, the learning rate ∆t and the batch size Nin order to use SGD as an optimizer. In this most simple version we keep ∆t constant, but thiscould also be varied according to a schedule {∆ti}Mi=1 in more advanced versions of the algorithm.

4

3.3 Empirical and Expected Values

As mentioned in 3.1 the actual functions attempted to be minimized in both GD and SGD is1

Ntraining

∑Ntraining

j=1 f(α(xj , θ),yj) which is based on a training sample from the distribution of

the data x. This only yields an approximate solution that is dependent on the training data usedand thus is refereed to as the empirical loss function. To make an estimation of the expected loss,E[f(α(x, θ),y)], we can feed the network data point not from the training set (from the test set),thus not previously seen by the model.

As both data sets are sampled from the same distribution, minimizing the empirical loss functionis hoped to also minimize the expected loss function. However, if M is too high training may fitthe model to noise in the training data and thus not further increase the expected loss. This iscalled over-fitting.

4 Implementation

4.1 MNIST Problem and Model, bild pa MNIST

To make a numerical analysis of the stochastic gradient descent optimizer we use a standardhandwritten digit recognition problem and the MNIST data set. The data set consists of labeled28×28 (pixels) images, picturing different handwritten digits 0-9. Each label comes in the form ofa 1 × 10 one-hot label [11]. The goal is to create a neural network that achieves a high accuracyin identifying which digit displayed in such a picture.

Figure 3 – An MNIST 4 Figure 4 – An MNIST 5

Figure 5 – An MNIST 7. Figure 6 – Another MNIST 7.

The basic model used in this report to solve the problem is a three layer neural network. As eachimage has 28× 28 = 784 pixels we have 784 neurons in the input layer. Hidden layers increase thecomplexity of the model and allows for a higher accuracy but also increases the time to train it,thus only one hidden layer with K = 500 neurons and a sigmoid activation function is used. Theoutput layer consists of 10 output neurons (one for each digits). Note that the purpose here is notto achieve the maximal possible accuracy for the problem, but to observe the SGD optimizationprocess.

Furthermore, the output of the neural network is normalized by passing it through the softmaxfunction. With z = (z1, ..., zn)T , we define the softmax function S(z) as follows [12]:

5

S(z) = (ez1∑ni=1 e

zi, ...,

ezn∑ni=1 e

zi). (9)

This forces each component of the output into the interval 0 < S(z)i < 1 as ex > 0. It also satisfies∑ni=1 S(z)i = 1. The cost function used is called the cross-entropy function [13]. In the discrete

case the cross-entropy function H(y, z) is defined as follows (y = (y1, ..., yn)T ):

H(y, z) = −n∑i=1

yilog(zi) (10)

In this particular problem, y is the label of the image and z is the output from the model throughthe softmax function. The value of H(y, z) is clearly defined for all outputs of the neural networkpassed through the softmax function, as these all are greater than 0 by (9). We also have n = 10and the concrete problem to solve becomes

minθ

E[H(y, S(α(x, θ)))] (11)

where x is the pixel data from the images, y is the corresponding label and α(x, θ) is the functionof the neural network, using the same notation as in (8). This will be solved using stochasticgradient descent described in section 3.3, with measurements taken according to the next section.

4.2 Code and measurements

The actual code used to train the model can be found in the appendix. In short we have used aTensorFlow [14] implementation in python, inspired by tutorials from [15]. The MNIST data set

{(xi,yi)}Ntoti=1 comes divided into a training set {(xi,yi)}

Nsplit

i=1 and a test set {(xi,yi)}Ntot

i=Nsplit+1

which we use for training and verification respectively. For each set of parameters the network wastrained for 7500 iterations, recording the following measurements each 25’th iteration m:

• Empirical cost, defined as 1Nsplit

∑Nsplit

j=1 H(yj , α(θm,xj))

• Expected cost, defined as 1Ntot−Nsplit

∑Ntot

j=Nsplit+1H(yj , α(θm,xj))

• Empirical accuracy, the ratio of images correctly identified on the training data set

• Expected accuracy, the ratio of images correctly identified on the test data set

This way we obtained data on how the SGD optimization effect the performance on training andtest data sets while solving (11) over the training loop described in section 3.3. Also, as the SGDalgorithm is random in nature we train the network 10 times for each set of parameters in orderto obtain mean and variance measurements for the values collected above. In addition to these wealso ran separate training sessions where we collected the time it takes for training the network fordifferent batch sizes.

The parameter sets tested were combinations of the following

• Learning rates used are 0.01, 0.1, 0.5, 1, 2, 5, 10, 20

• Batch sizes used are 1, 2, 4, 8, 16, 32, 64, 128, 256, 512

for a total of 8 × 10 × 10 = 800 training sessions for varying batch sizes and learning rates, inaddition to 10× 30 = 300 training sessions for recording time data.

5 Parameter study

5.1 Time dependency on batch size

5.1.1 Method

In order to analyze how the batch size affects the time efficiency, the digit recognizing programwas run with K = 500, M = 7500 and ∆t = 1 as fixed parameters while the batch size was varied,the batch sizes (N) used were: 1, 2, 4, 8, 16, 32, 64, 128, 256 and 512.

6

The digit recognizing program was run with each, previously chosen, batch size. The start timewas saved when calculations for a new batch size started and for each 25’th iteration the elapsedtime (tnow − tstart) was saved. This procedure was repeated 30 times and generated 30 time seriesfor each batch size.

The first two plots below shows the averages of all 30 time series for the iterations and the corre-sponding standard deviation to each data point.

A linear function was fitted to each time series where the slope, k, represents time per iterationfor that time series. The third plot below shows the average of the 30 slopes for each batch sizewith the standard deviation for each average slope.

5.1.2 Result

Figure 7 – The mean of the 30 different time series of the elapsed time for the different batch sizes.

Figure 7 shows that the time the digit recognizing program needs to make m steps grows linearlywith the number of iterations made. It is also, from this figure, possible to see that the timeneeded to perform m steps grows faster when the batch size is increased. This means that timeper iteration is longer for larger batch sizes.

7

Figure 8 – The uncertainty for time passed for different batch sizes.

Figure 8 shows the uncertainty in time elapsed between different time series. The general trend isthat the uncertainty of time elapsed grows linearly with the number of iteration. This is logical,since there is a difference in time per iteration between the different time series this difference willgrow larger when the number of iterations is increased. Something surprising was: for N = 16,and for some other Ns, one can see patterns in the data which probably has something to do withthe computer the digit recognizing program ran on.

Figure 9 – Time per iteration for different batch sizes.

Figure 9 illustrates how the time per iteration increases seemingly linearly for large batch sizes(N ≥ 8), though this does not seem to hold for the smaller batch sizes. Note that figure 9’sx-axis is logarithmic, hence the exponential implies a linear relationship. When the error bars arecompared with figure 8 one can see, as expected, small error bars correlate with low slopes.

The slope of this line is unique for the computer the training sessions were run on. A supercomputercould make 2 × 1017 calculations per second [16]. While home desktop computers can only makebillions (109) of calculations per second. Hence the super computer is expected to perform thesame calculations faster then the home desktop computer [17].

8

5.2 Accuracy and learning rate

5.2.1 Method

We measured empirical and expected losses/accuracies in accordance with the model describedin 4.2 for the learning rates ∆t = 0.01, 0.1, 0.5, 1, 2, 5, 10, 20. Keeping the batch size constant atN = 128 we run 10 training sessions for learning rate, taking measurements every 25’th iterationover a total of 7500 iterations. As the optimization is stochastic in nature the plots below areshowing the averages over these ten sessions for the iterations measured in order to better capturethe converging behaviour.

5.2.2 Result

Figure 10 – Empirical loss for different learning rates, N = 128.

Figure 10 describes the empirical losses for the different learning rates. Unsurprisingly the lower∆t’s exhibit slower declining rates, increasing up until ∆t = 2 which seem to yield the lowest costat m = 7500. For ∆t = 20 the effect is reversed, declining similar to ∆t = 0.01 but with a highervariance suggesting the algorithm has become unstable. This relationship between higher varianceand higher learning rates seem to be prevalent for all tested ∆t. Note that figure 10’s y-axis islogarithmic.

9

Figure 11 – Expected loss for different learning rates, N = 128.

In contrast, the expected losses described in figure 11 suggest convergence instead of a constantdecline. Again, the lower ∆t’s show a slower rate of decline up until ∆t = 10, where ∆t = 20 againexhibit behaviour similar to that of ∆t = 0.01 apart from a higher variance. This higher varianceis also present in ∆t = 10, which initially converges slower than ∆t = 5 and seem more sporadic,but ultimately converges to a lower value. Note that figure 11’s y-axis is logarithmic.

Figure 12 – Empirical accuracies for different convergence rates, N = 128.

Figure 12 shows the empirical accuracies for the different learning rates. Clearly noticeable is thefact that the accuracy has not completely converged yet for ∆t = 0.01 which, in accordance withthe previous figures, exhibits the slowest growth. The growth increases until ∆t = 5 after which itbegins to decline and instead increase in variance. Similarly to previous figures, ∆t = 20 convergesat a similar rate to that of ∆t = 0.1 but with higher variance. Most of the other models convergevery close to 1, suggesting potential over-fitting at higher iterations.

10

Figure 13 – Empirical accuracies for different convergence rates, N = 128.

Qualitatively the behavior in figure 13 is very similar to that in figure 12, apart from the finalconvergence rate being lower for seemingly all learning rates but 0.01, 0.1 and 20. The accuraciesalso flattens out faster than in the previous figure indicating most of the latter training iterationsfor these learning rates, are unnecessary.

5.3 Accuracy and batch size

5.3.1 Method

We measured empirical and expected losses/accuracies in accordance with the model described4.2 for the batch sizes N = 1, 2, 4, 8, 16, 32, 64, 128, 256, 512. Keeping the learning rate constant at∆t = 1 we run 10 training sessions for each batch size, taking measurements every 25’th iterationover a total of 7500 iterations. As the optimization is stochastic in nature the plots below areshowing the averages over these ten sessions for the iterations measured in order to better capturethe converging behaviour.

11

5.3.2 Result

Figure 14 – Empirical loss for different batch sizes, ∆t = 1.

Figure 14 describes the empirical losses for the different batch sizes. Unsurprisingly the loss exhibitsfaster decline rates with increased batch size up to N = 128. For iterations lower than 6000, batchsize of 512 seem to have the fastest rate of decline, but is then overtaken by the lower batch sizesof 128 and 256 which is an anomaly considering the more general patterns of the plot. Note thatfigure 14’s y-axis is logarithmic.

Figure 15 – Expected loss for different batch sizes, ∆t = 1.

Figure 15 describes the expected losses for the different batch sizes. In contrast to figure 14 whichshows that the loss for N = 128, 256 or 512 declines fastest, figure 15 shows N = 64 declinesfastest after around 2000 iterations which suggest that the model has been over-fitted. Note thatfigure 15’s y-axis is logarithmic.

12

Figure 16 – Empirical accuracy for different batch sizes, ∆t = 1.

Figure 16 describes the empirical accuracy for the different batch sizes. Clearly noticeable the factthat the accuracy has converged for all batch sizes. Another fact is that N = 128, 256 and 512achieves an accuracy of 1 which could mean that the model has been over-fitted.

Figure 17 – Expected accuracy for different batch sizes, ∆t = 1.

Qualitatively the behavior in figure 17 is very similar to that in figure 16, apart from the finalconvergence rate being lower for seemingly all batch sizes but N = 1, 2, 4 and 8. For these lowerN ’s the empirical and expected accuracies seem all but equal. For the remaining batch sizes, theaccuracies flattens out faster than in the previous figure indicating most of the latter trainingiterations for these learning rates are unnecessary.

5.4 Performance

5.4.1 Method

Performance of the different parameter sets was computed in two ways. The final expected accuracyAccex was taken as the average of the last 500 iterations over the 10 training sessions (20 × 10

13

datapoints). The accuracy at iteration m, Accm, was said to have converged to this value ifAccex − Accm ≤ 0.005. Below we plot these measurements against batch size for the differentlearning rates.

5.4.2 Result

Figure 18 – Final expected accuracy against batch size for different learning rates ∆t.

Figure 18 describes how the final accuracy depends on the parameter sets. Most learning ratesseem, apart from ∆t = 0.01 to converge to higher values with an increasing batch size, with lowerpayoffs from very large and very small batch sizes. Overall, high batch sizes with a learning rateof 10 seems to yield the highest final expected accuracy. In addition, higher learning rates shiftthe plots rightwards, suggesting a larger batch size is required to achieve high accuracy. Note thatfigure 18’s x-axis is logarithmic.

14

Figure 19 – Iterations until convergence against batch size for various learning rates ∆t.

Similarly, figure 19 describes the dependency of the convergence rate on the batch size for thedifferent learning rates. The overall behaviour observed is a somewhat concave shape for most ∆t’s,suggesting very small and very large batch sizes yield faster convergence iteration wise comparedto medium size batches. Note that figure 19’s x-axis is logarithmic.

6 Analysis and Conclusions

In this thesis the parameters of stochastic gradient descent has been analyzed in regards to theireffect on the neural network optimization, in particular the parameters learning rate ∆t and batchsize N . This was done by examining the empirical and expected costs and accuracies during thetraining processes, in addition to timing the training sessions.

With regards to time per iteration, the duration of each iteration was constant with a smalldeviation between the different time series. The constant nature of the time per iteration remainedduring the whole training session. When analyzing the time elapsed for different batch sizes onecould conclude that the time per iterations increased linearly with increased batch size. Thecomputer the training session was run on determines how the slope changes for different batchsizes. The linearly increased slope for increased batch size was expected as the algorithm whichsolves equation (8) scales linearly with the batch size (N) as described in section 3. Since this wasempirically shown one could also conclude that the TensorFlow gradient computing algorithm alsoscales linearly with the batch size.

Moreover, a low learning rate unsurprisingly yielded a slower rate of convergence. By observing thealgorithm in section 3.3 this is explained by the smaller iterative steps taken between each updateof the network, which thus also yields higher stability of the optimization. In contrast, the highestlearning rates consistently yield lower final accuracies and decreased stability which suggest thereis an optimal learning rate for each batch size. As mentioned in section 3.3 more sophisticatedalgorithms make use of varied learning rates which could make use of the quick initial convergenceof a high learning rate with the stability and precision of a lower one for higher iterations. It is alsoworth making note of the discrepancy between the empirical and expected loss (figure 10 and figure11) which increases with larger step sizes for the higher iterations. Thus higher learning rates seemmore prone to over-fit the model, by continue to optimize the empirical training function whilehave little effect on the expected one after the initial iterations.

Furthermore, the higher and more stable convergence of higher batch sizes illustrated in figure 14and figure 15 can be explained by the more accurate approximations of the gradient. Higher batch

15

sizes also yield quicker and higher convergence suggesting using a high batch size is preferable.The lower batch sizes also converge quickly but are unstable and yield lower final convergenceaccuracies, thus explaining this phenomenon. Figure 19 implies medium size batches converge theslowest (iteration wise). This is possibly due to them being able to approximate the gradient goodenough to yield an acceptable final accuracy, but still bad enough to not being able to competewith the higher batch sizes.

As the higher batch sizes clearly do not yield higher performance in a linear manner (figure 18,figure 19), while the cost of implementing these does (figure 9), it is doubtful using the highestbatch sizes is preferable to simply iterating the optimization loop more times. As described inthe result, while the medium batch sizes may require more iterations to converge it is not alwaysthe double amount, thus using a lower batch size might be preferable in this regard. Overall, thedifferent batch sizes and learning rates have different properties making them useful for differentsituations. Using constant batch sizes and learning rates as done in the thesis i clearly not optimaland these could be varied to increase the performances of the networks. Using a medium initialbatch size with higher learning rate initially and then increasing the batch size while decreasingthe learning rate later should be a better solution, given the results in the previous section.

Further studies could be made into such adaptive optimization variants of SGD, as a constantlearning rate clearly is not optimal. Varying batch sizes during training sessions is also an inter-esting area. In this paper, the size of the training set and test set was also held constant and takendirectly from the MNIST database. Studies into how the ratio of training and test set sizes effectthe expected and empirical losses and accuracies was not explored in this thesis and is thereforesomething that could be investigated further.

16

References

[1] Google Trends. (2019) Machine learning — Google Trends. [Online]. Available: https://trends.google.com/trends/explore?date=all&q=Machine%20learning [Accessed: 2019-04-24].

[2] V. Maini and S. Sabri, “Machine learning for humans,” Medium, 2017. [Online]. Available:https://medium.com/machine-learning-for-humans [Accessed: 2019-04-24].

[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, ch. 1, p. 2.[Online]. Available: http://www.deeplearningbook.org [Accessed: 2019-05-03].

[4] J. Vincent, “DeepMind’s AI agents conquer human pros at StarCraft II,” TheVerge, Jan. 2019. [Online]. Available: https://www.theverge.com/2019/1/24/18196135/google-deepmind-ai-starcraft-2-victory [Accessed: 2019-04-24].

[5] Day One Staff. (2018, Mar.) ”How our scientists are makingAlexa smarter”. [Online]. Available: https://blog.aboutamazon.com/amazon-ai/how-our-scientists-are-making-alexa-smarter [Accessed: 2019-04-24].

[6] John McCarthy, “What is artificial intelligence?” p. 1, 2007. [Online]. Avail-able: https://web.archive.org/web/20151118212404/http://www-formal.stanford.edu/jmc/whatisai/node1.html [Accessed: 2019-04-28].

[7] M. Esposito, K. Bheemaiah, and T. Tse, “What is machine learning?” The conversa-tion, 2017. [Online]. Available: http://theconversation.com/what-is-machine-learning-76759[Accessed: 2019-04-28].

[8] Michael A. Nielsen, Neural Networks and Deep Learning. Determination Press, 2015, ch.1, Sigmoid neurons. [Online]. Available: http://neuralnetworksanddeeplearning.com/chap1.html#sigmoid neurons [Accessed: 2019-05-02].

[9] ——, Neural Networks and Deep Learning. Determination Press, 2015, ch. 1. [Online].Available: http://neuralnetworksanddeeplearning.com/chap1.html [Accessed: 2019-04-28].

[10] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. [Online].Available: http://www.deeplearningbook.org

[11] Y. LeCun, C. Cortes, and C. Burges. The MNIST database of handwritten digits. [Online].Available: http://yann.lecun.com/exdb/mnist/ [Accessed: 2019-04-24].

[12] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, ch. 4, p. 79.[Online]. Available: http://www.deeplearningbook.org [Accessed: 2019-05-03].

[13] ——, Deep Learning. MIT Press, 2016, ch. 5, p. 130. [Online]. Available: http://www.deeplearningbook.org [Accessed: 2019-05-03].

[14] Tensorflow. [Online]. Available: https://www.tensorflow.org/ [Accessed: 2019-04-27].

[15] Harrison Kinsley ”Sentdex”. Pythonprograming. [Online]. Available: https://pythonprogramming.net [Accessed: 2019-04-27].

[16] J. Bryner, “This supercomputer can calculate in 1 second what would take you 6billion years,” Live science, 2018. [Online]. Available: https://www.livescience.com/62827-fastest-supercomputer.html [Accessed: 2019-04-29].

[17] J. Strickland, “What is computing power?” HowStuffWorks.com, 2019. [Online]. Available:https://computer.howstuffworks.com/computing-power.htm [Accessed: 2019-04-29].

17

https://trends.google.com/trends/explore?date=all&q=Machine%20learning

https://trends.google.com/trends/explore?date=all&q=Machine%20learning

https://medium.com/machine-learning-for-humans

http://www.deeplearningbook.org

https://www.theverge.com/2019/1/24/18196135/google-deepmind-ai-starcraft-2-victory

https://www.theverge.com/2019/1/24/18196135/google-deepmind-ai-starcraft-2-victory

https://blog.aboutamazon.com/amazon-ai/how-our-scientists-are-making-alexa-smarter

https://blog.aboutamazon.com/amazon-ai/how-our-scientists-are-making-alexa-smarter

https://web.archive.org/web/20151118212404/http://www-formal.stanford.edu/jmc/whatisai/node1.html

https://web.archive.org/web/20151118212404/http://www-formal.stanford.edu/jmc/whatisai/node1.html

http://theconversation.com/what-is-machine-learning-76759

http://neuralnetworksanddeeplearning.com/chap1.html#sigmoid_neurons

http://neuralnetworksanddeeplearning.com/chap1.html#sigmoid_neurons

http://neuralnetworksanddeeplearning.com/chap1.html


http://yann.lecun.com/exdb/mnist/




https://www.tensorflow.org/

https://pythonprogramming.net

https://pythonprogramming.net

https://www.livescience.com/62827-fastest-supercomputer.html

https://www.livescience.com/62827-fastest-supercomputer.html

https://computer.howstuffworks.com/computing-power.htm

7 Appendix

7.1 Digit recognizing program

The following code was used to train the digit recognizing program and was, as previously stated,inspired by code examples from:https://pythonprogramming.net/tensorflow-neural-network-session-machine-learning-tutorial/

I

https://pythonprogramming.net/tensorflow-neural-network-session-machine-learning-tutorial/

II

www.kth.se

Documents

Stochastic Gradient Descent in Machine Learningkth.diva-portal.org/smash/get/diva2:1335380/FULLTEXT01.pdf · 1.A literature study about Numerical stochastic gradient descent and other