Multiresolution Backpropagation Learning · Multiresolution Backpropagation Learning Ricardo Jorge Ferreira Ponciano Thesis to obtain the Master of Science Degree in Information Systems

Multiresolution Backpropagation Learning

Ricardo Jorge Ferreira Ponciano

Thesis to obtain the Master of Science Degree in

Information Systems and Computer Engineering

Supervisor: Prof. Andreas Miroslaus Wichert

Examination Committee

Chairperson: Prof. Mário Jorge Costa Gaspar da SilvaSupervisor: Prof. Andreas Miroslaus Wichert

Member of the Committee: Prof. João Carlos Serrenho Dias Pereira

June 2018

Acknowledgments

First I would like to thank my thesis advisor, Prof. Andreas Wichert for all the necessary support in a

gradual and timely manner. I want to express my gratitude for his knowledge sharing about the deep

learning field. It was a very positive experience and I found him not just an thesis advisor, but also a true

mentor.

I also want to thank my family to all the support during the course of this work.

iii

Resumo

Treinar dados com elevada dimensão requere a minimização de superfı́cies de erro complicadas. Propo-

mos uma abordagem multiresolução com treino incremental baseado em retro propagação para melho-

rar a generalização. A pirâmide Gaussiana, gerada a partir de um padrão inicial de imagens, é a entrada

para redes neuronais de alimentação direta que aprendem desde reduzida até elevada resolução. Após

a inicialização do treino, os valores precedentes inicializam a rede neuronal seguinte. Aplicámos este

método ao conjunto de dados MNIST para reconhecimento de padrões. A Aprendizagem Retro propa-

gada Multiresolução generalizou melhor do que um treino baseado em simples retro propagação, com

convergência mais rápida. Verificámos empiricamente que podemos chegar próximo de um mı́nimo

global, evitando mı́nimos locais.

Palavras-chave: Multiresolução, Redes neuronais, Generalização, Mı́nimos locais, Apren-dizagem profunda

v

Abstract

High dimensional data training requires minimizing complicated error surfaces. We propose a mul-

tiresolution approach with incrementally backpropagation-based training to improve generalization. A

Gaussian pyramid, generated from a initial pattern of images, is the input to feedforward neural net-

works learning from lower to higher resolution. After train initialization, the preceding values initialize the

following neural network. We applied this method to MNIST dataset for pattern recognition. The Mul-

tiresolution Backpropagation Learning generalized better than simple backpropagation-based training,

with faster convergence. We empirically verify that we can reach near a global minimum, avoiding local

minima.

Keywords: Multiresolution, Neural networks, Generalization, Local minima, Deep learning

vii

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 5

2.1 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.6 Stochastic Gradient Descent Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.7 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.8 Deep Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.9 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.10 LeCun Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.11 Deep Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.12 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.13 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.14 Multiresolution Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.15 Subspace Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Multiresolution Backpropagation Learning 21

3.1 Gaussian Pyramid Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Weight replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ix

3.4 Multiresolution Backpropagation Learning Architecture . . . . . . . . . . . . . . . . . . . . 25

4 Empirical Experiments 27

4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Performance measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.4.1 Preliminary CIFAR-10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5.1 MNIST Gaussian Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.5.2 Networks Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Conclusions 35

5.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Bibliography 37

x

List of Tables

4.1 Image classification error rate (%) of preliminary experiments on MNIST. . . . . . . . . . . 29

4.2 Image classification accuracy (%) of preliminary experiments on CIFAR-10. . . . . . . . . 29

4.3 Summary of MrBL and BL evaluation training parameters. . . . . . . . . . . . . . . . . . . 32

4.4 Image classification error rate of MrBL and BL evaluation on MNIST. . . . . . . . . . . . . 34

4.5 Image classification error rate of BL evaluation on MNIST with 100 epoch training. . . . . 34

xi

List of Figures

2.1 A simplified mathematical model of McCulloch and Pitts artificial neuron. . . . . . . . . . . 5

2.2 Graphical representation of activation functions. . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Surface generated by the training error of a linear neuron with two input weights. Also

known as error surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Illustration of a saddle point example [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 A simplified model of an artificial neural network. . . . . . . . . . . . . . . . . . . . . . . . 9

2.6 Convolution step [25]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Generic convolutional neural network architecture. Convolution step occurs in convolu-

tional layers, Max pooling occurs in pooling layers and the fully connected layers concern

to multilayer artificial neural networks [25]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.8 Illustration of a local minima example, given by the red dot. . . . . . . . . . . . . . . . . . 13

2.9 RGB raster image example showing individual pixels as squares and colour components

as values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.10 Schematic representation of an Gaussian pyramid with five levels. . . . . . . . . . . . . . 18

3.1 Schematic representation of three images at different resolution as input to three feedfor-

ward networks with backpropagation-based training. . . . . . . . . . . . . . . . . . . . . . 23

3.2 Hierarchy of weight replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Schematic representation of three images at different resolution as input to three feedfor-

ward networks with backpropagation-based training, showing the direction of the image

resolution reduction and the direction of the training procedure. . . . . . . . . . . . . . . . 25

4.1 MNIST Gaussian Pyramid sample images with σ = 1. . . . . . . . . . . . . . . . . . . . . 30

4.2 MNIST Gaussian Pyramid sample images with σ = 2 (two levels). . . . . . . . . . . . . . . 30

4.3 MNIST Gaussian Pyramid sample images with σ = 3 (two levels). . . . . . . . . . . . . . . 30

4.4 Error rate (%) obtained using different standard deviation values with MrBL. . . . . . . . . 31

4.5 Training set convergence properties of MrBL networks and BL evaluation, with scaled

epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.6 Training an test set convergence properties of MrBL and BL evaluation. . . . . . . . . . . 33

4.7 Test set convergence properties of MrBL and BL evaluation, with vertical bars represent-

ing the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

xiii

Chapter 1

Introduction

The human brain is a complex system, with about 86 billion neurons interacting through approximately

150 trillion synapses, allowing humans to perform an enormous diversity of tasks and its features are

desirable in artificial models [1]. Studied by neuroscience, the brain structure and its operation inspired

the development of artificial neural networks, in order to simulate brain’s learning capacity. Until now it

is the best known learning device.

Deep learning relies on specific architectures of artificial neural networks, allowing to learn about

datasets [2]. Deep learning methods are used in many domains, especially in pattern recognition tasks,

such as image or speech recognition. Its recent popularity is multi-factorial, mainly related to the avail-

ability of larger amounts of data and also due to an increase of computational power.

One of the main challenges about learning algorithms is the difficulty to obtain a model that performs

proper generalization towards new data [3]. It turns out that nowadays the abundance of high dimen-

sional data, like image databases, is transversal to fields such as information technology, astronomy,

bioinformatics or others and it can lead to performance problems [4]. Deep learning offers the possibility

to overcome such limitations, however its architectures are challenging to optimize and it is not yet fully

understood [5, 6].

Empirical studies have tried to explain certain phenomena that occurs during training, focusing on

specific types of critical points that occurs on the loss function shape [7, 5]. When the training involves

minimizing the loss or error function using gradient descent based algorithms, the derivative of those

functions might be equal zero, meaning that no information is given regarding the movement direction

of the function. When this happens, we call it critical points. There are three types of such points, local

maxima, local minima and saddle points. The latter two are more concerning since they typically occur

during the algorithm optimization and can lead to non optimal solutions [3].

1

We propose an optimization method to overcome the problem of local minima using a multiresolu-

tion approach and feedforward neural networks with backpropagation-based training. By obtaining a

sequence of low dimensional subspaces and incrementally training, we verify empirically that we can

reach near a global minimum, avoiding local minima [8]. This approach tries to deal with the undesired

effects of high dimensional data, leading to better generalization and avoiding overfitting. It gives an

alternative method to explore deep learning for pattern recognition tasks.

1.1 Motivation

Deep learning works very well [9]. Having an artificial neural network with several hidden layers opti-

mized with the stochastic gradient descent algorithm avoids that it gets trapped in a local minima [10]. It

is empirically shown that having many layers result in more saddle points than local minima and the high

dimensional error surface in that situations becomes a more flattened landscape [7, 5]. However, studies

demonstrated that local minima tend to occur towards the landscape bottom and they are expected to

be found more near the end of training [5].

Motivated by this theories, we developed a novel regularization method that resort to an hierarchy

of different and gradually less complicated landscapes to improve its generalization. We created a

sequence of subspaces represented by images at diverse resolutions and performed learning for feed-

forward networks with stochastic gradient descent computed with backpropagation from lower to higher

resolution.

This way, we do not start with a so complicated landscape and use the preceding to initialize the

following without reaching each of the landscape bottoms, where more local minima lay. Instead of

using many hidden layers in the feedforward networks we used one, because many hidden layers result

in more saddle points than local minima.

1.2 Objectives

The main goals of this thesis are to compare the Multiresolution Backpropagation Learning method to

the conventional backpropagation procedure and to empirically verify if the proposed method gives an

advantage comparatively to the traditional backpropagation procedure, using feedforward neural net-

works.

1.3 Thesis Outline

This work is divided into five chapters. After the present chapter, chapter 2 introduces the basic con-

cepts of deep learning, referring also their major problems and solutions. Then, we give an overview

perspective about multiresolution processing and the subspace tree, paving the way for the experiments

carried out later.

2

Chapter 3 describes the components of the Multiresolution Backpropagation Learning method show-

ing how was it done, focusing on the description of the algorithm behind it.

Chapter 4 presents the conducted empirical experiments using the proposed method. We start with

the dataset description and the preprocessing and then preliminary experiments were made before the

main experiments. In the experiments we explain in more detail the method training process and then,

the results obtained.

Chapter 5 present the conclusion of the work done, the main achievements and ideas for future

studies.

3

Chapter 2

Background

2.1 Artificial neuron

The search for artificial models that mimetize the human brain started around 1940 when the first elec-

tronic computers where developed.

The linear model introduced by McCulloch and Pitts in 1943 was perhaps the first artificial neuron,

presenting important features which can be found in many artificial neural networks [11].

The basic components of an artificial neuron are shown in the Figure 2.1, being related in such way

that the weighted sum of input signals is compared to a threshold (activation funtion) to determine the

output [12]. A neural network is no more than a collection of nodes or units (Figure 2.1) connected by

links.

Figure 2.1: A simplified mathematical model of McCulloch and Pitts artificial neuron.

2.2 Perceptron

Proposed by Frank Rosenblatt in 1957, the perceptron was built around the McCulloch and Pitts model

of a neuron to solve pattern recognition problems [13]. The author introduced a learning rule for training

perceptron networks, that will converge to the correct network weights [12].

Considering a simple perceptron, also known as one-layer feedforward networks, with a set of n

inputs and one output. Each input is the dot product between the weight vector value (wi) and the input

5

vector value (xi) which means that the neuron’s input is given by their summation (net). The bias term

(w0) is an extra weight constant [14, 15].

yk = φ(net) = φ

(n∑i=1

wi · xi + w0

)(2.1)

In order to obtain the perceptron’s output, it requires the application of the activation function φ(),

which can be either linear or nonlinear [15]. In this case φ() is an hard threshold, represented by the

nonlinear sign function:

φ(net) := sgn(net) =

1 if net ≥ 0−1 otherwise. (2.2)

The XOR Problem

A simple perceptron can only deal with linearly separable inputs and this problem has long been identi-

fied by Minsky and Papert, 1969 [3]. An one-layer perceptron represents an hyperplane in n dimensional

space able to divide linearly separable inputs and for this reason it can only deals with the boolean func-

tions AND, OR, NAND, NOR. Regarding XOR, it doesn’t exist any hyperplane that can classify input

patterns in this conditions [16].

2.3 Activation Functions

Whether in a simple perceptron, or in the units of a neural network, some kind of activation function is

usually applied [15]. The design of activation functions for the neural networks training to be used in

deep learning is currently an area of great research[17]. It is desirable that they present some general

characteristics, such as being non-linear, continuous and differentiable, in order to facilitate the applica-

tion of gradient-based methods [14].

There are several types of activation functions that can be used like the sigmoid or logistic (Fig-

ure 2.2(a)), the hyperbolic tangent (Tanh), the hard Tanh, the rectified linear unit (ReLU) (Figure 2.2(b))

and its variants, the SoftPlus, the Softmax, the Maxout or the Radial basis function [3, 14].

(a) Sigmoid. (b) ReLU.

Figure 2.2: Graphical representation of activation functions.

6

2.4 Gradient descent

Since a simple perceptron can fail to classify inputs that are not linearly separable, optimization is

required. Gradient descent also called steepest descent is an optimization algorithm that gradually

changes vectors to find a local minimum of a function [18]. For a better understanding, consider an

unthreshold perceptron or linear unit:

o =

n∑i=0

wi · xi = net0. (2.3)

Now we need to identify the training error or the loss function (E(w)) for N training examples, based

on the target output values (tk) minus the output value of the linear unit (ok). This function will determine

the surface shape which will try to fit the training samples [15] (Figure 2.3):

E(w) =1

2·N∑k=1

(tk − ok)2. (2.4)

Figure 2.3: Surface generated by the training error of a linear neuron with two input weights. Also knownas error surface.

The w refers to the weight vector w = (w0, w1) whose o depends. In order to know the direction of

steepest descent at each point of the training error function, we need to compute the derivative of E with

respect to each component of the weight vector, in the downward sloping direction:

∆wi = −η ·∂E

∂wi. (2.5)

The η constant is the learning rate, indicating the step size in gradient descent algorithm. With

this considerations, now we can obtain a more practical algorithm, differentiating equation (2.4) and

substituting its result in equation (2.5), which is called the weight update rule for gradient descent:

∆wi = η ·N∑k=1

(tk − ok) · xk,i. (2.6)

7

2.5 Stochastic Gradient Descent

The stochastic gradient descent, also known as LMS (least mean-square) or delta rule, is an extension of

the previous algorithm where the weight updates occur for each training sample instead of the complete

training set N :

∆wi = η · (tk − ok) · xk,i. (2.7)

By performing the weight updates iterating for each training sample at a time we can evaluate the

gradient in a computationally more efficient and faster way than with the gradient descent rule (equation

(2.6)). Still, the stochastic gradient descent perform updates with higher variance, meaning that it can

fail some local minima when there are multiple [19].

This rule was applied in 1960 with Adaline, an ”adaptative pattern classification machine” and has

been used widely by other authors [20, 3].

2.6 Stochastic Gradient Descent Optimizations

Although the high popularity of gradient descent techniques, especially stochastic gradient descent,

there are some disadvantages associated to it.

Choosing a proper learning rate (η) is a challenge, since this task is often carried out by trial and

error. If η is too small, many iterations are needed to get near the best values making this process a

slow task. On the other hand, if η is too big, it can skip or diverge from the optimal solution due to large

oscillations of the function [19, 3].

Other main disadvantage is the possibility of the function to get trapped in a local minima skipping

the desired value [19]. In this case, it should be taken in consideration the importance of saddle points,

a stationary point (Figure 2.4), which can be mistaken with a local minima, also making the algorithm

produce wrong results [10]. Several improvements to stochastic gradient descent have been proposed,

Figure 2.4: Illustration of a saddle point example [10].

with methods such as Momentum, Nesterov Momentum, Adagrad and Adadelta, RMSProp , Adam and

others [21, 22, 23, 24].

8

2.7 Deep Learning

Since a perceptron can only classify linearly separable inputs and knowing that stochastic gradient

descent based on simple perceptron (thresholded or unthresholded) does not take the full potential

from this algorithm, other approaches are used resorting to more units and more layers [3, 16]. The

usage of many layers – with hidden layers included – define an artificial neural network (Figure 2.5)

used in deep learning, able to build complex concepts from simpler ones. There are several different

deep learning methods and architectures [25]. This overview will focus deep feedforward networks, the

backpropagation procedure and LeCun convolutional neural networks.

Figure 2.5: A simplified model of an artificial neural network.

2.8 Deep Feedforward Networks

Introducing deep feedforward networks concept, also known as multilayer perceptrons [3], allowed the

definition of the first artificial neural networks wherein simple perceptron main limitation can be over-

come. This kind of artificial neural network is a layered structure where the information flows through

it, starting in the input units layer, then hidden units layers, until it reaches the output units layer, that

produces the network outputs. Each unit in a layer is connected to every unit in the following meaning

that the network is fully connected [16]. Concerning hidden layers, their output is not visible and error

correction is only done indirectly [3].

We can use a simple perceptron as the units of the deep feedforward network to classify non linearly

separable inputs, solving the XOR problem, but we stay limited only to linear functions. Alternatively, we

can use continuous and differentiable nonlinear activation functions, which allow a better understand-

ing of input variables interaction. So, in a deep feedforward network units typically have a continuous

nonlinear activation function, which can be chosen according to the dataset’s nature [17, 26, 27, 28].

9

2.9 Backpropagation

Applied in the 1980s by Rumelhart, Hinton and Williams and by other groups of authors, the backprop-

agation is a procedure to adjust the weights of a function and is commonly used as learning method

in multilayer artificial neural networks [3, 29, 30, 27]. This is an iterative procedure with mainly two

phases, propagate inputs forward phase and propagate errors backwards phase, which for simplicity

will be designated phase one and phase two, respectively.

In order to understand the backpropagation procedure, first it is needed to consider some funda-

mentals behind it, starting in phase one. As referred to in the deep feedforward networks section, the

network units typically have a nonlinear activation function being precisely this type of activation that will

be used here. Among the different types of nonlinear activation function (σ()), it will be considered the

sigmoid function due to its popularity, although others are also used [14]:

σ(x) =1

1 + e(−α·x)(2.8)

The term α is a positive constant that indicates the steepness of the function. The activation function

is applied to each unit of the network, giving the generic output:

ok = σ

(n∑i=0

wi · xk,i

)(2.9)

Next, based on equation (2.4) and considering the fact that we have multiple units, the redefinition

of the training error with respect to all weights of the network is needed, with N training patterns and l

outputs:

E(w) =1

2·N∑k=1

l∑i=1

(tki − oki)2 (2.10)

Obtaining E(w) we know at this point the resulting error between the desired and the computed

values in the output units. This training error should be the lowest possible and to achieve it we need to

find its derivative, using the gradient descent for the output units and the chain rule of calculus for the

hidden units. Thus begins the phase two of the backpropagation procedure. The error is propagated

backward from the output units to the hidden units and the network updates their weights [15]. In a

generic way, the rule to update hidden units weights, determined from the inputs (xk,j) to the unit and

the error term (δki) from the unit outputs is given by:

∆wij = η

N∑k=1

δki · xk,j . (2.11)

Both phases of the algorithm typically run several times, one after the other, and the stopping condi-

tion of this iterative process can be specified, for example, through a predetermined number of iterations

or through a predetermined threshold applied to the error.

10

Momentum

The momentum method can be added to backpropagation in order to accelerate the learning process, by

changing the weight update [31]. With this method, at each iteration the weight update stays dependent

on the previous weight update iteration [32]. It is based on equation (2.11) but with a momentum term

added to increase the convergence and reduce the oscillation:

∆wij(n) = η

N∑k=1

δki(n) · xk,j(n) + p∆wij(n− 1), (2.12)

where n is the iteration number and 0 ≤ p < 1 is the momentum parameter. For example, in the

presence of a valley with a highly slope on the error surface of the neural network, the system oscillates

almost horizontally backwards and forwards, causing a very slow descent along that valley but if we add

a momentum term it can help to overcome this situation [32]. Also, by using the momentum method with

an high learning rate (η) the high oscillations that could be caused by it are reduced and so a minimum

will be achieved more quickly. Without the momentum method, that minimum could not even be reached

[15].

2.10 LeCun Convolutional Neural Networks

The convolutional neural networks are similar to the artificial neural networks already described above,

however presenting a different architecture. Yann LeCun was pioneer of the first convolutional neural

networks, having begun its investigations before 1989, culminating in the LeNet5 architecture, a convolu-

tional neural network with several layers representing different operations such as convolution, pooling,

non linearity and classification [33]. Its development was inspired by the neocognitron of Fukushima

[34]. A convolutional neural network works well in pattern recognition tasks, namely image recognition,

using a ”grid-like topology” [3]. Generically, it contains one or several convolutional layers, followed by

a multilayer artificial neural networks. Next, the main operations of a convolutional neural network are

described.

Convolution Step

In the convolution step (Figure 2.6), we want to extract features of an input, typically an image. Since an

image is basically a pixels matrix, it can be applied another sliding matrix over that input. This matrix is

called filter, kernel or feature detector and as it moves through the input, multiplies the matrix elements,

adding up their result. The resulting matrix is called a feature map or convolved feature. If the values of

the kernel matrix is changed it can be detected different characteristics of the input image (e.g. curves)

[25]. All these multidimensional matrices are also called tensors [3].

It is also worth noting that each unit of a convolutional neural network layer depends only on one

region of the input, being designated by receptive field [35].

11

Figure 2.6: Convolution step [25].

The nonlinear activation function ReLU is usually applied in order to replace the negative values of

the feature map by zero. This function allows the artificial neural network to learn faster[36].

Pooling

In the pooling, sub-sampling or down-sampling operation, it is intended to reduce the dimensions of the

feature map, thus obtaining the most relevant information by applying a function like Max, Sum, Average,

among others. These functions define the pooling strategy type, with average pooling and max pooling

being the most used [25]. In this operation a sliding window (filter) with a predefined dimension is also

used, which advances in a step also pre-defined. The pooling operation by reducing the size of the input

representation allows to control the overfitting.

Fully Connected Layer

After the convolution, pooling and non-linearity, it follows a fully connected layer as basic elements

of a generic architecture of a convolutional neural networks(Figure 2.7). It is no more than a artificial

multilayer neural network that occurs after the last pooling layer and this latter acts as input of the former.

The multilayer artificial neural network then proceeds a typical classification task. There are several

methods based on convolutional neural networks like AlexNet, GoogLeNet, VGGNet or DenseNet [36,

37, 38, 39].

2.11 Deep Learning Problems

The usage of deep neural network architectures in different datasets has been shown very good results,

but despite of its achievements, the usage of more layers doesn’t always represent a better learning

artificial network [40]. In the following subsections, it will be refered the main problems related to the

typical deep neural networks: local minima, vanishing gradients and overfitting.

12

Figure 2.7: Generic convolutional neural network architecture. Convolution step occurs in convolutionallayers, Max pooling occurs in pooling layers and the fully connected layers concern to multilayer artificialneural networks [25].

Local Minima

Since the backpropagation procedure uses gradient descent to reduce the training error, it may hap-

pen that the algorithm stays trapped in one of several local minima of a neural network error surface

(Figure 2.8) [7, 41].

Figure 2.8: Illustration of a local minima example, given by the red dot.

This means that with backpropagation there is difficulty in finding the lowest training error value,

therefore, not converging to the global minimum. It is assumed that an artificial neural network with

several hidden layers is less likely to be stuck in a local minima and it is easier to find the right parameters

as demonstrated by empirical experiments [27].

Vanishing Gradients

The vanishing gradients problem is a phenomenon that may occur in deep neural networks training,

where the backpropagated error decreases rapidly, tending to zero as it approches the input layer [42].

By using classical activation functions, like sigmoid or hyperbolic tangent with a finite activation range,

(0, 1) and (−1, 1) respectively, the error output is limited. So, the error is backpropagated over hidden

layers with increasingly smaller values, meaning that weight updates become more and more residual.

13

There are some solutions proposed to deal with the vanishing gradients problem, but it is recommended

the usage of the rectified linear unit (ReLU), an activation function defined as f(x) = max(0, x) [42,

3, 43, 44]. The opposite phenomenon can also occur, with backpropagated errors suffering a large

increase, designated by exploding gradients [42, 45].

2.12 Overfitting

Overfitting is one of the main challenges in machine learning, in which a learning algorithm (model)

has a good performance in the training data, but an poor performance against new data. In order

to be obtained a correct description of the data, we estimate the minimum training error [8]. During

this process, the model adapts very well to the training data, which usually contains noise. It occurs

memorization, instead of an smoother and more generalized adaptation [15].

If the learning algorithm is very fitted to the training data, it will act more poorly on previously unseen

data, like a test set, used to assess the classifying performance, or a validation set, for parameter tuning.

By the end, we want to find a model in which the difference between the training error and the test error

will be the minimum possible.

The ability to classify new unseen inputs defines the model performance, also know as generalization

[3]. Underfitting may also happen due to a poor performance of the model in finding a good minimum

training error.

2.13 Regularization

In order to deal with the overfitting problem, several authors proposed different regularization methods.

Regularization is defined as ”any modification” made to ”a learning algorithm that is intended to reduce its

generalization error but not its training error” and is one of the main concerns when designing a machine

learning architecture [3]. Regularization is an important component in order to prevent overfitting and it

is one of the main concerns when designing a machine learning architecture.

This section will introduce some regularization techniques, among the wide range of options avail-

able, namely the Data augmentation, Early stopping, Bagging and Dropout, Weight penalty L2 and L1

and others.

Data augmentation

Data augmentation has been used by several authors and it consists in generating additional data in the

training datasets in order to obtain a machine learning model with a better generalization [46, 47, 36, 48].

Early Stopping

When training certain large models whose trend towards overfitting is representative, the training error

decreases over time but the validation set error increases at a given moment [3]. Early stopping is a

14

efficient capacity control approach based on monitoring the performance of validation set during training

in order to return the parameters with lowest validation set error, rather than the latest parameters [3, 49].

A recent work propose a novel early stopping criterion which removes the need for a held-out validation

set [50].

Bagging and Dropout

Bagging (an acronym of ”bootstrap aggregating”) is another procedure for regularization and it allows

to reduce generalization error resorting to the combination of several models [51, 3]. These are trained

separately and then they vote on the output for test examples, based on the assumption that different

models will certainly not make all the same errors on the same test set. This general strategy is called

model averaging and the techniques that employ it are also known as ensemble methods [3].

Dropout is a strategy proposed by [52] and is a variant of ensemble method where different neural

network topologies are combined and their nodes are randomly dropped out in the training phase, in

order to prevent complex co-adaptations and to enhance the generalization performance of the network

[53]. Dropout has been used by several authors whether in a in-depth explanation or with some improve-

ments, like ”standout” method, fast dropout training and others [54, 55, 56, 57, 58, 59, 60]. DropConnect

is a special case of generalization derived from Dropout which randomly drops the weights instead of

the activations [61].

Weight penalty L2 and L1

Parameter norm penalties is a regularization approach based on limiting the models capacity by adding

some parameter penalties [3, 62, 53]. Towards neural networks model, typically is selected a parameter

norm penalty that penalizes only the weights, like L2 or L1. The L2 regularization is also called weight

decay or Tikhonov regularization and it is the most common form of parameter regularization, encour-

aging near-zero weights [63, 64]. The L1 regularization result in a solution more sparse comparing to

L2, meaning that some parameters have an optimal value of zero, which is useful as a feature selection

mechanism [3]. LASSO (least absolute shrinkage and selection operator) is a well known model based

on L1 penalty proposed by [65] with recent adaptations [3, 66].

Others

Multi-task learning is a mechanism whose main goal is to improve generalization by training tasks in

parallel using a shared representation. When applied to artificial neural networks it uses a shared

hidden layer trained in parallel on all tasks benefiting the overall learning [67, 68]. This method has been

applied with success in areas as diverse as natural language processing, video games and biomedical

science [69, 70, 71].

Sparse representation is achieved by penalizing the activations of the units in neural networks in or-

der to their activations be sparse [3, 72]. Although this method has a good performance, it has difficulties

dealing with low dimensional data, still [73] proposed an effective method to overcome this situation.

15

Parameter tying is another technique that allows models to learn good representations of input data

by reducing the number of learnable parameters in Convolutional Neural Networks which makes possible

to train this model with fewer examples [2, 74].

2.14 Multiresolution Processing

Multiresolution processing and analysis refers to the decomposition of a signal into more than one scale

or resolution [75, 76]. A signal can be defined as ”a function that conveys information about the behaviour

of a system or attributes of some phenomenon” that can be processed into images, sound, and others

[77].

The basic idea behind multiresolution theory is not recent. In the beginning of 1800’s, Joseph Fourier

proposed essential theories about frequency analysis using superposition of sines and cosines to rep-

resent signals which allowed the development of new approaches later on [78]. Thus, one of the most

interesting discoveries were about wavelets. Wavelets are small wave-like oscillations with diverse fre-

quencies and limited duration that can be used as a mathematical tool to extract information from signals

[78, 75]. In this sense, Stéphane Mallat and Yves Meyer work (after 1980) introduced wavelet represen-

tation as a significantly new approach to image processing and analysis, called the multiresolution the-

ory. This theory incorporates techniques from different fields, namely signal processing, digital speech

recognition and pyramidal image processing in which a given signal is decomposed into different scales

or resolutions and then reconstructed from the elements of its decomposition [75, 79, 78].

Multiresolution processing and analysis is a very useful technique that is applied to the field of image

processing and computer vision. We can find applications of this technique in object detection and visual

recognition [80, 81, 82], robotic grasping detection [83], alignment and tracking [84], machine learning

[76, 85] and others [76].

In this section, we will give an overview of the multiresolution technique applied to the field of image

processing and computer vision, with focus on Image Pyramids.

Digital Images

A digital image can be represented either by a vector graphics, based on mathematical formulas that

define geometrical primitives (eg. polygons, lines), either by raster graphics, represented by pixels.

The ”digital image” term usually refers to the raster graphics image, which typically are two dimensional

arrays representing a set of values called pixels, its smallest element. Each pixel is defined by a certain

number of bits, within a range of intensity values, indicating the colour components it can represent. This

concept is called bit depth or pixel depth [8, 75].

Among the colour encoding models available (eg. YUV, CIELAB), the RGB colour model is a popular

method used in computing (Figure 2.9). The acronym means that the images in the red (R), green (G)

and blue (B) colour space are defined by three numbers, one for each colour. Each component can

be represented by a range of values depending on the bit depth. For example, a 24-bit colour image

typically uses 8 bits for each of the R, G and B components giving more than 16 million (224) colour

16

Figure 2.9: RGB raster image example showing individual pixels as squares and colour components asvalues.

variations. An 8-bit component can have 256 possible values (28), from 0 to 255. RGB digital images

may have an additional component that can create partial or full transparency, called alpha channel [86].

In the case of black and white digital images, the intensity varies between the different grey levels,

from the darkest to the lightest grey. They have a single 8-bit component per pixel, resulting in 256

different grey levels.

Image Pyramids

An image pyramid is a structure that corresponds to multiresolution image representations [8, 87, 88].

This kind of representation is somewhat similar to human visual encoding. Human visual system is

very effective in object recognition and in the representation of pictorial information, but has difficulties

evaluating distances, areas and accurate distinction of gray scales [89]. When we analyse a given

image with objects and features of many sizes, large and high contrast objects are coarse viewed and

the remaining objects usually need to be in a higher resolution for a proper examination [75]. Studying

images at different resolutions is the main motivation behind the concept of image pyramids.

The aforementioned concept is a simple and computationally effective structure where the base of

the pyramid contains a high-resolution image, followed by a collection of decreasing resolution images

until the apex, that contains a low-resolution approximation of the image. When moving to the apex,

image size and resolution decrease. Considering an image in a base level J with size 2J × 2J or N ×N ,

where J = log2N , there are J + 1 resolution levels in a pyramid, from 2J × 2J to 20× 20, with 0 ≤ j ≤ J .

Nevertheless, most pyramids are truncated to P + 1 levels, where 1 ≤ P ≤ J , meaning that going to a

very reduced resolution of a bigger original image may not add relevant information [75].

To generate an image pyramid, the original image can be decomposed as a set of lowpass filtered

copies via Gaussian pyramid or as a set of bandpass filtered copies via Laplacian pyramid [88]. In a

Gaussian pyramid (Figure 2.10), the lowpass filter is made by smoothing an image with the appropriate

filter and then downsampling (subsampling) the smoothed image in a iterative fashion. In a Laplacian

pyramid, the bandpass filter is made by subtracting each Gaussian pyramid level from the next lower

level and then performing an image interpolation between adjacent levels [90]. Other filter operations

17

can also be employed [8].

Figure 2.10: Schematic representation of an Gaussian pyramid with five levels.

The smoothing operation used in a Gaussian pyramid is a Gaussian filter (Gaussian blur) that is firstly

applied to transform each pixel of the original image. The Gaussian filter in two dimensions is given by

a Gaussian function G(x, y):

G(x, y) =1

2 · π · σ2· e−

x2+y2

2·σ2 (2.13)

where σ is the standard deviation of the Gaussian distribution. A Gaussian function expresses the

normal distribution, an important statistics concept used to represent random variables with a large

variety of distributions [91]. Visually, this formula produces a shape obtained from the Gaussian ”bell

curve”, rotated around the vertical axis [92].

Since the Gaussian function extends to infinity we must truncate it, due to the presence of near zero

values at more than 3σ from the mean. As a solution, we can use a simple rectangular window function,

with values from the truncated normal distribution to build a convolution matrix. This matrix is applied to

the image, setting new values to its pixels. In other words, the Gaussian filtering process involves the

convolution of the image with the convolution matrix [8].

Then the blurred image is downsampled by a factor of 2. The Gaussian filter/downsample steps are

repeated to generate the typical P + 1 levels of the pyramid. This operations ensure that the sampling

theorem is respected, meaning that we get no distortions of a signal (image) by sampling. Thus, the size

reduction goes together with an appropriate smoothing, ensuring a proper downsampled image [89].

18

2.15 Subspace Tree

The subspace tree is an efficient hierarchical structure described by a tree that can deal with the negative

effects of high dimensional data [93, 94].

The advances in hardware technology and the exponential production, storage and retrieval of digital

contents has been a challenge for computer scientists. Image databases, as a set of multimedia objects,

can be accessed traditionally by its file name or keyword, for example, or by its content, such as colour,

texture, shape and others [95, 96]. The content-based image retrieval is a set of techniques for searching

images on large databases, given a content query as a weighted combination of features [94]. The

similarity between an image and a content query is given by their feature vectors distance in the high

dimensional space. These vectors can support efficient indexing methods [97, 96].

The plethora of high dimensional data available today, such as image databases, can be identified

when the number of features is larger than the number of samples [4]. However, dealing with large

number of dimensions can lead to query performance problems. When the number of dimensions

grows, the performance tends to worse, running into the ”curse of dimensionality” problem [8].

The subspace tree can tackle this problem [97, 96]. By dividing a high dimensional space into

a sequence of low dimensional subspaces, an hierarchical subspace is obtained. Then, a distance

function measure the difference among corresponding multimedia objects in a space and a subspace.

The process starts in the lowest dimension subspace and continues to the following higher dimension

subspace. With this approach the ”curse of dimensionality” problem does not arise, due to the mapping

of multimedia objects into a low dimensional space [98].

More formally, supposing a sequence of subspaces U0 ⊃ U1 ⊃ U2 ⊃ . . . ⊃ Ut and dim(U0) >

dim(U1) > dim(U2) . . . > dim(Ut). V is a vector space, where V = U0. The dim(Ur) is the Ur subspace

dimension, represented graphically by the number of nodes in the tree. A family of projections where

multimedia objects are mapped to subspaces can be defined as a subspace sequence:

P1 : U0 7→ U1;P2 : U1 7→ U2; . . . ;Pt : Ut−1 7→ Ut. (2.14)

If an orthogonal projection is applied, the subspaces obtained correspond to the multiresolution image

representations of the image pyramid [88].

In order to obtain an efficient indexing structure, the distance d should be d ≤ 16 and the relation

between spaces defined as

dim(U0)

dim(U1)≤ d, dim(U1)

dim(U2)≤ d, . . . dim(Ut−1)

dim(Ut)≤ d. (2.15)

The computing costs of this subspace method given a query vector can be determined

costs =

t∑i=1

σi · dim(Ui−1) + s · dim(Ut), (2.16)

given a number of points σi below a given bound � for a sequence of subspaces Ui and a dataset of

19

size s. The costs tend to decrease until it reaches a minimum value, with the increase of the number of

subspaces [8, 93, 94].

20

Chapter 3

Multiresolution Backpropagation

Learning

The Multiresolution Backpropagation Learning is related to LeCun Convolutional Neural Networks ap-

proach, in which the main difference is that no receptive fields are used [33, 99]. We propose a method

that combines different concepts from the multiresolution image processing and from the deep learning

in order to obtain a good generalization, avoiding the problem of overfitting. Multiresolution Backpropa-

gation Learning can be described in three main components:

1. Generation of Gaussian pyramids from an initial pattern;

2. Artificial neural networks training on each resolution of the pattern;

3. Weights replication to initialize the following artificial neural network, from the lower to the higher

resolution of the pattern.

For the sake of reasoning, we first describe individually each component of the proposed method before

its overall architecture.

3.1 Gaussian Pyramid Generation

The generation of the Gaussian pyramid is the first stage of the Multiresolution Backpropagation Learn-

ing. Proposed by Burt and Adelson (1983), the pyramid is a multiresolution structure, representing

subsequent images that are filtered and scaled down. The base level contains the original image and it

is the starting point of the pyramid construction process [88].

Given an image dataset D = {(I1, c1), . . . , (In, cn) : n ∈ N+}, where I is a two-dimensional image,

also denoted by I(x, y) and c is the associated class or label. The Gaussian pyramid is defined on the

original image I as:

G0(x, y) = I(x, y), for level l = 0 (3.1)

21

and then an averaging process is carried out by a REDUCE function in the following pyramid levels as:

Gl(x, y) = REDUCE(Gl−1(x, y)), otherwise. (3.2)

Which means that the REDUCE function involves the convolution of each initial image with a Gaussian

filter G(x, y) (Equation (2.13)) and a downsampling operation by a factor of 2, resulting in the following

level of the pyramid [90]. Thus, starting with an initial image G0 of size N pixel collumns×N pixel rows,

the image G1 of size N2 ×N2 is created. Repeating the REDUCE, image G2 with size

N4 ×

N4 is obtained,

resulting in a three-level pyramid structure.

The process described is applied to all images that compose the dataset D. The initial same resolu-

tion images in D will originate two new lower resolution image datasets. The dataset corresponding toN2 ×

N2 images is represented by D

′ = {(I ′1, c1), . . . , (I ′n, cn) : n ∈ N+} and the dataset that representsN4 ×

N4 images is denoted by D

′′ = {(I ′′1 , c1), . . . , (I ′′n , cn) : n ∈ N+}.

3.2 Artificial Neural Networks

Inputs

Images are the input to the artificial neural networks (Figure 3.1). Each input image is represented as

a two-dimensional grayscale array. Since we have three image datasets, we also need three separate

networks. Lowest resolution images from dataset D′′, the level l = 2 of the Gaussian pyramid, are the

input to the first artificial neural network (NN1), medium resolution images from dataset D′, the level

l = 1 of the pyramid, are the input to the second neural network (NN2) and the higher resolution images

from dataset D, the level l = 0 of the pyramid, are the input to the third neural network (NN3).

Training

We resort to feedforward networks with backpropagation-based training as the artificial neural networks

architecture, based on [33]. The remaining configurations are not based on any specific feedforward

network architecture, but during the experiments phase we found relevant other contributes [100].

The three networks (Figure 3.1) have an input layer with diverse number of units, depending only on

the input image resolution. They also have one hidden layer each and a output layer with 10 units. The

activation function used in the hidden layers is the typical hyperbolic tangent:

tanh(x) =(ex − e−x)(ex + e−x)

, (3.3)

with output values in the range (−1, 1) and in the output layers, a softmax function:

σ(x)i =exi∑Jj=1 e

xjfor i = 1, · · · , J, (3.4)

giving output values between (0, 1). Then a loss function is applied, the cross-entropy, a measure of

22

dissimilarity between the true labels and the predicted labels. It is typically used in training when the

models have softmax outputs [3].

Figure 3.1: Schematic representation of three images at different resolution as input to three feedforwardnetworks with backpropagation-based training.

For the NN1, the weights and biases are initialized randomly from a normal distribution of values,

with mean set to 0 and standard deviation equal 1. In the remaining artificial neural networks, we resort

to weight replication, explained in the section 3.3. About biases, they initialize the following network as

they are.

The architectures of NN1, NN2 and NN3 are similar. The main difference occurs in the training and

in the number of input units in the input layer. We applied early stopping during the training phase of

each network, thus resulting in a different number of training epochs. After initialization, the results are

continually improved by training, from the lower to the higher resolution.

3.3 Weight replication

Since we have three artificial neural networks to train images at different resolution, some interconnec-

tion must be made in order to generate relevant results. Here is where the replication of weights between

networks takes its place.

After the training of NN1, we have the resulting weights of the process. They are represented as a

matrix of values. Then, the NN1 weights initialize the NN2 and subsequently NN2 weights initialize the

NN3.

In order to replicate the weights between networks, we resort to the Kronecker product of two matri-

23

ces, denoted by:

A⊗B =

a11B · · · a1nB

.... . .

...

am1B · · · amnB

, (3.5)where A is an m× n matrix of weight values and B is a 2× 2 matrix of ones.

This process is repeated always between artificial neural networks, from the lowest resolution to the

following higher resolution. Figure 3.2 illustrate the weight replication along resolutions.

Figure 3.2: Hierarchy of weight replication.

24

3.4 Multiresolution Backpropagation Learning Architecture

After the component description of the proposed method, now we can formulate the overall architecture

(Figure 3.3). We start with the convolution of each image of dataset D with a Gaussian filter and a

Figure 3.3: Schematic representation of three images at different resolution as input to three feedforwardnetworks with backpropagation-based training, showing the direction of the image resolution reductionand the direction of the training procedure.

downsampling by a factor of 2, originating a new dataset D′. The process is repeated in order to obtain

the dataset D′′, as stated in Algorithm 1. Performing these steps corresponds to the generation of the

Gaussian Pyramid.

Algorithm 1: Prepare the dataset.

1 foreach image G0(x, y) = In ∈ D for level l = 0 do// Apply REDUCE

2 G1(x, y) = REDUCE(G0(x, y));

3 Save image G1(x, y) = I ′n; // Build dataset D′ for level l = 14 G2(x, y) = REDUCE(G1(x, y));

5 Save image G2(x, y) = I ′′n ; // Build dataset D′′ for level l = 2

6 end

25

Each dataset represent a level in the pyramid. The training starts with level l = 2 as input to NN1

(Algorithm 2). After the training of the first network, we apply weight replication to initialize NN2, trained

with level l = 1 as input. We repeat the same procedure between NN2 and NN3, but the latter have level

l = 0 as input. Since the level l = 0 is the base level of the pyramid, the NN3 is the last network to be

trained, dispensing the weight replication component.

Algorithm 2: Training.

1 foreach level l = 2, l = 1, l = 0 do

2 if level l = 2 then

3 Initialize a feedforward neural network randomly;

4 Train with backpropagation;

5 Apply early stopping;

6 else

7 Initialize a feedforward neural network from the preceding resolution network;

8 Train with backpropagation;

9 Apply early stopping;

10 end

11 end

26

Chapter 4

Empirical Experiments

This section present the conducted experiments using the Multiresolution Backpropagation Learning

(MrBL) and the MNIST image dataset (section 4.1). We describe the main steps performed during the

development until obtaining the final results.

We resorted to standard backpropagation-based training since it is a simple and efficient procedure,

frequently used with feedforward networks and it was used in LeCun Convolutional Neural Networks

[33, 28].

All the experiments were developed using the Python programming language, version 3.5 and the

Tensorflow software library, version 1.2, which is a machine learning framework used to build neural

network models [101, 102]. Another library used was the NumPy, version 1.13, a package for scientific

usage that supports multidimensional array operations [103].

4.1 Dataset

Experiments were carried out using the MNIST dataset [99]. The acronym stands for Modified National

Institute of Standards and Technology and it is a largely used dataset of handwritten digits suited for

pattern recognition methods [42, 104]. It contains normalized grayscale images, of size 28 × 28 pixel,

split into a training set of 60.000 images and a test set of 10.000 images. Each image has label values

from 0 to 9, representing the digit on it.

4.2 Preprocessing

The MNIST is a balanced dataset across classes, already mixed and with normalized size images. Since

the pixel values of each image vary in the range (0, 255), normalization to the range (0, 1) was carried

out. This is a typical procedure in computer vision, since scaling images will make their values more

evenly distributed for training [3].

Additionally, the train and test labels were one-hot encoded. Instead using the original label for the

digit class, we used binary variables, where 0 means that it does not belong to a class and 1 refers to

27

the class it belongs. The one-hot encoding avoid misclassification when feeding data into the model.

4.3 Performance measure

All the tested models were evaluated with respect to its performance. In classification tasks, it is usual

to measure its accuracy. The accuracy gives the correct output proportion of a model.

Most of the existing methods applied to MNIST use the error rate as unit to express their results,

which may represent better its desired behaviour. The error rate is an equivalent measure of perfor-

mance, that gives the incorrect output proportion of the model [3]. Since accuracy and error rate are

equivalent measures, the latter was selected as the performance measure.

4.4 Preliminary Experiments

Before the development of the final architecture, we performed some previous preliminary experiments.

The main purpose was to test if the predictions about the generalization ability of the method were

promising. The preliminary experiments were performed running on the CPU of a laptop with Intel Core

i3-2310M processor at 2.10 GHz and 8 GB of RAM.

We started to perform the experiments using different number of hidden units among the artificial

neural networks. We prepared the MNIST dataset in order to generate the three-level Gaussian pyramid

(Algorithm 1). The standard deviation of the Gaussian filter was σ = 2 (Equation (2.13)).

The lowest resolution images, with size 7×7 pixel, were input to NN1 containing one hidden layer with

2 units. The medium resolution images, with size 14×14 pixel, were input to NN2 with 4 hidden units. The

higher resolution images, corresponding to the MNIST original images, were input to NN3 with 8 hidden

units. We followed the Algorithm 2 and confirmed that the loss was reduced, but no early stopping was

formally applied. The three networks were trained with the three complete training datasets, during 200

epochs (number of iterations over the dataset) [3]. A learning rate of η = 0.3 was defined for all artificial

neural networks. The MNIST test set was used to assess the classifying performance of the model.

To evaluate the model we resorted to the original MNIST images as input to a feedforward neural

network with backpropagation-based training and with random initialization. The other network settings

were the same as the NN3 of the models evaluated.

The first results failed to succeed (Table 4.1). Comparing the output classification error of the pro-

posed model, given by NN3, with the evaluation model, we can see that the percentage of incorrectly

recognized test digits was better (lower) using the evaluation model. This preliminary test indicated that

using a artificial neural network with less hidden units to initialize the following with more hidden units

may not work.

So, we modified our model. Instead considering a different number of hidden units, we considered

using the same number of hidden units, performing an experiment with one hidden layer with 8 units on

each of the networks (Table 4.1). The results were better in the proposed model than the evaluation

model. Even thought that the classification error values were high, the results obtained were promising.

28

Model NN1 NN2 NN3 Evaluation

Different number of hidden units 89.9 88.6 89.7 87.8Same number of hidden units 90.6 85.8 81.4 87.8

Table 4.1: Image classification error rate (%) of preliminary experiments on MNIST.

But in order to achieve a good generalization model, we need to scale it and perform adjustments.

4.4.1 Preliminary CIFAR-10 Experiments

We performed some preliminary experiments on CIFAR-10 dataset with the purpose to test if the pro-

posed method has a good performance on a different data [100]. The dataset consists of 60.000 colour

images of size 32× 32, representing 10 classes (e.g. airplane, cat, dog, among others). It is split into a

50.000 images training set and a 10.000 images test set.

Before applying the proposed method we converted the dataset from colour to grey level to reduce

the its dimensions from three to one, simplifying the process. Briefly, we adjusted NN1, NN2 and NN3

to have one hidden layer with 10000 units each, a learning rate of η = 0.01 and were trained over 25,

90 and 30 epochs, respectively. We evaluated with a randomly initialized feedforward neural networks

with similar settings to NN3, trained over 30 epochs. We empirically verified that the proposed method

worked on CIFAR-10 dataset and optimized the results, but probably it needs a lot more computing

power to improve the results (Table 4.2). The accuracy is the typical performance measure used on the

aforementioned dataset. Higher accuracy represents better results.

Model NN1 NN2 NN3 Evaluation

Same number of hidden units 27.2 29.2 28.6 26.7

Table 4.2: Image classification accuracy (%) of preliminary experiments on CIFAR-10.

4.5 Experiments

The components and the overall architecture of the Multiresolution Backpropagation Learning were de-

scribed in Chapter 3. Starting with the preliminary experiments until the final model, several experiments

were carried out, in order to obtain the best results. This section presents the preprocessing and the

experimental settings of MrBL carried out as well as the most relevant results obtained. These exper-

iments were performed on the CPU of a server computer with Intel Xeon E5-1620 processor at 3.60

GHz and 64 GB of RAM.

4.5.1 MNIST Gaussian Pyramid

We resorted MNIST dataset to generate the three-level Gaussian pyramid (Algorithm 1). Images from

level l = 2, with size 7 × 7, formed an input vector of 7 × 7 × 1 = 49 values to NN1. Images from level

29

l = 1 formed an input vector of 14×14×1 = 196 values to NN2. Images from level l = 0 formed an input

vector of 28× 28× 1 = 784 values to NN3.

The Gaussian filtering process was performed and tested with different convolution matrix settings.

We tested common values for standard deviation to understand its influence in MrBL behaviour and then

get the best parameter (Equation (2.13)). The values tested were σ = 1, σ = 2 and σ = 3, corresponding

to the 68 − 95 − 99.7 statistical rule. Figures 4.1, 4.2 and 4.3 shows its visual effect. About the window

size, it was set to 5 × 5. This size produces an appropriate filtering and is computationally less costly

[88].

(a) Images from level l = 0.

(b) Images from level l = 1. (c) Images from level l = 2.

Figure 4.1: MNIST Gaussian Pyramid sample images with σ = 1.

(a) Images from level l = 1. (b) Images from level l = 2.

Figure 4.2: MNIST Gaussian Pyramid sample images with σ = 2 (two levels).

(a) Images from level l = 1. (b) Images from level l = 2.

Figure 4.3: MNIST Gaussian Pyramid sample images with σ = 3 (two levels).

Quality of Gaussian pyramid

After setting up the entire model, we tested different Gaussian filters to understand if they affected the

error rate. After testing the proposed model with σ = 1, σ = 2 and σ = 3, we realized that the best

results were obtained with σ = 1 (Figure 4.4). Once that the convolution matrix centre has the higher

value of the Gaussian distribution, it means that larger σ values will produce a wider ”bell curve” shape.

Higher values also produces ”sharp edges”, with undesired results.

30

Figure 4.4: Error rate (%) obtained using different standard deviation values with MrBL.

4.5.2 Networks Training Parameters

Several adjustments were carried out in the MrBL networks parameters (Table 4.3). We selected random

batches of 100 images on each iteration as input to each artificial neural network. More specifically, a

mini-batch for stochastic gradient descent perform parameter updating. The NN1 had 49 units in the

input layer, the NN2 had 196 input units and the NN3 had 784, corresponding to the size of each input

vector. Each network had one hidden layer with 9000 units and 10 units in the output layer (section 3.2).

We resorted to a softmax cross-entropy computation implemented by Tensorflow to obtain the model

loss. The training process was carried out with a gradient descent optimizer also from Tensorflow, with

the learning rate set to η = 0.01. The referred optimizer performs automatic differentiation to implement

backpropagation [102].

Early stopping was applied to control the number of epochs during training. By using this technique,

we were not just interested in obtaining a good performance on each individual artificial neural network

by getting the point in time with the lowest test error. We were also interested in the overall generalization

ability of MrBL method. Thereby, NN1 was trained during 20 epochs, NN2 for 50 epochs and NN3 for 30

epochs.

We evaluated the model resorting to MNIST original dataset as input to a feedforward neural network

with 784 input units, 9000 hidden units and 10 output units. We resorted to backpropagation-based

training over 30 epochs and with random initialization (with mean set to 0 and standard deviation set to

1). The remaining settings are equal NN3. We titled it ”BL” model.

4.5.3 Results

To achieve an efficient training, several optimizations were carried out (section 4.5.2). We tried to keep

the model as simple as possible in order to demonstrate its performance.

We choose a random batch of size between 1 and some hundreds to improve the training time and

the convergence of the algorithm [105].

The usage of stochastic gradient descent allows that the error surface landscape changes between

31

Parameter NN1 NN2 NN3 BL

Input units 49 196 784 784Hidden units 9000 9000 9000 9000Output units 10 10 10 10Learning rate 0.01 0.01 0.01 0.01Epochs 20 50 30 30Initialization Random From preceding From preceding Random

Table 4.3: Summary of MrBL and BL evaluation training parameters.

image batches, probably with different local minima or saddle points [5]. This technique optimized the

training process relatively to the preliminary experiments.

A fixed learning rate was the most suitable solution for MrBL. It helped to obtain either a proper

convergence of NN1 and NN2, either a proper time to reduce the NN3 loss.

The wider hidden layers used in MrBL helped the results optimization, and were inspired by [100].

Even thought that one wider hidden layer may have a more flattened landscape, our multiresolution

approach suggest less flattened landscapes and with more local minima [10].

The choice of the hyperbolic tangent as activation function was due to its better performance. It

converged faster than sigmoid, due to the fact that it output values in the range (-1,1) instead (0,1),

avoiding gradient bias [28]. It also performed better than ReLU due to vanishing gradients problem. The

softmax in the output layer units are a typical choice for multi-class classification tasks [26].

The weights and biases were initialized randomly from a normal distribution (section 3.2) in order

to maximize the generalization ability of MrBL. Setting the standard deviation to 1 avoided overfitting

in NN3, even that regularization concerns suggests smaller values [3]. In order preserve the informa-

tion from the preceding lowest resolution landscape, we used the weight replication process (section

3.3) among the MrBL networks, stopping the training before reaching each landscape bottoms. The

weight replication followed the images resolution increase to preserve information and to improve gen-

eralization. The number of epochs chosen was variable (section 4.5.2). The NN1 obtained a better

convergence by stopping the training earlier. The NN2 presented more convergence and low overfitting,

so we stopped training latter. We stopped the NN3 training when no relevant error rate improvement

was obtained. The Figure 4.5 shows the MrBL and the BL evaluation model train convergence. The

variable number of epochs was scaled for a better visualization. We can observe that MrBL converged

faster than BL evaluation model.

The higher gap between training and test loss curves of BL evaluation model indicates more overfit-

ting than MrBL (Figure 4.6). By starting the training process with a lower value suggests that MrBL do not

reach the landscape bottoms, using almost optimal lower values between the hierarchy of landscapes.

We obtained a better result in the percentage of incorrectly recognized test digits, meaning that the

output error rate value in the MrBL was on average lower than BL evaluation. We performed three runs

for each method and the values mean and standard deviation are presented in Table 4.4. Since the

values interval does not overlapped, the results are statistically significant which indicates that using the

MrBL method gives an advantage comparatively to the simple BL method. Figure 4.7 shows the test set

32

Figure 4.5: Training set convergence properties of MrBL networks and BL evaluation, with scaledepochs.

Figure 4.6: Training an test set convergence properties of MrBL and BL evaluation.

33

MrBL

NN1 NN2 NN3 BL

Error rate (%) 7.32± 0.54 5.83± 0.20 8.24± 0.33 10.92± 0.14

Table 4.4: Image classification error rate of MrBL and BL evaluation on MNIST.

convergence on both methods and the dispersion values during the training process. It showed better

generalization ability towards new data. We also verified that using the same number of training epochs

Figure 4.7: Test set convergence properties of MrBL and BL evaluation, with vertical bars representingthe standard deviation.

in BL as the summation of all epochs in MrBL did not show improvement (Table 4.5). We performed three

runs and the results obtained were quite similar to the ones with BL during 30 epochs, but revealed more

overfitting tendency.

BL

Error rate (%) 11.07± 0.45

Table 4.5: Image classification error rate of BL evaluation on MNIST with 100 epoch training.

34

Chapter 5

Conclusions

The Multiresolution Backpropagation Learning method obtained best overall results than simple

backpropagation-based training. The proposed method gives faster training and the possibility to over-

come local minima. By using a sequence of subspaces represented by images at different resolution

as input to feedforward networks with backpropagation-based training, we most probably managed to

reach optimal lower values among the hierarchy of landscapes.

By developing the aforementioned method we did not intend to obtain the best results in MNIST, but

to demonstrate that it works. It is a novel alternative method for regularization, avoiding overfitting and

avoiding going into local minima.

5.1 Achievements

• We compared MrBL to the conventional BL and verified that it converged faster with less overfitting;

• We empirically verified that MrBL gives an advantage, by reaching near a global minimum, avoiding

local minima;

• We observed that MrBL gives better results in the MNIST digit recognition task, with statistically

significant better results.

5.2 Future Work

For the future, we should make algorithm optimization that represents best the error surface while re-

sorting to less computational power. Additionally, a better exploration of the loss surface properties

should be carried out. Another interesting future approach would be the exploration of different methods

complemented with a multiresolution approach, since it could bring advantages.

35

Bibliography

[1] X. Liao, A. V. Vasilakos, and Y. He. Small-world human brain networks: Perspectives and chal-

lenges. Neuroscience & Biobehavioral Reviews, 2017.

[2] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew. Deep learning for visual understand-

ing. Neurocomput., 2016.

[3] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.

[4] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and

Applications. Springer Berlin Heidelberg, 2011.

[5] S. Dube. High dimensional spaces, deep learning and adversarial examples. CoRR, 2018.

[6] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimiza-

tion problems. CoRR, 2015.

[7] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surface of multilayer

networks. CoRR, 2015.

[8] A. Wichert. Intelligent Big Multimedia Databases. World Scientific, 2015.

[9] H. W. Lin, M. Tegmark, and D. Rolnick. Why Does Deep and Cheap Learning Work So Well?

Journal of Statistical Physics, 2017.

[10] Y. Dauphin, R. Pascanu, Ç. Gülçehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking

the saddle point problem in high-dimensional non-convex optimization. CoRR, 2014.

[11] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The

bulletin of mathematical biophysics, 1943.

[12] M. T. Hagan, H. B. Demuth, and M. Beale. Neural Network Design. PWS Publishing Co., 1996.

[13] F. Rosenblatt. The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell

Aeronautical Laboratory, 1957.

[14] R. Rojas. Neural Networks: A Systematic Introduction. Springer-Verlag, 1996.

[15] B. Kröse and P. van der Smagt. An introduction to Neural Networks. The University of Amsterdam,

8th edition, 1996.

37

[16] S. Haykin and S. Haykin. Neural Networks and Learning Machines. Prentice Hall, 2009.

[17] F. Agostinelli, M. D. Hoffman, P. J. Sadowski, and P. Baldi. Learning activation functions to improve

deep neural networks. CoRR, 2014.

[18] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., 1995.

[19] S. Ruder. An overview of gradient descent optimization algorithms. CoRR, 2016.

[20] B. Widrow and M. E. Hoff. Adaptive switching circuits. 1960 IRE WESCON Convention Record,

1960.

[21] C. De Sa, K. Olukotun, and C. Ré. Global convergence of stochastic gradient descent for some

non-convex matrix problems. arXiv preprint arXiv:1411.1134, 2014.

[22] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating Stochastic Gradient

Descent. ArXiv e-prints, 2017.

[23] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for

deep learning. In Proceedings of the 28th International Conference on International Conference

on Machine Learning, 2011.

[24] S.-Y. Zhao and W.-J. Li. Fast asynchronous parallel stochastic gradient descent: A lock-free

approach with convergence guarantee. In AAAI, 2016.

[25] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew. Deep learning for visual understand-

ing: A review. Neurocomputing, 2016.

[26] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).

Springer-Verlag, 2006.

[27] Y. Lecun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.

[28] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In Neural Networks: Tricks of

the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, 1998.

[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research.

chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, Cam-

bridge, MA, USA, 1988. ISBN 0-262-01097-6. URL http://dl.acm.org/citation.cfm?id=

65669.104451.

[30] A. Prieto, B. Prieto, E. M. Ortigosa, E. Ros, F. Pelayo, J. Ortega, and I. Rojas. Neural networks:

An overview of early research, current frameworks and new challenges. Neurocomputing, pages

242–268, 2016.

[31] G. E. Dahl, T. N. Sainath, and G. E. Hinton. Improving deep neural networks for lvcsr using

rectified linear units and dropout. 2013 IEEE International Conference on Acoustics, Speech and

Signal Processing, 2013.

38

http://dl.acm.org/citation.cfm?id=65669.104451http://dl.acm.org/citation.cfm?id=65669.104451

[32] N. Qian. On the momentum term in gradient descent learning algorithms. Neural Netw., 1999.

[33] Y. Lecun. Generalization and network design strategies. Elsevier, 1989.

[34] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern

recognition unaffected by shift in position. Biological Cybernetics, 1980.

[35] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive field in deep

convolutional neural networks. In Advances in Neural Information Processing Systems 29. Curran

Associates, Inc., 2016.

[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional

neural networks. In Advances in Neural Information Processing Systems 25. Curran Associates,

Inc., 2012.

[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabi-

novich, et al. Going deeper with convolutions. Cvpr, 2015.

[38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recogni-

tion. arXiv preprint arXiv:1409.1556, 2014.

[39] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional

networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[40] S. Wu, S. Zhong, and Y. Liu. Deep residual learning for image steganalysis. Multimedia Tools and

Applications, 2017.

[41] N. A. Hamid, N. M. Nawi, R. Ghazali, and M. N. M. Salleh. Solving local minima problem in back

propagation algorithm using adaptive gain, adaptive momentum and adaptive learning rate on

classification problems. In International Journal of Modern Physics: Conference Series, 2012.

[42] J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 2015.

[43] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks.

In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,

2010.

[44] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the

Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.

[45] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In

Proceedings of the 30th International Conference on International Conference on Machine Learn-

ing - Volume 28, 2013.

[46] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi-

fication. In Proceedings of the 25th ieee conference on computer vision and pattern recognition

(cvpr 2012), 2012.

39

[47] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for

visual recognition. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision –

ECCV 2014, 2014.

[48] E. A. Smirnov, D. M. Timoshenko, and S. N. Andrianov. Comparison of regularization methods for

imagenet classification with deep convolutional neural networks. Aasri Procedia, 2014.

[49] Y. Bengio et al. Learning deep architectures for ai. Foundations and trends R© in Machine Learning,

2009.

[50] M. Mahsereci, L. Balles, C. Lassner, and P. Hennig. Early stopping without a validation set. CoRR,

2017.

[51] L. Breiman. Bagging predictors. Mach. Learn., 1996.

[52] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural

networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

[53] W. Sun and F. Su. Regularization of deep neural networks using a novel companion objective

function. In Image Processing (ICIP), 2015 IEEE International Conference on, 2015.

[54] P. Baldi and P. J. Sadowski. Understanding dropout. In Advances in Neural Information Processing

Systems 26. Curran Associates, Inc., 2013.

[55] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In Advances in Neural

Information Processing Systems 26. Curran Associates, Inc., 2013.

[56] S. Wang and C. Manning. Fast dropout training. In Proceedings of the 30th International Confer-

ence on Machine Learning, 2013.

[57] D. A. McAllester. A pac-bayesian tutorial with A dropout bound. CoRR, 2013.

[58] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple

way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.

[59] S. Wager, S. Wang, and P. S. Liang. Dropout training as adaptive regularization. In Advances in

Neural Information Processing

Documents

Multiresolution Backpropagation Learning · Multiresolution Backpropagation Learning Ricardo Jorge Ferreira Ponciano Thesis to obtain the Master of Science Degree in Information Systems