Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Multiresolution Backpropagation Learning
Ricardo Jorge Ferreira Ponciano
Thesis to obtain the Master of Science Degree in
Information Systems and Computer Engineering
Supervisor: Prof. Andreas Miroslaus Wichert
Examination Committee
Chairperson: Prof. Mário Jorge Costa Gaspar da SilvaSupervisor: Prof. Andreas Miroslaus Wichert
Member of the Committee: Prof. João Carlos Serrenho Dias Pereira
June 2018
ii
Acknowledgments
First I would like to thank my thesis advisor, Prof. Andreas Wichert for all the necessary support in a
gradual and timely manner. I want to express my gratitude for his knowledge sharing about the deep
learning field. It was a very positive experience and I found him not just an thesis advisor, but also a true
mentor.
I also want to thank my family to all the support during the course of this work.
iii
iv
Resumo
Treinar dados com elevada dimensão requere a minimização de superfı́cies de erro complicadas. Propo-
mos uma abordagem multiresolução com treino incremental baseado em retro propagação para melho-
rar a generalização. A pirâmide Gaussiana, gerada a partir de um padrão inicial de imagens, é a entrada
para redes neuronais de alimentação direta que aprendem desde reduzida até elevada resolução. Após
a inicialização do treino, os valores precedentes inicializam a rede neuronal seguinte. Aplicámos este
método ao conjunto de dados MNIST para reconhecimento de padrões. A Aprendizagem Retro propa-
gada Multiresolução generalizou melhor do que um treino baseado em simples retro propagação, com
convergência mais rápida. Verificámos empiricamente que podemos chegar próximo de um mı́nimo
global, evitando mı́nimos locais.
Palavras-chave: Multiresolução, Redes neuronais, Generalização, Mı́nimos locais, Apren-dizagem profunda
v
vi
Abstract
High dimensional data training requires minimizing complicated error surfaces. We propose a mul-
tiresolution approach with incrementally backpropagation-based training to improve generalization. A
Gaussian pyramid, generated from a initial pattern of images, is the input to feedforward neural net-
works learning from lower to higher resolution. After train initialization, the preceding values initialize the
following neural network. We applied this method to MNIST dataset for pattern recognition. The Mul-
tiresolution Backpropagation Learning generalized better than simple backpropagation-based training,
with faster convergence. We empirically verify that we can reach near a global minimum, avoiding local
minima.
Keywords: Multiresolution, Neural networks, Generalization, Local minima, Deep learning
vii
viii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 5
2.1 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Stochastic Gradient Descent Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Deep Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.9 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.10 LeCun Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.11 Deep Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.12 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.13 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.14 Multiresolution Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.15 Subspace Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Multiresolution Backpropagation Learning 21
3.1 Gaussian Pyramid Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Weight replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
ix
3.4 Multiresolution Backpropagation Learning Architecture . . . . . . . . . . . . . . . . . . . . 25
4 Empirical Experiments 27
4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 Performance measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 Preliminary CIFAR-10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.1 MNIST Gaussian Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.2 Networks Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Conclusions 35
5.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Bibliography 37
x
List of Tables
4.1 Image classification error rate (%) of preliminary experiments on MNIST. . . . . . . . . . . 29
4.2 Image classification accuracy (%) of preliminary experiments on CIFAR-10. . . . . . . . . 29
4.3 Summary of MrBL and BL evaluation training parameters. . . . . . . . . . . . . . . . . . . 32
4.4 Image classification error rate of MrBL and BL evaluation on MNIST. . . . . . . . . . . . . 34
4.5 Image classification error rate of BL evaluation on MNIST with 100 epoch training. . . . . 34
xi
xii
List of Figures
2.1 A simplified mathematical model of McCulloch and Pitts artificial neuron. . . . . . . . . . . 5
2.2 Graphical representation of activation functions. . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Surface generated by the training error of a linear neuron with two input weights. Also
known as error surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Illustration of a saddle point example [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 A simplified model of an artificial neural network. . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Convolution step [25]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Generic convolutional neural network architecture. Convolution step occurs in convolu-
tional layers, Max pooling occurs in pooling layers and the fully connected layers concern
to multilayer artificial neural networks [25]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Illustration of a local minima example, given by the red dot. . . . . . . . . . . . . . . . . . 13
2.9 RGB raster image example showing individual pixels as squares and colour components
as values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 Schematic representation of an Gaussian pyramid with five levels. . . . . . . . . . . . . . 18
3.1 Schematic representation of three images at different resolution as input to three feedfor-
ward networks with backpropagation-based training. . . . . . . . . . . . . . . . . . . . . . 23
3.2 Hierarchy of weight replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Schematic representation of three images at different resolution as input to three feedfor-
ward networks with backpropagation-based training, showing the direction of the image
resolution reduction and the direction of the training procedure. . . . . . . . . . . . . . . . 25
4.1 MNIST Gaussian Pyramid sample images with σ = 1. . . . . . . . . . . . . . . . . . . . . 30
4.2 MNIST Gaussian Pyramid sample images with σ = 2 (two levels). . . . . . . . . . . . . . . 30
4.3 MNIST Gaussian Pyramid sample images with σ = 3 (two levels). . . . . . . . . . . . . . . 30
4.4 Error rate (%) obtained using different standard deviation values with MrBL. . . . . . . . . 31
4.5 Training set convergence properties of MrBL networks and BL evaluation, with scaled
epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.6 Training an test set convergence properties of MrBL and BL evaluation. . . . . . . . . . . 33
4.7 Test set convergence properties of MrBL and BL evaluation, with vertical bars represent-
ing the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
xiii
xiv
Chapter 1
Introduction
The human brain is a complex system, with about 86 billion neurons interacting through approximately
150 trillion synapses, allowing humans to perform an enormous diversity of tasks and its features are
desirable in artificial models [1]. Studied by neuroscience, the brain structure and its operation inspired
the development of artificial neural networks, in order to simulate brain’s learning capacity. Until now it
is the best known learning device.
Deep learning relies on specific architectures of artificial neural networks, allowing to learn about
datasets [2]. Deep learning methods are used in many domains, especially in pattern recognition tasks,
such as image or speech recognition. Its recent popularity is multi-factorial, mainly related to the avail-
ability of larger amounts of data and also due to an increase of computational power.
One of the main challenges about learning algorithms is the difficulty to obtain a model that performs
proper generalization towards new data [3]. It turns out that nowadays the abundance of high dimen-
sional data, like image databases, is transversal to fields such as information technology, astronomy,
bioinformatics or others and it can lead to performance problems [4]. Deep learning offers the possibility
to overcome such limitations, however its architectures are challenging to optimize and it is not yet fully
understood [5, 6].
Empirical studies have tried to explain certain phenomena that occurs during training, focusing on
specific types of critical points that occurs on the loss function shape [7, 5]. When the training involves
minimizing the loss or error function using gradient descent based algorithms, the derivative of those
functions might be equal zero, meaning that no information is given regarding the movement direction
of the function. When this happens, we call it critical points. There are three types of such points, local
maxima, local minima and saddle points. The latter two are more concerning since they typically occur
during the algorithm optimization and can lead to non optimal solutions [3].
1
We propose an optimization method to overcome the problem of local minima using a multiresolu-
tion approach and feedforward neural networks with backpropagation-based training. By obtaining a
sequence of low dimensional subspaces and incrementally training, we verify empirically that we can
reach near a global minimum, avoiding local minima [8]. This approach tries to deal with the undesired
effects of high dimensional data, leading to better generalization and avoiding overfitting. It gives an
alternative method to explore deep learning for pattern recognition tasks.
1.1 Motivation
Deep learning works very well [9]. Having an artificial neural network with several hidden layers opti-
mized with the stochastic gradient descent algorithm avoids that it gets trapped in a local minima [10]. It
is empirically shown that having many layers result in more saddle points than local minima and the high
dimensional error surface in that situations becomes a more flattened landscape [7, 5]. However, studies
demonstrated that local minima tend to occur towards the landscape bottom and they are expected to
be found more near the end of training [5].
Motivated by this theories, we developed a novel regularization method that resort to an hierarchy
of different and gradually less complicated landscapes to improve its generalization. We created a
sequence of subspaces represented by images at diverse resolutions and performed learning for feed-
forward networks with stochastic gradient descent computed with backpropagation from lower to higher
resolution.
This way, we do not start with a so complicated landscape and use the preceding to initialize the
following without reaching each of the landscape bottoms, where more local minima lay. Instead of
using many hidden layers in the feedforward networks we used one, because many hidden layers result
in more saddle points than local minima.
1.2 Objectives
The main goals of this thesis are to compare the Multiresolution Backpropagation Learning method to
the conventional backpropagation procedure and to empirically verify if the proposed method gives an
advantage comparatively to the traditional backpropagation procedure, using feedforward neural net-
works.
1.3 Thesis Outline
This work is divided into five chapters. After the present chapter, chapter 2 introduces the basic con-
cepts of deep learning, referring also their major problems and solutions. Then, we give an overview
perspective about multiresolution processing and the subspace tree, paving the way for the experiments
carried out later.
2
Chapter 3 describes the components of the Multiresolution Backpropagation Learning method show-
ing how was it done, focusing on the description of the algorithm behind it.
Chapter 4 presents the conducted empirical experiments using the proposed method. We start with
the dataset description and the preprocessing and then preliminary experiments were made before the
main experiments. In the experiments we explain in more detail the method training process and then,
the results obtained.
Chapter 5 present the conclusion of the work done, the main achievements and ideas for future
studies.
3
4
Chapter 2
Background
2.1 Artificial neuron
The search for artificial models that mimetize the human brain started around 1940 when the first elec-
tronic computers where developed.
The linear model introduced by McCulloch and Pitts in 1943 was perhaps the first artificial neuron,
presenting important features which can be found in many artificial neural networks [11].
The basic components of an artificial neuron are shown in the Figure 2.1, being related in such way
that the weighted sum of input signals is compared to a threshold (activation funtion) to determine the
output [12]. A neural network is no more than a collection of nodes or units (Figure 2.1) connected by
links.
Figure 2.1: A simplified mathematical model of McCulloch and Pitts artificial neuron.
2.2 Perceptron
Proposed by Frank Rosenblatt in 1957, the perceptron was built around the McCulloch and Pitts model
of a neuron to solve pattern recognition problems [13]. The author introduced a learning rule for training
perceptron networks, that will converge to the correct network weights [12].
Considering a simple perceptron, also known as one-layer feedforward networks, with a set of n
inputs and one output. Each input is the dot product between the weight vector value (wi) and the input
5
vector value (xi) which means that the neuron’s input is given by their summation (net). The bias term
(w0) is an extra weight constant [14, 15].
yk = φ(net) = φ
(n∑i=1
wi · xi + w0
)(2.1)
In order to obtain the perceptron’s output, it requires the application of the activation function φ(),
which can be either linear or nonlinear [15]. In this case φ() is an hard threshold, represented by the
nonlinear sign function:
φ(net) := sgn(net) =
1 if net ≥ 0−1 otherwise. (2.2)
The XOR Problem
A simple perceptron can only deal with linearly separable inputs and this problem has long been identi-
fied by Minsky and Papert, 1969 [3]. An one-layer perceptron represents an hyperplane in n dimensional
space able to divide linearly separable inputs and for this reason it can only deals with the boolean func-
tions AND, OR, NAND, NOR. Regarding XOR, it doesn’t exist any hyperplane that can classify input
patterns in this conditions [16].
2.3 Activation Functions
Whether in a simple perceptron, or in the units of a neural network, some kind of activation function is
usually applied [15]. The design of activation functions for the neural networks training to be used in
deep learning is currently an area of great research[17]. It is desirable that they present some general
characteristics, such as being non-linear, continuous and differentiable, in order to facilitate the applica-
tion of gradient-based methods [14].
There are several types of activation functions that can be used like the sigmoid or logistic (Fig-
ure 2.2(a)), the hyperbolic tangent (Tanh), the hard Tanh, the rectified linear unit (ReLU) (Figure 2.2(b))
and its variants, the SoftPlus, the Softmax, the Maxout or the Radial basis function [3, 14].
(a) Sigmoid. (b) ReLU.
Figure 2.2: Graphical representation of activation functions.
6
2.4 Gradient descent
Since a simple perceptron can fail to classify inputs that are not linearly separable, optimization is
required. Gradient descent also called steepest descent is an optimization algorithm that gradually
changes vectors to find a local minimum of a function [18]. For a better understanding, consider an
unthreshold perceptron or linear unit:
o =
n∑i=0
wi · xi = net0. (2.3)
Now we need to identify the training error or the loss function (E(w)) for N training examples, based
on the target output values (tk) minus the output value of the linear unit (ok). This function will determine
the surface shape which will try to fit the training samples [15] (Figure 2.3):
E(w) =1
2·N∑k=1
(tk − ok)2. (2.4)
Figure 2.3: Surface generated by the training error of a linear neuron with two input weights. Also knownas error surface.
The w refers to the weight vector w = (w0, w1) whose o depends. In order to know the direction of
steepest descent at each point of the training error function, we need to compute the derivative of E with
respect to each component of the weight vector, in the downward sloping direction:
∆wi = −η ·∂E
∂wi. (2.5)
The η constant is the learning rate, indicating the step size in gradient descent algorithm. With
this considerations, now we can obtain a more practical algorithm, differentiating equation (2.4) and
substituting its result in equation (2.5), which is called the weight update rule for gradient descent:
∆wi = η ·N∑k=1
(tk − ok) · xk,i. (2.6)
7
2.5 Stochastic Gradient Descent
The stochastic gradient descent, also known as LMS (least mean-square) or delta rule, is an extension of
the previous algorithm where the weight updates occur for each training sample instead of the complete
training set N :
∆wi = η · (tk − ok) · xk,i. (2.7)
By performing the weight updates iterating for each training sample at a time we can evaluate the
gradient in a computationally more efficient and faster way than with the gradient descent rule (equation
(2.6)). Still, the stochastic gradient descent perform updates with higher variance, meaning that it can
fail some local minima when there are multiple [19].
This rule was applied in 1960 with Adaline, an ”adaptative pattern classification machine” and has
been used widely by other authors [20, 3].
2.6 Stochastic Gradient Descent Optimizations
Although the high popularity of gradient descent techniques, especially stochastic gradient descent,
there are some disadvantages associated to it.
Choosing a proper learning rate (η) is a challenge, since this task is often carried out by trial and
error. If η is too small, many iterations are needed to get near the best values making this process a
slow task. On the other hand, if η is too big, it can skip or diverge from the optimal solution due to large
oscillations of the function [19, 3].
Other main disadvantage is the possibility of the function to get trapped in a local minima skipping
the desired value [19]. In this case, it should be taken in consideration the importance of saddle points,
a stationary point (Figure 2.4), which can be mistaken with a local minima, also making the algorithm
produce wrong results [10]. Several improvements to stochastic gradient descent have been proposed,
Figure 2.4: Illustration of a saddle point example [10].
with methods such as Momentum, Nesterov Momentum, Adagrad and Adadelta, RMSProp , Adam and
others [21, 22, 23, 24].
8
2.7 Deep Learning
Since a perceptron can only classify linearly separable inputs and knowing that stochastic gradient
descent based on simple perceptron (thresholded or unthresholded) does not take the full potential
from this algorithm, other approaches are used resorting to more units and more layers [3, 16]. The
usage of many layers – with hidden layers included – define an artificial neural network (Figure 2.5)
used in deep learning, able to build complex concepts from simpler ones. There are several different
deep learning methods and architectures [25]. This overview will focus deep feedforward networks, the
backpropagation procedure and LeCun convolutional neural networks.
Figure 2.5: A simplified model of an artificial neural network.
2.8 Deep Feedforward Networks
Introducing deep feedforward networks concept, also known as multilayer perceptrons [3], allowed the
definition of the first artificial neural networks wherein simple perceptron main limitation can be over-
come. This kind of artificial neural network is a layered structure where the information flows through
it, starting in the input units layer, then hidden units layers, until it reaches the output units layer, that
produces the network outputs. Each unit in a layer is connected to every unit in the following meaning
that the network is fully connected [16]. Concerning hidden layers, their output is not visible and error
correction is only done indirectly [3].
We can use a simple perceptron as the units of the deep feedforward network to classify non linearly
separable inputs, solving the XOR problem, but we stay limited only to linear functions. Alternatively, we
can use continuous and differentiable nonlinear activation functions, which allow a better understand-
ing of input variables interaction. So, in a deep feedforward network units typically have a continuous
nonlinear activation function, which can be chosen according to the dataset’s nature [17, 26, 27, 28].
9
2.9 Backpropagation
Applied in the 1980s by Rumelhart, Hinton and Williams and by other groups of authors, the backprop-
agation is a procedure to adjust the weights of a function and is commonly used as learning method
in multilayer artificial neural networks [3, 29, 30, 27]. This is an iterative procedure with mainly two
phases, propagate inputs forward phase and propagate errors backwards phase, which for simplicity
will be designated phase one and phase two, respectively.
In order to understand the backpropagation procedure, first it is needed to consider some funda-
mentals behind it, starting in phase one. As referred to in the deep feedforward networks section, the
network units typically have a nonlinear activation function being precisely this type of activation that will
be used here. Among the different types of nonlinear activation function (σ()), it will be considered the
sigmoid function due to its popularity, although others are also used [14]:
σ(x) =1
1 + e(−α·x)(2.8)
The term α is a positive constant that indicates the steepness of the function. The activation function
is applied to each unit of the network, giving the generic output:
ok = σ
(n∑i=0
wi · xk,i
)(2.9)
Next, based on equation (2.4) and considering the fact that we have multiple units, the redefinition
of the training error with respect to all weights of the network is needed, with N training patterns and l
outputs:
E(w) =1
2·N∑k=1
l∑i=1
(tki − oki)2 (2.10)
Obtaining E(w) we know at this point the resulting error between the desired and the computed
values in the output units. This training error should be the lowest possible and to achieve it we need to
find its derivative, using the gradient descent for the output units and the chain rule of calculus for the
hidden units. Thus begins the phase two of the backpropagation procedure. The error is propagated
backward from the output units to the hidden units and the network updates their weights [15]. In a
generic way, the rule to update hidden units weights, determined from the inputs (xk,j) to the unit and
the error term (δki) from the unit outputs is given by:
∆wij = η
N∑k=1
δki · xk,j . (2.11)
Both phases of the algorithm typically run several times, one after the other, and the stopping condi-
tion of this iterative process can be specified, for example, through a predetermined number of iterations
or through a predetermined threshold applied to the error.
10
Momentum
The momentum method can be added to backpropagation in order to accelerate the learning process, by
changing the weight update [31]. With this method, at each iteration the weight update stays dependent
on the previous weight update iteration [32]. It is based on equation (2.11) but with a momentum term
added to increase the convergence and reduce the oscillation:
∆wij(n) = η
N∑k=1
δki(n) · xk,j(n) + p∆wij(n− 1), (2.12)
where n is the iteration number and 0 ≤ p < 1 is the momentum parameter. For example, in the
presence of a valley with a highly slope on the error surface of the neural network, the system oscillates
almost horizontally backwards and forwards, causing a very slow descent along that valley but if we add
a momentum term it can help to overcome this situation [32]. Also, by using the momentum method with
an high learning rate (η) the high oscillations that could be caused by it are reduced and so a minimum
will be achieved more quickly. Without the momentum method, that minimum could not even be reached
[15].
2.10 LeCun Convolutional Neural Networks
The convolutional neural networks are similar to the artificial neural networks already described above,
however presenting a different architecture. Yann LeCun was pioneer of the first convolutional neural
networks, having begun its investigations before 1989, culminating in the LeNet5 architecture, a convolu-
tional neural network with several layers representing different operations such as convolution, pooling,
non linearity and classification [33]. Its development was inspired by the neocognitron of Fukushima
[34]. A convolutional neural network works well in pattern recognition tasks, namely image recognition,
using a ”grid-like topology” [3]. Generically, it contains one or several convolutional layers, followed by
a multilayer artificial neural networks. Next, the main operations of a convolutional neural network are
described.
Convolution Step
In the convolution step (Figure 2.6), we want to extract features of an input, typically an image. Since an
image is basically a pixels matrix, it can be applied another sliding matrix over that input. This matrix is
called filter, kernel or feature detector and as it moves through the input, multiplies the matrix elements,
adding up their result. The resulting matrix is called a feature map or convolved feature. If the values of
the kernel matrix is changed it can be detected different characteristics of the input image (e.g. curves)
[25]. All these multidimensional matrices are also called tensors [3].
It is also worth noting that each unit of a convolutional neural network layer depends only on one
region of the input, being designated by receptive field [35].
11
Figure 2.6: Convolution step [25].
The nonlinear activation function ReLU is usually applied in order to replace the negative values of
the feature map by zero. This function allows the artificial neural network to learn faster[36].
Pooling
In the pooling, sub-sampling or down-sampling operation, it is intended to reduce the dimensions of the
feature map, thus obtaining the most relevant information by applying a function like Max, Sum, Average,
among others. These functions define the pooling strategy type, with average pooling and max pooling
being the most used [25]. In this operation a sliding window (filter) with a predefined dimension is also
used, which advances in a step also pre-defined. The pooling operation by reducing the size of the input
representation allows to control the overfitting.
Fully Connected Layer
After the convolution, pooling and non-linearity, it follows a fully connected layer as basic elements
of a generic architecture of a convolutional neural networks(Figure 2.7). It is no more than a artificial
multilayer neural network that occurs after the last pooling layer and this latter acts as input of the former.
The multilayer artificial neural network then proceeds a typical classification task. There are several
methods based on convolutional neural networks like AlexNet, GoogLeNet, VGGNet or DenseNet [36,
37, 38, 39].
2.11 Deep Learning Problems
The usage of deep neural network architectures in different datasets has been shown very good results,
but despite of its achievements, the usage of more layers doesn’t always represent a better learning
artificial network [40]. In the following subsections, it will be refered the main problems related to the
typical deep neural networks: local minima, vanishing gradients and overfitting.
12
Figure 2.7: Generic convolutional neural network architecture. Convolution step occurs in convolutionallayers, Max pooling occurs in pooling layers and the fully connected layers concern to multilayer artificialneural networks [25].
Local Minima
Since the backpropagation procedure uses gradient descent to reduce the training error, it may hap-
pen that the algorithm stays trapped in one of several local minima of a neural network error surface
(Figure 2.8) [7, 41].
Figure 2.8: Illustration of a local minima example, given by the red dot.
This means that with backpropagation there is difficulty in finding the lowest training error value,
therefore, not converging to the global minimum. It is assumed that an artificial neural network with
several hidden layers is less likely to be stuck in a local minima and it is easier to find the right parameters
as demonstrated by empirical experiments [27].
Vanishing Gradients
The vanishing gradients problem is a phenomenon that may occur in deep neural networks training,
where the backpropagated error decreases rapidly, tending to zero as it approches the input layer [42].
By using classical activation functions, like sigmoid or hyperbolic tangent with a finite activation range,
(0, 1) and (−1, 1) respectively, the error output is limited. So, the error is backpropagated over hidden
layers with increasingly smaller values, meaning that weight updates become more and more residual.
13
There are some solutions proposed to deal with the vanishing gradients problem, but it is recommended
the usage of the rectified linear unit (ReLU), an activation function defined as f(x) = max(0, x) [42,
3, 43, 44]. The opposite phenomenon can also occur, with backpropagated errors suffering a large
increase, designated by exploding gradients [42, 45].
2.12 Overfitting
Overfitting is one of the main challenges in machine learning, in which a learning algorithm (model)
has a good performance in the training data, but an poor performance against new data. In order
to be obtained a correct description of the data, we estimate the minimum training error [8]. During
this process, the model adapts very well to the training data, which usually contains noise. It occurs
memorization, instead of an smoother and more generalized adaptation [15].
If the learning algorithm is very fitted to the training data, it will act more poorly on previously unseen
data, like a test set, used to assess the classifying performance, or a validation set, for parameter tuning.
By the end, we want to find a model in which the difference between the training error and the test error
will be the minimum possible.
The ability to classify new unseen inputs defines the model performance, also know as generalization
[3]. Underfitting may also happen due to a poor performance of the model in finding a good minimum
training error.
2.13 Regularization
In order to deal with the overfitting problem, several authors proposed different regularization methods.
Regularization is defined as ”any modification” made to ”a learning algorithm that is intended to reduce its
generalization error but not its training error” and is one of the main concerns when designing a machine
learning architecture [3]. Regularization is an important component in order to prevent overfitting and it
is one of the main concerns when designing a machine learning architecture.
This section will introduce some regularization techniques, among the wide range of options avail-
able, namely the Data augmentation, Early stopping, Bagging and Dropout, Weight penalty L2 and L1
and others.
Data augmentation
Data augmentation has been used by several authors and it consists in generating additional data in the
training datasets in order to obtain a machine learning model with a better generalization [46, 47, 36, 48].
Early Stopping
When training certain large models whose trend towards overfitting is representative, the training error
decreases over time but the validation set error increases at a given moment [3]. Early stopping is a
14
efficient capacity control approach based on monitoring the performance of validation set during training
in order to return the parameters with lowest validation set error, rather than the latest parameters [3, 49].
A recent work propose a novel early stopping criterion which removes the need for a held-out validation
set [50].
Bagging and Dropout
Bagging (an acronym of ”bootstrap aggregating”) is another procedure for regularization and it allows
to reduce generalization error resorting to the combination of several models [51, 3]. These are trained
separately and then they vote on the output for test examples, based on the assumption that different
models will certainly not make all the same errors on the same test set. This general strategy is called
model averaging and the techniques that employ it are also known as ensemble methods [3].
Dropout is a strategy proposed by [52] and is a variant of ensemble method where different neural
network topologies are combined and their nodes are randomly dropped out in the training phase, in
order to prevent complex co-adaptations and to enhance the generalization performance of the network
[53]. Dropout has been used by several authors whether in a in-depth explanation or with some improve-
ments, like ”standout” method, fast dropout training and others [54, 55, 56, 57, 58, 59, 60]. DropConnect
is a special case of generalization derived from Dropout which randomly drops the weights instead of
the activations [61].
Weight penalty L2 and L1
Parameter norm penalties is a regularization approach based on limiting the models capacity by adding
some parameter penalties [3, 62, 53]. Towards neural networks model, typically is selected a parameter
norm penalty that penalizes only the weights, like L2 or L1. The L2 regularization is also called weight
decay or Tikhonov regularization and it is the most common form of parameter regularization, encour-
aging near-zero weights [63, 64]. The L1 regularization result in a solution more sparse comparing to
L2, meaning that some parameters have an optimal value of zero, which is useful as a feature selection
mechanism [3]. LASSO (least absolute shrinkage and selection operator) is a well known model based
on L1 penalty proposed by [65] with recent adaptations [3, 66].
Others
Multi-task learning is a mechanism whose main goal is to improve generalization by training tasks in
parallel using a shared representation. When applied to artificial neural networks it uses a shared
hidden layer trained in parallel on all tasks benefiting the overall learning [67, 68]. This method has been
applied with success in areas as diverse as natural language processing, video games and biomedical
science [69, 70, 71].
Sparse representation is achieved by penalizing the activations of the units in neural networks in or-
der to their activations be sparse [3, 72]. Although this method has a good performance, it has difficulties
dealing with low dimensional data, still [73] proposed an effective method to overcome this situation.
15
Parameter tying is another technique that allows models to learn good representations of input data
by reducing the number of learnable parameters in Convolutional Neural Networks which makes possible
to train this model with fewer examples [2, 74].
2.14 Multiresolution Processing
Multiresolution processing and analysis refers to the decomposition of a signal into more than one scale
or resolution [75, 76]. A signal can be defined as ”a function that conveys information about the behaviour
of a system or attributes of some phenomenon” that can be processed into images, sound, and others
[77].
The basic idea behind multiresolution theory is not recent. In the beginning of 1800’s, Joseph Fourier
proposed essential theories about frequency analysis using superposition of sines and cosines to rep-
resent signals which allowed the development of new approaches later on [78]. Thus, one of the most
interesting discoveries were about wavelets. Wavelets are small wave-like oscillations with diverse fre-
quencies and limited duration that can be used as a mathematical tool to extract information from signals
[78, 75]. In this sense, Stéphane Mallat and Yves Meyer work (after 1980) introduced wavelet represen-
tation as a significantly new approach to image processing and analysis, called the multiresolution the-
ory. This theory incorporates techniques from different fields, namely signal processing, digital speech
recognition and pyramidal image processing in which a given signal is decomposed into different scales
or resolutions and then reconstructed from the elements of its decomposition [75, 79, 78].
Multiresolution processing and analysis is a very useful technique that is applied to the field of image
processing and computer vision. We can find applications of this technique in object detection and visual
recognition [80, 81, 82], robotic grasping detection [83], alignment and tracking [84], machine learning
[76, 85] and others [76].
In this section, we will give an overview of the multiresolution technique applied to the field of image
processing and computer vision, with focus on Image Pyramids.
Digital Images
A digital image can be represented either by a vector graphics, based on mathematical formulas that
define geometrical primitives (eg. polygons, lines), either by raster graphics, represented by pixels.
The ”digital image” term usually refers to the raster graphics image, which typically are two dimensional
arrays representing a set of values called pixels, its smallest element. Each pixel is defined by a certain
number of bits, within a range of intensity values, indicating the colour components it can represent. This
concept is called bit depth or pixel depth [8, 75].
Among the colour encoding models available (eg. YUV, CIELAB), the RGB colour model is a popular
method used in computing (Figure 2.9). The acronym means that the images in the red (R), green (G)
and blue (B) colour space are defined by three numbers, one for each colour. Each component can
be represented by a range of values depending on the bit depth. For example, a 24-bit colour image
typically uses 8 bits for each of the R, G and B components giving more than 16 million (224) colour
16
Figure 2.9: RGB raster image example showing individual pixels as squares and colour components asvalues.
variations. An 8-bit component can have 256 possible values (28), from 0 to 255. RGB digital images
may have an additional component that can create partial or full transparency, called alpha channel [86].
In the case of black and white digital images, the intensity varies between the different grey levels,
from the darkest to the lightest grey. They have a single 8-bit component per pixel, resulting in 256
different grey levels.
Image Pyramids
An image pyramid is a structure that corresponds to multiresolution image representations [8, 87, 88].
This kind of representation is somewhat similar to human visual encoding. Human visual system is
very effective in object recognition and in the representation of pictorial information, but has difficulties
evaluating distances, areas and accurate distinction of gray scales [89]. When we analyse a given
image with objects and features of many sizes, large and high contrast objects are coarse viewed and
the remaining objects usually need to be in a higher resolution for a proper examination [75]. Studying
images at different resolutions is the main motivation behind the concept of image pyramids.
The aforementioned concept is a simple and computationally effective structure where the base of
the pyramid contains a high-resolution image, followed by a collection of decreasing resolution images
until the apex, that contains a low-resolution approximation of the image. When moving to the apex,
image size and resolution decrease. Considering an image in a base level J with size 2J × 2J or N ×N ,
where J = log2N , there are J + 1 resolution levels in a pyramid, from 2J × 2J to 20× 20, with 0 ≤ j ≤ J .
Nevertheless, most pyramids are truncated to P + 1 levels, where 1 ≤ P ≤ J , meaning that going to a
very reduced resolution of a bigger original image may not add relevant information [75].
To generate an image pyramid, the original image can be decomposed as a set of lowpass filtered
copies via Gaussian pyramid or as a set of bandpass filtered copies via Laplacian pyramid [88]. In a
Gaussian pyramid (Figure 2.10), the lowpass filter is made by smoothing an image with the appropriate
filter and then downsampling (subsampling) the smoothed image in a iterative fashion. In a Laplacian
pyramid, the bandpass filter is made by subtracting each Gaussian pyramid level from the next lower
level and then performing an image interpolation between adjacent levels [90]. Other filter operations
17
can also be employed [8].
Figure 2.10: Schematic representation of an Gaussian pyramid with five levels.
The smoothing operation used in a Gaussian pyramid is a Gaussian filter (Gaussian blur) that is firstly
applied to transform each pixel of the original image. The Gaussian filter in two dimensions is given by
a Gaussian function G(x, y):
G(x, y) =1
2 · π · σ2· e−
x2+y2
2·σ2 (2.13)
where σ is the standard deviation of the Gaussian distribution. A Gaussian function expresses the
normal distribution, an important statistics concept used to represent random variables with a large
variety of distributions [91]. Visually, this formula produces a shape obtained from the Gaussian ”bell
curve”, rotated around the vertical axis [92].
Since the Gaussian function extends to infinity we must truncate it, due to the presence of near zero
values at more than 3σ from the mean. As a solution, we can use a simple rectangular window function,
with values from the truncated normal distribution to build a convolution matrix. This matrix is applied to
the image, setting new values to its pixels. In other words, the Gaussian filtering process involves the
convolution of the image with the convolution matrix [8].
Then the blurred image is downsampled by a factor of 2. The Gaussian filter/downsample steps are
repeated to generate the typical P + 1 levels of the pyramid. This operations ensure that the sampling
theorem is respected, meaning that we get no distortions of a signal (image) by sampling. Thus, the size
reduction goes together with an appropriate smoothing, ensuring a proper downsampled image [89].
18
2.15 Subspace Tree
The subspace tree is an efficient hierarchical structure described by a tree that can deal with the negative
effects of high dimensional data [93, 94].
The advances in hardware technology and the exponential production, storage and retrieval of digital
contents has been a challenge for computer scientists. Image databases, as a set of multimedia objects,
can be accessed traditionally by its file name or keyword, for example, or by its content, such as colour,
texture, shape and others [95, 96]. The content-based image retrieval is a set of techniques for searching
images on large databases, given a content query as a weighted combination of features [94]. The
similarity between an image and a content query is given by their feature vectors distance in the high
dimensional space. These vectors can support efficient indexing methods [97, 96].
The plethora of high dimensional data available today, such as image databases, can be identified
when the number of features is larger than the number of samples [4]. However, dealing with large
number of dimensions can lead to query performance problems. When the number of dimensions
grows, the performance tends to worse, running into the ”curse of dimensionality” problem [8].
The subspace tree can tackle this problem [97, 96]. By dividing a high dimensional space into
a sequence of low dimensional subspaces, an hierarchical subspace is obtained. Then, a distance
function measure the difference among corresponding multimedia objects in a space and a subspace.
The process starts in the lowest dimension subspace and continues to the following higher dimension
subspace. With this approach the ”curse of dimensionality” problem does not arise, due to the mapping
of multimedia objects into a low dimensional space [98].
More formally, supposing a sequence of subspaces U0 ⊃ U1 ⊃ U2 ⊃ . . . ⊃ Ut and dim(U0) >
dim(U1) > dim(U2) . . . > dim(Ut). V is a vector space, where V = U0. The dim(Ur) is the Ur subspace
dimension, represented graphically by the number of nodes in the tree. A family of projections where
multimedia objects are mapped to subspaces can be defined as a subspace sequence:
P1 : U0 7→ U1;P2 : U1 7→ U2; . . . ;Pt : Ut−1 7→ Ut. (2.14)
If an orthogonal projection is applied, the subspaces obtained correspond to the multiresolution image
representations of the image pyramid [88].
In order to obtain an efficient indexing structure, the distance d should be d ≤ 16 and the relation
between spaces defined as
dim(U0)
dim(U1)≤ d, dim(U1)
dim(U2)≤ d, . . . dim(Ut−1)
dim(Ut)≤ d. (2.15)
The computing costs of this subspace method given a query vector can be determined
costs =
t∑i=1
σi · dim(Ui−1) + s · dim(Ut), (2.16)
given a number of points σi below a given bound � for a sequence of subspaces Ui and a dataset of
19
size s. The costs tend to decrease until it reaches a minimum value, with the increase of the number of
subspaces [8, 93, 94].
20
Chapter 3
Multiresolution Backpropagation
Learning
The Multiresolution Backpropagation Learning is related to LeCun Convolutional Neural Networks ap-
proach, in which the main difference is that no receptive fields are used [33, 99]. We propose a method
that combines different concepts from the multiresolution image processing and from the deep learning
in order to obtain a good generalization, avoiding the problem of overfitting. Multiresolution Backpropa-
gation Learning can be described in three main components:
1. Generation of Gaussian pyramids from an initial pattern;
2. Artificial neural networks training on each resolution of the pattern;
3. Weights replication to initialize the following artificial neural network, from the lower to the higher
resolution of the pattern.
For the sake of reasoning, we first describe individually each component of the proposed method before
its overall architecture.
3.1 Gaussian Pyramid Generation
The generation of the Gaussian pyramid is the first stage of the Multiresolution Backpropagation Learn-
ing. Proposed by Burt and Adelson (1983), the pyramid is a multiresolution structure, representing
subsequent images that are filtered and scaled down. The base level contains the original image and it
is the starting point of the pyramid construction process [88].
Given an image dataset D = {(I1, c1), . . . , (In, cn) : n ∈ N+}, where I is a two-dimensional image,
also denoted by I(x, y) and c is the associated class or label. The Gaussian pyramid is defined on the
original image I as:
G0(x, y) = I(x, y), for level l = 0 (3.1)
21
and then an averaging process is carried out by a REDUCE function in the following pyramid levels as:
Gl(x, y) = REDUCE(Gl−1(x, y)), otherwise. (3.2)
Which means that the REDUCE function involves the convolution of each initial image with a Gaussian
filter G(x, y) (Equation (2.13)) and a downsampling operation by a factor of 2, resulting in the following
level of the pyramid [90]. Thus, starting with an initial image G0 of size N pixel collumns×N pixel rows,
the image G1 of size N2 ×N2 is created. Repeating the REDUCE, image G2 with size
N4 ×
N4 is obtained,
resulting in a three-level pyramid structure.
The process described is applied to all images that compose the dataset D. The initial same resolu-
tion images in D will originate two new lower resolution image datasets. The dataset corresponding toN2 ×
N2 images is represented by D
′ = {(I ′1, c1), . . . , (I ′n, cn) : n ∈ N+} and the dataset that representsN4 ×
N4 images is denoted by D
′′ = {(I ′′1 , c1), . . . , (I ′′n , cn) : n ∈ N+}.
3.2 Artificial Neural Networks
Inputs
Images are the input to the artificial neural networks (Figure 3.1). Each input image is represented as
a two-dimensional grayscale array. Since we have three image datasets, we also need three separate
networks. Lowest resolution images from dataset D′′, the level l = 2 of the Gaussian pyramid, are the
input to the first artificial neural network (NN1), medium resolution images from dataset D′, the level
l = 1 of the pyramid, are the input to the second neural network (NN2) and the higher resolution images
from dataset D, the level l = 0 of the pyramid, are the input to the third neural network (NN3).
Training
We resort to feedforward networks with backpropagation-based training as the artificial neural networks
architecture, based on [33]. The remaining configurations are not based on any specific feedforward
network architecture, but during the experiments phase we found relevant other contributes [100].
The three networks (Figure 3.1) have an input layer with diverse number of units, depending only on
the input image resolution. They also have one hidden layer each and a output layer with 10 units. The
activation function used in the hidden layers is the typical hyperbolic tangent:
tanh(x) =(ex − e−x)(ex + e−x)
, (3.3)
with output values in the range (−1, 1) and in the output layers, a softmax function:
σ(x)i =exi∑Jj=1 e
xjfor i = 1, · · · , J, (3.4)
giving output values between (0, 1). Then a loss function is applied, the cross-entropy, a measure of
22
dissimilarity between the true labels and the predicted labels. It is typically used in training when the
models have softmax outputs [3].
Figure 3.1: Schematic representation of three images at different resolution as input to three feedforwardnetworks with backpropagation-based training.
For the NN1, the weights and biases are initialized randomly from a normal distribution of values,
with mean set to 0 and standard deviation equal 1. In the remaining artificial neural networks, we resort
to weight replication, explained in the section 3.3. About biases, they initialize the following network as
they are.
The architectures of NN1, NN2 and NN3 are similar. The main difference occurs in the training and
in the number of input units in the input layer. We applied early stopping during the training phase of
each network, thus resulting in a different number of training epochs. After initialization, the results are
continually improved by training, from the lower to the higher resolution.
3.3 Weight replication
Since we have three artificial neural networks to train images at different resolution, some interconnec-
tion must be made in order to generate relevant results. Here is where the replication of weights between
networks takes its place.
After the training of NN1, we have the resulting weights of the process. They are represented as a
matrix of values. Then, the NN1 weights initialize the NN2 and subsequently NN2 weights initialize the
NN3.
In order to replicate the weights between networks, we resort to the Kronecker product of two matri-
23
ces, denoted by:
A⊗B =
a11B · · · a1nB
.... . .
...
am1B · · · amnB
, (3.5)where A is an m× n matrix of weight values and B is a 2× 2 matrix of ones.
This process is repeated always between artificial neural networks, from the lowest resolution to the
following higher resolution. Figure 3.2 illustrate the weight replication along resolutions.
Figure 3.2: Hierarchy of weight replication.
24
3.4 Multiresolution Backpropagation Learning Architecture
After the component description of the proposed method, now we can formulate the overall architecture
(Figure 3.3). We start with the convolution of each image of dataset D with a Gaussian filter and a
Figure 3.3: Schematic representation of three images at different resolution as input to three feedforwardnetworks with backpropagation-based training, showing the direction of the image resolution reductionand the direction of the training procedure.
downsampling by a factor of 2, originating a new dataset D′. The process is repeated in order to obtain
the dataset D′′, as stated in Algorithm 1. Performing these steps corresponds to the generation of the
Gaussian Pyramid.
Algorithm 1: Prepare the dataset.
1 foreach image G0(x, y) = In ∈ D for level l = 0 do// Apply REDUCE
2 G1(x, y) = REDUCE(G0(x, y));
3 Save image G1(x, y) = I ′n; // Build dataset D′ for level l = 14 G2(x, y) = REDUCE(G1(x, y));
5 Save image G2(x, y) = I ′′n ; // Build dataset D′′ for level l = 2
6 end
25
Each dataset represent a level in the pyramid. The training starts with level l = 2 as input to NN1
(Algorithm 2). After the training of the first network, we apply weight replication to initialize NN2, trained
with level l = 1 as input. We repeat the same procedure between NN2 and NN3, but the latter have level
l = 0 as input. Since the level l = 0 is the base level of the pyramid, the NN3 is the last network to be
trained, dispensing the weight replication component.
Algorithm 2: Training.
1 foreach level l = 2, l = 1, l = 0 do
2 if level l = 2 then
3 Initialize a feedforward neural network randomly;
4 Train with backpropagation;
5 Apply early stopping;
6 else
7 Initialize a feedforward neural network from the preceding resolution network;
8 Train with backpropagation;
9 Apply early stopping;
10 end
11 end
26
Chapter 4
Empirical Experiments
This section present the conducted experiments using the Multiresolution Backpropagation Learning
(MrBL) and the MNIST image dataset (section 4.1). We describe the main steps performed during the
development until obtaining the final results.
We resorted to standard backpropagation-based training since it is a simple and efficient procedure,
frequently used with feedforward networks and it was used in LeCun Convolutional Neural Networks
[33, 28].
All the experiments were developed using the Python programming language, version 3.5 and the
Tensorflow software library, version 1.2, which is a machine learning framework used to build neural
network models [101, 102]. Another library used was the NumPy, version 1.13, a package for scientific
usage that supports multidimensional array operations [103].
4.1 Dataset
Experiments were carried out using the MNIST dataset [99]. The acronym stands for Modified National
Institute of Standards and Technology and it is a largely used dataset of handwritten digits suited for
pattern recognition methods [42, 104]. It contains normalized grayscale images, of size 28 × 28 pixel,
split into a training set of 60.000 images and a test set of 10.000 images. Each image has label values
from 0 to 9, representing the digit on it.
4.2 Preprocessing
The MNIST is a balanced dataset across classes, already mixed and with normalized size images. Since
the pixel values of each image vary in the range (0, 255), normalization to the range (0, 1) was carried
out. This is a typical procedure in computer vision, since scaling images will make their values more
evenly distributed for training [3].
Additionally, the train and test labels were one-hot encoded. Instead using the original label for the
digit class, we used binary variables, where 0 means that it does not belong to a class and 1 refers to
27
the class it belongs. The one-hot encoding avoid misclassification when feeding data into the model.
4.3 Performance measure
All the tested models were evaluated with respect to its performance. In classification tasks, it is usual
to measure its accuracy. The accuracy gives the correct output proportion of a model.
Most of the existing methods applied to MNIST use the error rate as unit to express their results,
which may represent better its desired behaviour. The error rate is an equivalent measure of perfor-
mance, that gives the incorrect output proportion of the model [3]. Since accuracy and error rate are
equivalent measures, the latter was selected as the performance measure.
4.4 Preliminary Experiments
Before the development of the final architecture, we performed some previous preliminary experiments.
The main purpose was to test if the predictions about the generalization ability of the method were
promising. The preliminary experiments were performed running on the CPU of a laptop with Intel Core
i3-2310M processor at 2.10 GHz and 8 GB of RAM.
We started to perform the experiments using different number of hidden units among the artificial
neural networks. We prepared the MNIST dataset in order to generate the three-level Gaussian pyramid
(Algorithm 1). The standard deviation of the Gaussian filter was σ = 2 (Equation (2.13)).
The lowest resolution images, with size 7×7 pixel, were input to NN1 containing one hidden layer with
2 units. The medium resolution images, with size 14×14 pixel, were input to NN2 with 4 hidden units. The
higher resolution images, corresponding to the MNIST original images, were input to NN3 with 8 hidden
units. We followed the Algorithm 2 and confirmed that the loss was reduced, but no early stopping was
formally applied. The three networks were trained with the three complete training datasets, during 200
epochs (number of iterations over the dataset) [3]. A learning rate of η = 0.3 was defined for all artificial
neural networks. The MNIST test set was used to assess the classifying performance of the model.
To evaluate the model we resorted to the original MNIST images as input to a feedforward neural
network with backpropagation-based training and with random initialization. The other network settings
were the same as the NN3 of the models evaluated.
The first results failed to succeed (Table 4.1). Comparing the output classification error of the pro-
posed model, given by NN3, with the evaluation model, we can see that the percentage of incorrectly
recognized test digits was better (lower) using the evaluation model. This preliminary test indicated that
using a artificial neural network with less hidden units to initialize the following with more hidden units
may not work.
So, we modified our model. Instead considering a different number of hidden units, we considered
using the same number of hidden units, performing an experiment with one hidden layer with 8 units on
each of the networks (Table 4.1). The results were better in the proposed model than the evaluation
model. Even thought that the classification error values were high, the results obtained were promising.
28
Model NN1 NN2 NN3 Evaluation
Different number of hidden units 89.9 88.6 89.7 87.8Same number of hidden units 90.6 85.8 81.4 87.8
Table 4.1: Image classification error rate (%) of preliminary experiments on MNIST.
But in order to achieve a good generalization model, we need to scale it and perform adjustments.
4.4.1 Preliminary CIFAR-10 Experiments
We performed some preliminary experiments on CIFAR-10 dataset with the purpose to test if the pro-
posed method has a good performance on a different data [100]. The dataset consists of 60.000 colour
images of size 32× 32, representing 10 classes (e.g. airplane, cat, dog, among others). It is split into a
50.000 images training set and a 10.000 images test set.
Before applying the proposed method we converted the dataset from colour to grey level to reduce
the its dimensions from three to one, simplifying the process. Briefly, we adjusted NN1, NN2 and NN3
to have one hidden layer with 10000 units each, a learning rate of η = 0.01 and were trained over 25,
90 and 30 epochs, respectively. We evaluated with a randomly initialized feedforward neural networks
with similar settings to NN3, trained over 30 epochs. We empirically verified that the proposed method
worked on CIFAR-10 dataset and optimized the results, but probably it needs a lot more computing
power to improve the results (Table 4.2). The accuracy is the typical performance measure used on the
aforementioned dataset. Higher accuracy represents better results.
Model NN1 NN2 NN3 Evaluation
Same number of hidden units 27.2 29.2 28.6 26.7
Table 4.2: Image classification accuracy (%) of preliminary experiments on CIFAR-10.
4.5 Experiments
The components and the overall architecture of the Multiresolution Backpropagation Learning were de-
scribed in Chapter 3. Starting with the preliminary experiments until the final model, several experiments
were carried out, in order to obtain the best results. This section presents the preprocessing and the
experimental settings of MrBL carried out as well as the most relevant results obtained. These exper-
iments were performed on the CPU of a server computer with Intel Xeon E5-1620 processor at 3.60
GHz and 64 GB of RAM.
4.5.1 MNIST Gaussian Pyramid
We resorted MNIST dataset to generate the three-level Gaussian pyramid (Algorithm 1). Images from
level l = 2, with size 7 × 7, formed an input vector of 7 × 7 × 1 = 49 values to NN1. Images from level
29
l = 1 formed an input vector of 14×14×1 = 196 values to NN2. Images from level l = 0 formed an input
vector of 28× 28× 1 = 784 values to NN3.
The Gaussian filtering process was performed and tested with different convolution matrix settings.
We tested common values for standard deviation to understand its influence in MrBL behaviour and then
get the best parameter (Equation (2.13)). The values tested were σ = 1, σ = 2 and σ = 3, corresponding
to the 68 − 95 − 99.7 statistical rule. Figures 4.1, 4.2 and 4.3 shows its visual effect. About the window
size, it was set to 5 × 5. This size produces an appropriate filtering and is computationally less costly
[88].
(a) Images from level l = 0.
(b) Images from level l = 1. (c) Images from level l = 2.
Figure 4.1: MNIST Gaussian Pyramid sample images with σ = 1.
(a) Images from level l = 1. (b) Images from level l = 2.
Figure 4.2: MNIST Gaussian Pyramid sample images with σ = 2 (two levels).
(a) Images from level l = 1. (b) Images from level l = 2.
Figure 4.3: MNIST Gaussian Pyramid sample images with σ = 3 (two levels).
Quality of Gaussian pyramid
After setting up the entire model, we tested different Gaussian filters to understand if they affected the
error rate. After testing the proposed model with σ = 1, σ = 2 and σ = 3, we realized that the best
results were obtained with σ = 1 (Figure 4.4). Once that the convolution matrix centre has the higher
value of the Gaussian distribution, it means that larger σ values will produce a wider ”bell curve” shape.
Higher values also produces ”sharp edges”, with undesired results.
30
Figure 4.4: Error rate (%) obtained using different standard deviation values with MrBL.
4.5.2 Networks Training Parameters
Several adjustments were carried out in the MrBL networks parameters (Table 4.3). We selected random
batches of 100 images on each iteration as input to each artificial neural network. More specifically, a
mini-batch for stochastic gradient descent perform parameter updating. The NN1 had 49 units in the
input layer, the NN2 had 196 input units and the NN3 had 784, corresponding to the size of each input
vector. Each network had one hidden layer with 9000 units and 10 units in the output layer (section 3.2).
We resorted to a softmax cross-entropy computation implemented by Tensorflow to obtain the model
loss. The training process was carried out with a gradient descent optimizer also from Tensorflow, with
the learning rate set to η = 0.01. The referred optimizer performs automatic differentiation to implement
backpropagation [102].
Early stopping was applied to control the number of epochs during training. By using this technique,
we were not just interested in obtaining a good performance on each individual artificial neural network
by getting the point in time with the lowest test error. We were also interested in the overall generalization
ability of MrBL method. Thereby, NN1 was trained during 20 epochs, NN2 for 50 epochs and NN3 for 30
epochs.
We evaluated the model resorting to MNIST original dataset as input to a feedforward neural network
with 784 input units, 9000 hidden units and 10 output units. We resorted to backpropagation-based
training over 30 epochs and with random initialization (with mean set to 0 and standard deviation set to
1). The remaining settings are equal NN3. We titled it ”BL” model.
4.5.3 Results
To achieve an efficient training, several optimizations were carried out (section 4.5.2). We tried to keep
the model as simple as possible in order to demonstrate its performance.
We choose a random batch of size between 1 and some hundreds to improve the training time and
the convergence of the algorithm [105].
The usage of stochastic gradient descent allows that the error surface landscape changes between
31
Parameter NN1 NN2 NN3 BL
Input units 49 196 784 784Hidden units 9000 9000 9000 9000Output units 10 10 10 10Learning rate 0.01 0.01 0.01 0.01Epochs 20 50 30 30Initialization Random From preceding From preceding Random
Table 4.3: Summary of MrBL and BL evaluation training parameters.
image batches, probably with different local minima or saddle points [5]. This technique optimized the
training process relatively to the preliminary experiments.
A fixed learning rate was the most suitable solution for MrBL. It helped to obtain either a proper
convergence of NN1 and NN2, either a proper time to reduce the NN3 loss.
The wider hidden layers used in MrBL helped the results optimization, and were inspired by [100].
Even thought that one wider hidden layer may have a more flattened landscape, our multiresolution
approach suggest less flattened landscapes and with more local minima [10].
The choice of the hyperbolic tangent as activation function was due to its better performance. It
converged faster than sigmoid, due to the fact that it output values in the range (-1,1) instead (0,1),
avoiding gradient bias [28]. It also performed better than ReLU due to vanishing gradients problem. The
softmax in the output layer units are a typical choice for multi-class classification tasks [26].
The weights and biases were initialized randomly from a normal distribution (section 3.2) in order
to maximize the generalization ability of MrBL. Setting the standard deviation to 1 avoided overfitting
in NN3, even that regularization concerns suggests smaller values [3]. In order preserve the informa-
tion from the preceding lowest resolution landscape, we used the weight replication process (section
3.3) among the MrBL networks, stopping the training before reaching each landscape bottoms. The
weight replication followed the images resolution increase to preserve information and to improve gen-
eralization. The number of epochs chosen was variable (section 4.5.2). The NN1 obtained a better
convergence by stopping the training earlier. The NN2 presented more convergence and low overfitting,
so we stopped training latter. We stopped the NN3 training when no relevant error rate improvement
was obtained. The Figure 4.5 shows the MrBL and the BL evaluation model train convergence. The
variable number of epochs was scaled for a better visualization. We can observe that MrBL converged
faster than BL evaluation model.
The higher gap between training and test loss curves of BL evaluation model indicates more overfit-
ting than MrBL (Figure 4.6). By starting the training process with a lower value suggests that MrBL do not
reach the landscape bottoms, using almost optimal lower values between the hierarchy of landscapes.
We obtained a better result in the percentage of incorrectly recognized test digits, meaning that the
output error rate value in the MrBL was on average lower than BL evaluation. We performed three runs
for each method and the values mean and standard deviation are presented in Table 4.4. Since the
values interval does not overlapped, the results are statistically significant which indicates that using the
MrBL method gives an advantage comparatively to the simple BL method. Figure 4.7 shows the test set
32
Figure 4.5: Training set convergence properties of MrBL networks and BL evaluation, with scaledepochs.
Figure 4.6: Training an test set convergence properties of MrBL and BL evaluation.
33
MrBL
NN1 NN2 NN3 BL
Error rate (%) 7.32± 0.54 5.83± 0.20 8.24± 0.33 10.92± 0.14
Table 4.4: Image classification error rate of MrBL and BL evaluation on MNIST.
convergence on both methods and the dispersion values during the training process. It showed better
generalization ability towards new data. We also verified that using the same number of training epochs
Figure 4.7: Test set convergence properties of MrBL and BL evaluation, with vertical bars representingthe standard deviation.
in BL as the summation of all epochs in MrBL did not show improvement (Table 4.5). We performed three
runs and the results obtained were quite similar to the ones with BL during 30 epochs, but revealed more
overfitting tendency.
BL
Error rate (%) 11.07± 0.45
Table 4.5: Image classification error rate of BL evaluation on MNIST with 100 epoch training.
34
Chapter 5
Conclusions
The Multiresolution Backpropagation Learning method obtained best overall results than simple
backpropagation-based training. The proposed method gives faster training and the possibility to over-
come local minima. By using a sequence of subspaces represented by images at different resolution
as input to feedforward networks with backpropagation-based training, we most probably managed to
reach optimal lower values among the hierarchy of landscapes.
By developing the aforementioned method we did not intend to obtain the best results in MNIST, but
to demonstrate that it works. It is a novel alternative method for regularization, avoiding overfitting and
avoiding going into local minima.
5.1 Achievements
• We compared MrBL to the conventional BL and verified that it converged faster with less overfitting;
• We empirically verified that MrBL gives an advantage, by reaching near a global minimum, avoiding
local minima;
• We observed that MrBL gives better results in the MNIST digit recognition task, with statistically
significant better results.
5.2 Future Work
For the future, we should make algorithm optimization that represents best the error surface while re-
sorting to less computational power. Additionally, a better exploration of the loss surface properties
should be carried out. Another interesting future approach would be the exploration of different methods
complemented with a multiresolution approach, since it could bring advantages.
35
36
Bibliography
[1] X. Liao, A. V. Vasilakos, and Y. He. Small-world human brain networks: Perspectives and chal-
lenges. Neuroscience & Biobehavioral Reviews, 2017.
[2] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew. Deep learning for visual understand-
ing. Neurocomput., 2016.
[3] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
[4] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and
Applications. Springer Berlin Heidelberg, 2011.
[5] S. Dube. High dimensional spaces, deep learning and adversarial examples. CoRR, 2018.
[6] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimiza-
tion problems. CoRR, 2015.
[7] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surface of multilayer
networks. CoRR, 2015.
[8] A. Wichert. Intelligent Big Multimedia Databases. World Scientific, 2015.
[9] H. W. Lin, M. Tegmark, and D. Rolnick. Why Does Deep and Cheap Learning Work So Well?
Journal of Statistical Physics, 2017.
[10] Y. Dauphin, R. Pascanu, Ç. Gülçehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking
the saddle point problem in high-dimensional non-convex optimization. CoRR, 2014.
[11] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The
bulletin of mathematical biophysics, 1943.
[12] M. T. Hagan, H. B. Demuth, and M. Beale. Neural Network Design. PWS Publishing Co., 1996.
[13] F. Rosenblatt. The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell
Aeronautical Laboratory, 1957.
[14] R. Rojas. Neural Networks: A Systematic Introduction. Springer-Verlag, 1996.
[15] B. Kröse and P. van der Smagt. An introduction to Neural Networks. The University of Amsterdam,
8th edition, 1996.
37
[16] S. Haykin and S. Haykin. Neural Networks and Learning Machines. Prentice Hall, 2009.
[17] F. Agostinelli, M. D. Hoffman, P. J. Sadowski, and P. Baldi. Learning activation functions to improve
deep neural networks. CoRR, 2014.
[18] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., 1995.
[19] S. Ruder. An overview of gradient descent optimization algorithms. CoRR, 2016.
[20] B. Widrow and M. E. Hoff. Adaptive switching circuits. 1960 IRE WESCON Convention Record,
1960.
[21] C. De Sa, K. Olukotun, and C. Ré. Global convergence of stochastic gradient descent for some
non-convex matrix problems. arXiv preprint arXiv:1411.1134, 2014.
[22] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating Stochastic Gradient
Descent. ArXiv e-prints, 2017.
[23] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for
deep learning. In Proceedings of the 28th International Conference on International Conference
on Machine Learning, 2011.
[24] S.-Y. Zhao and W.-J. Li. Fast asynchronous parallel stochastic gradient descent: A lock-free
approach with convergence guarantee. In AAAI, 2016.
[25] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew. Deep learning for visual understand-
ing: A review. Neurocomputing, 2016.
[26] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).
Springer-Verlag, 2006.
[27] Y. Lecun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
[28] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In Neural Networks: Tricks of
the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, 1998.
[29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research.
chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, Cam-
bridge, MA, USA, 1988. ISBN 0-262-01097-6. URL http://dl.acm.org/citation.cfm?id=
65669.104451.
[30] A. Prieto, B. Prieto, E. M. Ortigosa, E. Ros, F. Pelayo, J. Ortega, and I. Rojas. Neural networks:
An overview of early research, current frameworks and new challenges. Neurocomputing, pages
242–268, 2016.
[31] G. E. Dahl, T. N. Sainath, and G. E. Hinton. Improving deep neural networks for lvcsr using
rectified linear units and dropout. 2013 IEEE International Conference on Acoustics, Speech and
Signal Processing, 2013.
38
http://dl.acm.org/citation.cfm?id=65669.104451http://dl.acm.org/citation.cfm?id=65669.104451
[32] N. Qian. On the momentum term in gradient descent learning algorithms. Neural Netw., 1999.
[33] Y. Lecun. Generalization and network design strategies. Elsevier, 1989.
[34] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern
recognition unaffected by shift in position. Biological Cybernetics, 1980.
[35] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive field in deep
convolutional neural networks. In Advances in Neural Information Processing Systems 29. Curran
Associates, Inc., 2016.
[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In Advances in Neural Information Processing Systems 25. Curran Associates,
Inc., 2012.
[37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabi-
novich, et al. Going deeper with convolutions. Cvpr, 2015.
[38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556, 2014.
[39] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional
networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[40] S. Wu, S. Zhong, and Y. Liu. Deep residual learning for image steganalysis. Multimedia Tools and
Applications, 2017.
[41] N. A. Hamid, N. M. Nawi, R. Ghazali, and M. N. M. Salleh. Solving local minima problem in back
propagation algorithm using adaptive gain, adaptive momentum and adaptive learning rate on
classification problems. In International Journal of Modern Physics: Conference Series, 2012.
[42] J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 2015.
[43] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,
2010.
[44] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the
Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.
[45] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In
Proceedings of the 30th International Conference on International Conference on Machine Learn-
ing - Volume 28, 2013.
[46] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi-
fication. In Proceedings of the 25th ieee conference on computer vision and pattern recognition
(cvpr 2012), 2012.
39
[47] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for
visual recognition. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision –
ECCV 2014, 2014.
[48] E. A. Smirnov, D. M. Timoshenko, and S. N. Andrianov. Comparison of regularization methods for
imagenet classification with deep convolutional neural networks. Aasri Procedia, 2014.
[49] Y. Bengio et al. Learning deep architectures for ai. Foundations and trends R© in Machine Learning,
2009.
[50] M. Mahsereci, L. Balles, C. Lassner, and P. Hennig. Early stopping without a validation set. CoRR,
2017.
[51] L. Breiman. Bagging predictors. Mach. Learn., 1996.
[52] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural
networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[53] W. Sun and F. Su. Regularization of deep neural networks using a novel companion objective
function. In Image Processing (ICIP), 2015 IEEE International Conference on, 2015.
[54] P. Baldi and P. J. Sadowski. Understanding dropout. In Advances in Neural Information Processing
Systems 26. Curran Associates, Inc., 2013.
[55] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In Advances in Neural
Information Processing Systems 26. Curran Associates, Inc., 2013.
[56] S. Wang and C. Manning. Fast dropout training. In Proceedings of the 30th International Confer-
ence on Machine Learning, 2013.
[57] D. A. McAllester. A pac-bayesian tutorial with A dropout bound. CoRR, 2013.
[58] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple
way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
[59] S. Wager, S. Wang, and P. S. Liang. Dropout training as adaptive regularization. In Advances in
Neural Information Processing