57
Multiresolution Backpropagation Learning Ricardo Jorge Ferreira Ponciano Thesis to obtain the Master of Science Degree in Information Systems and Computer Engineering Supervisor: Prof. Andreas Miroslaus Wichert Examination Committee Chairperson: Prof. Mário Jorge Costa Gaspar da Silva Supervisor: Prof. Andreas Miroslaus Wichert Member of the Committee: Prof. João Carlos Serrenho Dias Pereira June 2018

Multiresolution Backpropagation Learning · Multiresolution Backpropagation Learning Ricardo Jorge Ferreira Ponciano Thesis to obtain the Master of Science Degree in Information Systems

  • Upload
    others

  • View
    16

  • Download
    0

Embed Size (px)

Citation preview

  • Multiresolution Backpropagation Learning

    Ricardo Jorge Ferreira Ponciano

    Thesis to obtain the Master of Science Degree in

    Information Systems and Computer Engineering

    Supervisor: Prof. Andreas Miroslaus Wichert

    Examination Committee

    Chairperson: Prof. Mário Jorge Costa Gaspar da SilvaSupervisor: Prof. Andreas Miroslaus Wichert

    Member of the Committee: Prof. João Carlos Serrenho Dias Pereira

    June 2018

  • ii

  • Acknowledgments

    First I would like to thank my thesis advisor, Prof. Andreas Wichert for all the necessary support in a

    gradual and timely manner. I want to express my gratitude for his knowledge sharing about the deep

    learning field. It was a very positive experience and I found him not just an thesis advisor, but also a true

    mentor.

    I also want to thank my family to all the support during the course of this work.

    iii

  • iv

  • Resumo

    Treinar dados com elevada dimensão requere a minimização de superfı́cies de erro complicadas. Propo-

    mos uma abordagem multiresolução com treino incremental baseado em retro propagação para melho-

    rar a generalização. A pirâmide Gaussiana, gerada a partir de um padrão inicial de imagens, é a entrada

    para redes neuronais de alimentação direta que aprendem desde reduzida até elevada resolução. Após

    a inicialização do treino, os valores precedentes inicializam a rede neuronal seguinte. Aplicámos este

    método ao conjunto de dados MNIST para reconhecimento de padrões. A Aprendizagem Retro propa-

    gada Multiresolução generalizou melhor do que um treino baseado em simples retro propagação, com

    convergência mais rápida. Verificámos empiricamente que podemos chegar próximo de um mı́nimo

    global, evitando mı́nimos locais.

    Palavras-chave: Multiresolução, Redes neuronais, Generalização, Mı́nimos locais, Apren-dizagem profunda

    v

  • vi

  • Abstract

    High dimensional data training requires minimizing complicated error surfaces. We propose a mul-

    tiresolution approach with incrementally backpropagation-based training to improve generalization. A

    Gaussian pyramid, generated from a initial pattern of images, is the input to feedforward neural net-

    works learning from lower to higher resolution. After train initialization, the preceding values initialize the

    following neural network. We applied this method to MNIST dataset for pattern recognition. The Mul-

    tiresolution Backpropagation Learning generalized better than simple backpropagation-based training,

    with faster convergence. We empirically verify that we can reach near a global minimum, avoiding local

    minima.

    Keywords: Multiresolution, Neural networks, Generalization, Local minima, Deep learning

    vii

  • viii

  • Contents

    Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

    Resumo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

    Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

    List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

    List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Background 5

    2.1 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.3 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.4 Gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.5 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.6 Stochastic Gradient Descent Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.7 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.8 Deep Feedforward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.9 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.10 LeCun Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    2.11 Deep Learning Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.12 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.13 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.14 Multiresolution Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    2.15 Subspace Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    3 Multiresolution Backpropagation Learning 21

    3.1 Gaussian Pyramid Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    3.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    3.3 Weight replication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    ix

  • 3.4 Multiresolution Backpropagation Learning Architecture . . . . . . . . . . . . . . . . . . . . 25

    4 Empirical Experiments 27

    4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.3 Performance measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.4 Preliminary Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4.4.1 Preliminary CIFAR-10 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.5.1 MNIST Gaussian Pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    4.5.2 Networks Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    4.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    5 Conclusions 35

    5.1 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    Bibliography 37

    x

  • List of Tables

    4.1 Image classification error rate (%) of preliminary experiments on MNIST. . . . . . . . . . . 29

    4.2 Image classification accuracy (%) of preliminary experiments on CIFAR-10. . . . . . . . . 29

    4.3 Summary of MrBL and BL evaluation training parameters. . . . . . . . . . . . . . . . . . . 32

    4.4 Image classification error rate of MrBL and BL evaluation on MNIST. . . . . . . . . . . . . 34

    4.5 Image classification error rate of BL evaluation on MNIST with 100 epoch training. . . . . 34

    xi

  • xii

  • List of Figures

    2.1 A simplified mathematical model of McCulloch and Pitts artificial neuron. . . . . . . . . . . 5

    2.2 Graphical representation of activation functions. . . . . . . . . . . . . . . . . . . . . . . . . 6

    2.3 Surface generated by the training error of a linear neuron with two input weights. Also

    known as error surface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    2.4 Illustration of a saddle point example [10]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.5 A simplified model of an artificial neural network. . . . . . . . . . . . . . . . . . . . . . . . 9

    2.6 Convolution step [25]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.7 Generic convolutional neural network architecture. Convolution step occurs in convolu-

    tional layers, Max pooling occurs in pooling layers and the fully connected layers concern

    to multilayer artificial neural networks [25]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.8 Illustration of a local minima example, given by the red dot. . . . . . . . . . . . . . . . . . 13

    2.9 RGB raster image example showing individual pixels as squares and colour components

    as values. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    2.10 Schematic representation of an Gaussian pyramid with five levels. . . . . . . . . . . . . . 18

    3.1 Schematic representation of three images at different resolution as input to three feedfor-

    ward networks with backpropagation-based training. . . . . . . . . . . . . . . . . . . . . . 23

    3.2 Hierarchy of weight replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.3 Schematic representation of three images at different resolution as input to three feedfor-

    ward networks with backpropagation-based training, showing the direction of the image

    resolution reduction and the direction of the training procedure. . . . . . . . . . . . . . . . 25

    4.1 MNIST Gaussian Pyramid sample images with σ = 1. . . . . . . . . . . . . . . . . . . . . 30

    4.2 MNIST Gaussian Pyramid sample images with σ = 2 (two levels). . . . . . . . . . . . . . . 30

    4.3 MNIST Gaussian Pyramid sample images with σ = 3 (two levels). . . . . . . . . . . . . . . 30

    4.4 Error rate (%) obtained using different standard deviation values with MrBL. . . . . . . . . 31

    4.5 Training set convergence properties of MrBL networks and BL evaluation, with scaled

    epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    4.6 Training an test set convergence properties of MrBL and BL evaluation. . . . . . . . . . . 33

    4.7 Test set convergence properties of MrBL and BL evaluation, with vertical bars represent-

    ing the standard deviation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    xiii

  • xiv

  • Chapter 1

    Introduction

    The human brain is a complex system, with about 86 billion neurons interacting through approximately

    150 trillion synapses, allowing humans to perform an enormous diversity of tasks and its features are

    desirable in artificial models [1]. Studied by neuroscience, the brain structure and its operation inspired

    the development of artificial neural networks, in order to simulate brain’s learning capacity. Until now it

    is the best known learning device.

    Deep learning relies on specific architectures of artificial neural networks, allowing to learn about

    datasets [2]. Deep learning methods are used in many domains, especially in pattern recognition tasks,

    such as image or speech recognition. Its recent popularity is multi-factorial, mainly related to the avail-

    ability of larger amounts of data and also due to an increase of computational power.

    One of the main challenges about learning algorithms is the difficulty to obtain a model that performs

    proper generalization towards new data [3]. It turns out that nowadays the abundance of high dimen-

    sional data, like image databases, is transversal to fields such as information technology, astronomy,

    bioinformatics or others and it can lead to performance problems [4]. Deep learning offers the possibility

    to overcome such limitations, however its architectures are challenging to optimize and it is not yet fully

    understood [5, 6].

    Empirical studies have tried to explain certain phenomena that occurs during training, focusing on

    specific types of critical points that occurs on the loss function shape [7, 5]. When the training involves

    minimizing the loss or error function using gradient descent based algorithms, the derivative of those

    functions might be equal zero, meaning that no information is given regarding the movement direction

    of the function. When this happens, we call it critical points. There are three types of such points, local

    maxima, local minima and saddle points. The latter two are more concerning since they typically occur

    during the algorithm optimization and can lead to non optimal solutions [3].

    1

  • We propose an optimization method to overcome the problem of local minima using a multiresolu-

    tion approach and feedforward neural networks with backpropagation-based training. By obtaining a

    sequence of low dimensional subspaces and incrementally training, we verify empirically that we can

    reach near a global minimum, avoiding local minima [8]. This approach tries to deal with the undesired

    effects of high dimensional data, leading to better generalization and avoiding overfitting. It gives an

    alternative method to explore deep learning for pattern recognition tasks.

    1.1 Motivation

    Deep learning works very well [9]. Having an artificial neural network with several hidden layers opti-

    mized with the stochastic gradient descent algorithm avoids that it gets trapped in a local minima [10]. It

    is empirically shown that having many layers result in more saddle points than local minima and the high

    dimensional error surface in that situations becomes a more flattened landscape [7, 5]. However, studies

    demonstrated that local minima tend to occur towards the landscape bottom and they are expected to

    be found more near the end of training [5].

    Motivated by this theories, we developed a novel regularization method that resort to an hierarchy

    of different and gradually less complicated landscapes to improve its generalization. We created a

    sequence of subspaces represented by images at diverse resolutions and performed learning for feed-

    forward networks with stochastic gradient descent computed with backpropagation from lower to higher

    resolution.

    This way, we do not start with a so complicated landscape and use the preceding to initialize the

    following without reaching each of the landscape bottoms, where more local minima lay. Instead of

    using many hidden layers in the feedforward networks we used one, because many hidden layers result

    in more saddle points than local minima.

    1.2 Objectives

    The main goals of this thesis are to compare the Multiresolution Backpropagation Learning method to

    the conventional backpropagation procedure and to empirically verify if the proposed method gives an

    advantage comparatively to the traditional backpropagation procedure, using feedforward neural net-

    works.

    1.3 Thesis Outline

    This work is divided into five chapters. After the present chapter, chapter 2 introduces the basic con-

    cepts of deep learning, referring also their major problems and solutions. Then, we give an overview

    perspective about multiresolution processing and the subspace tree, paving the way for the experiments

    carried out later.

    2

  • Chapter 3 describes the components of the Multiresolution Backpropagation Learning method show-

    ing how was it done, focusing on the description of the algorithm behind it.

    Chapter 4 presents the conducted empirical experiments using the proposed method. We start with

    the dataset description and the preprocessing and then preliminary experiments were made before the

    main experiments. In the experiments we explain in more detail the method training process and then,

    the results obtained.

    Chapter 5 present the conclusion of the work done, the main achievements and ideas for future

    studies.

    3

  • 4

  • Chapter 2

    Background

    2.1 Artificial neuron

    The search for artificial models that mimetize the human brain started around 1940 when the first elec-

    tronic computers where developed.

    The linear model introduced by McCulloch and Pitts in 1943 was perhaps the first artificial neuron,

    presenting important features which can be found in many artificial neural networks [11].

    The basic components of an artificial neuron are shown in the Figure 2.1, being related in such way

    that the weighted sum of input signals is compared to a threshold (activation funtion) to determine the

    output [12]. A neural network is no more than a collection of nodes or units (Figure 2.1) connected by

    links.

    Figure 2.1: A simplified mathematical model of McCulloch and Pitts artificial neuron.

    2.2 Perceptron

    Proposed by Frank Rosenblatt in 1957, the perceptron was built around the McCulloch and Pitts model

    of a neuron to solve pattern recognition problems [13]. The author introduced a learning rule for training

    perceptron networks, that will converge to the correct network weights [12].

    Considering a simple perceptron, also known as one-layer feedforward networks, with a set of n

    inputs and one output. Each input is the dot product between the weight vector value (wi) and the input

    5

  • vector value (xi) which means that the neuron’s input is given by their summation (net). The bias term

    (w0) is an extra weight constant [14, 15].

    yk = φ(net) = φ

    (n∑i=1

    wi · xi + w0

    )(2.1)

    In order to obtain the perceptron’s output, it requires the application of the activation function φ(),

    which can be either linear or nonlinear [15]. In this case φ() is an hard threshold, represented by the

    nonlinear sign function:

    φ(net) := sgn(net) =

    1 if net ≥ 0−1 otherwise. (2.2)

    The XOR Problem

    A simple perceptron can only deal with linearly separable inputs and this problem has long been identi-

    fied by Minsky and Papert, 1969 [3]. An one-layer perceptron represents an hyperplane in n dimensional

    space able to divide linearly separable inputs and for this reason it can only deals with the boolean func-

    tions AND, OR, NAND, NOR. Regarding XOR, it doesn’t exist any hyperplane that can classify input

    patterns in this conditions [16].

    2.3 Activation Functions

    Whether in a simple perceptron, or in the units of a neural network, some kind of activation function is

    usually applied [15]. The design of activation functions for the neural networks training to be used in

    deep learning is currently an area of great research[17]. It is desirable that they present some general

    characteristics, such as being non-linear, continuous and differentiable, in order to facilitate the applica-

    tion of gradient-based methods [14].

    There are several types of activation functions that can be used like the sigmoid or logistic (Fig-

    ure 2.2(a)), the hyperbolic tangent (Tanh), the hard Tanh, the rectified linear unit (ReLU) (Figure 2.2(b))

    and its variants, the SoftPlus, the Softmax, the Maxout or the Radial basis function [3, 14].

    (a) Sigmoid. (b) ReLU.

    Figure 2.2: Graphical representation of activation functions.

    6

  • 2.4 Gradient descent

    Since a simple perceptron can fail to classify inputs that are not linearly separable, optimization is

    required. Gradient descent also called steepest descent is an optimization algorithm that gradually

    changes vectors to find a local minimum of a function [18]. For a better understanding, consider an

    unthreshold perceptron or linear unit:

    o =

    n∑i=0

    wi · xi = net0. (2.3)

    Now we need to identify the training error or the loss function (E(w)) for N training examples, based

    on the target output values (tk) minus the output value of the linear unit (ok). This function will determine

    the surface shape which will try to fit the training samples [15] (Figure 2.3):

    E(w) =1

    2·N∑k=1

    (tk − ok)2. (2.4)

    Figure 2.3: Surface generated by the training error of a linear neuron with two input weights. Also knownas error surface.

    The w refers to the weight vector w = (w0, w1) whose o depends. In order to know the direction of

    steepest descent at each point of the training error function, we need to compute the derivative of E with

    respect to each component of the weight vector, in the downward sloping direction:

    ∆wi = −η ·∂E

    ∂wi. (2.5)

    The η constant is the learning rate, indicating the step size in gradient descent algorithm. With

    this considerations, now we can obtain a more practical algorithm, differentiating equation (2.4) and

    substituting its result in equation (2.5), which is called the weight update rule for gradient descent:

    ∆wi = η ·N∑k=1

    (tk − ok) · xk,i. (2.6)

    7

  • 2.5 Stochastic Gradient Descent

    The stochastic gradient descent, also known as LMS (least mean-square) or delta rule, is an extension of

    the previous algorithm where the weight updates occur for each training sample instead of the complete

    training set N :

    ∆wi = η · (tk − ok) · xk,i. (2.7)

    By performing the weight updates iterating for each training sample at a time we can evaluate the

    gradient in a computationally more efficient and faster way than with the gradient descent rule (equation

    (2.6)). Still, the stochastic gradient descent perform updates with higher variance, meaning that it can

    fail some local minima when there are multiple [19].

    This rule was applied in 1960 with Adaline, an ”adaptative pattern classification machine” and has

    been used widely by other authors [20, 3].

    2.6 Stochastic Gradient Descent Optimizations

    Although the high popularity of gradient descent techniques, especially stochastic gradient descent,

    there are some disadvantages associated to it.

    Choosing a proper learning rate (η) is a challenge, since this task is often carried out by trial and

    error. If η is too small, many iterations are needed to get near the best values making this process a

    slow task. On the other hand, if η is too big, it can skip or diverge from the optimal solution due to large

    oscillations of the function [19, 3].

    Other main disadvantage is the possibility of the function to get trapped in a local minima skipping

    the desired value [19]. In this case, it should be taken in consideration the importance of saddle points,

    a stationary point (Figure 2.4), which can be mistaken with a local minima, also making the algorithm

    produce wrong results [10]. Several improvements to stochastic gradient descent have been proposed,

    Figure 2.4: Illustration of a saddle point example [10].

    with methods such as Momentum, Nesterov Momentum, Adagrad and Adadelta, RMSProp , Adam and

    others [21, 22, 23, 24].

    8

  • 2.7 Deep Learning

    Since a perceptron can only classify linearly separable inputs and knowing that stochastic gradient

    descent based on simple perceptron (thresholded or unthresholded) does not take the full potential

    from this algorithm, other approaches are used resorting to more units and more layers [3, 16]. The

    usage of many layers – with hidden layers included – define an artificial neural network (Figure 2.5)

    used in deep learning, able to build complex concepts from simpler ones. There are several different

    deep learning methods and architectures [25]. This overview will focus deep feedforward networks, the

    backpropagation procedure and LeCun convolutional neural networks.

    Figure 2.5: A simplified model of an artificial neural network.

    2.8 Deep Feedforward Networks

    Introducing deep feedforward networks concept, also known as multilayer perceptrons [3], allowed the

    definition of the first artificial neural networks wherein simple perceptron main limitation can be over-

    come. This kind of artificial neural network is a layered structure where the information flows through

    it, starting in the input units layer, then hidden units layers, until it reaches the output units layer, that

    produces the network outputs. Each unit in a layer is connected to every unit in the following meaning

    that the network is fully connected [16]. Concerning hidden layers, their output is not visible and error

    correction is only done indirectly [3].

    We can use a simple perceptron as the units of the deep feedforward network to classify non linearly

    separable inputs, solving the XOR problem, but we stay limited only to linear functions. Alternatively, we

    can use continuous and differentiable nonlinear activation functions, which allow a better understand-

    ing of input variables interaction. So, in a deep feedforward network units typically have a continuous

    nonlinear activation function, which can be chosen according to the dataset’s nature [17, 26, 27, 28].

    9

  • 2.9 Backpropagation

    Applied in the 1980s by Rumelhart, Hinton and Williams and by other groups of authors, the backprop-

    agation is a procedure to adjust the weights of a function and is commonly used as learning method

    in multilayer artificial neural networks [3, 29, 30, 27]. This is an iterative procedure with mainly two

    phases, propagate inputs forward phase and propagate errors backwards phase, which for simplicity

    will be designated phase one and phase two, respectively.

    In order to understand the backpropagation procedure, first it is needed to consider some funda-

    mentals behind it, starting in phase one. As referred to in the deep feedforward networks section, the

    network units typically have a nonlinear activation function being precisely this type of activation that will

    be used here. Among the different types of nonlinear activation function (σ()), it will be considered the

    sigmoid function due to its popularity, although others are also used [14]:

    σ(x) =1

    1 + e(−α·x)(2.8)

    The term α is a positive constant that indicates the steepness of the function. The activation function

    is applied to each unit of the network, giving the generic output:

    ok = σ

    (n∑i=0

    wi · xk,i

    )(2.9)

    Next, based on equation (2.4) and considering the fact that we have multiple units, the redefinition

    of the training error with respect to all weights of the network is needed, with N training patterns and l

    outputs:

    E(w) =1

    2·N∑k=1

    l∑i=1

    (tki − oki)2 (2.10)

    Obtaining E(w) we know at this point the resulting error between the desired and the computed

    values in the output units. This training error should be the lowest possible and to achieve it we need to

    find its derivative, using the gradient descent for the output units and the chain rule of calculus for the

    hidden units. Thus begins the phase two of the backpropagation procedure. The error is propagated

    backward from the output units to the hidden units and the network updates their weights [15]. In a

    generic way, the rule to update hidden units weights, determined from the inputs (xk,j) to the unit and

    the error term (δki) from the unit outputs is given by:

    ∆wij = η

    N∑k=1

    δki · xk,j . (2.11)

    Both phases of the algorithm typically run several times, one after the other, and the stopping condi-

    tion of this iterative process can be specified, for example, through a predetermined number of iterations

    or through a predetermined threshold applied to the error.

    10

  • Momentum

    The momentum method can be added to backpropagation in order to accelerate the learning process, by

    changing the weight update [31]. With this method, at each iteration the weight update stays dependent

    on the previous weight update iteration [32]. It is based on equation (2.11) but with a momentum term

    added to increase the convergence and reduce the oscillation:

    ∆wij(n) = η

    N∑k=1

    δki(n) · xk,j(n) + p∆wij(n− 1), (2.12)

    where n is the iteration number and 0 ≤ p < 1 is the momentum parameter. For example, in the

    presence of a valley with a highly slope on the error surface of the neural network, the system oscillates

    almost horizontally backwards and forwards, causing a very slow descent along that valley but if we add

    a momentum term it can help to overcome this situation [32]. Also, by using the momentum method with

    an high learning rate (η) the high oscillations that could be caused by it are reduced and so a minimum

    will be achieved more quickly. Without the momentum method, that minimum could not even be reached

    [15].

    2.10 LeCun Convolutional Neural Networks

    The convolutional neural networks are similar to the artificial neural networks already described above,

    however presenting a different architecture. Yann LeCun was pioneer of the first convolutional neural

    networks, having begun its investigations before 1989, culminating in the LeNet5 architecture, a convolu-

    tional neural network with several layers representing different operations such as convolution, pooling,

    non linearity and classification [33]. Its development was inspired by the neocognitron of Fukushima

    [34]. A convolutional neural network works well in pattern recognition tasks, namely image recognition,

    using a ”grid-like topology” [3]. Generically, it contains one or several convolutional layers, followed by

    a multilayer artificial neural networks. Next, the main operations of a convolutional neural network are

    described.

    Convolution Step

    In the convolution step (Figure 2.6), we want to extract features of an input, typically an image. Since an

    image is basically a pixels matrix, it can be applied another sliding matrix over that input. This matrix is

    called filter, kernel or feature detector and as it moves through the input, multiplies the matrix elements,

    adding up their result. The resulting matrix is called a feature map or convolved feature. If the values of

    the kernel matrix is changed it can be detected different characteristics of the input image (e.g. curves)

    [25]. All these multidimensional matrices are also called tensors [3].

    It is also worth noting that each unit of a convolutional neural network layer depends only on one

    region of the input, being designated by receptive field [35].

    11

  • Figure 2.6: Convolution step [25].

    The nonlinear activation function ReLU is usually applied in order to replace the negative values of

    the feature map by zero. This function allows the artificial neural network to learn faster[36].

    Pooling

    In the pooling, sub-sampling or down-sampling operation, it is intended to reduce the dimensions of the

    feature map, thus obtaining the most relevant information by applying a function like Max, Sum, Average,

    among others. These functions define the pooling strategy type, with average pooling and max pooling

    being the most used [25]. In this operation a sliding window (filter) with a predefined dimension is also

    used, which advances in a step also pre-defined. The pooling operation by reducing the size of the input

    representation allows to control the overfitting.

    Fully Connected Layer

    After the convolution, pooling and non-linearity, it follows a fully connected layer as basic elements

    of a generic architecture of a convolutional neural networks(Figure 2.7). It is no more than a artificial

    multilayer neural network that occurs after the last pooling layer and this latter acts as input of the former.

    The multilayer artificial neural network then proceeds a typical classification task. There are several

    methods based on convolutional neural networks like AlexNet, GoogLeNet, VGGNet or DenseNet [36,

    37, 38, 39].

    2.11 Deep Learning Problems

    The usage of deep neural network architectures in different datasets has been shown very good results,

    but despite of its achievements, the usage of more layers doesn’t always represent a better learning

    artificial network [40]. In the following subsections, it will be refered the main problems related to the

    typical deep neural networks: local minima, vanishing gradients and overfitting.

    12

  • Figure 2.7: Generic convolutional neural network architecture. Convolution step occurs in convolutionallayers, Max pooling occurs in pooling layers and the fully connected layers concern to multilayer artificialneural networks [25].

    Local Minima

    Since the backpropagation procedure uses gradient descent to reduce the training error, it may hap-

    pen that the algorithm stays trapped in one of several local minima of a neural network error surface

    (Figure 2.8) [7, 41].

    Figure 2.8: Illustration of a local minima example, given by the red dot.

    This means that with backpropagation there is difficulty in finding the lowest training error value,

    therefore, not converging to the global minimum. It is assumed that an artificial neural network with

    several hidden layers is less likely to be stuck in a local minima and it is easier to find the right parameters

    as demonstrated by empirical experiments [27].

    Vanishing Gradients

    The vanishing gradients problem is a phenomenon that may occur in deep neural networks training,

    where the backpropagated error decreases rapidly, tending to zero as it approches the input layer [42].

    By using classical activation functions, like sigmoid or hyperbolic tangent with a finite activation range,

    (0, 1) and (−1, 1) respectively, the error output is limited. So, the error is backpropagated over hidden

    layers with increasingly smaller values, meaning that weight updates become more and more residual.

    13

  • There are some solutions proposed to deal with the vanishing gradients problem, but it is recommended

    the usage of the rectified linear unit (ReLU), an activation function defined as f(x) = max(0, x) [42,

    3, 43, 44]. The opposite phenomenon can also occur, with backpropagated errors suffering a large

    increase, designated by exploding gradients [42, 45].

    2.12 Overfitting

    Overfitting is one of the main challenges in machine learning, in which a learning algorithm (model)

    has a good performance in the training data, but an poor performance against new data. In order

    to be obtained a correct description of the data, we estimate the minimum training error [8]. During

    this process, the model adapts very well to the training data, which usually contains noise. It occurs

    memorization, instead of an smoother and more generalized adaptation [15].

    If the learning algorithm is very fitted to the training data, it will act more poorly on previously unseen

    data, like a test set, used to assess the classifying performance, or a validation set, for parameter tuning.

    By the end, we want to find a model in which the difference between the training error and the test error

    will be the minimum possible.

    The ability to classify new unseen inputs defines the model performance, also know as generalization

    [3]. Underfitting may also happen due to a poor performance of the model in finding a good minimum

    training error.

    2.13 Regularization

    In order to deal with the overfitting problem, several authors proposed different regularization methods.

    Regularization is defined as ”any modification” made to ”a learning algorithm that is intended to reduce its

    generalization error but not its training error” and is one of the main concerns when designing a machine

    learning architecture [3]. Regularization is an important component in order to prevent overfitting and it

    is one of the main concerns when designing a machine learning architecture.

    This section will introduce some regularization techniques, among the wide range of options avail-

    able, namely the Data augmentation, Early stopping, Bagging and Dropout, Weight penalty L2 and L1

    and others.

    Data augmentation

    Data augmentation has been used by several authors and it consists in generating additional data in the

    training datasets in order to obtain a machine learning model with a better generalization [46, 47, 36, 48].

    Early Stopping

    When training certain large models whose trend towards overfitting is representative, the training error

    decreases over time but the validation set error increases at a given moment [3]. Early stopping is a

    14

  • efficient capacity control approach based on monitoring the performance of validation set during training

    in order to return the parameters with lowest validation set error, rather than the latest parameters [3, 49].

    A recent work propose a novel early stopping criterion which removes the need for a held-out validation

    set [50].

    Bagging and Dropout

    Bagging (an acronym of ”bootstrap aggregating”) is another procedure for regularization and it allows

    to reduce generalization error resorting to the combination of several models [51, 3]. These are trained

    separately and then they vote on the output for test examples, based on the assumption that different

    models will certainly not make all the same errors on the same test set. This general strategy is called

    model averaging and the techniques that employ it are also known as ensemble methods [3].

    Dropout is a strategy proposed by [52] and is a variant of ensemble method where different neural

    network topologies are combined and their nodes are randomly dropped out in the training phase, in

    order to prevent complex co-adaptations and to enhance the generalization performance of the network

    [53]. Dropout has been used by several authors whether in a in-depth explanation or with some improve-

    ments, like ”standout” method, fast dropout training and others [54, 55, 56, 57, 58, 59, 60]. DropConnect

    is a special case of generalization derived from Dropout which randomly drops the weights instead of

    the activations [61].

    Weight penalty L2 and L1

    Parameter norm penalties is a regularization approach based on limiting the models capacity by adding

    some parameter penalties [3, 62, 53]. Towards neural networks model, typically is selected a parameter

    norm penalty that penalizes only the weights, like L2 or L1. The L2 regularization is also called weight

    decay or Tikhonov regularization and it is the most common form of parameter regularization, encour-

    aging near-zero weights [63, 64]. The L1 regularization result in a solution more sparse comparing to

    L2, meaning that some parameters have an optimal value of zero, which is useful as a feature selection

    mechanism [3]. LASSO (least absolute shrinkage and selection operator) is a well known model based

    on L1 penalty proposed by [65] with recent adaptations [3, 66].

    Others

    Multi-task learning is a mechanism whose main goal is to improve generalization by training tasks in

    parallel using a shared representation. When applied to artificial neural networks it uses a shared

    hidden layer trained in parallel on all tasks benefiting the overall learning [67, 68]. This method has been

    applied with success in areas as diverse as natural language processing, video games and biomedical

    science [69, 70, 71].

    Sparse representation is achieved by penalizing the activations of the units in neural networks in or-

    der to their activations be sparse [3, 72]. Although this method has a good performance, it has difficulties

    dealing with low dimensional data, still [73] proposed an effective method to overcome this situation.

    15

  • Parameter tying is another technique that allows models to learn good representations of input data

    by reducing the number of learnable parameters in Convolutional Neural Networks which makes possible

    to train this model with fewer examples [2, 74].

    2.14 Multiresolution Processing

    Multiresolution processing and analysis refers to the decomposition of a signal into more than one scale

    or resolution [75, 76]. A signal can be defined as ”a function that conveys information about the behaviour

    of a system or attributes of some phenomenon” that can be processed into images, sound, and others

    [77].

    The basic idea behind multiresolution theory is not recent. In the beginning of 1800’s, Joseph Fourier

    proposed essential theories about frequency analysis using superposition of sines and cosines to rep-

    resent signals which allowed the development of new approaches later on [78]. Thus, one of the most

    interesting discoveries were about wavelets. Wavelets are small wave-like oscillations with diverse fre-

    quencies and limited duration that can be used as a mathematical tool to extract information from signals

    [78, 75]. In this sense, Stéphane Mallat and Yves Meyer work (after 1980) introduced wavelet represen-

    tation as a significantly new approach to image processing and analysis, called the multiresolution the-

    ory. This theory incorporates techniques from different fields, namely signal processing, digital speech

    recognition and pyramidal image processing in which a given signal is decomposed into different scales

    or resolutions and then reconstructed from the elements of its decomposition [75, 79, 78].

    Multiresolution processing and analysis is a very useful technique that is applied to the field of image

    processing and computer vision. We can find applications of this technique in object detection and visual

    recognition [80, 81, 82], robotic grasping detection [83], alignment and tracking [84], machine learning

    [76, 85] and others [76].

    In this section, we will give an overview of the multiresolution technique applied to the field of image

    processing and computer vision, with focus on Image Pyramids.

    Digital Images

    A digital image can be represented either by a vector graphics, based on mathematical formulas that

    define geometrical primitives (eg. polygons, lines), either by raster graphics, represented by pixels.

    The ”digital image” term usually refers to the raster graphics image, which typically are two dimensional

    arrays representing a set of values called pixels, its smallest element. Each pixel is defined by a certain

    number of bits, within a range of intensity values, indicating the colour components it can represent. This

    concept is called bit depth or pixel depth [8, 75].

    Among the colour encoding models available (eg. YUV, CIELAB), the RGB colour model is a popular

    method used in computing (Figure 2.9). The acronym means that the images in the red (R), green (G)

    and blue (B) colour space are defined by three numbers, one for each colour. Each component can

    be represented by a range of values depending on the bit depth. For example, a 24-bit colour image

    typically uses 8 bits for each of the R, G and B components giving more than 16 million (224) colour

    16

  • Figure 2.9: RGB raster image example showing individual pixels as squares and colour components asvalues.

    variations. An 8-bit component can have 256 possible values (28), from 0 to 255. RGB digital images

    may have an additional component that can create partial or full transparency, called alpha channel [86].

    In the case of black and white digital images, the intensity varies between the different grey levels,

    from the darkest to the lightest grey. They have a single 8-bit component per pixel, resulting in 256

    different grey levels.

    Image Pyramids

    An image pyramid is a structure that corresponds to multiresolution image representations [8, 87, 88].

    This kind of representation is somewhat similar to human visual encoding. Human visual system is

    very effective in object recognition and in the representation of pictorial information, but has difficulties

    evaluating distances, areas and accurate distinction of gray scales [89]. When we analyse a given

    image with objects and features of many sizes, large and high contrast objects are coarse viewed and

    the remaining objects usually need to be in a higher resolution for a proper examination [75]. Studying

    images at different resolutions is the main motivation behind the concept of image pyramids.

    The aforementioned concept is a simple and computationally effective structure where the base of

    the pyramid contains a high-resolution image, followed by a collection of decreasing resolution images

    until the apex, that contains a low-resolution approximation of the image. When moving to the apex,

    image size and resolution decrease. Considering an image in a base level J with size 2J × 2J or N ×N ,

    where J = log2N , there are J + 1 resolution levels in a pyramid, from 2J × 2J to 20× 20, with 0 ≤ j ≤ J .

    Nevertheless, most pyramids are truncated to P + 1 levels, where 1 ≤ P ≤ J , meaning that going to a

    very reduced resolution of a bigger original image may not add relevant information [75].

    To generate an image pyramid, the original image can be decomposed as a set of lowpass filtered

    copies via Gaussian pyramid or as a set of bandpass filtered copies via Laplacian pyramid [88]. In a

    Gaussian pyramid (Figure 2.10), the lowpass filter is made by smoothing an image with the appropriate

    filter and then downsampling (subsampling) the smoothed image in a iterative fashion. In a Laplacian

    pyramid, the bandpass filter is made by subtracting each Gaussian pyramid level from the next lower

    level and then performing an image interpolation between adjacent levels [90]. Other filter operations

    17

  • can also be employed [8].

    Figure 2.10: Schematic representation of an Gaussian pyramid with five levels.

    The smoothing operation used in a Gaussian pyramid is a Gaussian filter (Gaussian blur) that is firstly

    applied to transform each pixel of the original image. The Gaussian filter in two dimensions is given by

    a Gaussian function G(x, y):

    G(x, y) =1

    2 · π · σ2· e−

    x2+y2

    2·σ2 (2.13)

    where σ is the standard deviation of the Gaussian distribution. A Gaussian function expresses the

    normal distribution, an important statistics concept used to represent random variables with a large

    variety of distributions [91]. Visually, this formula produces a shape obtained from the Gaussian ”bell

    curve”, rotated around the vertical axis [92].

    Since the Gaussian function extends to infinity we must truncate it, due to the presence of near zero

    values at more than 3σ from the mean. As a solution, we can use a simple rectangular window function,

    with values from the truncated normal distribution to build a convolution matrix. This matrix is applied to

    the image, setting new values to its pixels. In other words, the Gaussian filtering process involves the

    convolution of the image with the convolution matrix [8].

    Then the blurred image is downsampled by a factor of 2. The Gaussian filter/downsample steps are

    repeated to generate the typical P + 1 levels of the pyramid. This operations ensure that the sampling

    theorem is respected, meaning that we get no distortions of a signal (image) by sampling. Thus, the size

    reduction goes together with an appropriate smoothing, ensuring a proper downsampled image [89].

    18

  • 2.15 Subspace Tree

    The subspace tree is an efficient hierarchical structure described by a tree that can deal with the negative

    effects of high dimensional data [93, 94].

    The advances in hardware technology and the exponential production, storage and retrieval of digital

    contents has been a challenge for computer scientists. Image databases, as a set of multimedia objects,

    can be accessed traditionally by its file name or keyword, for example, or by its content, such as colour,

    texture, shape and others [95, 96]. The content-based image retrieval is a set of techniques for searching

    images on large databases, given a content query as a weighted combination of features [94]. The

    similarity between an image and a content query is given by their feature vectors distance in the high

    dimensional space. These vectors can support efficient indexing methods [97, 96].

    The plethora of high dimensional data available today, such as image databases, can be identified

    when the number of features is larger than the number of samples [4]. However, dealing with large

    number of dimensions can lead to query performance problems. When the number of dimensions

    grows, the performance tends to worse, running into the ”curse of dimensionality” problem [8].

    The subspace tree can tackle this problem [97, 96]. By dividing a high dimensional space into

    a sequence of low dimensional subspaces, an hierarchical subspace is obtained. Then, a distance

    function measure the difference among corresponding multimedia objects in a space and a subspace.

    The process starts in the lowest dimension subspace and continues to the following higher dimension

    subspace. With this approach the ”curse of dimensionality” problem does not arise, due to the mapping

    of multimedia objects into a low dimensional space [98].

    More formally, supposing a sequence of subspaces U0 ⊃ U1 ⊃ U2 ⊃ . . . ⊃ Ut and dim(U0) >

    dim(U1) > dim(U2) . . . > dim(Ut). V is a vector space, where V = U0. The dim(Ur) is the Ur subspace

    dimension, represented graphically by the number of nodes in the tree. A family of projections where

    multimedia objects are mapped to subspaces can be defined as a subspace sequence:

    P1 : U0 7→ U1;P2 : U1 7→ U2; . . . ;Pt : Ut−1 7→ Ut. (2.14)

    If an orthogonal projection is applied, the subspaces obtained correspond to the multiresolution image

    representations of the image pyramid [88].

    In order to obtain an efficient indexing structure, the distance d should be d ≤ 16 and the relation

    between spaces defined as

    dim(U0)

    dim(U1)≤ d, dim(U1)

    dim(U2)≤ d, . . . dim(Ut−1)

    dim(Ut)≤ d. (2.15)

    The computing costs of this subspace method given a query vector can be determined

    costs =

    t∑i=1

    σi · dim(Ui−1) + s · dim(Ut), (2.16)

    given a number of points σi below a given bound � for a sequence of subspaces Ui and a dataset of

    19

  • size s. The costs tend to decrease until it reaches a minimum value, with the increase of the number of

    subspaces [8, 93, 94].

    20

  • Chapter 3

    Multiresolution Backpropagation

    Learning

    The Multiresolution Backpropagation Learning is related to LeCun Convolutional Neural Networks ap-

    proach, in which the main difference is that no receptive fields are used [33, 99]. We propose a method

    that combines different concepts from the multiresolution image processing and from the deep learning

    in order to obtain a good generalization, avoiding the problem of overfitting. Multiresolution Backpropa-

    gation Learning can be described in three main components:

    1. Generation of Gaussian pyramids from an initial pattern;

    2. Artificial neural networks training on each resolution of the pattern;

    3. Weights replication to initialize the following artificial neural network, from the lower to the higher

    resolution of the pattern.

    For the sake of reasoning, we first describe individually each component of the proposed method before

    its overall architecture.

    3.1 Gaussian Pyramid Generation

    The generation of the Gaussian pyramid is the first stage of the Multiresolution Backpropagation Learn-

    ing. Proposed by Burt and Adelson (1983), the pyramid is a multiresolution structure, representing

    subsequent images that are filtered and scaled down. The base level contains the original image and it

    is the starting point of the pyramid construction process [88].

    Given an image dataset D = {(I1, c1), . . . , (In, cn) : n ∈ N+}, where I is a two-dimensional image,

    also denoted by I(x, y) and c is the associated class or label. The Gaussian pyramid is defined on the

    original image I as:

    G0(x, y) = I(x, y), for level l = 0 (3.1)

    21

  • and then an averaging process is carried out by a REDUCE function in the following pyramid levels as:

    Gl(x, y) = REDUCE(Gl−1(x, y)), otherwise. (3.2)

    Which means that the REDUCE function involves the convolution of each initial image with a Gaussian

    filter G(x, y) (Equation (2.13)) and a downsampling operation by a factor of 2, resulting in the following

    level of the pyramid [90]. Thus, starting with an initial image G0 of size N pixel collumns×N pixel rows,

    the image G1 of size N2 ×N2 is created. Repeating the REDUCE, image G2 with size

    N4 ×

    N4 is obtained,

    resulting in a three-level pyramid structure.

    The process described is applied to all images that compose the dataset D. The initial same resolu-

    tion images in D will originate two new lower resolution image datasets. The dataset corresponding toN2 ×

    N2 images is represented by D

    ′ = {(I ′1, c1), . . . , (I ′n, cn) : n ∈ N+} and the dataset that representsN4 ×

    N4 images is denoted by D

    ′′ = {(I ′′1 , c1), . . . , (I ′′n , cn) : n ∈ N+}.

    3.2 Artificial Neural Networks

    Inputs

    Images are the input to the artificial neural networks (Figure 3.1). Each input image is represented as

    a two-dimensional grayscale array. Since we have three image datasets, we also need three separate

    networks. Lowest resolution images from dataset D′′, the level l = 2 of the Gaussian pyramid, are the

    input to the first artificial neural network (NN1), medium resolution images from dataset D′, the level

    l = 1 of the pyramid, are the input to the second neural network (NN2) and the higher resolution images

    from dataset D, the level l = 0 of the pyramid, are the input to the third neural network (NN3).

    Training

    We resort to feedforward networks with backpropagation-based training as the artificial neural networks

    architecture, based on [33]. The remaining configurations are not based on any specific feedforward

    network architecture, but during the experiments phase we found relevant other contributes [100].

    The three networks (Figure 3.1) have an input layer with diverse number of units, depending only on

    the input image resolution. They also have one hidden layer each and a output layer with 10 units. The

    activation function used in the hidden layers is the typical hyperbolic tangent:

    tanh(x) =(ex − e−x)(ex + e−x)

    , (3.3)

    with output values in the range (−1, 1) and in the output layers, a softmax function:

    σ(x)i =exi∑Jj=1 e

    xjfor i = 1, · · · , J, (3.4)

    giving output values between (0, 1). Then a loss function is applied, the cross-entropy, a measure of

    22

  • dissimilarity between the true labels and the predicted labels. It is typically used in training when the

    models have softmax outputs [3].

    Figure 3.1: Schematic representation of three images at different resolution as input to three feedforwardnetworks with backpropagation-based training.

    For the NN1, the weights and biases are initialized randomly from a normal distribution of values,

    with mean set to 0 and standard deviation equal 1. In the remaining artificial neural networks, we resort

    to weight replication, explained in the section 3.3. About biases, they initialize the following network as

    they are.

    The architectures of NN1, NN2 and NN3 are similar. The main difference occurs in the training and

    in the number of input units in the input layer. We applied early stopping during the training phase of

    each network, thus resulting in a different number of training epochs. After initialization, the results are

    continually improved by training, from the lower to the higher resolution.

    3.3 Weight replication

    Since we have three artificial neural networks to train images at different resolution, some interconnec-

    tion must be made in order to generate relevant results. Here is where the replication of weights between

    networks takes its place.

    After the training of NN1, we have the resulting weights of the process. They are represented as a

    matrix of values. Then, the NN1 weights initialize the NN2 and subsequently NN2 weights initialize the

    NN3.

    In order to replicate the weights between networks, we resort to the Kronecker product of two matri-

    23

  • ces, denoted by:

    A⊗B =

    a11B · · · a1nB

    .... . .

    ...

    am1B · · · amnB

    , (3.5)where A is an m× n matrix of weight values and B is a 2× 2 matrix of ones.

    This process is repeated always between artificial neural networks, from the lowest resolution to the

    following higher resolution. Figure 3.2 illustrate the weight replication along resolutions.

    Figure 3.2: Hierarchy of weight replication.

    24

  • 3.4 Multiresolution Backpropagation Learning Architecture

    After the component description of the proposed method, now we can formulate the overall architecture

    (Figure 3.3). We start with the convolution of each image of dataset D with a Gaussian filter and a

    Figure 3.3: Schematic representation of three images at different resolution as input to three feedforwardnetworks with backpropagation-based training, showing the direction of the image resolution reductionand the direction of the training procedure.

    downsampling by a factor of 2, originating a new dataset D′. The process is repeated in order to obtain

    the dataset D′′, as stated in Algorithm 1. Performing these steps corresponds to the generation of the

    Gaussian Pyramid.

    Algorithm 1: Prepare the dataset.

    1 foreach image G0(x, y) = In ∈ D for level l = 0 do// Apply REDUCE

    2 G1(x, y) = REDUCE(G0(x, y));

    3 Save image G1(x, y) = I ′n; // Build dataset D′ for level l = 14 G2(x, y) = REDUCE(G1(x, y));

    5 Save image G2(x, y) = I ′′n ; // Build dataset D′′ for level l = 2

    6 end

    25

  • Each dataset represent a level in the pyramid. The training starts with level l = 2 as input to NN1

    (Algorithm 2). After the training of the first network, we apply weight replication to initialize NN2, trained

    with level l = 1 as input. We repeat the same procedure between NN2 and NN3, but the latter have level

    l = 0 as input. Since the level l = 0 is the base level of the pyramid, the NN3 is the last network to be

    trained, dispensing the weight replication component.

    Algorithm 2: Training.

    1 foreach level l = 2, l = 1, l = 0 do

    2 if level l = 2 then

    3 Initialize a feedforward neural network randomly;

    4 Train with backpropagation;

    5 Apply early stopping;

    6 else

    7 Initialize a feedforward neural network from the preceding resolution network;

    8 Train with backpropagation;

    9 Apply early stopping;

    10 end

    11 end

    26

  • Chapter 4

    Empirical Experiments

    This section present the conducted experiments using the Multiresolution Backpropagation Learning

    (MrBL) and the MNIST image dataset (section 4.1). We describe the main steps performed during the

    development until obtaining the final results.

    We resorted to standard backpropagation-based training since it is a simple and efficient procedure,

    frequently used with feedforward networks and it was used in LeCun Convolutional Neural Networks

    [33, 28].

    All the experiments were developed using the Python programming language, version 3.5 and the

    Tensorflow software library, version 1.2, which is a machine learning framework used to build neural

    network models [101, 102]. Another library used was the NumPy, version 1.13, a package for scientific

    usage that supports multidimensional array operations [103].

    4.1 Dataset

    Experiments were carried out using the MNIST dataset [99]. The acronym stands for Modified National

    Institute of Standards and Technology and it is a largely used dataset of handwritten digits suited for

    pattern recognition methods [42, 104]. It contains normalized grayscale images, of size 28 × 28 pixel,

    split into a training set of 60.000 images and a test set of 10.000 images. Each image has label values

    from 0 to 9, representing the digit on it.

    4.2 Preprocessing

    The MNIST is a balanced dataset across classes, already mixed and with normalized size images. Since

    the pixel values of each image vary in the range (0, 255), normalization to the range (0, 1) was carried

    out. This is a typical procedure in computer vision, since scaling images will make their values more

    evenly distributed for training [3].

    Additionally, the train and test labels were one-hot encoded. Instead using the original label for the

    digit class, we used binary variables, where 0 means that it does not belong to a class and 1 refers to

    27

  • the class it belongs. The one-hot encoding avoid misclassification when feeding data into the model.

    4.3 Performance measure

    All the tested models were evaluated with respect to its performance. In classification tasks, it is usual

    to measure its accuracy. The accuracy gives the correct output proportion of a model.

    Most of the existing methods applied to MNIST use the error rate as unit to express their results,

    which may represent better its desired behaviour. The error rate is an equivalent measure of perfor-

    mance, that gives the incorrect output proportion of the model [3]. Since accuracy and error rate are

    equivalent measures, the latter was selected as the performance measure.

    4.4 Preliminary Experiments

    Before the development of the final architecture, we performed some previous preliminary experiments.

    The main purpose was to test if the predictions about the generalization ability of the method were

    promising. The preliminary experiments were performed running on the CPU of a laptop with Intel Core

    i3-2310M processor at 2.10 GHz and 8 GB of RAM.

    We started to perform the experiments using different number of hidden units among the artificial

    neural networks. We prepared the MNIST dataset in order to generate the three-level Gaussian pyramid

    (Algorithm 1). The standard deviation of the Gaussian filter was σ = 2 (Equation (2.13)).

    The lowest resolution images, with size 7×7 pixel, were input to NN1 containing one hidden layer with

    2 units. The medium resolution images, with size 14×14 pixel, were input to NN2 with 4 hidden units. The

    higher resolution images, corresponding to the MNIST original images, were input to NN3 with 8 hidden

    units. We followed the Algorithm 2 and confirmed that the loss was reduced, but no early stopping was

    formally applied. The three networks were trained with the three complete training datasets, during 200

    epochs (number of iterations over the dataset) [3]. A learning rate of η = 0.3 was defined for all artificial

    neural networks. The MNIST test set was used to assess the classifying performance of the model.

    To evaluate the model we resorted to the original MNIST images as input to a feedforward neural

    network with backpropagation-based training and with random initialization. The other network settings

    were the same as the NN3 of the models evaluated.

    The first results failed to succeed (Table 4.1). Comparing the output classification error of the pro-

    posed model, given by NN3, with the evaluation model, we can see that the percentage of incorrectly

    recognized test digits was better (lower) using the evaluation model. This preliminary test indicated that

    using a artificial neural network with less hidden units to initialize the following with more hidden units

    may not work.

    So, we modified our model. Instead considering a different number of hidden units, we considered

    using the same number of hidden units, performing an experiment with one hidden layer with 8 units on

    each of the networks (Table 4.1). The results were better in the proposed model than the evaluation

    model. Even thought that the classification error values were high, the results obtained were promising.

    28

  • Model NN1 NN2 NN3 Evaluation

    Different number of hidden units 89.9 88.6 89.7 87.8Same number of hidden units 90.6 85.8 81.4 87.8

    Table 4.1: Image classification error rate (%) of preliminary experiments on MNIST.

    But in order to achieve a good generalization model, we need to scale it and perform adjustments.

    4.4.1 Preliminary CIFAR-10 Experiments

    We performed some preliminary experiments on CIFAR-10 dataset with the purpose to test if the pro-

    posed method has a good performance on a different data [100]. The dataset consists of 60.000 colour

    images of size 32× 32, representing 10 classes (e.g. airplane, cat, dog, among others). It is split into a

    50.000 images training set and a 10.000 images test set.

    Before applying the proposed method we converted the dataset from colour to grey level to reduce

    the its dimensions from three to one, simplifying the process. Briefly, we adjusted NN1, NN2 and NN3

    to have one hidden layer with 10000 units each, a learning rate of η = 0.01 and were trained over 25,

    90 and 30 epochs, respectively. We evaluated with a randomly initialized feedforward neural networks

    with similar settings to NN3, trained over 30 epochs. We empirically verified that the proposed method

    worked on CIFAR-10 dataset and optimized the results, but probably it needs a lot more computing

    power to improve the results (Table 4.2). The accuracy is the typical performance measure used on the

    aforementioned dataset. Higher accuracy represents better results.

    Model NN1 NN2 NN3 Evaluation

    Same number of hidden units 27.2 29.2 28.6 26.7

    Table 4.2: Image classification accuracy (%) of preliminary experiments on CIFAR-10.

    4.5 Experiments

    The components and the overall architecture of the Multiresolution Backpropagation Learning were de-

    scribed in Chapter 3. Starting with the preliminary experiments until the final model, several experiments

    were carried out, in order to obtain the best results. This section presents the preprocessing and the

    experimental settings of MrBL carried out as well as the most relevant results obtained. These exper-

    iments were performed on the CPU of a server computer with Intel Xeon E5-1620 processor at 3.60

    GHz and 64 GB of RAM.

    4.5.1 MNIST Gaussian Pyramid

    We resorted MNIST dataset to generate the three-level Gaussian pyramid (Algorithm 1). Images from

    level l = 2, with size 7 × 7, formed an input vector of 7 × 7 × 1 = 49 values to NN1. Images from level

    29

  • l = 1 formed an input vector of 14×14×1 = 196 values to NN2. Images from level l = 0 formed an input

    vector of 28× 28× 1 = 784 values to NN3.

    The Gaussian filtering process was performed and tested with different convolution matrix settings.

    We tested common values for standard deviation to understand its influence in MrBL behaviour and then

    get the best parameter (Equation (2.13)). The values tested were σ = 1, σ = 2 and σ = 3, corresponding

    to the 68 − 95 − 99.7 statistical rule. Figures 4.1, 4.2 and 4.3 shows its visual effect. About the window

    size, it was set to 5 × 5. This size produces an appropriate filtering and is computationally less costly

    [88].

    (a) Images from level l = 0.

    (b) Images from level l = 1. (c) Images from level l = 2.

    Figure 4.1: MNIST Gaussian Pyramid sample images with σ = 1.

    (a) Images from level l = 1. (b) Images from level l = 2.

    Figure 4.2: MNIST Gaussian Pyramid sample images with σ = 2 (two levels).

    (a) Images from level l = 1. (b) Images from level l = 2.

    Figure 4.3: MNIST Gaussian Pyramid sample images with σ = 3 (two levels).

    Quality of Gaussian pyramid

    After setting up the entire model, we tested different Gaussian filters to understand if they affected the

    error rate. After testing the proposed model with σ = 1, σ = 2 and σ = 3, we realized that the best

    results were obtained with σ = 1 (Figure 4.4). Once that the convolution matrix centre has the higher

    value of the Gaussian distribution, it means that larger σ values will produce a wider ”bell curve” shape.

    Higher values also produces ”sharp edges”, with undesired results.

    30

  • Figure 4.4: Error rate (%) obtained using different standard deviation values with MrBL.

    4.5.2 Networks Training Parameters

    Several adjustments were carried out in the MrBL networks parameters (Table 4.3). We selected random

    batches of 100 images on each iteration as input to each artificial neural network. More specifically, a

    mini-batch for stochastic gradient descent perform parameter updating. The NN1 had 49 units in the

    input layer, the NN2 had 196 input units and the NN3 had 784, corresponding to the size of each input

    vector. Each network had one hidden layer with 9000 units and 10 units in the output layer (section 3.2).

    We resorted to a softmax cross-entropy computation implemented by Tensorflow to obtain the model

    loss. The training process was carried out with a gradient descent optimizer also from Tensorflow, with

    the learning rate set to η = 0.01. The referred optimizer performs automatic differentiation to implement

    backpropagation [102].

    Early stopping was applied to control the number of epochs during training. By using this technique,

    we were not just interested in obtaining a good performance on each individual artificial neural network

    by getting the point in time with the lowest test error. We were also interested in the overall generalization

    ability of MrBL method. Thereby, NN1 was trained during 20 epochs, NN2 for 50 epochs and NN3 for 30

    epochs.

    We evaluated the model resorting to MNIST original dataset as input to a feedforward neural network

    with 784 input units, 9000 hidden units and 10 output units. We resorted to backpropagation-based

    training over 30 epochs and with random initialization (with mean set to 0 and standard deviation set to

    1). The remaining settings are equal NN3. We titled it ”BL” model.

    4.5.3 Results

    To achieve an efficient training, several optimizations were carried out (section 4.5.2). We tried to keep

    the model as simple as possible in order to demonstrate its performance.

    We choose a random batch of size between 1 and some hundreds to improve the training time and

    the convergence of the algorithm [105].

    The usage of stochastic gradient descent allows that the error surface landscape changes between

    31

  • Parameter NN1 NN2 NN3 BL

    Input units 49 196 784 784Hidden units 9000 9000 9000 9000Output units 10 10 10 10Learning rate 0.01 0.01 0.01 0.01Epochs 20 50 30 30Initialization Random From preceding From preceding Random

    Table 4.3: Summary of MrBL and BL evaluation training parameters.

    image batches, probably with different local minima or saddle points [5]. This technique optimized the

    training process relatively to the preliminary experiments.

    A fixed learning rate was the most suitable solution for MrBL. It helped to obtain either a proper

    convergence of NN1 and NN2, either a proper time to reduce the NN3 loss.

    The wider hidden layers used in MrBL helped the results optimization, and were inspired by [100].

    Even thought that one wider hidden layer may have a more flattened landscape, our multiresolution

    approach suggest less flattened landscapes and with more local minima [10].

    The choice of the hyperbolic tangent as activation function was due to its better performance. It

    converged faster than sigmoid, due to the fact that it output values in the range (-1,1) instead (0,1),

    avoiding gradient bias [28]. It also performed better than ReLU due to vanishing gradients problem. The

    softmax in the output layer units are a typical choice for multi-class classification tasks [26].

    The weights and biases were initialized randomly from a normal distribution (section 3.2) in order

    to maximize the generalization ability of MrBL. Setting the standard deviation to 1 avoided overfitting

    in NN3, even that regularization concerns suggests smaller values [3]. In order preserve the informa-

    tion from the preceding lowest resolution landscape, we used the weight replication process (section

    3.3) among the MrBL networks, stopping the training before reaching each landscape bottoms. The

    weight replication followed the images resolution increase to preserve information and to improve gen-

    eralization. The number of epochs chosen was variable (section 4.5.2). The NN1 obtained a better

    convergence by stopping the training earlier. The NN2 presented more convergence and low overfitting,

    so we stopped training latter. We stopped the NN3 training when no relevant error rate improvement

    was obtained. The Figure 4.5 shows the MrBL and the BL evaluation model train convergence. The

    variable number of epochs was scaled for a better visualization. We can observe that MrBL converged

    faster than BL evaluation model.

    The higher gap between training and test loss curves of BL evaluation model indicates more overfit-

    ting than MrBL (Figure 4.6). By starting the training process with a lower value suggests that MrBL do not

    reach the landscape bottoms, using almost optimal lower values between the hierarchy of landscapes.

    We obtained a better result in the percentage of incorrectly recognized test digits, meaning that the

    output error rate value in the MrBL was on average lower than BL evaluation. We performed three runs

    for each method and the values mean and standard deviation are presented in Table 4.4. Since the

    values interval does not overlapped, the results are statistically significant which indicates that using the

    MrBL method gives an advantage comparatively to the simple BL method. Figure 4.7 shows the test set

    32

  • Figure 4.5: Training set convergence properties of MrBL networks and BL evaluation, with scaledepochs.

    Figure 4.6: Training an test set convergence properties of MrBL and BL evaluation.

    33

  • MrBL

    NN1 NN2 NN3 BL

    Error rate (%) 7.32± 0.54 5.83± 0.20 8.24± 0.33 10.92± 0.14

    Table 4.4: Image classification error rate of MrBL and BL evaluation on MNIST.

    convergence on both methods and the dispersion values during the training process. It showed better

    generalization ability towards new data. We also verified that using the same number of training epochs

    Figure 4.7: Test set convergence properties of MrBL and BL evaluation, with vertical bars representingthe standard deviation.

    in BL as the summation of all epochs in MrBL did not show improvement (Table 4.5). We performed three

    runs and the results obtained were quite similar to the ones with BL during 30 epochs, but revealed more

    overfitting tendency.

    BL

    Error rate (%) 11.07± 0.45

    Table 4.5: Image classification error rate of BL evaluation on MNIST with 100 epoch training.

    34

  • Chapter 5

    Conclusions

    The Multiresolution Backpropagation Learning method obtained best overall results than simple

    backpropagation-based training. The proposed method gives faster training and the possibility to over-

    come local minima. By using a sequence of subspaces represented by images at different resolution

    as input to feedforward networks with backpropagation-based training, we most probably managed to

    reach optimal lower values among the hierarchy of landscapes.

    By developing the aforementioned method we did not intend to obtain the best results in MNIST, but

    to demonstrate that it works. It is a novel alternative method for regularization, avoiding overfitting and

    avoiding going into local minima.

    5.1 Achievements

    • We compared MrBL to the conventional BL and verified that it converged faster with less overfitting;

    • We empirically verified that MrBL gives an advantage, by reaching near a global minimum, avoiding

    local minima;

    • We observed that MrBL gives better results in the MNIST digit recognition task, with statistically

    significant better results.

    5.2 Future Work

    For the future, we should make algorithm optimization that represents best the error surface while re-

    sorting to less computational power. Additionally, a better exploration of the loss surface properties

    should be carried out. Another interesting future approach would be the exploration of different methods

    complemented with a multiresolution approach, since it could bring advantages.

    35

  • 36

  • Bibliography

    [1] X. Liao, A. V. Vasilakos, and Y. He. Small-world human brain networks: Perspectives and chal-

    lenges. Neuroscience & Biobehavioral Reviews, 2017.

    [2] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew. Deep learning for visual understand-

    ing. Neurocomput., 2016.

    [3] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.

    [4] P. Bühlmann and S. van de Geer. Statistics for High-Dimensional Data: Methods, Theory and

    Applications. Springer Berlin Heidelberg, 2011.

    [5] S. Dube. High dimensional spaces, deep learning and adversarial examples. CoRR, 2018.

    [6] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimiza-

    tion problems. CoRR, 2015.

    [7] A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surface of multilayer

    networks. CoRR, 2015.

    [8] A. Wichert. Intelligent Big Multimedia Databases. World Scientific, 2015.

    [9] H. W. Lin, M. Tegmark, and D. Rolnick. Why Does Deep and Cheap Learning Work So Well?

    Journal of Statistical Physics, 2017.

    [10] Y. Dauphin, R. Pascanu, Ç. Gülçehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking

    the saddle point problem in high-dimensional non-convex optimization. CoRR, 2014.

    [11] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The

    bulletin of mathematical biophysics, 1943.

    [12] M. T. Hagan, H. B. Demuth, and M. Beale. Neural Network Design. PWS Publishing Co., 1996.

    [13] F. Rosenblatt. The Perceptron, a Perceiving and Recognizing Automaton Project Para. Cornell

    Aeronautical Laboratory, 1957.

    [14] R. Rojas. Neural Networks: A Systematic Introduction. Springer-Verlag, 1996.

    [15] B. Kröse and P. van der Smagt. An introduction to Neural Networks. The University of Amsterdam,

    8th edition, 1996.

    37

  • [16] S. Haykin and S. Haykin. Neural Networks and Learning Machines. Prentice Hall, 2009.

    [17] F. Agostinelli, M. D. Hoffman, P. J. Sadowski, and P. Baldi. Learning activation functions to improve

    deep neural networks. CoRR, 2014.

    [18] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Inc., 1995.

    [19] S. Ruder. An overview of gradient descent optimization algorithms. CoRR, 2016.

    [20] B. Widrow and M. E. Hoff. Adaptive switching circuits. 1960 IRE WESCON Convention Record,

    1960.

    [21] C. De Sa, K. Olukotun, and C. Ré. Global convergence of stochastic gradient descent for some

    non-convex matrix problems. arXiv preprint arXiv:1411.1134, 2014.

    [22] P. Jain, S. M. Kakade, R. Kidambi, P. Netrapalli, and A. Sidford. Accelerating Stochastic Gradient

    Descent. ArXiv e-prints, 2017.

    [23] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. On optimization methods for

    deep learning. In Proceedings of the 28th International Conference on International Conference

    on Machine Learning, 2011.

    [24] S.-Y. Zhao and W.-J. Li. Fast asynchronous parallel stochastic gradient descent: A lock-free

    approach with convergence guarantee. In AAAI, 2016.

    [25] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew. Deep learning for visual understand-

    ing: A review. Neurocomputing, 2016.

    [26] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).

    Springer-Verlag, 2006.

    [27] Y. Lecun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.

    [28] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller. Efficient backprop. In Neural Networks: Tricks of

    the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, 1998.

    [29] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Neurocomputing: Foundations of research.

    chapter Learning Representations by Back-propagating Errors, pages 696–699. MIT Press, Cam-

    bridge, MA, USA, 1988. ISBN 0-262-01097-6. URL http://dl.acm.org/citation.cfm?id=

    65669.104451.

    [30] A. Prieto, B. Prieto, E. M. Ortigosa, E. Ros, F. Pelayo, J. Ortega, and I. Rojas. Neural networks:

    An overview of early research, current frameworks and new challenges. Neurocomputing, pages

    242–268, 2016.

    [31] G. E. Dahl, T. N. Sainath, and G. E. Hinton. Improving deep neural networks for lvcsr using

    rectified linear units and dropout. 2013 IEEE International Conference on Acoustics, Speech and

    Signal Processing, 2013.

    38

    http://dl.acm.org/citation.cfm?id=65669.104451http://dl.acm.org/citation.cfm?id=65669.104451

  • [32] N. Qian. On the momentum term in gradient descent learning algorithms. Neural Netw., 1999.

    [33] Y. Lecun. Generalization and network design strategies. Elsevier, 1989.

    [34] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern

    recognition unaffected by shift in position. Biological Cybernetics, 1980.

    [35] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understanding the effective receptive field in deep

    convolutional neural networks. In Advances in Neural Information Processing Systems 29. Curran

    Associates, Inc., 2016.

    [36] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional

    neural networks. In Advances in Neural Information Processing Systems 25. Curran Associates,

    Inc., 2012.

    [37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabi-

    novich, et al. Going deeper with convolutions. Cvpr, 2015.

    [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recogni-

    tion. arXiv preprint arXiv:1409.1556, 2014.

    [39] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional

    networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

    [40] S. Wu, S. Zhong, and Y. Liu. Deep residual learning for image steganalysis. Multimedia Tools and

    Applications, 2017.

    [41] N. A. Hamid, N. M. Nawi, R. Ghazali, and M. N. M. Salleh. Solving local minima problem in back

    propagation algorithm using adaptive gain, adaptive momentum and adaptive learning rate on

    classification problems. In International Journal of Modern Physics: Conference Series, 2012.

    [42] J. Schmidhuber. Deep learning in neural networks: An overview. Neural networks, 2015.

    [43] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks.

    In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,

    2010.

    [44] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In Proceedings of the

    Fourteenth International Conference on Artificial Intelligence and Statistics, 2011.

    [45] R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In

    Proceedings of the 30th International Conference on International Conference on Machine Learn-

    ing - Volume 28, 2013.

    [46] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi-

    fication. In Proceedings of the 25th ieee conference on computer vision and pattern recognition

    (cvpr 2012), 2012.

    39

  • [47] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks for

    visual recognition. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision –

    ECCV 2014, 2014.

    [48] E. A. Smirnov, D. M. Timoshenko, and S. N. Andrianov. Comparison of regularization methods for

    imagenet classification with deep convolutional neural networks. Aasri Procedia, 2014.

    [49] Y. Bengio et al. Learning deep architectures for ai. Foundations and trends R© in Machine Learning,

    2009.

    [50] M. Mahsereci, L. Balles, C. Lassner, and P. Hennig. Early stopping without a validation set. CoRR,

    2017.

    [51] L. Breiman. Bagging predictors. Mach. Learn., 1996.

    [52] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural

    networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.

    [53] W. Sun and F. Su. Regularization of deep neural networks using a novel companion objective

    function. In Image Processing (ICIP), 2015 IEEE International Conference on, 2015.

    [54] P. Baldi and P. J. Sadowski. Understanding dropout. In Advances in Neural Information Processing

    Systems 26. Curran Associates, Inc., 2013.

    [55] J. Ba and B. Frey. Adaptive dropout for training deep neural networks. In Advances in Neural

    Information Processing Systems 26. Curran Associates, Inc., 2013.

    [56] S. Wang and C. Manning. Fast dropout training. In Proceedings of the 30th International Confer-

    ence on Machine Learning, 2013.

    [57] D. A. McAllester. A pac-bayesian tutorial with A dropout bound. CoRR, 2013.

    [58] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple

    way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.

    [59] S. Wager, S. Wang, and P. S. Liang. Dropout training as adaptive regularization. In Advances in

    Neural Information Processing