Convolutional Neural Networks for Object Detection: A visual …ic14/IChalkiadakis_ASR.pdf · 2018-01-09 · Convolutional Neural Networks for Object Detection: A visual cortex perspective

Convolutional Neural Networks for Object Detection: A visual

cortex perspective

Ioannis Chalkiadakis

December 19, 2016

Abstract

The use of deep neural network architectures has seen an enormous increase in recent years,which has been facilitated by improvements in training algorithms and hardware breakthroughs.However, not much is yet known about how exactly such networks operate, and why they oftenconstitute the state-of-the-art in a task. One of the most intriguing and less explained archi-tectures of neural networks is the Convolutional Neural Network, which has been particularlysuccessful in image-related tasks. In this work we aspire to get an intuition into the opera-tion of the Convolutional Neural Network. We do so by examining it in the context of objectrecognition, and by relating its function with the human visual cortex.

We hope that this intuition will be of value to the research community by providing someinitial guidelines and principles for the design of a Convolutional Neural Network.

1 Introduction

Detection of an object of interest is one of the major tasks of computer vision, and has attracteda lot of attention from the research community. Recent work and advances in convolutional neuralnetworks (CNN ) and deep learning have lead to remarkable results and detection scores in commonobject detection datasets. However, much controversy still exists regarding the use of convolutionalneural networks, and the way they function on the data. This project aspires to shed some lightonto this special type of neural network architecture, and get some intuition into why it achievessuch good results.

We will start by formulating the object detection problem, and present some common currentapproaches to tackle it. We will then proceed to introduce neural networks and specifically con-volutional neural networks, as well as give a brief overview of the human vision system. We willfinally try to connect the presented knowledge, in an attempt to give some insight into the wayconvolutional neural networks work and draw useful conclusions as to how one can practically usethat insight to design better CNN systems.

2 Object detection

2.1 Problem definition

[1] Object detection deals with finding and identifying semantic objects of a particular class in astatic image or in a video sequence.

The number of object detection applications and the broad scope of the application areas havemade the computer vision research community turn its attention towards discovering accurate, fastand efficient object detection methods. Common uses of object detection include:

1

• Assisitive and security applications involving face or vehicle detection

• Industrial applications and quality control

• Object tracking and/or counting

• Optical character recognition

• Robot localization

The challenges of the object detection tasks are mostly due to the high variability both of the inputimage and of the object class itself. Changes in illumination, camera position, as well as peculiaritiesof the scene depicted in the image (clutter, partial view of the object, shadows) make it difficult todetect the object and acquire enough characteristics to be able to identify its class. Furthermore,an object detection system should be invariant to the intra-class variability of the objects that itshould recognize; an object should be identified regardless of its position and orientation in theimage, in any of the forms that it could possibly take (for instance a bike should be identifiedirrespective of its color, frame, size, or viewing angle).

2.2 Current approaches

In this section we will provide a brief introduction to the general categories into which we coulddivide object detection methods:

• appearance-based methods

• feature-based methods

• part-based models

• neural networks

2.2.1 Appearance-based methods

[2]Appearance-based methods in essence create a small database with templates (different exam-

ples of the object), and find the template that is the best match to the object in question.The templates describe different appearance features of the object (edges, color, homogeneous

regions), as they vary under different viewing, scale or illumination conditions.Matching based on edges could rely on edge overlap between the template and the query image

or closeness of edges. Region-growing matching methods are also commonly used, where, startingfrom seed points, one connects neighboring pixels that have a degree of homogeneity with respectto color, shape, texture or the image gradient. Furthermore, efficient matching can be performedby storing the eigenfaces of the template images, i.e. their eigenvectors.

2.2.2 Feature-based methods

Feature-based methods store features of an example image and during detection time they try tomatch them with features extracted from the query image. Extracted features include class-specificfeatures, invariant features like area, elongation compactness or higher order image moments, aswell as those derived from the SIFT ( [3]) and SURF ( [4]) transforms. Furthermore, bag-of-wordsrepresentations are often used, where one creates a codebook based on groups of features andrepresents the image as a histogram of the codewords.

2

2.2.3 Part-based models

This category of object models employs algorithms which try to detect an object of interest intosub-parts of the image.

One example of such model is the constellation model ( [5], [6]), which seeks to detect featuresor distinct parts of the object class, and then conclude about the object as a whole. For instance,if the object of interest is a face, a system based on the constellation model would try to identifythe eyes, the nose, the mouth etc. in order to detect the whole face.

A different part-based model that has gained popularity is the deformable parts model ( [?], [7]),which hypothesizes the location of a low-resolution, root object template in various parts of theimage and tries to find the best match. It then tries to detect local properties of the object bymatching several high-resolution templates of parts of the object, in the region of the root template.

2.2.4 Neural Networks

Object detection with neural networks can be formulated both as a classification and a regressionproblem. The network is trained on a vast dataset of images, which contain various poses ofthe desired object under multiple viewing conditions. Depending on the task, the output of thenetwork will be the class of the recognized object (classification), the coordinates of a bounding boxcontaining the object (regression), or both. The following section will provide more informationabout the (convolutional) neural network approach to object detection.

3 Convolutional Neural Networks

3.1 Neural Networks

Neural networks in general can be though of as computational machines, consisting of multiple sets(layers) of computational units (neurons), that are able to approximate any function. They usuallydo achieve this by receiving (input,label) pairs, and undergoing a training procedure, during whichthe parameters of the network are tuned according to a training criterion. Different types of neuralnetworks can be constructed by altering the computation layers and the interconnections betweenthem.

The most basic approach to design a neural network is by connecting all neurons of a layerwith all neurons of the previous layer, that is build a fully connected network. This is the firstarchitecture that was proposed, and has been extensively researched into, however, it is not themost suitable for tasks involving image processing, which is our focus. First, given that imagesare large and consist of a huge number of pixels a network would have to have a big number ofparameters in order to fulfil the required task. This requirement would not only make training thenetwork very computationally demanding, but would also require vast amounts of training data.The latter implies a huge effort in annotating the dataset, i.e. assigning labels to the input examples,which has still to be done manually for many tasks. Furthermore, networks without any specialstructure, such as the fully connected mentioned earlier, fail to take advantage of the topology of theimage; images have a strong local structure which means that spatially close pixels are also highlycorrelated. Taking advantage of the local structure of the image could provide the model withinvariance when it comes to translation, rotation and scaling distortions. Such information is nottrivial to extract with some form pf preprocessing of the data before feeding them into your model,and consequently, it is a great benefit if the network can extract such information independentlyduring training.

3

On the other hand, convolutional neural networks employ a certain type of architecture, whichallows them to be very successful in vision tasks. The basic features of their architectures can besummarized as follows:

• Locality of receptive fields. As mentioned above, neural networks are organized in layers ofcomputation, where each of them is stacked on top of the previous. The receptive field isthe area of a layer that provides the input to a neuron in the next layer. In fully connectedarchitectures all neurons of a layer are connected to each neuron in the next; in convolutionalnetworks however, each neuron receives its input from a small subset of the neurons of theprevious layer, i.e. it has a local receptive field.

• Sharing weights. Each convolution layer of a CNN comprises of planes (sets) of units whichshare the same weight matrix. The units on each plane however have different receptive fieldsthat span the whole image, which means that the output of each plane is in essence the resultof the convolution of the weight matrix with the whole image.

• Sub-sampling. The output of the aforementioned planes is subsequently subsampled, by meansof an extra operation such as pooling (e.g. max-pooling) or averaging.

• Feature Hierarchy. The multiple layers of computation of the CNN ensure that image featureswill be extracted in a hierarchical fashion, starting from low-level features, such as edges, inthe first layers, up to higher-level, class-specific features in the higher levels of the network.

3.2 Architecture of a CNN

The architecture of a convolutional neural network can be seen in figure 1. The network depictedhas two hidden layers (layer 2 and part of layer 1 in the image) and the sizes of the layers andweight matrices (convolution kernels) are indicative. Notice that it is convenient to think of alayer as the combination of a convolution component, a non-linearity (e.g. a rectifying unit) and asubsampling component, although the three are separate and each has its own training parameters.Mathematically, the operation of a convolutional layer of the network can be expressed as follows( [8]):

y[r, c] =∑r

∑c

x[u, v]w[r + u, c + v] (1)

Regarding the sub-sampling layer, it is constructed by introducing a non-linear operation suchas max-pooling, where each neuron’s output it the maximum response of its receptive field. Byintroducing sub-sampling, we manage to keep the number of connections low, and at the same timereduce the resolution of the feature map and consequently reduce sensitivity to distortions of theinput.

After the initial convolutional layers responsible for the feature extraction in the network, weintroduce a number of fully connected layers which will fuse the features and forward the completerepresentation to the classifier or regressor layer. In the case of a classifying network, the trainingcriterion is the categorical cross-entropy function, whereas in the case of a regression task, the meansquare error between the target and the network’s output is computed. The training follows thecommon back-propagation algorithm ( [9]).

3.3 CNN in Object Recognition

Although there are a number of successful approaches to object recognition using CNN, there aretwo that are particularly important for the goal of the current project. In this section we will givea brief introduction to both and present the main notion behind them.

4

Figure 1: Main components of a convolutional neural network.

3.3.1 CNN and region proposals

[10] The main idea behind this approach is to use a region-proposal algorithm to hypothesize theobject’s locations in the image and then use the convolutional network to extract features for eachregion.

Regarding the region proposal algorithm [11], there are a number of options available: objectness( [12]), selective search ( [13]), constrained parametric min cuts ( [14]) and even methods employinga different CNN to propose candidate regions ( [15]). In the original paper referred to above,the writers use selective search to identify potential regions containing the object, and from eachproposal they extract a 4096-dimensional feature vector, which will be used to tune a linear SVMper class. Training is divided in two phases: a supervised pre-training phase, where the network istrained discriminatively on a large complementary dataset with image-level labels, i.e. whether ornot the object of interest is present in the image. The second phase is a fine-tuning phase using adomain-specific dataset, where the positive or negative decision about the object is taken accordingto the overlap of the region proposal with a ground truth bounding box.

Using the above presented method the writers achieved a 53.5% mean average precision (mAP)on PASCAL VOC 2012 dataset (compared to 35.1% using bag-of-words model and region proposals,and to 33.4% achieved by the deformable parts model).

3.3.2 Weakly-supervised CNN

The second most notable approach is a weakly-supervised approach which manages to detect andapproximately locate the object, although not its extent, i.e. how much of the image it covers.

In their work ( [16]) Oquab, Bottou et. al. introduce three changes into the successful CNNarchitecture:

• Handle the fully connected layers as convolutions, that is as convolutional layers with kernelsize (size of the weight matrix) equal to the input size

• Employ a max-pooling operation to assume the possible location of the object in the image

• Use a weakly-supervised strategy for training, i.e. use only image-level labels to train thenetwork

5

Figure 2: Human visual system

The weakly-supervised model, although it is essentially trained for classification, achieves a de-tection mean average precision of 74.5% on PASCAL VOC 2012, as compared to 74.8% for anRegion-CNN approach (modified for a fair comparison), which is designed specifically for objectdetection.

4 Human vision system

The following sections attempt to give a short overview of the human visual system and its basicfeatures. Figure 2 shows the flow of information in the brain, where visual stimuli coming in fromthe eye, go through the retina and the Lateral Geniculate Nucleus (LGN ) to the primary visualcortex and then follow two different pathways to the higher areas of the visual cortex.

4.1 Primary Visual Cortex - V1

The primary visual cortex (V1 ) consists of a vast number of neurons (≈ 140 million in each V1)which have small receptive fields. In V1 we can identify two types of cells: the simple and thecomplex cells.

The simple cells receive their input from the LGN cells and their function can be described bya linear filter followed by a non-linearity. They are considered to act as edge detectors, and someresearchers simulate their operation with Gabor filters. Furthermore, simple cells show selectivityto the orientation and the exact spatial location of the stimulus.

The complex cells have a larger receptive field than simple cells, and are sensitive to motionwith a certain velocity and orientation. Hubel and Wiesel ( [17]) in their work suggest that complexcells receive their input through a pooling operation on the simple cells.

Moreover, the representation of the stimulus learnt in V1 is overcomplete and retino-topic: thereare more cells in V1 than in LGN (hence overcomplete), and neighboring cells have neighboringreceptive fields (retino-topic).

After V1 we can distinguish two different flows of information:

• the ventral stream or ‘what’ pathway, which is responsible for form recognition and represen-tation of objects

• the dorsal stream or ‘where/how’ pathway which is responsible for locating the object andcontrolling the motion of the eyes based on visual input

6

4.2 Higher-level Cortex

The precise organization and function of the higher level cortex areas are not completely known,however we can roughly identify the following levels:

• V2, which is important for the visual memory, and responds to orientation, color and thespatial frequency of the stimulus

• V4, which also responds to orientation, color and spatial frequency but also identifies featuresof intermediate complexity, such as geometric shapes

• V5, which is tuned for responding to complex features, for instance line ends or corners

Regarding the representation of the stimulus that is extracted in the higher-level cortex, it is sparseand distributed, possibly to minimize the consumption of energy in the brain.

5 Convolutional Neural Networks under a visual cortex perspec-tive

Having seen the fundamentals of convolutional networks and the human vision system, we can nowacquire some insight into how we can draw correspondences between the two, and how we coulduse that knowledge in the design of a convolutional network.

We saw that V1 cells are retino-topic, have a small receptive field and act as low-level featureextractors. Correspondingly, the first layers of a CNN have each neuron connected to a small subsetof neurons in the previous layer (locality of the receptive field) and are responsible for extractingbasic features such as edges. Furthermore, by taking advantage of the convolution operation wemanage to achieve shift distortion invariance in the network.

A form of the sub-sampling operation is also present in the visual cortex. We have already seenthat complex cells receive their input by pooling neighborhoods of simple cells. According to thework presented in [18] max-pooling is present in higher-level areas of cortex as well. Experimentssuggest that the Inferior Temporal (IT ) cortex works like a max-pooling layer, since its responseto two simultaneous stimuli is dominated by the stimulus that causes the biggest response whenthe same stimuli are presented one at a time. A max-pooling operation is more robust to clutterand the size of the object, especially compared to a summing operation which was also assumedfor the IT cortex; as a result, it is considered especially important for object detection.

Furthermore, spatial filtering is also applied in the retina, perhaps because the throughput ofthe neurons is not high enough to process the total amount of visual information.

In addition the distributed representation and feature hierarchy that is present in the cortexis also maintained in a convolutional network. [19] demonstrate the feature hierarchy in the CNN(figure 3): the first one or two layers capture low-level features, e.g. edges, while preserving afuzzier representation of the input ( [20]) and as we move on in the hierarchy more complex andabstract features are identified: texture (layer three), object class-specific variations (layer 4) andpose variations or objects as a whole (layer five). Regarding the distribution of information it issurprising to see ( [21]) that the pattern of activity of the units is more important than their actualvalues, and that more information is carried by the top five but first activation units than thedominant unit which defines the classification decision.

7

5.1 Designing a CNN

We use the term design of a convolutional network to refer to the choice of the number of layers, thesuccession of the type of layers, the choice of non-linearities applied and the size of the convolutionkernel. All these parameters are certainly dataset-dependent, however we can use some of the theorypresented above to make our decisions more guided, especially for object detection applications.

For instance, given that the first layers capture low-level features, they should be tuned morewhen we need to process a dataset containing edges crucial for the recognition.

The depth of the network should also be decided according to the complexity and variabilityof the objects present in the dataset. If an object exhibits strong variability, a more abstractrepresentation is required in order to capture features that are invariant to the object’s distortions.

Regarding the fully connected layers towards the end of the CNN, one should consider whatis necessary for the task: do we need to just fuse the extracted features and feed them into theclassifier or do we need to fuse features for sub-regions of the image and take local decisions, inwhich case fully connected convolutional layers should be employed? Furthermore, the width ofthe fully connected layers is not as crucial compared to the overall depth of the network.

Finally, the choice of the end layer should be decided according to the task: is it identifying anobject’s class or localizing the object? In case it is the latter, do we need to know its extent, inwhich case the training labels should be coordinates of the bounding box of the object, or just anapproximation of its location in the image, and so we could rely on image-level labels?

Lastly, perhaps the most difficult and thus trial-requiring parameter to choose is the size ofthe convolution kernel (the weight matrix). To do so, one would need to take into account thenoise introduced by any type of filter, whose amount is proportional to the filter size, as well ascomputational considerations given that convolution dominates the training procedure of a CNN.

8

Figure 3: Feature hierarchy in the CNN.

9

A Lua - Torch

During the progress of the current work, an effort was made to implement the weakly-supervisedarchitecture ( [16]) in the Lua programming language. Lua ( [22]) was developed in the PontificalCatholic University of Rio de Janeiro. It is a scripting language, supporting many different pro-gramming paradigms, such as object-oriented, functional and procedural programming. It is fastand has simple syntax which makes it suitable for rapid prototyping.

Lua is particularly suited for machine learning applications, given that it has a strong supportof machine learning libraries, which are used in leading research labs, such as Facebook, Google,Twitter, NYU and IDIAP. The framework that was used in the current project is Torch ( [23]),which allows for easy construction of neural networks and seamless execution of code both on CPUand GPU. Torch has an interactive MatLab-like environment and creating a network can be doneeasily by creating a blob and then adding the appropriate layers.

10

References

[1] Pedro Felzenszwalb Yali Amit, “Object detection,” .

[2] Martin Winter Peter M. Roth, “Survey of appearance-based methods for object recognition,”.

[3] David G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput.Vision, vol. 60, no. 2, pp. 91–110, Nov. 2004.

[4] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool, “Speeded-up robust features(surf),” Comput. Vis. Image Underst., vol. 110, no. 3, pp. 346–359, June 2008.

[5] Dana Sharon and Michiel van de Panne, “Constellation Models for Sketch Recognition,”in Eurographics Workshop on Sketch-Based Interfaces and Modeling, Thomas Stahovich andMario Costa Sousa, Eds. 2006, The Eurographics Association.

[6] Markus Weber, “Unsupervised learning of models for object recognition,” .

[7] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection withdiscriminatively trained part based models,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.

[8] Abhineet Saxena, “Convolutional neural networks: An illustrated explanation,” .

[9] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams, “Neurocomputing: Founda-tions of research,” chapter Learning Representations by Back-propagating Errors, pp. 696–699.MIT Press, Cambridge, MA, USA, 1988.

[10] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchiesfor accurate object detection and semantic segmentation,” in Computer Vision and PatternRecognition, 2014.

[11] Chunhui Gu, J. J. Lim, P. Arbelaez, and J. Malik, “Recognition using regions,” in 2009 IEEEConference on Computer Vision and Pattern Recognition, June 2009, pp. 1030–1037.

[12] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari, “Measuring the objectness of imagewindows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2189–2202, Nov. 2012.

[13] J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders, “Selective searchfor object recognition,” International Journal of Computer Vision, 2013.

[14] Joao Carreira and et al., “Constrained parametric min-cuts for automatic object segmenta-tion,” 2010.

[15] Dan C. Ciresan, Alessandro Giusti, Luca M. Gambardella, and Jurgen Schmidhuber, MitosisDetection in Breast Cancer Histology Images with Deep Neural Networks, pp. 411–418, SpringerBerlin Heidelberg, Berlin, Heidelberg, 2013.

[16] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free? weakly-supervised learning with convolutional neural networks,” in Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, 2015.

11

[17] David H. Hubel and Torsten N. Wiesel, “Receptive fields and functional architecture of monkeystriate cortex,” Journal of Physiology (London), vol. 195, pp. 215–243, 1968.

[18] Maximilian Riesenhuber and Tomaso Poggio, “Hierarchical models of object recognition incortex,” Nature Neuroscience, vol. 2, pp. 1019–1025, 1999.

[19] Matthew D. Zeiler and Rob Fergus, “Visualizing and understanding convolutional networks,”CoRR, vol. abs/1311.2901, 2013.

[20] Aravindh Mahendran and Andrea Vedaldi, “Understanding deep image representations byinverting them,” CoRR, vol. abs/1412.0035, 2014.

[21] Alexey Dosovitskiy and Thomas Brox, “Inverting convolutional networks with convolutionalnetworks,” CoRR, vol. abs/1506.02753, 2015.

[22] Pontifical Catholic University of Rio de Janeiro, “The lua programming language,” .

[23] “Torch, a scientific computing framework for luajit,” .

12

Documents

Convolutional Neural Networks for Object Detection: A visual …ic14/IChalkiadakis_ASR.pdf · 2018-01-09 · Convolutional Neural Networks for Object Detection: A visual cortex perspective