Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
CNN in Image ProcessingClassification, Detection and Retrieval
Ali Ahmadi
18th October, 2017
K.N.Toosi University of Technology
1
Definitions of Deep Learning
Classes of Deep Learning Networks
Various architectures in Deep Convolutional Neural
Networks
RCNN approach to object detection
DNN-based regression approach to object detection
Our proposed CBIR using a deep CNN (GoogLeNet)
2
Deep Structured Learning,
Or, Hierarchical Learning
Or, Deep Machine Learning
or more commonly called Deep Learning
Since 2006, it has emerged as a new area of
machine learning research.
3
Deep Learning is a branch of Machine Learning based on set of
algorithms that attempt to model high level abstractions in data by
using a deep graph with multiple processing layers, composed of
multiple linear an non-linear transformation.
A sub-field within machine learning that is based on algorithms for
learning multiple levels of representation in order to model
complex relationships among data.
◦ Higher-level features and concepts are defined in terms of lower-level
ones, and such a hierarchy of features is called a deep architecture.
◦ Most of these models are based on unsupervised learning of
representations.
4
It typically uses artificial neural networks.
Higher-level concepts are defined from lower-level
ones.
The same lower-level concepts can help to define
higher-level concepts.
5
Drastically increased chip processing
Significantly increased size of data used for
training
Recent advances in machine learning and
signal/information processing research
6
Image: object recognition, Object Detection, Image
De-Noising
Audio: speech recognition, music retrieval
1. Text: parsing, sentiment analysis, machine translation
7
Deep Neural Networks have recently shown great performance
on image classification.
we take another step and present some proposed methods for
object detection and semantic segmentation which can be
used for CBIR.
we propose an approach based on a well-known deep CNN
architecture, GoogLeNet.
8
9
10
11
Classes of Deep
Learning
Network
Deep networks
for supervised
learning
Hybrid deep
networks
Deep
Networks for
unsupervised
or generative
learning
Also called Discriminative Deep Networks.
Target label data are always available in direct or
indirect forms for such supervised learning.
Examples:
◦ Deep Neural Network (DNN)
◦ Convolutional Neural Network (CNN)
12
Also called Generative Deep Networks.
Used when no information about target class labels is available.
Captured high-order correlation of the observed or visible data
for pattern analysis.
Examples:
◦ Restricted Boltzmann Machines (RBM)
◦ Deep Boltzmann Machines (DBM)
◦ Deep Belief Networks (DBN)
13
Deep architecture that either comprises or makes use of both
generative and discriminative model components.
◦ This can be accomplished by better optimization or/ and
regularization of supervised deep networks.
◦ The generative component is mostly exploited to help with
discrimination, which is the final goal of the hybrid architecture.
14
Deep Boltzmann Machine (DBM)
Deep Belief Networks
Deep Neural Networks
AutoEncoders
Convolutional Deep Neural Networks
15
This architecture allows CNNs to take advantage of the 2D
structure of input data.
In comparison with other deep architectures, convolutional
neural networks have shown superior results in both image
and speech applications.
They can also be trained with standard back-propagation.
CNNs are easier to train than other regular, deep, feed-forward
neural networks and have many fewer parameters to estimate,
making them a highly attractive architecture to use
The most recent study on supervised learning for computer
vision shows that the deep CNN architecture is not only
successful for object/image classification but also successful
for object detection in the whole images
16
17
18
Image
Search
19
Learning effective feature representations and similarity
measures are critical to the performance of a CBIR.
Although various techniques have been proposed, it remains
one of the most challenging problems in CBIR, which is
mainly due to "semantic gap" issue that exists between low-
level image pixels captured by machine and high-level
semantic concepts perceived by human.
20
One of the most important advances in machine learning is
known as "deep learning" that attempts to model high-level
abstractions in data by employing deep architectures
composed of multiple non-linear transformations..
We can improve CBIR using the state-of-the-art deep learning
techniques for learning feature representations and similarity
measures.
21
Deep convolutional neural networks model pre-trained on
large-scale dataset can be straightly used for feature extraction
in new CBIR tasks and are able to capture high semantic
information in the raw pixels
The features extracted by pre-trained CNN model may or may
not be better than the traditional hand-crafted features, but
with proper feature refining schemes, the deep learning feature
representations consistently outperform convolutional hand-
crafted features on all datasets
22
When being applied for feature representation in a new
domain, similarity learning can further boost the retrieval
performance of the direct feature output of pre-trained deep
models.
By retraining the deep models with classification or similarity
learning objective on the new domain, the retrieval
performance could be boosted considerably which is much
better than the improvements made by shallow similarity
learning.
23
Deep learning framework for CBIR includes two stages
◦ Training a deep learning model from a large collection oftraining data.
◦ Learning feature representations of CBIR tasks in a newdomain by use of trained deep model
24
Feature extracted from the last fully connected layers in a deep
CNN-based model can be used as the feature representations
for any task such as classification, detection, and CBIR.
In CBIR, we do not consider features from lower
convolutional layers in the network since the lower layers are
in lack of rich semantic representations.
The features extracted from last convolutional layer and fully
connected layers are significant features and we can make use
of these features for training tasks such as object localization,
object detection, and specially image retrieval.
25
AlexNet (2012)
ZF Net (2013)
26
SPP (2014)
VGG (2014)
27
GoogLeNet (2014)
28
AlexNet is proposed in paper, titled “ImageNet Classification with
Deep Convolutional Networks” in 2012. this paper has been cited a
total of 6184 times.
AlexNet is a deep convolutional neural network to classify the 1.2
million high-resolution images in ImageNet ILSVRC-2010 contest
into the 1000 different classes.
ILSVRC: ImageNet Large Scale Visual Recognition Competition
On the test data, AlexNet achieved top-1 and top-5 error rates of
37.5% and 15.4% which is considerably better than the previous
state-of-the-art.
AlexNet has 60 million parameters and 650.000 neurons.
AlexNet layers:
◦ Five convolutional layers, Max-pooling layers, Dropout layers, Three
fully connected layers
29
30
31
Trained the network on ImageNet data, which contained over
15 million annotated images from a total of over 22.000
categories.
Used ReLU for the nonlinearity functions (Found to decrease
training time as ReLUs are several times faster than the
convolutional tanh function)
Used data augmentation techniques that consisted of image
translations horizontal reflections, and patch extractions.
Implemented dropout layers in order to combat the problem of
overfitting to the training data.
Trained the model using batch stochastic gradient descent,
with specific values for momentum and weight decay.
Trained on two GTX 580 GPUs for five to six days.
32
33
Zeiler-Fergus Net (ZF Net) is the winner of the
competition in 2013.
ZF Net achieved 11.7% top-5 error rate.
This architecture was more of a fine tuning to
the previous AlexNet structure, but still
developed some very keys ideas about
improving performance.
34
35
As the network grows, we also see a rise in the number of
filters used.
Used ReLUs for their activation functions, cross-entropy loss
for the error function, and trained using batch stochastic
gradient descent.
Trained on a GTX 580 GPU for twelve days.
36
VGG Net was proposed by visual Geometry Group,
Department of Engineering Science, University of
Oxford in ILSVRC 2014.
Simplicity and Depth
weren’t the Winner of ILSVRC 2014
VGG Net achieved 7.3% top-5 error rate.
37
38
39
Worked well on both image classification
and localization tasks.
Built model with Caffe toolbox.
Used ReLU layers after each conv layer
and trained with batch gradient descent.
Trained on 4 Nvidia Titan Black GPUs
for two to three weeks.
40
41
42
Two well-known object detection approaches based on deep
convolutional neural networks:
◦ RCNN (Regions with CNN features)
◦ DNN-based regression
43
RCNN includes five stages◦ Stage 1: Determining object proposals without considering the
category of image.
◦ Stage 2: Extracting a fixed-length feature vector from eachwarped proposal using CNN.
◦ Stage 3: Training a set of classifier linear SVMs.
◦ Stage 4: Ranking the proposals and using Non-MaximumSuppression to get the bounding boxes.
◦ Stage 5: Using bounding box regression to augment localization
performance.
44
45
DeepID-Net, a RCNN-based method improves the result of
the RCNN framework by use of deformation models of object
parts and multi-stage training.
The stages of DeepID-Net:
46
DeepID-Net consists of four parts:◦ Part1: the baseline deep model.
◦ Part2: the layers with multi-stage training.
◦ Part 3: the layers with variable filter sizes and def-pooling layer.
◦ Part4: the deep model for obtaining 1000-class imageclassification scores.
47
DNN-based regression for object detection is presented as a regression
problem to get object bounding box masks.
Methods that uses DNN-based regression approach for object detection,
define a multi-scale inference procedure which is able to produce high-
resolution object detections.
This regression uses architecture of a deep CNN and changes the last fully
connected layer or both last fully connected and last convolutional layer.
48
Overfeat is other algorithm that uses DNN-based regression for
classification, localization and detection.
This integrated framework is the winner of the localization task of the
ILSVRC2013 and obtained very competitive results for the detection and
classification tasks.
In This algorithm, multi-scale and sliding window approach is efficiently
implemented by DNN-based regression.
Overfeat accumulates bounding boxes in order to increase detection
confidence.
50
overfeat explores the entire image by densely running the
network at each location and multiple scale
This approach yields significantly more views for voting,
which increases robustness and efficiencyles.
The result of convolving a ConvNet on an image of arbitrary
size is a spatial map of C-dimensional vectors at each scale.
Overfeat uses 6 scales of input which result in unspooled layer
5 maps of varying resolution.
51
We can change overfeat from a detection task to a
CBIR task by changing definition of labeling training
data and changing the last layers in overfeat and train
the modified architecture for CBIR.
In this case, similar to overfeat for detection, we can
get a spatial map of C-dimensional vectors at each
scale and then combine them to do CBIR task.
52
differences◦ In RCNN-based approach, we classify images using shallow methods
such as linear SVM in order to enhance the classification and reduce
object localization error. In contrast, there is no shallow classifier in
DNN-based regression approach.
◦ In RCNN-based approach, the input of Deep CNN is some object
proposals while in DNN-based regression approach the input of Deep
CNN is the entire images and densely sliding windows is applied on the
image. Using object proposal algorithms in RCNN-based approach,
increases the speed of inference and using densely sliding windows in
DNN-based regression approach, increases the precision of this
approach.
53
Similarities
◦ Both approaches use the features of pool5 layer and fully connected
layers for detection and semantic segmentation. They may feed other
networks or classifiers using these features.
◦ Both approaches may modify the last layer and adjust it with detection
and semantic segmentation tasks. In fact, the classifier layers are
replaced by a regression network and trained to predict object bounding
boxes.
◦ Both approaches make use of multi-scale image in order to increase
detection and semantic segmentation precision. The better aligned the
network window and the object, the strongest the confidence of the
network response.
54
Our approach for CBIR is based on GoogLeNet [17]
architecture.
In our proposed CBIR, we compare images based on
the features extracted from pre-trained GoogLeNet.
Actually we choose to increase performance of our
proposed CBIR, because we can extract deeper
feature maps.
We consider the output of the last convolutional layer
as image features to find similar images based on
these feature maps.
55
We make use of Caffe to implement and extract last
convolutional layers feature maps of pre-trained
GoogLeNet.
In our proposed CBIR, GoogLeNet receives input
images via a RappidMQ queue cluster. Then the
feature representations extracted from GoogLeNet
are placed in target queue. We implement our CBIR
on GPU GForce GTX 1080.
56
Stage 1: Read the encoded input images from RappidMQ
queue.
Stage 2: Decode the images to a readable structure for Caffe.
Stage 3: Feed forward the images to the pre-trained GooLeNet
in Caffe.
Stage 4: Get the output of the last pooling layer as the feature
vector.
Stage 5: Encode the feature vectors to a proper format for
queue.
Stage 6: Put the final encoded vectors into target queue.
57
We reviewed some CNN architectures in Image classification.
We presented some object detection and semantic segmentation
algorithms based on deep convolutional neural network, which can
be used for CBIR.
we can change detection task to a CBIR task by changing definition
of labeling training data and changing the last layers in Deep CNN
and train the modified architecture for CBIR.
There are two well-known semantic segmentation and object
detection approaches based on deep convolutional neural networks:
RCNN-based approach and DNN-based regression approach. These
approaches have some similarities and differences.
In our proposed CBIR, we compare images based on the features
extracted from pre-trained GoogLeNet. In fact, GoogLeNet receives
input images via a RappidMQ queue cluster. Then the feature
representations extracted from GoogLeNet are placed in target
queue.
58
[1] Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma, "A survey of content-based image retrieval
with high-level semantics" Pattern Recogn, 40(1): 262-283, January 2007.
[2] Y. Cao, C. Wang, L. Zhang, and L. Zhang, "Edge index for large scale sketch-based image search", in:
IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 761-768.
[3] J. Xie, Y. Fang, F. Zhu, and E. Wong, "Deepshape: Deep learned shape descriptor for 3d shape matching
and retrieval", in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1275-1283.
[4] F. Wang, L. Kang, and Y. Li, "Sketch-based 3d shape retrieval using convolutional neural networks", in:IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1875-1883.
[5] S. Bai, X. Bai, Z. Zhou, Z. Zhang, and L. Jan Latecki, "Gift: A real-time and scalable 3d shape search
engine", in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5023-5032.
[6] M. Park, J. S. Jin, and L. S. Wilson, "Fast content-based image retrieval using quasi-gabor filter andreduction of image feature dimension", in: IEEE Southwest Symposium on Image Analysis andInterpretation. IEEE, 2002, pp. 178-182.
[7] X.-Y. Wang, B.-B. Zhang, and H.-Y. Yang, "Content-based image retrieval by integrating color andtexture features", Multimedia Tools and Applications (MTA), vol. 68, no. 3, pp. 545-569, 2014.
[8] J. Wang and X.-S. Hua, "Interactive image search by color map", ACM Transactions on IntelligentSystems and Technology (TIST), vol. 3, no. 1, p. 12, 2011.
[9] C. Wengert, M. Douze, and H. Jegou, "Bag-of-colors for improved image search", in: ACM International Conference on Multimedia, ACM, 2011, pp. 1437-1440.
[10] B. Wang, Z. Li, M. Li, and W.-Y. Ma, "Large-scale duplicate detection for web image search", in:
IEEE International Conference on Multimedia and Expo (ICME), IEEE, 2006, pp. 353-356.
59
[11] J. Wan, D. Wang, S. Hoi, et al., "Deep Learning for content-based image retrieval: a comprehensive study", in: Proceeding of the Multimedia, 2014.
[12] Ji. Wan, D. Wang, S.C.H. HOI, P. Wu, J. Zhu, "Deep learning for content-based image retrieval: A
comprehensive study", in: Proceedings of the 22nd ACM International Conference on Multimedia, 2014.
[13] A. Krizhevsky, I. Sutskever, G.E. Hinton, "Imagenet classification with deep convolutional neural
networks", in: Proceedings of the NIPS, 2012.
[14] M.D. Zeiler, R. Fergus, "Visualizing and understanding convolutional neural networks", in:
Proceedings of the ECCV, 2014.
[15] K. He, X. Zhang, S. Ren, et al., "Spatial pyramid pooling in deep convolutional networks for visual
recognition", in: Proceedings of the ECCV, 2014.
[16] K. Simonyan, A. Zisserman, "Very deep convolutional networks for large-scale image recognition",
in: Proceedings of the ICLR, 2015.
[17] C. Szegedy, W. Liu, Y. Jia, et al., "Going deeper with convolutions", in: Proceedings of the CVPR,
2015.
[18] O. Russakovsky, J. Deng, H. Su, et al., "Imagenet large scale visual recognition challenge", int, J.
Comput, Vis. 115 (3) (2015) 211-252.
[19] R. Girshick, J. Donahue, T. Darrell, et al., "Rich feature hierarchies for accurate object detection and
semantic segmentation", in: Proceedings of the CVPR, 2014.
[20] W. Ouyang, P. Luo, X. Zeng, et al., "DeepID-Net: multi-stage and deformable deep convolutional
neural networks for object detection", in: Proceedings of the CVPR, 2015.
60
[21] R. Grishick, "Fast R-CNN", in: Proceedings of the ICCV, 2015.
[22] S. Ren, K. He, R. Girshick, et al., "Faster R-CNN: towards real-time object detection with region proposal
networks", in: Proceedings of the NIPS, 2015.
[23] Y. Zhu, R. Salakhutdinov, et al., "segDeepM: exploiting segmentation and context in deep neural networks for
object detection", in: Proceedings of the CVPR, 2015.
[24] S. Gidaris, N. Komodakis, "object detection via a multi-region and semantic segmentation-aware CNN model",
in: Proceedings of the ICCV, 2015.
[25] C. Szegedy, A. Toshev, D. Erhan, "Deep neural networks for object detection", in: Proceedings of the NIPS,
2013.
[26] P. Sermanent, D. Eigen, X. Zhang, et al., "Overfeat: integrated recognition, localization and detection using
convolutional networks", in: Proceedings of the ICLR, 2014.
[27] D. Erhan, C. Szegedy, A. Toshev, et al., "Scalable object detection using deep neural networks", in:
Proceedings of the CVPR, 2014.
[28] B. Alexe, T. Deselaers, V. Ferrari, "Measuring the objectness of image windows", Pattern Anal. Mach. Intell.
IEEE Trans. 34 (11) (2012) 2189-2202.
[29] J.R.R Uijlings, K.E.A van de Sande, T. Gevers, et al., "Selective search for object recognition", Int. J. Comput.
Vis. 104 (2) (2013) 154-171.
[30] I. Endres, D. Hoiem, "Category independent object proposals", in: "Proceedings of the ECCV, 2010.
[31] M.M. Cheng, Z. Zhang, W.Y. Lin, et al., "BING: binarized normed gradients for objectness estimation at
300fps", in: Proceedings of the CVPR, 2014.
[32] C.L. Zitnick, P. Dollar, "Edge boxes: locating object proposals from edges", in: Proceedings of the ECCV,
2014.
[33] J. Hosang, R. Benenson, B. Schiele, "How good are detection proposals, really?", in: Proceedings of the
BMVC, 2014.
61
Thank you so much
Any Question?
For object localization, overfeat replaces the classifier layers
by a regression network and trains it to predict bounding boxes
at each spatial location and scale.
It then combines the regression predictions together, along
with the classification results at each location.
overfeat simultaneously runs the classifier and regressor
networks across all locations and scales.
The output of the final softmax layer for a class c at each
location provides a score of confidence that an object of class
c is present in the corresponding field of view. So it is possible
to assign a confidence to each bounding box.
63