A Deep Learning Application for Traffic Sign Classification

Bachelor of Science in Computer ScienceMay 2021

A Deep Learning Application for TrafficSign Classification

Pramod Sai KondamariAnudeep Itha

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology inpartial fulfilment of the requirements for the degree of Bachelor of Science in Computer Science. The thesis is equivalent to 10 weeks of full-time studies.

The authors declare that they are the sole authors of this thesis and that they have not usedany sources other than those listed in the bibliography and identified as references. They furtherdeclare that they have not submitted this thesis at any other institution to obtain a degree.

Contact Information:Author(s):Pramod Sai KondamariE-mail: [email protected]

Anudeep IthaE-mail: [email protected]

University advisor:Shahrooz AbghariDepartment of Computer Science

Faculty of Computing Internet : www.bth.seBlekinge Institute of Technology Phone : +46 455 38 50 00SE–371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

Abstract

Background: Traffic Sign Recognition (TSR) is particularly useful for novice driversand self-driving cars. Driver Assistance Systems(DAS) involves automatic trafficsign recognition. Efficient classification of the traffic signs is required in DAS andunmanned vehicles for safe navigation. Convolutional Neural Networks(CNN) isknown for establishing promising results in the field of image classification, whichinspired us to employ this technique in our thesis. Computer vision is a process thatis used to understand the images and retrieve data from them. OpenCV is a Pythonlibrary used to detect traffic sign images in real-time.Objectives: This study deals with an experiment to build a CNN model which canclassify the traffic signs in real-time effectively using OpenCV. The model is builtwith low computational cost. The study also includes an experiment where variouscombinations of parameters are tuned to improve the model’s performance.Methods: The experimentation method involve building a CNN model based onmodified LeNet architecture with four convolutional layers, two max-pooling layersand two dense layers. The model is trained and tested with the German Traffic SignRecognition Benchmark (GTSRB) dataset. Parameter tuning with different com-binations of learning rate and epochs is done to improve the model’s performance.Later this model is used to classify the images introduced to the camera in real-time.Results: The graphs depicting the accuracy and loss of the model before and afterparameter tuning are presented. An experiment is done to classify the traffic signimage introduced to the camera by using the CNN model. High probability scoresare achieved during the process which is presented.Conclusions: The results show that the proposed model achieved 95% model ac-curacy with an optimum number of epochs, i.e., 30 and default optimum value oflearning rate, i.e., 0.001. High probabilities, i.e., above 75%, were achieved when themodel was tested using new real-time data.

Keywords: Image Processing, Deep Learning Algorithms, Convolutional NeuralNetwork (CNN), OpenCV, Supervised Learning.

i

Acknowledgments

We are grateful to our supervisor, Shahrooz Abghari, PhD, for mentoring us through-out our thesis. We could not complete this assignment without the help of our familyand friends. They sincerely believed in our work and helped us give ourselves timeto do this thesis. We want to thank every person who inspired us for completing thethesis.

Authors:Pramod Sai KondamariAnudeep Itha

iii

Contents

Abstract i

Acknowledgments iii

1 Introduction 11.1 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 42.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . 52.5 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.6 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 72.7 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.8 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.9 Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.10 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.11 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 9

3 Related Work 10

4 Method 124.1 Experimentation Environment . . . . . . . . . . . . . . . . . . . . . . 134.2 Data Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Processing The Dataset Images . . . . . . . . . . . . . . . . . . . . . 164.4 Image Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.5 Building The CNN Model . . . . . . . . . . . . . . . . . . . . . . . . 184.6 Training and Testing The Model . . . . . . . . . . . . . . . . . . . . . 194.7 Parameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.8 Testing The Model In Real-time . . . . . . . . . . . . . . . . . . . . . 20

5 Results and Analysis 225.1 Experiment-1: Training The Model . . . . . . . . . . . . . . . . . . . 22

5.1.1 The model trained with default parameter values . . . . . . . 22

v

5.1.2 The model trained after parameter tuning with various combi-nations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Experiment-3: Using The Model In Real-time . . . . . . . . . . . . . 30

6 Discussion 32

7 Conclusions and Future Work 357.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Bibliography 36

A Model code snippet 40

vi

List of Figures

2.1 In the figure a convolution filter (Sobel Gx) is applied on each sub-array of the source pixel to obtain a corresponding destination pixel(The figure is borrowed from [2]) . . . . . . . . . . . . . . . . . . . . 6

4.1 Steps involved in training, testing the proposed model (Phase 1) andtesting the model in real-time (Phase 2). . . . . . . . . . . . . . . . . 13

4.2 Sample of the GTSRB dataset . . . . . . . . . . . . . . . . . . . . . . 154.3 Sample Pre-processed Image . . . . . . . . . . . . . . . . . . . . . . . 174.4 Augmented Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.5 Our proposed CNN model Architecture . . . . . . . . . . . . . . . . . 184.6 Distribution of the training dataset . . . . . . . . . . . . . . . . . . . 20

5.1 Model accuracy with default parameter values . . . . . . . . . . . . . 225.2 Loss occurred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.3 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.4 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.5 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.6 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.7 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.8 Loss occurred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.9 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.10 Loss occurred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.11 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.12 Loss occurred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.13 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.14 Loss occurred . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.15 Traffic sign classification in real time using webcam . . . . . . . . . . 30

6.1 Graph plot reflecting the model learning patterns with decreasingnumber of epochs. (The figure is borrowed from [3]) . . . . . . . . . . 33

vii

List of Tables

4.1 Distribution of images across different datasets . . . . . . . . . . . . . 154.2 Architecture of CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Various combinations of parameter tuning . . . . . . . . . . . . . . . 20

5.1 Probability scores achieved when testing the model with new imagesin real time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6.1 Accuracy and loss of model for train and test data by tuning variouscombinations of learning rate and epochs . . . . . . . . . . . . . . . . 32

viii

Chapter 1

Introduction

Traffic signs are indicators present at either the side or above roads to provide infor-mation about the road conditions and directions. Traffic signs are generally pictorialsigns using symbols and pictures to give information to the vehicle drivers. There arevarious categories of traffic signs classified according to the information they convey.There are primarily 7 different traffic signs [5]: 1) Warning signs, 2) Priority signs,3) Restrictive signs, 4) Mandatory signs, 5) Special regulation signs, 6) Informativesigns, 7) Directional signs and additional panels.

Among these signs, warning signs are most important. Warning signs indicatethe potential hazard, obstacle, condition of the road. The warning signs may provideinformation about the state of the road or hazards on the road which the driver maynot immediately see. This category includes pedestrian crossing sign, caution sign,crosswalk alert, traffic light alert etc. Priority signs indicate the course of the vehi-cle, i.e., the way a vehicle should pass to prevent a collision. This category includesstopping sign, intersection sign, one-way traffic sign etc. Restrictive signs are used torestrict certain types of manoeuvres and to prohibit certain vehicles on roads. Thisincludes a no-entry sign, no-motorcycle sign, no-pedestrian etc. Mandatory signs arequite the opposite to prohibit signs saying the drivers what they must do. Permitteddirections, Vehicle allowances belong to this category. Special regulation and infor-mative signs indicate regulation or provide information about using something. Oneway sign, home zone indication comes to this category. Directional signs provideinformation about the possible destinations from a location. Turn-right, Turn-leftetc., come under this category.

In recent years, several attempts are being made to detect the traffic signs andinform the vehicle driver. Traffic Sign Recognition regulates the traffic signs andinforms the vehicle driver about the detected sign for efficient navigation, ensuringsafety. A smart real-time automatic traffic sign detection can support the driver or,in the case of self-driving cars, can provide efficient navigation. Automatic traffic signrecognition has practical applications in Driver Assistance System [22]. DAS involvesautomatic traffic sign recognition where the traffic sign is detected and classified inreal-time. Traffic sign recognition is also an important module in unmanned drivingtechnology.

Traffic sign recognition systems usually have two phases [19, 32]. The first phaseinvolves detecting the traffic sign in a video sequence or image using a method calledimage processing. The second phase involves classifying this detected image usingan artificial neural network model. Detecting the traffic signs at various places andrecognizing them has been quite an issue nowadays, which causes various problems.

1

2 Chapter 1. Introduction

According to World Health Organization (WHO), enforcing speed limits on traffichelp minimizing road accidents [29]. This speed limit is often neglected by driverswho drive past the limits. Road accidents can be caused due to various reasonslike the low quality of the traffic sign, driver negligence, obstacles near the trafficsign, which causes visual hindrance. The mentioned causes involve human-causederrors. Automated traffic sign detection may minimise human-involved errors andcan effectively detect the traffic signs.

In recent years, deep learning have proven effective in many fields like speechrecognition, natural language processing, image processing [17]. In particular, deeplearning performed well and achieved better results in visual recognition tasks thanhumans [18].

In this thesis, we propose a cost-sensitive CNN model in computational resourcecost like convolution layers. Our model is based on a modified LeNET model withfour convolution layers that can classify the traffic signs. GTSRB dataset [4] is usedto train and test the model. Later, we will use this model to classify the imageintroduced to the model via a camera. Here we use OpenCV to detect and extractthe traffic sign picture placed in front of the camera.

1.1 Aim and ObjectivesThis thesis aims to build a cost-sensitive CNN model trained and tested using theGTSRB dataset. Later the model is used to classify the images placed in front of acamera detected using OpenCV.

Objectives:

• The first objective is to split the dataset, which contains images, into train andtest data.

• To pre-process the images in the dataset using grayscale and histogram equal-ization methods.

• To build a cost-sensitive CNN model with a modified LeNet model.

• To train and test the model with the corresponding datasets.

• To test the hypothesis of whether the accuracy increases with parameter tuningand draw a suitable conclusion.

• To test the model with the images detected using OpenCV.

1.2 Research QuestionRQ1:

Considering the initial parameter setups of the proposed model, to what extent canparameter tuning improve the model’s accuracy?

Motivation:

1.3. Overview 3

Parameter tuning is used to get the best accuracy from the optimal combination ofparameters by checking with all parameter combinations. Parameter tuning needsto be done to get better accuracy and minimum loss. We can achieve an optimummodel using parameter tuning.

RQ2:

How can a traffic sign image be detected and extracted effectively when introducedto the live camera using OpenCV?

Motivation:

The built model has to be tested in real-time. We send an image extracted fromthe live camera using OpenCV to input the model for efficient classification. Datapre-processing needs to be done to recognize the image effectively. Grayscale andhistogram equalization techniques are suitable for the given dataset.

1.3 OverviewThis section discusses the structure of our thesis. The background contains variousconcepts that we have used in our thesis. This chapter contains a brief explanation foreach choice in our work. Related work discusses previous works in the field of neuralnetworks and computer vision for traffic sign recognition. The method comprisesthe procedure of our project, starting from the experimentation environment to theevaluation metrics used for evaluating the model. The results obtained from themethod are explained in detail in the results chapter. Discussion about the obtainedresult is discussed in the discussion chapter. Finally, the thesis ends with a conclusionwhere deductions are made based on the results obtained, and future works are brieflydiscussed.

Chapter 2

Background

This chapter provides a brief explanation of the theoretical background necessary tounderstand the project report. In this section, we discuss various concepts and toolsthat are used in our thesis.

2.1 Deep Learning

Deep learning is the subfield of artificial intelligence. It deals with creating neu-ral network models that can make data-driven decisions by themselves [15]. Deeplearning enables systems to learn from experience and understand the world in termsof the hierarchy of concepts [24]. There is no human intervention to specify theknowledge needed for the computer as the computer is self-capable in gathering theinformation. Feature extraction is one of the characteristics that make deep learningspecial. Feature extraction is a concept where the properties are extracted from theinput data provided to the model. In this way, image recognition can be achievedwithout any explicit coding.

Every image comprises pixels. These pixels are correlated to the neighbouringpixels. Deep learning techniques exploit this phenomenon by extracting features fromsmall sections of a larger image. Such methods have the ability to learn high-levelfeatures so that the task of developing a feature extractor for every new problem iseliminated. As deep learning algorithm has many parameters than machine learningalgorithms, it usually takes more time to train a deep learning model.

2.2 Supervised Learning

Supervised learning is one of the common branches of machine learning. Supervisedlearning requires learning a mapping between input and output data used to predictthe outputs from unseen data [13]. In supervised learning, the model is trainedwith the labelled data, i.e., images with class labels in image classification. Duringthe training phase, features are extracted from the training data and identify thepattern of the dataset. Test data is sent as an input to the model during the testingphase, where the data is classified according to the knowledge gained during thetraining phase. The main aim of supervised learning is to accurately predict theclassification of the test data based on the prior knowledge gained by analyzing thetrain data. Artificial neural networks, Bayesian statistics, analytical learning are afew approaches and algorithms used in supervised learning [8].

4

2.3. Neural Networks 5

2.3 Neural Networks

A neural network is an information processing system typically comprising biologicalneurons to implement learning and memorise things. When a neural network consistsof software and hardware components to solve the Artificial Intelligence (AI) prob-lems, they are called Artificial Neural Networks (ANN). A neural network comprisingof artificial neurons is called an ANN. The network consists of several layers, includ-ing activation layers and algorithms. In this thesis, we use Convolutional NeuralNetwork.

2.4 Convolutional Neural Network

Convolution networks (also known as CNN or ConvNet) belong to the family of neu-ral networks, which can process data with a known grid-like topology [10]. In thecase of images, there is a 2D grid of pixels. A CNN takes an input image and addsimportance by adding weights and biases to various aspects/objects in the image anddifferentiating one image from another [7]. CNN simply employ a mathematical op-eration called convolution instead of matrix multiplication in its layers. CNN consistsof an input layer and an output layer with multiple convolutional, pooling or fullyconnected layers in between [27]. The convolution operation can be mathematicallyexpressed as:

s(t) =

∫x(a)w(t− a)da (2.1)

This can further be simplified to the following:

s(t) = (x ∗ w)(t) (2.2)

Here x is considered the input variable, i.e., the image pixels and w is consideredthe kernel or the filter applied to the pixels. A kernel is a matrix smaller in size thanthe image. This operation applies the kernel with the corresponding neighbourhoodof image pixels to get an output called a feature map. The output of one layer is sentas an input to the other layer. In this way, the convolutions happen for the imagepixels, and corresponding feature maps are generated from which the model extractsthe features.

Convolutional layer

A CNN consists of several convolutional layers used to extract features from theinput images. When the input image pixels passes through the convolutional layer,kernels are applied to the neighbourhood pixels of the image. Cross-correlation isperformed on the kernel matrix and then applied to the image pixels. The weightsin the kernel are adjusted during the training, but during calculations, the weightsare constant. The Figure 2.1 shows the application of the kernel to the image pixels.After the convolution mapping, the output is sent to non-linear activation functionsfor handling the gradient problem. This is called feature mapping.

6 Chapter 2. Background

Figure 2.1: In the figure a convolution filter (Sobel Gx) is applied on each sub-arrayof the source pixel to obtain a corresponding destination pixel (The figure is borrowedfrom [2])

Pooling

Pooling is a sliding window technique where statistical functions are applied to thevalues within a window. Max pooling is a commonly used function where the maxi-mum value within the window is taken. Pooling is used for downsampling operation.Pooling is mainly used to reduce the number of parameters and to make featuredetection more robust.

Fully connected layer

A fully connected or dense layer makes the classification of the objects from theoutput obtained from the pooling layer. The input to the dense layer should beflattened into a single N x 1 tensor.

Dropout layer

The dropout layer reduces the overfitting problem by randomly setting the inputs tozero with configurable probability. The idea of dropout is to randomly drop units,along with their connections, during the training of the neural network. Dropoutimproves the performance of neural networks [35].

2.5 Activation FunctionsActivation functions are used in ANN to convert the input data to output data, whichis sent as input to the next layer. A neural network model’s prediction accuracy isdependent on the type of activation function used [31]. If the activation functions

2.6. Optimization Algorithm 7

are not used, the output data would be a simple linear polynomial function. Themodel will not be able to learn and recognize complex mappings from the data. Thusthe neural network will not be effective for modelling complicated types of data suchas images, speech, text, video, audio etc. ReLU and Softmax are two activationfunctions that we used in this thesis.

1. ReLU

ReLU stands for rectified linear unit belongs to the non-linear activation func-tion, which is widely used in neural networks. ReLU activation function hasmade training deep neural networks easier by reducing the burden of weight-initialization and vanishing gradients [6]. ReLU is more efficient than otheractivation functions as only a few neurons are activated at a time rather thanactivating all neurons. ReLU is a piecewise linear function whose positive partis an identity function (x), and the negative part is zero [28]. ReLU functioncan be mathematically represented as:

f(x) =1

1 + e−x(2.3)

2. Softmax

Softmax activation function is a combination of multiple sigmoid activationfunctions. The output values from these function will be in the range 0 to 1,which can be treated as particular class data points. Unlike the sigmoid func-tion used for binary classification, the softmax function can be used for multipleclass classification [6]. Models built using the softmax activation function havean output layer with the same number of neurons as the number of classes.Softmax function can be mathematically expressed as:

σ(z)j =ezj∑Kk=1 e

zkforj = 1, ..., K. (2.4)

Here, an exponential function is applied to each element zi of an input vectorz and normalizing these values by dividing with the sum of all the exponents.The sum of the components in the output vector σ(z)

2.6 Optimization AlgorithmOptimization algorithms are used to reduce the randomness in neural networks. Us-ing optimization algorithms, losses are reduced by changing the attributes of theneural networks. With reduced losses, the accuracy of the result is increasing.

Adam

Adaptive Moment Estimation(also known as Adam) is a gradient-based methodwhere the past squared gradients like Adadelta and RMSprop are accumulated andstores the decaying mean of the past gradients [9]. This method of optimization isstraightforward to implement and requires less memory. This algorithm is an efficientalgorithm for gradient-based optimization of stochastic objective functions.

8 Chapter 2. Background

2.7 Loss Function

The loss function depicts the failure of the model to predict valid classification.While training and testing the model, the general motto is to minimize the loss toincrease the model’s accuracy. The loss function used in our model is ’categoricalcross-entropy loss’. For an activation function x, where xi is the activation functionfor class i and y is the index of a correct label, the loss is computed mathematicallyas:

L(x, y) = − log(exy∑j e

xj) (2.5)

2.8 Computer Vision

Computer vision is the transformation of data from a still or video camera intoeither a decision or a new representation. It is a process that helps in understandinghow images and videos are stored and how to manipulate and retrieve data fromthem. It is mainly used for detecting and recognizing the required data from a video(or) image. OpenCV is an open-source Computer Vision library, which is helpful indetecting and extracting the image [12].

2.9 Image Processing

Image processing is a method to perform some operations on the images to enhancethe image in order to scrap useful information from them. Image processing is amethod in which images are sent as input, and their features or characteristics areextracted. Image processing typically has three steps:

1. Importing tools via image acquisition tools.

2. Processing the images through various processing techniques.

3. Sending an altered image or reporting that is based on image analysis as anoutput.

The image processing techniques used in our thesis are grayscale and histogramequalization, which are briefly explained in this thesis.

Grayscale

In digital images, grayscale images have their pixel values representing the intensityinformation of the light. Grayscale images only display black and white colours withvarying intensities. Gray image is usually represented by 8-bit, thus ranging the pixelvalues from 0 to 255. To avoid the complicated process of processing colour images,all the images in the dataset are grayscaled.

2.10. Image Augmentation 9

Image Enhancement

Image enhancement is the primary step in image processing. The quality of the imageis significantly increased by certain methods like contrasting, detailing, removing thenoise in the image, sharpening the image, etc. Image enhancement is a process ofmanipulating the pixel intensity of the input image so that the interpretation abilityor the information contained in the image can be improved [20]. The technique usedin this thesis is Histogram Equalization, which is a common technique used for imageenhancement. The gray levels of the image are remapped based on the probabilitydistributions of the input image. The image’s histogram is flattened and stretched,resulting in the contrast enhancement [25].

2.10 Image AugmentationImage Augmentation is a data-space solution to the problem of overfitting in neuralnetworks. Over-fitting is a phenomenon where a performance difference of a modelcan be noticed between the training and testing of the model. Over-fitting occurswhen the model is trained with an excess of the same data resulting in memorizingthe incorrect data. Image Augmentation decreases the validation and training errorto reduce the problem of over-fitting [33]. It creates artificial images through variousprocessing methods like zooming, shifting, random rotation, shearing etc. This is auseful technique to increase the size of the training dataset without acquiring newimages.

2.11 Performance Evaluation MetricsTo identify the efficiency of a model, one needs to select proper evaluation metrics.Accuracy score can evaluate the performance of the model, which is trained formulti-level classification. In our thesis, the accuracy score is checked for the modelto evaluate the model before and after parameter tuning. Given classwise accuracyAi for class i, average accuracy is calculated as:

Ac =1

n

n∑i=1

Ai (2.6)

Chapter 3

Related Work

Previously various classifiers and object-classification algorithms were used for therecognition of the traffic signs. Support vector machine, random forest, neural net-work, deep learning models, decision fusion. [5, 23, 38] Among these algorithms,deep learning models have proven effective in many fields, including speech recog-nition, Natural Language Processing (NLP), and Image classification [17]. Thereare many deep learning classifiers like Convolution Neural Network (CNN), DeepBoltzmann Machines (DBM) for image classification [16].

Zhongqin et al. [11] had performed Traffic sign classification using improved VGG-16 architecture and attained 99% model accuracy with the German Traffic SignRecognition Benchmark (GTSRB) [4] dataset. In their proposed model, redundantlayers are removed from the VGG-16 architecture. Also, batch normalization andpooling layers were added to greatly reduce the parameters and accelerate the model’saccuracy.

Shangzheng [30] proposed a fast traffic sign recognition method based on animproved CNN model by image processing and machine vision processing technologyalong with in-depth learning in target classification. The proposal uses HOG featuresto detect the traffic signs and an improved CNN model to determine the traffic sign.The proposed model accurately locate the traffic sign with improved accuracy andreduced false detection rate.

Tsoi and Wheelus [37] had performed TSR with cost-sensitive deep learning CNNmodel and attained 86.5% model accuracy. Their study built a model to classify traf-fic signals with cost-sensitive techniques like cost-sensitive rejection sampling and theuse of the cost-sensitive loss functions. Their baseline model consists of 3 convolutionlayers, 2 max-pooling layers, 3 fully connected layers, 2 dropout layers and a softmaxlayer.

Shabarinath and Muralidhar [29] proposes an optimized CNN architecture basedon VGGNet along with image processing techniques for traffic sign classification.They have utilized the German Traffic Sign Detection Benchmark(GTSDB) for test-ing and training their model. With Google’s TensorFlow running in the background,they achieved 99% validation accuracy and 100% test accuracy for novel input in-ference. The authors applied pruning and post-training quantization for the trainedmodel to obtain less memory footprint of the CNN.

Dhar et al. [14] had performed TSR in real-time with image acquisition tech-niques using colour ques from the HSV colour model. The desired area in the image

10

11

is cropped using region properties and shape signature. The extracted image isthus classified using a CNN model with two convolutional layers. The authors haveachieved 97% using Bangladesh Traffic signs where the extracted image is classifiedusing CNN. The authors have also performed the classification using various otherclassifiers such as SVM(cubic and quadratic), ANN, Decision trees, KNN, Ensemblesand achieved the highest accuracy for the CNN classifier.

Islam [21] has used LeNET architecture for the classification of the traffic signs.The author apprises two models based on CNN, one for the sign classification whilethe other is for shape classification. This model was not built for real-time imple-mentation but can be helpful.

Shopa, Sumitha and Patra [32] presented a study to recognize traffic sign pat-tern using OpenCV techniques. In their work, they have used the HSV algorithmfor image detection and Gaussian filter for image extraction. Image recognition isdone by implementing knn methods. Image pre-processing, feature extraction andclassification are the three modules used to recognize the traffic sign.

In the above mentioned works, either the authors use CNN for traffic sign recog-nition or OpenCV for traffic sign detection and extraction. Our paper proposes aCNN model to classify the traffic sign and test this model using the images extractedfrom real-time using OpenCV. The proposed CNN model is based on the LeNETmodel but has 4 convolution layers, 2 pooling layers, 2 dropout layers, followed byan output layer.

Chapter 4Method

To achieve the aim of the thesis and to address the research questions, we employedthe experimentation method in our thesis. A CNN model based on the modifiedLeNET model was built to conduct the experiment. GTSRB dataset is used to trainand test the model to achieve good accuracy. An experiment is done by runningthe model after parameter tuning to find if it makes a significant change in themodel’s accuracy. Later, another experiment is done where the built model is usedto recognize and classify the traffic sign image extracted using OpenCV methods inreal-time. The procedure of this experimentation is as follows:

• Splitting up the images into train and test dataset.

• Pre-processing the images in respective datasets for enhancing the image qual-ity and interpret ability.

• Building a cost-sensitive CNN model to classify the images.

• Tuning the parameters to find how significant the model’s accuracy is improved.

• Utilizing this model to classify images extracted in real-time using OpenCV.

• Presenting the experiment results.

Figure 4.1 summarizes the steps which are conducted in the experiment to rec-ognize traffic signs. The steps are mainly classified into two phases.

Phase 1: Training and testing the built CNN model with default parametersand then perform parameter tuning.

Phase 2: Testing the model in real-time using OpenCV.

12

4.1. Experimentation Environment 13

Figure 4.1: Steps involved in training, testing the proposed model (Phase 1) andtesting the model in real-time (Phase 2).

4.1 Experimentation Environment

Python1

Python is an interpreted high-level, general-purpose, easily readable programminglanguage. It is dynamically typed and garbage-collected. It supports procedural,functional and object-oriented programming. Python has a huge collection of li-braries that provides tools suited to many tasks. Python is popular for Data Science,Machine Learning and Artificial Intelligence. It is also popular for web developmentas it contains good frameworks such as Flask, Django etc. The python libraries usedin this thesis are described as follows.

Numpy2

NumPy (also called Numerical Python) is a library in python which is used forscientific computing. NumPy contains a multi-dimensional array and matrix data

1https://www.python.org/2https://numpy.org/

14 Chapter 4. Method

structures. It can be utilised to perform a number of mathematical operations onarrays such as trigonometric, statistical, and algebraic routines.

Matplotlib3

Matplotlib is a cross-platform data visualization and graphical plotting library inpython used for plotting attractive graphs. Numpy is the required dependency formatplotlib to plot graphs. Matplotlib script is structured so that just a few lines ofcode is enough to generate a visual data plot in many instances. It also provides a UImenu for customizing the plots. Matplotlib is widely known and easier to visualizeamong other libraries that represent data.

Keras4

Keras is a Google-developed high-level deep learning API for building neural net-works. It is used to make the implementation of neural networks easy. It alsosupports multiple back-end neural network computation. Keras is easy to learn anduse, as it provides python front-end with high-level of abstraction, having multipleback-end options for computation purposes.

OpenCV5

OpenCV is an open-source library for Image processing and Computer Vision. It isused to detect and recognize entities in real-time. When it is integrated with otherlibraries such as Numpy, it is capable of preprocessing the OpenCV array structurefor analysis, i.e. the detected image with the help of OpenCV can be used for pre-processing using Numpy. It supports C++, Java and Python interfaces.

Pandas6

Pandas is an Open-source package used for Data analysis and manipulation. It isreleased under the BSD license. It is built on top of the Numpy package. Pandasaves a lot of time by removing time-consuming and repetitive tasks associated withdata.

Scikit-learn7

Scikit-learn is a library in Python that provides many unsupervised and supervisedlearning algorithms. It is built on packages like Numpy, Pandas and Matplotlib.

4.2. Data Overview 15

Figure 4.2: Sample of the GTSRB dataset

4.2 Data Overview

German Traffic Sign Recognition Benchmark (GTSRB) is the dataset considered forthis experimentation process. This dataset was introduced in International JointConference On Neural Networks (IJCNN) 2011 [36]. It is a benchmark dataset forTSR, which is multi-class and used for single-image classification. The dataset wasused in the IJCNN competition, and high classification accuracy was achieved byusing CNN models as well as manual classification. The dataset contains over 35,000images of different traffic signs. It is further classified into 43 different classes. These43 class labels are the traffic signs which are common in most the countries. Asample image of each class can be shown in the Figure 4.2. The dataset is quitevarying. Some of the classes have many images, while some classes have few images.The size of the dataset is around 300 MB [1]. The dataset is divided into train andtest datasets. 70% of images of every class are considered for training and remainingused for testing and validation. Step 1 of phase 1 in the Figure 4.1 is done here. TheTable 4.1 refers to the number of images are there in each dataset (Training, Testingand Validation). There is a CSV file containing the labels for the corresponding classand class Id along with the dataset.

Category Total Images Images for Training Images for Testing Images for ValidationGSTRB dataset 34,799 24,360 10,440 10,440

Table 4.1: Distribution of images across different datasets

3https://matplotlib.org/4https://keras.io/5https://opencv.org/6https://pandas.pydata.org/7https://scikit-learn.org/stable/


4.3 Processing The Dataset ImagesStep 2 of phase 1 in the Figure 4.1 is done here. After splitting the dataset for trainingand testing, the preprocessing techniques are applied on training and testing datasets,respectively, by iterating through the images. Initially, the image is converted intoan array of pixels. The array of pixels are sent as input to gray-scale method.Gray-scale method is used to convert colourful image to monochrome Grayscaleimage. The conversion from a RGB image to gray is done with cvtColor(src,CV_RGB2GRAY).

• cvtColor - It is an OpenCV method which converts an image from one colorspace to another.

• src - Input image

• CV_RGB2GRAY - Transforms RGB colors to gray using the formula.

Y = 0.299 R + 0.587 G + 0.114 B (4.1)

Where R, G, B are Red, Green and Blue.By following the above-mentioned proportions of Red, Green and Blue colours,

the Grayscale method transforms into a monochrome gray image.Now the converted grayscaled array is sent as input to the Histogram equalization

method. This method is used to improve contrast in images by spreading out themost frequent intensity values. Initially, the pixel values are higher in some areasand lower in some areas. What Histogram Equalization does is to stretch out thepixel range by uniformly distributing the pixel values respectively. It uses the Cu-mulative Distribution Function (CDF) for re-mapping the Gray-scale distribution toHistogram Equalization distribution.

H1(i) =∑0<j<i

H(j) (4.2)

After the Histogram equalization method, the array of pixels are divided by 255so that the pixel values are between 0 to 1, respectively. The Figure 4.3 shows samplepre-processed images.

4.4 Image AugmentationImage augmentation is a useful method for building the CNN model by increasing thedataset images (step 3 of phase 1 in the Figure 4.1). It can create more images from asingle image so that the CNN model can have a lot of images for training the model.The idea is simple, it creates multiple duplicate images from a single image withsome variations to that. Keras provides a method called ImageDataGenerator forperforming image argumentation. It has some parameters (like zoom, rotation etc.)and based on these parameters the multiple images are created. In our model, weused the following parameters for generating multiple images.

4.5. Building The CNN Model 17

Figure 4.3: Sample Pre-processed Image

• width_shift_range: It shifts the image to the left or right(horizontal shifts).If the value is floating and <=1, it will take the percentage of total width as arange. We have taken the value as 0.1, i.e. it will take 10% of the total widthas a range.

• height_shift_range: It works same as width_shift_range but shift ver-tically(up or down)

• zoom_range: It zooms the image to the mentioned percentage. We havegiven the value 0.2

• Shear_range: Shear means that the image will be distorted along an axis,this parameter is used to augment images in a way how humans see thingsfrom different angles. The value here is the angle distorted along axes. Wehave given the value as 0.1.

• Rotation_range: It rotates the image by the given angle. We have given theangle as 10 degrees.

With the help of the above parameters, 20 images are generated from an image(The limit is set to 20). These generated images are used for training the CNN modelas well. The Figure 4.4 shows sample augmented images.

Figure 4.4: Augmented Images


Type of layer Output shape Parametersconv2d (Conv2D) (None, 28, 28, 60) 1560

conv2d_1 (Conv2D) (None, 24, 24, 60) 90060max_pooling2d (MaxPooling2D) (None, 12, 12, 60) 0

conv2d_2 (Conv2D) (None, 10, 10, 30) 16230conv2d_3 (Conv2D) (None, 8, 8, 30) 8130

max_pooling2d_1 (MaxPooling2) (None, 4, 4, 30) 0dropout (Dropout) (None, 4, 4, 30) 0flatten (Flatten) (None, 480) 0dense (Dense) (None, 500) 240500

dropout1(Dropout) (None, 500) 0dense1(Dense) (None, 43) 21543

Table 4.2: Architecture of CNN

4.5 Building The CNN ModelIn our experiment, we have taken lenet model based on [26] and made some changesto it. LeNet-5 CNN architecture is made up of 7 layers. The layer compositionconsists of 3 convolutional layers, 2 subsampling layers and 2 fully connected layers.We added 1 more convolutional layer and 2 dropout layers to lenet model. So, in ourexperiment, we used 4 convolution layers, 2 max-pooling layers, 2 dropout layers,2 Fully-Connected layers, and 1 flatten method (which is shown in the Figure 4.5).We used 60 kernels of size 5*5 in the first and second Convolution layers with ReLUactivation. For the 3rd and 4th layers, 30 kernels of size 3*3 are used. For bothmax-pooling layers, we used pool_size as 2*2 filter. 2 drop out layers are used with0.5 drop, i.e. 50% neurons will shut down at each training step by zeroing out theneuron values. 2 dense layers for which one dense layer performs ’relu’ activationand other performs ’softmax’.

Figure 4.5: Our proposed CNN model Architecture

1. In the conv2d layer, element-wise multiplication is performed with the pre-processed 3D image array and kernel (or) filter of size 5*5, sums up the array

4.6. Training and Testing The Model 19

and updates the sum value in the corresponding location. In the same way,every pixel value of the image array gets updated. 60 filters are applied inthis layer. After this layer, the size changes to 28*28. Later, the RELU layerapplies an activation function max (0, x) on all the pixel values of an outputimage.

2. Same operation is performed on conv2d_1 layer with "relu" activation function.The size will be reduced to 24*24.

3. Later, max pooling is applied, which takes the maximum element from theregion of the feature map covered by the filter. The main purpose of the maxpool is to get the most prominent features of the previous feature map. Here,a filter size of 2*2 is given.

4. After max pooling, 2 conv_2d layers applied one after another with 30 filtersof size 3*3 and with "relu" activation function.

5. Again max-pooling applied with 2*2 filter size. Later, a drop out of 0.5 is usedto reduce the connections to half. After performing drop out, flatten methodis used, which converts the image array into a single dimension.

6. Finally, 2 fully connected layers are used (one with "relu" activation functionand the other with "softmax" activation function) with one dropout in between(0.5) to get the final output array.

After building, the model has compiled with adam optimizer along with accuracymetric and loss.

4.6 Training and Testing The ModelTraining of the model is done using Adam optimizer and categorical_cross entropyas loss function. Step 3 of phase 1 in the Figure 4.1 is done here. As shown in theFigure 4.6 the number of images are few in some classes, so data augmentation is usedwhile training the model to generate multiple duplicate images from a single image.Fit method is used to train the model, which contains the following parameters.

• Augmented images generated from flow method of image data generator classper each image in training dataset.

• steps_per_epoch: len(X_train)/ batch_size

• epochs: Initially 10 epochs are taken, later can be changed for parametertuning.

• Validation_data: Validation is also performed along with train, using vali-dation dataset.

After performing the training, graphs are plotted for loss and accuracy with increas-ing epochs among training and validation datasets. After training is completed,testing is performed with the testing dataset. These are the images that are notused for training, i.e., unseen traffic images. Later, the model is saved using the savemethod in ’.h5’ extension.


Figure 4.6: Distribution of the training dataset

4.7 Parameter TuningAfter testing the model, we performed parameter tuning, in which we performed asystematic test with multiple combinations of epochs and learning rate. There are alot of hyper-parameters in the deep learning model. Among these parameters, epochsand learning rate significantly affect the models accuracy. The learning rate controlshow much to change the model when the model’s weights are updated. Learningrate is considered as one of the important hyper-parameter that affects the model’saccuracy [34]. Epoch is defined as a full pass of the dataset. Tuning the epochs maysignificantly affect the accuracy. Hence both these parameters are used for tuning.The combinations are summarized in Table 4.3.

Learning rate epochs0.001 100.001 200.002 100.002 200.001 300.0005 200.001 5

Table 4.3: Various combinations of parameter tuning

4.8 Testing The Model In Real-timeLater, OpenCV is used to set up the live camera to introduce the image to the cameraas input. The image is extracted using OpenCV and converted to an array of pixels

4.8. Testing The Model In Real-time 21

(Figure 4.1). Later the array is resized into a 32*32 dimension array using resizemethod (OpenCV method). The converted array will be pre-processed using thegrayscale method (step 1 of phase 2 of Figure 4.1), histogram equalization method(step 2 of phase 2 of Figure 4.1), and sent as input to the saved CNN model, whichwas trained and tested previously (step 3 of phase 2 of Figure 4.1 is mentioned here).Here, we test our model on a real-time video captured using the webcam or anyother external feed. After we access the web camera, every frame in the live videois processed. We use our saved model to predict the probability of the processedimage. If the probability of the prediction is higher than the threshold value, i.e.,0.75, then the image is recognised as a traffic sign. A corresponding class nameand the prediction probability value are displayed (step 4 of phase 2 of Figure 4.1 ismentioned here). The probability value depends on the size of the training datasetand on the architecture of the model. If the dataset of the corresponding class islarge in size, then we get a high probability prediction value to that class images.

Chapter 5Results and Analysis

This chapter presents the results obtained during the experimentation method. Theresults include the evaluation of the model before and after parameter tuning. Theevaluation metrics used in the thesis is accuracy. Also, a sample classification ofthe image introduced to a real-time camera is presented with its class name andprediction probability.

5.1 Experiment-1: Training The ModelIn this experiment, a CNN model with a modified LeNet model is built with 4convolution layers, 2 max-pooling layers, 2 dropout layers, 2 dense layers, and 1flatten layer.

5.1.1 The model trained with default parameter values

The model runs on the default parameter values configured, i.e., 0.001 for the learningrate and 10 epochs. The figure 5.1 represents the accuracy of the model for trainand test data and 5.2 represents the loss of the model for train and test data.

Figure 5.1: Model accuracy with default parameter values

22

5.1. Experiment-1: Training The Model 23

Figure 5.2: Loss occurred

5.1.2 The model trained after parameter tuning with variouscombinations

In this experiment, we tune the parameters learning rate and epochs, which leadsto a significant improvement in the model’s accuracy. We have implemented variouscombinations of these parameters to achieve significant model accuracy. The belowgraphs depict the accuracy and loss of the model after the tuning of the parameters.The 5.11 to 5.3, blue line indicates the train data, and the orange line indicates thetest data.

24 Chapter 5. Results and Analysis

For learning rate = 0.002 and epochs = 10

Figure 5.3 and 5.4 represent the accuracy and loss of the model for train and testdata when the learning rate is tuned to 0.002 and epochs to 10.

Figure 5.3: Accuracy

Figure 5.4: Loss



Figures 5.5 and 5.6 represent the accuracy and loss of the model for train and testdata when the learning rate is tuned to 0.001 and epochs to 20.


Figure 5.6: Loss






















5.2 Experiment-3: Using The Model In Real-timeIn this experiment, we tested the model on a real-time video using a webcam or anyexternal live feed. Traffic sign images introduced to the camera are new images andare not from training or testing data. New images from each class are used for testing.We process every frame of the video to check for a traffic sign. If the probabilityvalue of the prediction is higher than the threshold value, then the class label willbe displayed along with the probability. The figreffig:webcam displays an examplewhere a traffic stop sign image is placed in front of the webcam. The class nameis displayed along with the probability of prediction. Each traffic sign in 43 classesis tested in real-time. While some show probability scores greater than 90%, somefluctuate their probability scores. For example, when the traffic sign ’Speed limit(20km/h)’ image is placed in front of the camera, the probability score fluctuates.We tested the model with 43 different Traffic signs, one image for each class, and we

Figure 5.15: Traffic sign classification in real time using webcam

noted the probabilities. The probabilities for each class label is shown in Table 5.1.

5.2. Experiment-3: Using The Model In Real-time 31

Class Id Traffic sign Probability0 Speed limit (20km/h) 80-90%1 Speed limit (30km/h) 100%2 Speed limit (50km/h) 98%3 Speed limit (60km/h) 99%4 Speed limit (70km/h) 100%5 Speed limit (80km/h) 100%6 End of speed limit (80km/h) 95-98%7 Speed limit (100km/h) 99%8 Speed limit (120km/h) 98%9 No passing 100%10 No passing for vechiles over 3.5 metric tons 99%11 Right-of-way at the next intersection 100%12 Priority road 97-99%13 Yield 99%14 Stop 92-95%15 No vehicles 91-95%16 Vehicles over 3.5 metric tons prohibited 93-96%17 No entry 98-99%18 General caution 99%19 Dangerous curve to the left 80-90%20 Dangerous curve to the right 85-90%21 Double curve 89-93%22 Bumpy road 86-91%23 Slippery road 87-90%24 Road narrows on the right 90-93%25 Road work 98-99%26 Traffic signals 94-96%27 Pedestrians 86-90%28 Children crossing 89-94%29 Bicycles crossing 85-90%30 Beware of ice/snow 91-95%31 Wild animals crossing 96-98%32 End of all speed and passing limits 89-93%33 Turn right ahead 95-97%34 Turn left ahead 88-91%35 Ahead only 99-100%36 Go straight or right 89-92%37 Go straight or left 85-88%38 Keep right 100%39 Keep left 85-88%40 Roundabout mandatory 86-89%41 End of no passing 90-92%42 End of no passing by vechiles over 3.5 metric tons 87-91%

Table 5.1: Probability scores achieved when testing the model with new images inreal time.

Chapter 6

Discussion

We have proposed a CNN model whose architecture is based on the LeNet modelthat can classify the traffic signs using the GTSRB dataset. The research questionsare answered by drawing certain conclusions from the results obtained.RQ1:Considering the initial parameter setups of the proposed model, to what extent canparameter tuning improve the model’s accuracy?Answer:

The Figures 5.1 to 5.14 from the Chapter refchp:results reflects the accuracy andloss of train and test data. This data is obtained by tuning epochs and learning ratewith various combinations. Every combination of the parameter tuning showed aconsiderable change in the accuracy of the model. The tablereftab:Tuning representsthe values obtained from the experiment-1 in Chapter5.

Learning rate Epochs Train accuracy Train Loss Test Accuracy Test loss0.001 10 0.8883 0.3659 0.9843 0.06690.002 10 0.8549 0.4684 0.9697 0.09960.001 20 0.9262 0.2352 0.9857 0.04770.002 20 0.9052 0.3155 0.9883 0.04690.0005 20 0.9222 0.2508 0.9870 0.04070.001 5 0.7026 0.9544 0.9274 0.25800.001 30 0.9510 0.1664 0.9931 0.0252

Table 6.1: Accuracy and loss of model for train and test data by tuning variouscombinations of learning rate and epochs

Some interesting trends in the model’s accuracy is found when the learning rateis tuned. Learning rate or step size is referred to as the number of weights thatare updated while training the model. Generally, the learning rate is a configurablehyperparameter that is used while training the model whose range is between 0.0 -1.0. From the table6.1, we can find that changing the learning rate from the defaultset optimum value, i.e. 0.001 decreases the model’s accuracy. This can be causedbecause the learning rate controls the rate of adaption of a model to a problem.So, a large learning rate makes the model learn faster but results in a sub-optimalset of weights. Loss on the training dataset occurs due to the divergent weightdecreasing the accuracy. A lower learning rate allows the model to learn an optimalset of weights but may take a lot of time. The model may be held on a sub-optimal

32

33

solution. Hence, changing the learning rate from the default value decreases themodel accuracy.

On the other hand, the Table 6.1 clearly depicts that increase in the number ofepochs results in achieving high accuracy. But increasing the number of epochs morethan necessary may lead to an overfitting problem. When the number of epochs fortraining the model is increased rapidly, the model learns the patterns of the sampledata to a large extent. This makes the model less efficient on test data. Due tooverfitting, the model shows high accuracy on the train data but fails to do so ontest data. A fewer number of epochs result in underfitting. This results in lowaccuracy as well. The figurereffig:epoch represents the model learning patterns withthe different number of epochs.

Figure 6.1: Graph plot reflecting the model learning patterns with decreasing numberof epochs. (The figure is borrowed from [3])

From the above discussion, increasing the epochs significantly increase the model’saccuracy up to a certain extent. With optimum learning rate set to 0.001 and epochsset to 30, 95% accuracy is achieved. A minimum loss of 0.1664 is achieved.RQ2:How can a traffic sign image be detected and extracted effectively when introducedto the live camera using OpenCV?Answer:

From Experiment-2 in Chapter5, traffic sign images are introduced to the modelvia a web camera. The images used for testing the model are new images, and theyare not from training and testing datasets. Every frame of the video is checkedfor a traffic sign. A pre-processed image (grayscaled and histogram equalized) iscaptured from the camera and uses the CNN model to classify the digital image.The probability score of the prediction is calculated and is checked with the thresholdvalue, i.e. 0.75. This means if the probability of predicting the correct traffic sign ismore than 0.75, then the corresponding class label is displayed. For a typical binaryclassification default threshold is 0.5, and for a multi-class classification problem,the default value ranges between 0 to 1. Employing default threshold value maynot represent the optimal interpretation of the predicted probabilities. Hence areasonably high threshold than the default threshold is employed.

While conducting Experiment 2, some traffic signs, when introduced to the cam-era, show fluctuating probability score as shown in Table 5.1. This is due to theunbalanced data in various classes. If the class contains optimum number of images,

34 Chapter 6. Discussion

i.e. up to 500 images, then a high probability score is achieved. Images can be aug-mented into those classes whose images are insufficient to achieve high probability ora dataset containing balanced data in its classes can also be used to train the modelto achieve high probability scores of the traffic sign images. This answers the secondresearch question.

Chapter 7

Conclusions and Future Work

7.1 Conclusion

Traffic Sign Recognition helps novice drivers through Driver Assistance System andprovide safe, efficient navigation in the case of self-driving vehicles. In this thesis,we proposed a cost-sensitive CNN model in terms of computational cost, which canclassify the traffic signs images from the GTSRB dataset with 95% accuracy. Theproposed model architecture has 4 convolution layers which have low computationalcost than various state-of-the-art architectures like VGG-16, where there are 16 con-volution layers. The learning rate and number of epochs were tuned in differentcombinations to achieve the highest accuracy possible. Later, the model was usedto predict the class labels for the traffic sign images that are introduced to the webcamera with reasonably high probability.

From the results, we can conclude that tuning the epoch significantly impactsthe model’s accuracy. The results show that changing the learning rate decreases themodel’s accuracy. Hence with learning rate = 0.001 and epoch = 30, the model hasachieved high accuracy, i.e., 95.1%

7.2 Future work

As for future work, more parameters can be tuned, like batch-size, dropout rateetc., to show significant improvement in the model’s accuracy. It would be of greatinterest to find out the parameters that can increase the model’s accuracy when theyare tuned. To solve the problem of long training times, Amazon Web Server (AWS)Deep Learning Amazon Machine Images (AMI) can be used to generate auto-scaledclusters of GPU for large scale training. One can quickly launch an Amazon EC2instance which is pre-installed with popular deep learning frameworks and interfacesto train the model [1].

To improve the prediction probability of the images in real-time, the larger datasetcan be used to train the model with more images in each class. Employing a largerdataset may result in better predictions because more features of each class labelcan be obtained. Oversampling the minority classes can also help in increasing theprobability. The simplest way to perform oversampling is to duplicate the existingimages in the minority classes so that the model will learn no extra information.Data augmentation can also be done more on the minority classes to increase thedataset size.

35

36 Chapter 7. Conclusions and Future Work

The model can be improved further to use in real-time. This model can beemployed in the ADS system with the output as a voice prompter.

Bibliography

[1] “Amazon Deep Learning AMIs.” [Online]. Available: https://aws.amazon.com/machine-learning/amis/

[2] “deep learning - Why convolutions always use odd-numbers as filter_size.”[Online]. Available: https://datascience.stackexchange.com/questions/23183/why-convolutions-always-use-odd-numbers-as-filter-size

[3] “Does increasing epochs increase accuracy? - Quora.” [Online]. Available:https://www.quora.com/Does-increasing-epochs-increase-accuracy

[4] “Public Archive: daaeac0d7ce1152aea9b61d9f1e19370.” [Online]. Available:https://sid.erda.dk/public/archives/daaeac0d7ce1152aea9b61d9f1e19370/published-archive.html

[5] A. Adam and C. Ioannidis, “Automatic road-sign detection and classificationbased on support vector machines and hog descriptors.” ISPRS Annals of Pho-togrammetry, Remote Sensing & Spatial Information Sciences, vol. 2, no. 5,2014.

[6] F. Agostinelli, M. Hoffman, P. Sadowski, and P. Baldi, “Learning activationfunctions to improve deep neural networks,” arXiv preprint arXiv:1412.6830,2014.

[7] S. Albawi, T. A. Mohammed, and S. Al-Zawi, “Understanding of a convolu-tional neural network,” in 2017 International Conference on Engineering andTechnology (ICET). Ieee, 2017, pp. 1–6.

[8] J. C. Ang, A. Mirzal, H. Haron, and H. N. A. Hamed, “Supervised, unsupervised,and semi-supervised feature selection: a review on gene selection,” IEEE/ACMtransactions on computational biology and bioinformatics, vol. 13, no. 5, pp.971–989, 2015.

[9] P. Bahar, T. Alkhouli, P. Jan-Thorsten, C. J.-S. Brix, and H. Ney, “Empiricalinvestigation of optimization algorithms in neural machine translation,” ThePrague Bulletin of Mathematical Linguistics, vol. 108, no. 1, p. 13, 2017.

[10] Y. Bengio, I. Goodfellow, and A. Courville, Deep learning. MIT press Mas-sachusetts, USA:, 2017, vol. 1.

[11] Z. Bi, L. Yu, H. Gao, P. Zhou, and H. Yao, “Improved VGG model-basedefficient traffic sign recognition for safe driving in 5G scenarios,” InternationalJournal of Machine Learning and Cybernetics, Aug. 2020. [Online]. Available:https://doi.org/10.1007/s13042-020-01185-5

[12] G. Bradski and A. Kaehler, Learning OpenCV: Computer vision with theOpenCV library. " O’Reilly Media, Inc.", 2008.

37

38 BIBLIOGRAPHY

[13] P. Cunningham, M. Cord, and S. J. Delany, “Supervised learning,” in Machinelearning techniques for multimedia. Springer, 2008, pp. 21–49.

[14] P. Dhar, M. Z. Abedin, T. Biswas, and A. Datta, “Traffic sign detection — anew approach and recognition using convolution neural network,” in 2017 IEEERegion 10 Humanitarian Technology Conference (R10-HTC), 2017, pp. 416–419.

[15] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MITpress Cambridge, 2016, vol. 1, no. 2.

[16] H. Guan, W. Yan, Y. Yu, L. Zhong, and D. Li, “Robust Traffic-Sign Detec-tion and Classification Using Mobile LiDAR Data With Digital Images,” IEEEJournal of Selected Topics in Applied Earth Observations and Remote Sensing,vol. 11, no. 5, pp. 1715–1724, May 2018, conference Name: IEEE Journal ofSelected Topics in Applied Earth Observations and Remote Sensing.

[17] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in opticalremote sensing images based on weakly supervised learning and high-level fea-ture learning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53,no. 6, pp. 3325–3337, 2014.

[18] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassinghuman-level performance on imagenet classification,” in Proceedings of the IEEEinternational conference on computer vision, 2015, pp. 1026–1034.

[19] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, “Detection oftraffic signs in real-world images: The german traffic sign detection benchmark,”in The 2013 international joint conference on neural networks (IJCNN). Ieee,2013, pp. 1–8.

[20] H. Ibrahim and N. S. P. Kong, “Brightness preserving dynamic histogram equal-ization for image contrast enhancement,” IEEE Transactions on Consumer Elec-tronics, vol. 53, no. 4, pp. 1752–1758, 2007.

[21] M. T. Islam, “Traffic sign detection and recognition based on convolutional neu-ral networks,” in 2019 International Conference on Advances in Computing,Communication and Control (ICAC3), 2019, pp. 1–6.

[22] M. B. Jensen, M. P. Philipsen, A. Møgelmose, T. B. Moeslund, and M. M.Trivedi, “Vision for looking at traffic lights: Issues, survey, and perspectives,”IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 7, pp.1800–1815, 2016.

[23] J. Jin, K. Fu, and C. Zhang, “Traffic sign recognition with hinge loss trainedconvolutional neural networks,” IEEE Transactions on Intelligent Transporta-tion Systems, vol. 15, no. 5, pp. 1991–2000, 2014.

[24] J. D. Kelleher, Deep learning. MIT press, 2019.[25] Y.-T. Kim, “Contrast enhancement using brightness preserving bi-histogram

equalization,” IEEE transactions on Consumer Electronics, vol. 43, no. 1, pp.1–8, 1997.

[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning ap-plied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp.2278–2324, 1998.

BIBLIOGRAPHY 39

[27] M. Matsugu, K. Mori, Y. Mitari, and Y. Kaneda, “Subject independent facialexpression recognition with robust face detection using a convolutional neuralnetwork,” Neural Networks, vol. 16, no. 5-6, pp. 555–559, 2003.

[28] S. Qian, H. Liu, C. Liu, S. Wu, and H. San Wong, “Adaptive activation functionsin convolutional neural networks,” Neurocomputing, vol. 272, pp. 204–212, 2018.

[29] B. B. Shabarinath and P. Muralidhar, “Convolutional Neural Network basedTraffic-Sign Classifier Optimized for Edge Inference,” in 2020 IEEE REGION10 CONFERENCE (TENCON), Nov. 2020, pp. 420–425, iSSN: 2159-3450.

[30] L. Shangzheng, “A traffic sign image recognition and classification approachbased on convolutional neural network,” in 2019 11th International Conferenceon Measuring Technology and Mechatronics Automation (ICMTMA). IEEE,2019, pp. 408–411.

[31] S. Sharma and S. Sharma, “Activation functions in neural networks,” TowardsData Science, vol. 6, no. 12, pp. 310–316, 2017.

[32] P. Shopa, N. Sumitha, and P. S. K. Patra, “Traffic sign detection and recognitionusing OpenCV,” in International Conference on Information Communicationand Embedded Systems (ICICES2014), Feb. 2014, pp. 1–6.

[33] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation fordeep learning,” Journal of Big Data, vol. 6, no. 1, pp. 1–48, 2019.

[34] L. N. Smith, “A disciplined approach to neural network hyper-parameters: Part1–learning rate, batch size, momentum, and weight decay,” arXiv preprintarXiv:1803.09820, 2018.

[35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,“Dropout: a simple way to prevent neural networks from overfitting,” The jour-nal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.

[36] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer:Benchmarking machine learning algorithms for traffic sign recognition,” NeuralNetworks, no. 0, pp. –, 2012. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0893608012000457

[37] T. S. Tsoi and C. Wheelus, “Traffic Signal Classification with Cost-SensitiveDeep Learning Models,” in 2020 IEEE International Conference on KnowledgeGraph (ICKG), Aug. 2020, pp. 586–592.

[38] F. Zaklouta and B. Stanciulescu, “Real-time traffic-sign recognition using treeclassifiers,” IEEE Transactions on Intelligent Transportation Systems, vol. 13,no. 4, pp. 1507–1514, 2012.

Appendix AModel code snippet

The following code snippet takes a pre-processed image and classifies it into a corre-sponding label.

def myModel ( ) :no_Of_Filters = 60s i z e_o f_F i l t e r = (5 , 5)s i z e_o f_Fi l t e r2 = (3 , 3)s ize_of_pool = (2 , 2)no_Of_Nodes = 500 # NO. OF NODES IN HIDDEN LAYERSmodel = Sequent i a l ( )model . add ( (Conv2D( no_Of_Filters , s i z e_of_Fi l t e r ,

input_shape=(imageDimesions [ 0 ] , imageDimesions [ 1 ] , 1 ) ,a c t i v a t i o n=’ r e l u ’ ) ) )

model . add ( (Conv2D( no_Of_Filters , s i z e_of_Fi l t e r ,

a c t i v a t i o n=’ r e l u ’ ) ) )model . add (MaxPooling2D ( poo l_s ize=size_of_pool ) )

model . add ( (Conv2D( no_Of_Filters // 2 , s i z e_of_Fi l t e r2 ,

a c t i v a t i o n=’ r e l u ’ ) ) )model . add ( (Conv2D( no_Of_Filters // 2 , s i z e_of_Fi l t e r2 ,

a c t i v a t i o n=’ r e l u ’ ) ) )model . add (MaxPooling2D ( poo l_s ize=size_of_pool ) )model . add (Dropout ( 0 . 5 ) )

model . add ( Flat ten ( ) )model . add (Dense (no_Of_Nodes , a c t i v a t i o n=’ r e l u ’ ) )model . add (Dropout ( 0 . 5 ) )model . add (Dense ( noOfClasses , a c t i v a t i o n=’ softmax ’ ) )# COMPILE MODELmodel . compile (Adam( l r =0.001) , l o s s=’ ca t ego r i c a l_c ro s s en t r opy ’ ,met r i c s=[ ’ accuracy ’ ] )return model

40

Faculty of Computing, Blekinge Institute of Technology, 371 79 Karlskrona, Sweden

Documents

A Deep Learning Application for Traffic Sign Classification