10
1 Monitoring COVID-19 social distancing with person detection and tracking via fine-tuned YOLO v3 and Deepsort techniques Narinder Singh Punn, Sanjay Kumar Sonbhadra, Sonali Agarwal and Gaurav Rai Abstract—The rampant coronavirus disease 2019 (COVID-19) has brought global crisis with its deadly spread to more than 180 countries, and about 3,519,901 confirmed cases along with 247,630 deaths globally as on May 4, 2020. The absence of any ac- tive therapeutic agents and the lack of immunity against COVID- 19 increases the vulnerability of the population. Since there are no vaccines available, social distancing is the only feasible approach to fight against this pandemic. Motivated by this notion, this article proposes a deep learning based framework for automating the task of monitoring social distancing using surveillance video. The proposed framework utilizes the YOLO v3 object detection model to segregate humans from the background and Deepsort approach to track the identified people with the help of bounding boxes and assigned IDs. The results of the YOLO v3 model are further compared with other popular state-of-the-art models, e.g. faster region-based CNN (convolution neural network) and single shot detector (SSD) in terms of mean average precision (mAP), frames per second (FPS) and loss values defined by object classification and localization. Later, the pairwise vectorized L2 norm is computed based on the three-dimensional feature space obtained by using the centroid coordinates and dimensions of the bounding box. The violation index term is proposed to quantize the non adoption of social distancing protocol. From the experimental analysis, it is observed that the YOLO v3 with Deepsort tracking scheme displayed best results with balanced mAP and FPS score to monitor the social distancing in real-time. Index Terms—COVID-19, Video surveillance, Social distanc- ing, Object detection, Object tracking. I. I NTRODUCTION C OVID-19 belongs to the family of coronavirus caused diseases, initially reported at Wuhan, China, during late December 2020. On March 11, it spread over 114 countries with 118,000 active cases and 4000 deaths, WHO declared this a pandemic [1], [2]. On May 4, 2020, over 3,519,901 cases and 247,630 deaths had been reported worldwide. Several healthcare organizations, medical experts and scientists are trying to develop proper medicines and vaccines for this deadly virus, but till date, no success is reported. This situation forces the global community to look for alternate ways to stop the spread of this infectious virus. Social distancing is claimed as the best spread stopper in the present scenario, and all affected countries are locked-down to implement social distancing. This research is aimed to support and mitigate the coronavirus N. S. Punn, S. K. Sonbhadra, S. Agarwal, Indian Institute of Informa- tion Technology Allahabad, Jhalwa, Prayagraj, Uttar Pradesh, India; emails: {pse2017002, rsi2017502, sonali}@iiita.ac.in. G. Rai, Data Analytics, Comptroller and Auditor General of India; email: [email protected]. Fig. 1: An outcome of social distancing as the reduced peak of the epidemic and matching with available healthcare capacity. pandemic along with minimum loss of economic endeavours, and propose a solution to detect the social distancing among people gathered at any public place. The word “social distancing” is best practice in the direction of efforts through a variety of means, aiming to minimize or interrupt the transmission of COVID-19. It aims at reducing the physical contact between possibly infected individuals and healthy persons. As per the WHO norms [3] it is prescribed that people should maintain at least 6 feet of distance among each other in order to follow social distancing. A recent study indicates that social distancing is an im- portant containment measure and essential to prevent SARS- CoV-2, because people with mild or no symptoms may for- tuitously carry corona infection and can infect others [4]. Fig. 1 indicates that proper social distancing is the best way to reduce infectious physical contact, hence reduces the infection rate [5], [6]. This reduced peak may surely match with the available healthcare infrastructure and help to offer better facil- ities to the patients battling against the coronavirus pandemic. Epidemiology is the study of factors and reasons for the spread of infectious diseases. To study epidemiological phenomena, mathematical models are always the most preferred choice. Almost all models descend from the classical SIR model of Kermack and McKendrick established in 1927 [7]. Various research works have been done on the SIR model and its extensions by the deterministic system [8], and consequently, many researchers studied stochastic biological systems and epidemic models [9]. Respiratory diseases are infectious where the rate and mode of transmission of the causing virus are the most critical factors to be considered for the treatment or ways to stop the spread of the virus in the community. Several medicine organizations and pandemic researchers are trying to develop vaccines for COVID-19, but still, there is no well-known arXiv:2005.01385v4 [cs.CV] 27 Apr 2021

Monitoring COVID-19 social distancing with person

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

1

Monitoring COVID-19 social distancing withperson detection and tracking via fine-tuned YOLO

v3 and Deepsort techniquesNarinder Singh Punn Sanjay Kumar Sonbhadra Sonali Agarwal and Gaurav Rai

AbstractmdashThe rampant coronavirus disease 2019 (COVID-19)has brought global crisis with its deadly spread to more than180 countries and about 3519901 confirmed cases along with247630 deaths globally as on May 4 2020 The absence of any ac-tive therapeutic agents and the lack of immunity against COVID-19 increases the vulnerability of the population Since there are novaccines available social distancing is the only feasible approachto fight against this pandemic Motivated by this notion thisarticle proposes a deep learning based framework for automatingthe task of monitoring social distancing using surveillance videoThe proposed framework utilizes the YOLO v3 object detectionmodel to segregate humans from the background and Deepsortapproach to track the identified people with the help of boundingboxes and assigned IDs The results of the YOLO v3 modelare further compared with other popular state-of-the-art modelseg faster region-based CNN (convolution neural network) andsingle shot detector (SSD) in terms of mean average precision(mAP) frames per second (FPS) and loss values defined by objectclassification and localization Later the pairwise vectorized L2norm is computed based on the three-dimensional feature spaceobtained by using the centroid coordinates and dimensions ofthe bounding box The violation index term is proposed toquantize the non adoption of social distancing protocol Fromthe experimental analysis it is observed that the YOLO v3 withDeepsort tracking scheme displayed best results with balancedmAP and FPS score to monitor the social distancing in real-time

Index TermsmdashCOVID-19 Video surveillance Social distanc-ing Object detection Object tracking

I INTRODUCTION

COVID-19 belongs to the family of coronavirus causeddiseases initially reported at Wuhan China during late

December 2020 On March 11 it spread over 114 countrieswith 118000 active cases and 4000 deaths WHO declared thisa pandemic [1] [2] On May 4 2020 over 3519901 casesand 247630 deaths had been reported worldwide Severalhealthcare organizations medical experts and scientists aretrying to develop proper medicines and vaccines for this deadlyvirus but till date no success is reported This situation forcesthe global community to look for alternate ways to stop thespread of this infectious virus Social distancing is claimed asthe best spread stopper in the present scenario and all affectedcountries are locked-down to implement social distancingThis research is aimed to support and mitigate the coronavirus

N S Punn S K Sonbhadra S Agarwal Indian Institute of Informa-tion Technology Allahabad Jhalwa Prayagraj Uttar Pradesh India emailspse2017002 rsi2017502 sonaliiiitaacin

G Rai Data Analytics Comptroller and Auditor General of India emailraigcaggovin

Fig 1 An outcome of social distancing as the reduced peak ofthe epidemic and matching with available healthcare capacity

pandemic along with minimum loss of economic endeavoursand propose a solution to detect the social distancing amongpeople gathered at any public place

The word ldquosocial distancingrdquo is best practice in the directionof efforts through a variety of means aiming to minimize orinterrupt the transmission of COVID-19 It aims at reducingthe physical contact between possibly infected individuals andhealthy persons As per the WHO norms [3] it is prescribedthat people should maintain at least 6 feet of distance amongeach other in order to follow social distancing

A recent study indicates that social distancing is an im-portant containment measure and essential to prevent SARS-CoV-2 because people with mild or no symptoms may for-tuitously carry corona infection and can infect others [4]Fig 1 indicates that proper social distancing is the best way toreduce infectious physical contact hence reduces the infectionrate [5] [6] This reduced peak may surely match with theavailable healthcare infrastructure and help to offer better facil-ities to the patients battling against the coronavirus pandemicEpidemiology is the study of factors and reasons for the spreadof infectious diseases To study epidemiological phenomenamathematical models are always the most preferred choiceAlmost all models descend from the classical SIR model ofKermack and McKendrick established in 1927 [7] Variousresearch works have been done on the SIR model and itsextensions by the deterministic system [8] and consequentlymany researchers studied stochastic biological systems andepidemic models [9]

Respiratory diseases are infectious where the rate and modeof transmission of the causing virus are the most criticalfactors to be considered for the treatment or ways to stopthe spread of the virus in the community Several medicineorganizations and pandemic researchers are trying to developvaccines for COVID-19 but still there is no well-known

arX

iv2

005

0138

5v4

[cs

CV

] 2

7 A

pr 2

021

2

medicine available for treatment Hence precautionary stepsare taken by the whole world to restrict the spread of infectionRecently Eksin et al [8] proposed a modified SIR model withthe inclusion of a social distancing parameter a(IR) whichcan be determined with the help of the number of infected andrecovered persons represented as I and R respectively

dS

dt= minusβS I

Na(IN)

dI

dt= minusδI + βI

I

Na(IN)

dR

dt= δI

(1)

where β represents the infection rate and δ represents recoveryrate The population size is computed as N = S + I + RHere the social distancing term (a(IR) R2 ε [0 1]) mapsthe transition rate from a susceptible state (S) to an infectedstate (I) which is calculated by aβSI

N The social distancing models are of two types where the

first model is known as ldquolong-term awarenessrdquo in which theoccurrence of interaction of an individual with other is reducedproportionally with the cumulative percentage of affected(infectious and recovered) individuals (Eq 2)

a =

(1minus I +R

N

)k(2)

Meanwhile the second model is known as ldquoshort-term aware-nessrdquo where the reduction in interaction is directly propor-tional to the proportion of infectious individuals at a giveninstance (Eq 3)

a =

(1minus I

N

)k(3)

where k is behavior parameter defined as k ge 0 Higher valueof k implies that individuals are becoming sensitive to thedisease prevalence

In the similar background on April 16 2020 a companyLanding AI [10] under the leadership of most recognizablenames in AI Dr Andrew Ng [11] announced the creationof an AI tool to monitor social distancing at the workplaceIn a brief article the company claimed that the upcomingtool could detect if people are maintaining the safe physicaldistance from each other by analyzing real-time video streamsfrom the camera It is also claimed that this tool can easily getintegrated with existing security cameras available at differentworkplaces to maintain a safe distance among all workers Abrief demo was released that shows three steps calibrationdetection and measurement to monitor the social distancingOn April 21 2020 Gartner Inc identified Landing AI as CoolVendors in AI Core Technologies to appreciate their timelyinitiative in this revolutionary area to support the fight againstthe COVID -19 [12]

Motivated by this in this present work authors are at-tempting to check and compare the performance of popularobject detection and tracking schemes in monitoring the socialdistancing Rest of the paper structure is organized as followsSection II presents the recent work proposed in this fieldof study followed by the state-of-the-art object detectionand tracking models in Section III Later in Section IV the

deep learning based framework is proposed to monitor socialdistancing In Section V experimentation and the correspond-ing results are discussed accompanied by the outcome inSection VI In Section VII the future scope and challengesare discussed and lastly Section VIII presents the conclusionof the present research work

II BACKGROUND STUDY AND RELATED WORK

Social distancing is surely the most trustworthy techniqueto stop the spreading of infectious disease with this belief inthe background of December 2019 when COVID-19 emergedin Wuhan China it was opted as an unprecedented measureon January 23 2020 [13] Within one month the outbreakin China gained a peak in the first week of February with2000 to 4000 new confirmed cases per day Later for thefirst time after this outbreak there have been a sign of reliefwith no new confirmed cases for five consecutive days up to23 March 2020 [14] This is evident that social distancingmeasures enacted in China initially adopted worldwide laterto control COVID-19

Prem et al [15] aimed to study the effects of socialdistancing measures on the spread of the COVID-19 epi-demic Authors used synthetic location-specific contact pat-terns to simulate the ongoing trajectory of the outbreak usingsusceptible-exposed-infected-removed (SEIR) models It wasalso suggested that premature and sudden lifting of socialdistancing could lead to an earlier secondary peak whichcould be flattened by relaxing the interventions gradually [15]As we all understand social distancing though essential buteconomically painful measures to flatten the infection curveAdolph et al [16] highlighted the situation of the United Statesof America where due to lack of common consent among allpolicymakers it could not be adopted at an early stage which isresulting into on-going harm to public health Although socialdistancing impacted economic productivity many researchersare trying hard to overcome the loss Following from thiscontext Kylie et al [17] studied the correlation betweenthe strictness of social distancing and the economic statusof the region The study indicated that intermediate levelsof activities could be permitted while avoiding a massiveoutbreak

Since the novel coronavirus pandemic began many coun-tries have been taking the help of technology based solutionsin different capacities to contain the outbreak [18] [19] [20]Many developed countries including India and South Koreafor instance utilising GPS to track the movements of thesuspected or infected persons to monitor any possibility oftheir exposure among healthy people In India the governmentis using the Arogya Setu App which worked with the helpof GPS and bluetooth to locate the presence of COVID-19patients in the vicinity area It also helps others to keep asafe distance from the infected person [21] On the otherhand some law enforcement departments have been usingdrones and other surveillance cameras to detect mass gath-erings of people and taking regulatory actions to disperse thecrowd [22] [23] Such manual intervention in these criticalsituations might help flatten the curve but it also brings a

3

unique set of threats to the public and is challenging to theworkforce

Human detection using visual surveillance system is anestablished area of research which is relying upon manualmethods of identifying unusual activities however it haslimited capabilities [24] In this direction recent advancementsadvocate the need for intelligent systems to detect and capturehuman activities Although human detection is an ambitiousgoal due to a variety of constraints such as low-resolutionvideo varying articulated pose clothing lighting and back-ground complexities and limited machine vision capabilitieswherein prior knowledge on these challenges can improve thedetection performance [25]

Detecting an object which is in motion incorporates twostages object detection [26] and object classification [27] Theprimary stage of object detection could be achieved by usingbackground subtraction [28] optical flow [29] and spatio-temporal filtering techniques [30] In the background subtrac-tion method [31] the difference between the current frameand a background frame (first frame) at pixel or block level iscomputed Adaptive Gaussian mixture temporal differencinghierarchical background models warping background andnon-parametric background are the most popular approachesof background subtraction [32] In optical flow-based objectdetection technique [29] flow vectors associated with theobjectrsquos motion are characterised over a time span in order toidentify regions in motion for a given sequence of images [33]Researchers reported that optical flow based techniques consistof computational overheads and are sensitive to various motionrelated outliers such as noise colour and lighting etc [34]In another method of motion detection Aslani et al [30]proposed spatio-temporal filter based approach in which themotion parameters are identified by using three-dimensional(3D) spatio-temporal features of the person in motion in theimage sequence These methods are advantageous due to itssimplicity and less computational complexity however showslimited performance because of noise and uncertainties onmoving patterns [35]

Object detection problems have been efficiently addressedby recently developed advanced techniques In the last decadeconvolutional neural networks (CNN) region-based CNN [36]and faster region-based CNN [37] used region proposal tech-niques to generate the objectness score prior to its classifica-tion and later generates the bounding boxes around the objectof interest for visualization and other statistical analysis [38]Although these methods are efficient but suffer in termsof larger training time requirements [39] Since all theseCNN based approaches utilize classification another approachYOLO considers a regression based method to dimension-ally separate the bounding boxes and interpret their classprobabilities [40] In this method the designed frameworkefficiently divides the image into several portions representingbounding boxes along with the class probability scores foreach portion to consider as an object This approach offersexcellent improvements in terms of speed while trading thegained speed with the efficiency The detector module exhibitspowerful generalization capabilities of representing an entireimage [41]

Based on the above concepts many research findings havebeen reported in the last few years Crowd counting emerged asa promising area of research with many societal applicationsEshel et al [42] focused on crowd detection and personcount by proposing multiple height homographies for headtop detection and solved the occlusions problem associatedwith video surveillance related applications Chen et al [43]developed an electronic advertising application based on theconcept of crowd counting In similar application Chih-Wenet al [44] proposed a vision-based people counting modelFollowing this Yao et al [45] generated inputs from stationarycameras to perform background subtraction to train the modelfor the appearance and the foreground shape of the crowd invideos

Once an object is detected classification techniques canbe applied to identify a human on the basis of shape tex-ture or motion-based features In shape-based methods theshape related information of moving regions such as pointsboxes and blobs are determined to identify the human Thismethod performs poorly due to certain limitations in stan-dard template-matching schemes [46] [47] which is furtherenhanced by applying part-based template matching [48] ap-proach In another research Dalal et al [49] proposed texture-based schemes such as histograms of oriented gradient (HOG)which utilises high dimensional features based on edges alongwith the support vector machine (SVM) to detect humans

According to recent research further identification of aperson through video surveillance can be done by usingface [50] [51] and gait recognition [52] techniques Howeverdetection and tracking of people under crowd are difficultsometimes due to partial or full occlusion problems Leibeet al [53] proposed trajectory estimation based solution whileAndriluka et al [54] proposed a solution to detect partiallyoccluded people using tracklet-based detectors Many othertracking techniques including a variety of object and motionrepresentations are reviewed by Yilmaz et al [55]

A large number of studies are available in the area of videosurveillance Among many publically available datasets KTHhuman motion dataset [56] shows six categories of activitieswhereas INRIA XMAS multi-view dataset [57] and Weizmannhuman action dataset [58] contain 11 and 10 categories ofactions respectively Another dataset named as performanceevaluation of tracking and surveillance (PETS) is proposedby a group of researchers at university of Oxford [59] Thisdataset is available for vision based research comprising alarge number of datasets for varying tasks in the field ofcomputer vision In the present research in order to fine-tunethe object detection and tracking models for identifying theperson open images datasets [60] are considered It is a col-lection of 19957 classes out of which the models are trainedfor the identification of a person The images are annotatedwith image-level labels and corresponding coordinates of thebounding boxes representing the person Furthermore the finetuned proposed framework is simulated on the Oxford towncenter surveillance footage [23] to monitor social distancing

We believe that having a single dataset with unified an-notations for image classification object detection visualrelationship detection instance segmentation and multimodal

4

Fig 2 Performance overview of the most popular objectdetection models on PASCAL-VOC and MS-COCO datasets

image descriptions will enable us to study and perform ob-ject detection tasks efficiently and stimulate progress towardsgenuine understanding of the scene All explored literatureand related research work clearly establishes a picture thatthe application of human detection can easily get extended tomany applications to cater the situation that arises presentlysuch as to check prescribed standards for hygiene socialdistancing work practices etc

III OBJECT DETECTION AND TRACKING MODELS

As observed from Fig 2 the successful object detectionmodels like RCNN [61] fast RCNN [62] faster RCNN [38]SSD [63] YOLO v1 [40] YOLO v2 [64] and YOLO v3 [65]tested on PASCAL-VOC [66] and MS-COCO [67] datasetsundergo trade-off between speed and accuracy of the detec-tion which is dependent on various factors like backbonearchitecture (feature extraction network eg VGG-16 [68]ResNet-101 [69] Inception v2 [70] etc) input sizes modeldepth varying software and hardware environment A featureextractor tends to encode the modelrsquos input into certain featurerepresentation which aids in learning and discovering thepatterns associated with the desired objects In order to identifymultiple objects of varying scale or size it also uses predefinedboxes covering an entire image termed as anchor boxes Table Idescribes the performance in terms of accuracy for each ofthese popular and powerful feature extraction networks onILSVRC ImageNet challenge [71] along with the numberof trainable parameters which have a direct impact on thetraining speed and time As highlighted in Table I the ratio ofaccuracy to the number of parameters is highest for Inceptionv2 model indicating that Inception v2 achieved adequateclassification accuracy with minimal trainable parameters incontrast to other models and hence is utilized as a backbonearchitecture for faster and efficient computations in the fasterRCNN and SSD object detection models whereas YOLO v3uses different architecture Darknet-53 as proposed by Redmonet al [65]

A Anchor boxes

With the exhaustive literature survey it is observed thatevery popular object detection model utilizes the concept ofanchor boxes to detect multiple objects in the scene [36] Theseboxes are overlaid on the input image over various spatial

TABLE I Performance of the feature extraction network onImageNet challenge

Backbone model Accuracy (a) Parameters (p) Ratio (a100p)VGG-16 [68] 071 15 M 473ResNet-101 [69] 076 425 M 178Inception v2 [70] 074 10 M 740Inception v3 [72] 078 22 M 358Resnet v2 [72] 080 54 M 148

TABLE II Hyperparameters for generating the anchor boxes

Detectionmodel

Size vector(p)

Aspect ratio(r)

Anchorboxes

IoU thfor NMS

FasterRCNN [025 05 10] [05 10 20] 9 07

SSD [02 057 095] [03 05 10] 9 06YOLO v3 [025 05 10] [05 10 20] 9 07

locations (per filter) with varying sizes and aspect ratio Inthis article for an image of dimension breadth (b) times height(h) the anchor boxes are generated in the following mannerConsider the parameters size as p ε (0 1] and aspect ratioas r gt 0 then the anchor boxes for a certain location in animage can be constructed with dimensions as bp

radicr times hp

radicr

Table II shows the values of p and r configured for each modelLater the object detection model is trained to predict for eachgenerated anchor box to belong to a certain class and an offsetto adjust the dimensions of the anchor box to better fit theground-truth of the object while using the classification andregression loss Since there are many anchor boxes for a spatiallocation the object can get associated with more than oneanchor box This problem is dealt with non-max suppression(NMS) by computing intersection over union (IoU) parameterthat limits the anchor boxes association with the object ofinterest by calculating the score as the ratio of overlappingregions between the assigned anchor box and the ground-truthto the union of regions of the anchor box and the ground-truth The score value is then compared with the set thresholdhyperparameter to return the best bounding box for an object

1) Loss Function With each step of model training pre-dicted anchor box lsquoarsquo is assigned a label as positive (1) ornegative (0) based on its associativity with the object ofinterest having ground-truth box lsquogrsquo The positive anchor boxis then assigned a class label yo ε c1 c2 cn here cnindicates the category of the nth object while also generatingthe encoding vector for box lsquogrsquo with respect to lsquoarsquo as f(ga|a)where yo = 0 for negative anchor boxes Consider an imageI for some anchor lsquoarsquo model with trained parameters ωpredicted the object class as Ycls(I|aω) and the correspond-ing box offset as Yreg(I|aω) then the loss for a singleanchor prediction can be computed (Lcls) and bounding boxregression loss (Lreg) as given by the Eq 4

L(a|Iω) = α1obja Lreg(f(ga|a)minus Yreg(I|aω))+βLcls(ya Ycls(I|aω))

(4)

where 1obja is 1 if lsquoarsquo is a positive anchor α and β arethe weights associated with the regression and classificationloss Later the overall loss of the model can be computed as

5

Fig 3 Schematic representation of faster RCNN architecture

the average of the L(a|Iw) over the predictions for all theanchors

B Faster RCNN

Proposed by Ren et al [38] the faster RCNN is derivedfrom its predecessors RCNN [61] and fast RCNN [62] whichrely on external region proposal approach based on selectivesearch (SS) [73] Many researchers [74] [75] [76] observedthat instead of using the SS it is recommended to utilizethe advantages of convolution layers for better and fasterlocalization of the objects Hence Ren et al proposed theRegion Proposal Network (RPN) which uses CNN modelseg VGGNet ResNet etc to generate the region proposalsthat made faster RCNN 10 times faster than fast RCNNFig 3 shows the schematic representation of faster RCNNarchitecture where RPN module performs binary classificationof an object or not an object (background) while classificationmodule assigns categories for each detected object (multi-class classification) by using the region of interest (RoI)pooling [38] on the extracted feature maps with projectedregions

1) Loss function The faster RCNN is the combinationof two modules RPN and fast RCNN detector The overallmulti-task loss function is composed of classification loss andbounding box regression loss as defined in Eq 4 with Lclsand Lreg functions defined in Eq 5

Lcls(pi plowasti ) = minusplowasti log(pi)minus (1minus plowasti ) log(1minus pi)

Lreg(tu v) =

sumxεxywh

Lsmooth1 (tui minus v)

Lsmooth1 (q) =

05q2 if | q |lt 1

| q | minus05 otherwise

(5)

where tu is the predicted corrections of the bounding box tu =tux tuy tuw tuh Here u is a true class label (x y) correspondsto the top-left coordinates of the bounding box with height hand width w v is a ground-truth bounding box plowasti is thepredicted class and pi is the actual class

C Single Shot Detector (SSD)

In this research single shot detector (SSD) [63] is also usedas another object identification method to detect people in real-time video surveillance system As discussed earlier fasterR-CNN works on region proposals to create boundary boxesto indicate objects shows better accuracy but has slow pro-cessing of frames per second (FPS) For real-time processingSSD further improves the accuracy and FPS by using multi-scale features and default boxes in a single process It follows

Fig 4 Schematic representation of SSD architecture

the principle of the feed-forward convolution network whichgenerates bounding boxes of fixed sizes along with a scorebased on the presence of object class instances in those boxesfollowed by NMS step to produce the final detections Thusit consists of two steps extracting feature maps and applyingconvolution filters to detect objects by using an architecturehaving three main parts First part is a base pretrained networkto extract feature maps whereas in the second part multi-scale feature layers are used in which series of convolutionfilters are cascaded after the base network The last part isa non-maximum suppression unit for eliminating overlappingboxes and one object only per box The architecture of SSDis shown in Fig 4

1) Loss function Similar to the above discussed fasterRCNN model the overall loss function of the SSD modelis equal to the sum of multi-class classification loss (Lcls)and bounding box regression loss (localization loss Lreg) asshown in Eq 4 where Lreg and Lcls is defined by Eq 6 and 7

Lreg(x l g) =

Nsumiεpos

summεcxcywh

xkijsmoothL1(lmi minus gmj )

gcxj =(gcxj minus a

cxi )

awi g

cyj =

(gcyj minus a

cyi )

ahi

gwj = log

(gwjawi

) ghj = log

(ghjahi

)

xpij =

1 if IoU gt 05

0 otherwise

(6)

where l is the predicted box g is the ground truth box xpij isan indicator that matches the ith anchor box to the jth groundtruth box cx and cy are offsets to the anchor box a

Lcls(x c) = minusNsum

iεPos

xpij log(cpi )minus

sumiεNeg

log(coi ) (7)

where cpi =exp cpisump exp cpi

and N is the number of default matchedboxes

D YOLO

For object detection another competitor of SSD isYOLO [40] This method can predict the type and location ofan object by looking only once at the image YOLO considersthe object detection problem as a regression task insteadof classification to assign class probabilities to the anchorboxes A single convolutional network simultaneously predicts

6

Fig 5 Schematic representation of YOLO v3 architecture

multiple bounding boxes and class probabilities Majorly thereare three versions of YOLO v1 v2 and v3 YOLO v1 isinspired by GoogleNet (Inception network) which is designedfor object classification in an image This network consistsof 24 convolutional layers and 2 fully connected layersInstead of the Inception modules used by GoogLeNet YOLOv1 simply uses a reduction layer followed by convolutionallayers Later YOLO v2 [64] is proposed with the objectiveof improving the accuracy significantly while making it fasterYOLO v2 uses Darknet-19 as a backbone network consistingof 19 convolution layers along with 5 max pooling layersand an output softmax layer for object classification YOLOv2 outperformed its predecessor (YOLO v1) with significantimprovements in mAP FPS and object classification scoreIn contrast YOLO v3 performs multi-label classification withthe help of logistic classifiers instead of using softmax asin case of YOLO v1 and v2 In YOLO v3 Redmon et alproposed Darknet-53 as a backbone architecture that extractsfeatures maps for classification In contrast to Darknet-19Darknet-53 consists of residual blocks (short connections)along with the upsampling layers for concatenation and addeddepth to the network YOLO v3 generates three predictionsfor each spatial location at different scales in an image whicheliminates the problem of not being able to detect small objectsefficiently [77] Each prediction is monitored by computingobjectness boundary box regressor and classification scoresIn Fig 5 a schematic description of the YOLOv3 architectureis presented

1) Loss function The overall loss function of YOLO v3consists of localization loss (bounding box regressor) crossentropy and confidence loss for classification score definedas follows

λcoord

S2sumi=0

Bsumj=0

1objij ((tx minus tx)2+ (ty minus ty)

2+ (tw minus tw)

2+

(th minus th)2)

+

S2sumi=0

Bsumj=0

1objij (minus log(σ(to)) +

Csumk=1

BCE(yk σ(sk)))

+λnoobj

S2sumi=0

Bsumj=0

1noobjij (minus log(1minus σ(to))

(8)where λcoord indicates the weight of the coordinate errorS2 indicates the number of grids in the image and B isthe number of generated bounding boxes per grid 1objij = 1

describes that object confines in the jth bounding box in grid

i otherwise it is 0

E Deepsort

Deepsort is a deep learning based approach to track customobjects in a video [78] In the present research Deepsort isutilized to track individuals present in the surveillance footageIt makes use of patterns learned via detected objects in theimages which is later combined with the temporal informationfor predicting associated trajectories of the objects of interestIt keeps track of each object under consideration by mappingunique identifiers for further statistical analysis Deepsort isalso useful to handle associated challenges such as occlusionmultiple viewpoints non-stationary cameras and annotatingtraining data For effective tracking the Kalman filter andthe Hungarian algorithm are used Kalman filter is recursivelyused for better association and it can predict future positionsbased on the current position Hungarian algorithm is used forassociation and id attribution that identifies if an object in thecurrent frame is the same as the one in the previous frameInitially a Faster RCNN is trained for person identification andfor tracking a linear constant velocity model [79] is utilized todescribe each target with eight dimensional space as follows

x = [u v λ h x y λ h]T (9)

where (u v) is the centroid of the bounding box a is the aspectratio and h is the height of the image The other variables arethe respective velocities of the variables Later the standardKalman filter is used with constant velocity motion and linearobservation model where the bounding coordinates (u v λ h)are taken as direct observations of the object state

For each track k starting from the last successful measure-ment association ak the total number of frames are calculatedWith positive prediction from the Kalman filter the counteris incremented and later when the track gets associated witha measurement it resets its value to 0 Furthermore if theidentified tracks exceed a predefined maximum age thenthose objects are considered to have left the scene and thecorresponding track gets removed from the track set Andif there are no tracks available for some detected objectsthen new track hypotheses are initiated for each unidentifiedtrack of novel detected objects that cannot be mapped to theexisting tracks For the first three frames the new tracks areclassified as indefinite until a successful measurement map-ping is computed If the tracks are not successfully mappedwith measurement then it gets deleted from the track setHungarian algorithm is then utilized in order to solve themapping problem between the newly arrived measurementsand the predicted Kalman states by considering the motion andappearance information with the help of Mahalanobis distancecomputed between them as defined in Eq 10

d(1)(i j) = (dj minus yi)TSminus1i (dj minus yi) (10)

where the projection of the ith track distribution into measure-ment space is represented by (yi Si) and the jth boundingbox detection by dj The Mahalanobis distance considers thisuncertainty by estimating the count of standard deviationsthe detection is away from the mean track location Further

7

using this metric it is possible to exclude unlikely associationsby thresholding the Mahalanobis distance This decision isdenoted with an indicator that evaluates to 1 if the associationbetween the ith track and jth detection is admissible (Eq 11)

b(1)ij = 1[d(1)(i j) lt t(1)] (11)

Though Mahalanobis distance performs efficiently but failsin the environment where camera motion is possible therebyanother metric is introduced for the assignment problem Thissecond metric measures the smallest cosine distance betweenthe ith track and jth detection in appearance space as follows

d(2)(i j) = min1minus rjT rk(i) | rk(i) ε R2 (12)

Again a binary variable is introduced to indicate if an asso-ciation is admissible according to the following metric

b(1)ij = 1[d(2)(i j) lt t(2)] (13)

and a suitable threshold is measured for this indicator on aseparate training dataset To build the association problemboth metrics are combined using a weighted sum

cij = λd(1)(i j) + (1minus λ)d(2)(i j) (14)

where an association is admissible if it is within the gatingregion of both metrics

bij =prodm=1

2b(m)ij (15)

The influence of each metric on the combined association costcan be controlled through hyperparameter λ

IV PROPOSED APPROACH

The emergence of deep learning has brought the best per-forming techniques for a wide variety of tasks and challengesincluding medical diagnosis [74] machine translation [75]speech recognition [76] and a lot more [80] Most of thesetasks are centred around object classification detection seg-mentation tracking and recognition [81] [82] In recent yearsthe convolution neural network (CNN) based architectureshave shown significant performance improvements that areleading towards the high quality of object detection as shownin Fig 2 which presents the performance of such modelsin terms of mAP and FPS on standard benchmark datasetsPASCAL-VOC [66] and MS-COCO [67] and similar hard-ware resources

In the present article a deep learning based framework isproposed that utilizes object detection and tracking modelsto aid in the social distancing remedy for dealing with theescalation of COVID-19 cases In order to maintain thebalance of speed and accuracy YOLO v3 [65] alongside theDeepsort [78] are utilized as object detection and trackingapproaches while surrounding each detected object with thebounding boxes Later these bounding boxes are utilized tocompute the pairwise L2 norm with computationally efficientvectorized representation for identifying the clusters of peoplenot obeying the order of social distancing Furthermore tovisualize the clusters in the live stream each bounding box

is color-coded based on its association with the group wherepeople belonging to the same group are represented withthe same color Each surveillance frame is also accompaniedwith the streamline plot depicting the statistical count of thenumber of social groups and an index term (violation index)representing the ratio of the number of people to the numberof groups Furthermore estimated violations can be computedby multiplying the violation index with the total number ofsocial groups

A Workflow

This section includes the necessary steps undertaken tocompose a framework for monitoring social distancing

1 Fine-tune the trained object detection model to identifyand track the person in a footage

2 The trained model is feeded with the surveillance footageThe model generates a set of bounding boxes and an IDfor each identified person

3 Each individual is associated with three-dimensional fea-ture space (x y d) where (x y) corresponds to thecentroid coordinates of the bounding box and d definesthe depth of the individual as observed from the camera

d = ((2 lowast 314 lowast 180)(w + h lowast 360) lowast 1000 + 3) (16)

where w is the width of the bounding box and h is theheight of the bounding box [83]

4 For the set of bounding boxes pairwise L2 norm iscomputed as given by the following equation

||D||2 =

radicradicradicradic nsumi=1

(qi minus pi)2 (17)

where in this work n = 35 The dense matrix of L2 norm is then utilized to assign the

neighbors for each individual that satisfies the closenesssensitivity With extensive trials the closeness thresholdis updated dynamically based on the spatial location ofthe person in a given frame ranging between (90 170)pixels

6 Any individual that meets the closeness property isassigned a neighbour or neighbours forming a grouprepresented in a different color coding in contrast to otherpeople

7 The formation of groups indicates the violation of thepractice of social distancing which is quantified with helpof the following

ndash Consider ng as number of groups or clusters identi-fied and np as total number of people found in closeproximity

ndash vi = npng where vi is the violation index

V EXPERIMENTS AND RESULTS

The above discussed object detection models are fine tunedfor binary classification (person or not a person) with Inceptionv2 as a backbone network on the Nvidia GTX 1060 GPUusing the dataset acquired from the open image dataset (OID)

8

Fig 6 Data samples showing (a) true samples and (b) falsesamples of a ldquoPersonrdquo class from the open image dataset

Fig 7 Losses per iteration of the object detection modelsduring the training phase on the OID validation set fordetecting the person in an image

repository [73] maintained by the Google open source com-munity The diverse images with a class label as ldquoPersonrdquo aredownloaded via OIDv4 toolkit [84] along with the annotationsFig 6 shows the sample images of the obtained dataset con-sisting of 800 images which is obtained by manually filteringto only contain the true samples The dataset is then dividedinto training and testing sets in 82 ratio In order to make thetesting robust the testing set is also accompanied by the framesof surveillance footage of the Oxford town center [23] Laterthis footage is also utilized to simulate the overall approach formonitoring the social distancing In case of faster RCNN theimages are resized to P pixels on the shorter edge with 600and 1024 for low and high resolution while in SSD and YOLOthe images are scaled to the fixed dimension P times P with Pvalue as 416 During the training phase the performance of themodels is continuously monitored using the mAP along withthe localization classification and overall loss in the detectionof the person as indicated in Fig 7 Table III summarizes theresults of each model obtained at the end of the training phasewith the training time (TT) number of iterations (NoI) mAPand total loss (TL) value It is observed that the faster RCNNmodel achieved minimal loss with maximum mAP howeverhas the lowest FPS which makes it not suitable for real-timeapplications Furthermore as compared to SSD YOLO v3achieved better results with balanced mAP training time andFPS score The trained YOLO v3 model is then utilized formonitoring the social distancing on the surveillance video

TABLE III Performance comparison of the object detectionmodels

Model TT (in sec) NoI mAP TL FPSFaster RCNN 9651 12135 0969 002 3SSD 2124 1200 0691 022 10YOLO v3 5659 7560 0846 087 23

Fig 8 Sample output of the proposed framework for monitor-ing social distancing on surveillance footage of Oxford TownCenter

VI OUTPUT

The proposed framework outputs (as shown in Fig 8) theprocessed frame with the identified people confined in thebounding boxes while also simulating the statistical analysisshowing the total number of social groups displayed by samecolor encoding and a violation index term computed as theratio of the number of people to the number of groups Theframes shown in Fig 8 displays violation index as 3 2 2 and233 The frames with detected violations are recorded withthe timestamp for future analysis

VII FUTURE SCOPE AND CHALLENGES

Since this application is intended to be used in any workingenvironment accuracy and precision are highly desired toserve the purpose Higher number of false positive may raisediscomfort and panic situation among people being observedThere may also be genuinely raised concerns about privacy andindividual rights which can be addressed with some additionalmeasures such as prior consents for such working environ-ments hiding a personrsquos identity in general and maintainingtransparency about its fair uses within limited stakeholders

VIII CONCLUSION

The article proposes an efficient real-time deep learningbased framework to automate the process of monitoring thesocial distancing via object detection and tracking approacheswhere each individual is identified in the real-time with the

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

2

medicine available for treatment Hence precautionary stepsare taken by the whole world to restrict the spread of infectionRecently Eksin et al [8] proposed a modified SIR model withthe inclusion of a social distancing parameter a(IR) whichcan be determined with the help of the number of infected andrecovered persons represented as I and R respectively

dS

dt= minusβS I

Na(IN)

dI

dt= minusδI + βI

I

Na(IN)

dR

dt= δI

(1)

where β represents the infection rate and δ represents recoveryrate The population size is computed as N = S + I + RHere the social distancing term (a(IR) R2 ε [0 1]) mapsthe transition rate from a susceptible state (S) to an infectedstate (I) which is calculated by aβSI

N The social distancing models are of two types where the

first model is known as ldquolong-term awarenessrdquo in which theoccurrence of interaction of an individual with other is reducedproportionally with the cumulative percentage of affected(infectious and recovered) individuals (Eq 2)

a =

(1minus I +R

N

)k(2)

Meanwhile the second model is known as ldquoshort-term aware-nessrdquo where the reduction in interaction is directly propor-tional to the proportion of infectious individuals at a giveninstance (Eq 3)

a =

(1minus I

N

)k(3)

where k is behavior parameter defined as k ge 0 Higher valueof k implies that individuals are becoming sensitive to thedisease prevalence

In the similar background on April 16 2020 a companyLanding AI [10] under the leadership of most recognizablenames in AI Dr Andrew Ng [11] announced the creationof an AI tool to monitor social distancing at the workplaceIn a brief article the company claimed that the upcomingtool could detect if people are maintaining the safe physicaldistance from each other by analyzing real-time video streamsfrom the camera It is also claimed that this tool can easily getintegrated with existing security cameras available at differentworkplaces to maintain a safe distance among all workers Abrief demo was released that shows three steps calibrationdetection and measurement to monitor the social distancingOn April 21 2020 Gartner Inc identified Landing AI as CoolVendors in AI Core Technologies to appreciate their timelyinitiative in this revolutionary area to support the fight againstthe COVID -19 [12]

Motivated by this in this present work authors are at-tempting to check and compare the performance of popularobject detection and tracking schemes in monitoring the socialdistancing Rest of the paper structure is organized as followsSection II presents the recent work proposed in this fieldof study followed by the state-of-the-art object detectionand tracking models in Section III Later in Section IV the

deep learning based framework is proposed to monitor socialdistancing In Section V experimentation and the correspond-ing results are discussed accompanied by the outcome inSection VI In Section VII the future scope and challengesare discussed and lastly Section VIII presents the conclusionof the present research work

II BACKGROUND STUDY AND RELATED WORK

Social distancing is surely the most trustworthy techniqueto stop the spreading of infectious disease with this belief inthe background of December 2019 when COVID-19 emergedin Wuhan China it was opted as an unprecedented measureon January 23 2020 [13] Within one month the outbreakin China gained a peak in the first week of February with2000 to 4000 new confirmed cases per day Later for thefirst time after this outbreak there have been a sign of reliefwith no new confirmed cases for five consecutive days up to23 March 2020 [14] This is evident that social distancingmeasures enacted in China initially adopted worldwide laterto control COVID-19

Prem et al [15] aimed to study the effects of socialdistancing measures on the spread of the COVID-19 epi-demic Authors used synthetic location-specific contact pat-terns to simulate the ongoing trajectory of the outbreak usingsusceptible-exposed-infected-removed (SEIR) models It wasalso suggested that premature and sudden lifting of socialdistancing could lead to an earlier secondary peak whichcould be flattened by relaxing the interventions gradually [15]As we all understand social distancing though essential buteconomically painful measures to flatten the infection curveAdolph et al [16] highlighted the situation of the United Statesof America where due to lack of common consent among allpolicymakers it could not be adopted at an early stage which isresulting into on-going harm to public health Although socialdistancing impacted economic productivity many researchersare trying hard to overcome the loss Following from thiscontext Kylie et al [17] studied the correlation betweenthe strictness of social distancing and the economic statusof the region The study indicated that intermediate levelsof activities could be permitted while avoiding a massiveoutbreak

Since the novel coronavirus pandemic began many coun-tries have been taking the help of technology based solutionsin different capacities to contain the outbreak [18] [19] [20]Many developed countries including India and South Koreafor instance utilising GPS to track the movements of thesuspected or infected persons to monitor any possibility oftheir exposure among healthy people In India the governmentis using the Arogya Setu App which worked with the helpof GPS and bluetooth to locate the presence of COVID-19patients in the vicinity area It also helps others to keep asafe distance from the infected person [21] On the otherhand some law enforcement departments have been usingdrones and other surveillance cameras to detect mass gath-erings of people and taking regulatory actions to disperse thecrowd [22] [23] Such manual intervention in these criticalsituations might help flatten the curve but it also brings a

3

unique set of threats to the public and is challenging to theworkforce

Human detection using visual surveillance system is anestablished area of research which is relying upon manualmethods of identifying unusual activities however it haslimited capabilities [24] In this direction recent advancementsadvocate the need for intelligent systems to detect and capturehuman activities Although human detection is an ambitiousgoal due to a variety of constraints such as low-resolutionvideo varying articulated pose clothing lighting and back-ground complexities and limited machine vision capabilitieswherein prior knowledge on these challenges can improve thedetection performance [25]

Detecting an object which is in motion incorporates twostages object detection [26] and object classification [27] Theprimary stage of object detection could be achieved by usingbackground subtraction [28] optical flow [29] and spatio-temporal filtering techniques [30] In the background subtrac-tion method [31] the difference between the current frameand a background frame (first frame) at pixel or block level iscomputed Adaptive Gaussian mixture temporal differencinghierarchical background models warping background andnon-parametric background are the most popular approachesof background subtraction [32] In optical flow-based objectdetection technique [29] flow vectors associated with theobjectrsquos motion are characterised over a time span in order toidentify regions in motion for a given sequence of images [33]Researchers reported that optical flow based techniques consistof computational overheads and are sensitive to various motionrelated outliers such as noise colour and lighting etc [34]In another method of motion detection Aslani et al [30]proposed spatio-temporal filter based approach in which themotion parameters are identified by using three-dimensional(3D) spatio-temporal features of the person in motion in theimage sequence These methods are advantageous due to itssimplicity and less computational complexity however showslimited performance because of noise and uncertainties onmoving patterns [35]

Object detection problems have been efficiently addressedby recently developed advanced techniques In the last decadeconvolutional neural networks (CNN) region-based CNN [36]and faster region-based CNN [37] used region proposal tech-niques to generate the objectness score prior to its classifica-tion and later generates the bounding boxes around the objectof interest for visualization and other statistical analysis [38]Although these methods are efficient but suffer in termsof larger training time requirements [39] Since all theseCNN based approaches utilize classification another approachYOLO considers a regression based method to dimension-ally separate the bounding boxes and interpret their classprobabilities [40] In this method the designed frameworkefficiently divides the image into several portions representingbounding boxes along with the class probability scores foreach portion to consider as an object This approach offersexcellent improvements in terms of speed while trading thegained speed with the efficiency The detector module exhibitspowerful generalization capabilities of representing an entireimage [41]

Based on the above concepts many research findings havebeen reported in the last few years Crowd counting emerged asa promising area of research with many societal applicationsEshel et al [42] focused on crowd detection and personcount by proposing multiple height homographies for headtop detection and solved the occlusions problem associatedwith video surveillance related applications Chen et al [43]developed an electronic advertising application based on theconcept of crowd counting In similar application Chih-Wenet al [44] proposed a vision-based people counting modelFollowing this Yao et al [45] generated inputs from stationarycameras to perform background subtraction to train the modelfor the appearance and the foreground shape of the crowd invideos

Once an object is detected classification techniques canbe applied to identify a human on the basis of shape tex-ture or motion-based features In shape-based methods theshape related information of moving regions such as pointsboxes and blobs are determined to identify the human Thismethod performs poorly due to certain limitations in stan-dard template-matching schemes [46] [47] which is furtherenhanced by applying part-based template matching [48] ap-proach In another research Dalal et al [49] proposed texture-based schemes such as histograms of oriented gradient (HOG)which utilises high dimensional features based on edges alongwith the support vector machine (SVM) to detect humans

According to recent research further identification of aperson through video surveillance can be done by usingface [50] [51] and gait recognition [52] techniques Howeverdetection and tracking of people under crowd are difficultsometimes due to partial or full occlusion problems Leibeet al [53] proposed trajectory estimation based solution whileAndriluka et al [54] proposed a solution to detect partiallyoccluded people using tracklet-based detectors Many othertracking techniques including a variety of object and motionrepresentations are reviewed by Yilmaz et al [55]

A large number of studies are available in the area of videosurveillance Among many publically available datasets KTHhuman motion dataset [56] shows six categories of activitieswhereas INRIA XMAS multi-view dataset [57] and Weizmannhuman action dataset [58] contain 11 and 10 categories ofactions respectively Another dataset named as performanceevaluation of tracking and surveillance (PETS) is proposedby a group of researchers at university of Oxford [59] Thisdataset is available for vision based research comprising alarge number of datasets for varying tasks in the field ofcomputer vision In the present research in order to fine-tunethe object detection and tracking models for identifying theperson open images datasets [60] are considered It is a col-lection of 19957 classes out of which the models are trainedfor the identification of a person The images are annotatedwith image-level labels and corresponding coordinates of thebounding boxes representing the person Furthermore the finetuned proposed framework is simulated on the Oxford towncenter surveillance footage [23] to monitor social distancing

We believe that having a single dataset with unified an-notations for image classification object detection visualrelationship detection instance segmentation and multimodal

4

Fig 2 Performance overview of the most popular objectdetection models on PASCAL-VOC and MS-COCO datasets

image descriptions will enable us to study and perform ob-ject detection tasks efficiently and stimulate progress towardsgenuine understanding of the scene All explored literatureand related research work clearly establishes a picture thatthe application of human detection can easily get extended tomany applications to cater the situation that arises presentlysuch as to check prescribed standards for hygiene socialdistancing work practices etc

III OBJECT DETECTION AND TRACKING MODELS

As observed from Fig 2 the successful object detectionmodels like RCNN [61] fast RCNN [62] faster RCNN [38]SSD [63] YOLO v1 [40] YOLO v2 [64] and YOLO v3 [65]tested on PASCAL-VOC [66] and MS-COCO [67] datasetsundergo trade-off between speed and accuracy of the detec-tion which is dependent on various factors like backbonearchitecture (feature extraction network eg VGG-16 [68]ResNet-101 [69] Inception v2 [70] etc) input sizes modeldepth varying software and hardware environment A featureextractor tends to encode the modelrsquos input into certain featurerepresentation which aids in learning and discovering thepatterns associated with the desired objects In order to identifymultiple objects of varying scale or size it also uses predefinedboxes covering an entire image termed as anchor boxes Table Idescribes the performance in terms of accuracy for each ofthese popular and powerful feature extraction networks onILSVRC ImageNet challenge [71] along with the numberof trainable parameters which have a direct impact on thetraining speed and time As highlighted in Table I the ratio ofaccuracy to the number of parameters is highest for Inceptionv2 model indicating that Inception v2 achieved adequateclassification accuracy with minimal trainable parameters incontrast to other models and hence is utilized as a backbonearchitecture for faster and efficient computations in the fasterRCNN and SSD object detection models whereas YOLO v3uses different architecture Darknet-53 as proposed by Redmonet al [65]

A Anchor boxes

With the exhaustive literature survey it is observed thatevery popular object detection model utilizes the concept ofanchor boxes to detect multiple objects in the scene [36] Theseboxes are overlaid on the input image over various spatial

TABLE I Performance of the feature extraction network onImageNet challenge

Backbone model Accuracy (a) Parameters (p) Ratio (a100p)VGG-16 [68] 071 15 M 473ResNet-101 [69] 076 425 M 178Inception v2 [70] 074 10 M 740Inception v3 [72] 078 22 M 358Resnet v2 [72] 080 54 M 148

TABLE II Hyperparameters for generating the anchor boxes

Detectionmodel

Size vector(p)

Aspect ratio(r)

Anchorboxes

IoU thfor NMS

FasterRCNN [025 05 10] [05 10 20] 9 07

SSD [02 057 095] [03 05 10] 9 06YOLO v3 [025 05 10] [05 10 20] 9 07

locations (per filter) with varying sizes and aspect ratio Inthis article for an image of dimension breadth (b) times height(h) the anchor boxes are generated in the following mannerConsider the parameters size as p ε (0 1] and aspect ratioas r gt 0 then the anchor boxes for a certain location in animage can be constructed with dimensions as bp

radicr times hp

radicr

Table II shows the values of p and r configured for each modelLater the object detection model is trained to predict for eachgenerated anchor box to belong to a certain class and an offsetto adjust the dimensions of the anchor box to better fit theground-truth of the object while using the classification andregression loss Since there are many anchor boxes for a spatiallocation the object can get associated with more than oneanchor box This problem is dealt with non-max suppression(NMS) by computing intersection over union (IoU) parameterthat limits the anchor boxes association with the object ofinterest by calculating the score as the ratio of overlappingregions between the assigned anchor box and the ground-truthto the union of regions of the anchor box and the ground-truth The score value is then compared with the set thresholdhyperparameter to return the best bounding box for an object

1) Loss Function With each step of model training pre-dicted anchor box lsquoarsquo is assigned a label as positive (1) ornegative (0) based on its associativity with the object ofinterest having ground-truth box lsquogrsquo The positive anchor boxis then assigned a class label yo ε c1 c2 cn here cnindicates the category of the nth object while also generatingthe encoding vector for box lsquogrsquo with respect to lsquoarsquo as f(ga|a)where yo = 0 for negative anchor boxes Consider an imageI for some anchor lsquoarsquo model with trained parameters ωpredicted the object class as Ycls(I|aω) and the correspond-ing box offset as Yreg(I|aω) then the loss for a singleanchor prediction can be computed (Lcls) and bounding boxregression loss (Lreg) as given by the Eq 4

L(a|Iω) = α1obja Lreg(f(ga|a)minus Yreg(I|aω))+βLcls(ya Ycls(I|aω))

(4)

where 1obja is 1 if lsquoarsquo is a positive anchor α and β arethe weights associated with the regression and classificationloss Later the overall loss of the model can be computed as

5

Fig 3 Schematic representation of faster RCNN architecture

the average of the L(a|Iw) over the predictions for all theanchors

B Faster RCNN

Proposed by Ren et al [38] the faster RCNN is derivedfrom its predecessors RCNN [61] and fast RCNN [62] whichrely on external region proposal approach based on selectivesearch (SS) [73] Many researchers [74] [75] [76] observedthat instead of using the SS it is recommended to utilizethe advantages of convolution layers for better and fasterlocalization of the objects Hence Ren et al proposed theRegion Proposal Network (RPN) which uses CNN modelseg VGGNet ResNet etc to generate the region proposalsthat made faster RCNN 10 times faster than fast RCNNFig 3 shows the schematic representation of faster RCNNarchitecture where RPN module performs binary classificationof an object or not an object (background) while classificationmodule assigns categories for each detected object (multi-class classification) by using the region of interest (RoI)pooling [38] on the extracted feature maps with projectedregions

1) Loss function The faster RCNN is the combinationof two modules RPN and fast RCNN detector The overallmulti-task loss function is composed of classification loss andbounding box regression loss as defined in Eq 4 with Lclsand Lreg functions defined in Eq 5

Lcls(pi plowasti ) = minusplowasti log(pi)minus (1minus plowasti ) log(1minus pi)

Lreg(tu v) =

sumxεxywh

Lsmooth1 (tui minus v)

Lsmooth1 (q) =

05q2 if | q |lt 1

| q | minus05 otherwise

(5)

where tu is the predicted corrections of the bounding box tu =tux tuy tuw tuh Here u is a true class label (x y) correspondsto the top-left coordinates of the bounding box with height hand width w v is a ground-truth bounding box plowasti is thepredicted class and pi is the actual class

C Single Shot Detector (SSD)

In this research single shot detector (SSD) [63] is also usedas another object identification method to detect people in real-time video surveillance system As discussed earlier fasterR-CNN works on region proposals to create boundary boxesto indicate objects shows better accuracy but has slow pro-cessing of frames per second (FPS) For real-time processingSSD further improves the accuracy and FPS by using multi-scale features and default boxes in a single process It follows

Fig 4 Schematic representation of SSD architecture

the principle of the feed-forward convolution network whichgenerates bounding boxes of fixed sizes along with a scorebased on the presence of object class instances in those boxesfollowed by NMS step to produce the final detections Thusit consists of two steps extracting feature maps and applyingconvolution filters to detect objects by using an architecturehaving three main parts First part is a base pretrained networkto extract feature maps whereas in the second part multi-scale feature layers are used in which series of convolutionfilters are cascaded after the base network The last part isa non-maximum suppression unit for eliminating overlappingboxes and one object only per box The architecture of SSDis shown in Fig 4

1) Loss function Similar to the above discussed fasterRCNN model the overall loss function of the SSD modelis equal to the sum of multi-class classification loss (Lcls)and bounding box regression loss (localization loss Lreg) asshown in Eq 4 where Lreg and Lcls is defined by Eq 6 and 7

Lreg(x l g) =

Nsumiεpos

summεcxcywh

xkijsmoothL1(lmi minus gmj )

gcxj =(gcxj minus a

cxi )

awi g

cyj =

(gcyj minus a

cyi )

ahi

gwj = log

(gwjawi

) ghj = log

(ghjahi

)

xpij =

1 if IoU gt 05

0 otherwise

(6)

where l is the predicted box g is the ground truth box xpij isan indicator that matches the ith anchor box to the jth groundtruth box cx and cy are offsets to the anchor box a

Lcls(x c) = minusNsum

iεPos

xpij log(cpi )minus

sumiεNeg

log(coi ) (7)

where cpi =exp cpisump exp cpi

and N is the number of default matchedboxes

D YOLO

For object detection another competitor of SSD isYOLO [40] This method can predict the type and location ofan object by looking only once at the image YOLO considersthe object detection problem as a regression task insteadof classification to assign class probabilities to the anchorboxes A single convolutional network simultaneously predicts

6

Fig 5 Schematic representation of YOLO v3 architecture

multiple bounding boxes and class probabilities Majorly thereare three versions of YOLO v1 v2 and v3 YOLO v1 isinspired by GoogleNet (Inception network) which is designedfor object classification in an image This network consistsof 24 convolutional layers and 2 fully connected layersInstead of the Inception modules used by GoogLeNet YOLOv1 simply uses a reduction layer followed by convolutionallayers Later YOLO v2 [64] is proposed with the objectiveof improving the accuracy significantly while making it fasterYOLO v2 uses Darknet-19 as a backbone network consistingof 19 convolution layers along with 5 max pooling layersand an output softmax layer for object classification YOLOv2 outperformed its predecessor (YOLO v1) with significantimprovements in mAP FPS and object classification scoreIn contrast YOLO v3 performs multi-label classification withthe help of logistic classifiers instead of using softmax asin case of YOLO v1 and v2 In YOLO v3 Redmon et alproposed Darknet-53 as a backbone architecture that extractsfeatures maps for classification In contrast to Darknet-19Darknet-53 consists of residual blocks (short connections)along with the upsampling layers for concatenation and addeddepth to the network YOLO v3 generates three predictionsfor each spatial location at different scales in an image whicheliminates the problem of not being able to detect small objectsefficiently [77] Each prediction is monitored by computingobjectness boundary box regressor and classification scoresIn Fig 5 a schematic description of the YOLOv3 architectureis presented

1) Loss function The overall loss function of YOLO v3consists of localization loss (bounding box regressor) crossentropy and confidence loss for classification score definedas follows

λcoord

S2sumi=0

Bsumj=0

1objij ((tx minus tx)2+ (ty minus ty)

2+ (tw minus tw)

2+

(th minus th)2)

+

S2sumi=0

Bsumj=0

1objij (minus log(σ(to)) +

Csumk=1

BCE(yk σ(sk)))

+λnoobj

S2sumi=0

Bsumj=0

1noobjij (minus log(1minus σ(to))

(8)where λcoord indicates the weight of the coordinate errorS2 indicates the number of grids in the image and B isthe number of generated bounding boxes per grid 1objij = 1

describes that object confines in the jth bounding box in grid

i otherwise it is 0

E Deepsort

Deepsort is a deep learning based approach to track customobjects in a video [78] In the present research Deepsort isutilized to track individuals present in the surveillance footageIt makes use of patterns learned via detected objects in theimages which is later combined with the temporal informationfor predicting associated trajectories of the objects of interestIt keeps track of each object under consideration by mappingunique identifiers for further statistical analysis Deepsort isalso useful to handle associated challenges such as occlusionmultiple viewpoints non-stationary cameras and annotatingtraining data For effective tracking the Kalman filter andthe Hungarian algorithm are used Kalman filter is recursivelyused for better association and it can predict future positionsbased on the current position Hungarian algorithm is used forassociation and id attribution that identifies if an object in thecurrent frame is the same as the one in the previous frameInitially a Faster RCNN is trained for person identification andfor tracking a linear constant velocity model [79] is utilized todescribe each target with eight dimensional space as follows

x = [u v λ h x y λ h]T (9)

where (u v) is the centroid of the bounding box a is the aspectratio and h is the height of the image The other variables arethe respective velocities of the variables Later the standardKalman filter is used with constant velocity motion and linearobservation model where the bounding coordinates (u v λ h)are taken as direct observations of the object state

For each track k starting from the last successful measure-ment association ak the total number of frames are calculatedWith positive prediction from the Kalman filter the counteris incremented and later when the track gets associated witha measurement it resets its value to 0 Furthermore if theidentified tracks exceed a predefined maximum age thenthose objects are considered to have left the scene and thecorresponding track gets removed from the track set Andif there are no tracks available for some detected objectsthen new track hypotheses are initiated for each unidentifiedtrack of novel detected objects that cannot be mapped to theexisting tracks For the first three frames the new tracks areclassified as indefinite until a successful measurement map-ping is computed If the tracks are not successfully mappedwith measurement then it gets deleted from the track setHungarian algorithm is then utilized in order to solve themapping problem between the newly arrived measurementsand the predicted Kalman states by considering the motion andappearance information with the help of Mahalanobis distancecomputed between them as defined in Eq 10

d(1)(i j) = (dj minus yi)TSminus1i (dj minus yi) (10)

where the projection of the ith track distribution into measure-ment space is represented by (yi Si) and the jth boundingbox detection by dj The Mahalanobis distance considers thisuncertainty by estimating the count of standard deviationsthe detection is away from the mean track location Further

7

using this metric it is possible to exclude unlikely associationsby thresholding the Mahalanobis distance This decision isdenoted with an indicator that evaluates to 1 if the associationbetween the ith track and jth detection is admissible (Eq 11)

b(1)ij = 1[d(1)(i j) lt t(1)] (11)

Though Mahalanobis distance performs efficiently but failsin the environment where camera motion is possible therebyanother metric is introduced for the assignment problem Thissecond metric measures the smallest cosine distance betweenthe ith track and jth detection in appearance space as follows

d(2)(i j) = min1minus rjT rk(i) | rk(i) ε R2 (12)

Again a binary variable is introduced to indicate if an asso-ciation is admissible according to the following metric

b(1)ij = 1[d(2)(i j) lt t(2)] (13)

and a suitable threshold is measured for this indicator on aseparate training dataset To build the association problemboth metrics are combined using a weighted sum

cij = λd(1)(i j) + (1minus λ)d(2)(i j) (14)

where an association is admissible if it is within the gatingregion of both metrics

bij =prodm=1

2b(m)ij (15)

The influence of each metric on the combined association costcan be controlled through hyperparameter λ

IV PROPOSED APPROACH

The emergence of deep learning has brought the best per-forming techniques for a wide variety of tasks and challengesincluding medical diagnosis [74] machine translation [75]speech recognition [76] and a lot more [80] Most of thesetasks are centred around object classification detection seg-mentation tracking and recognition [81] [82] In recent yearsthe convolution neural network (CNN) based architectureshave shown significant performance improvements that areleading towards the high quality of object detection as shownin Fig 2 which presents the performance of such modelsin terms of mAP and FPS on standard benchmark datasetsPASCAL-VOC [66] and MS-COCO [67] and similar hard-ware resources

In the present article a deep learning based framework isproposed that utilizes object detection and tracking modelsto aid in the social distancing remedy for dealing with theescalation of COVID-19 cases In order to maintain thebalance of speed and accuracy YOLO v3 [65] alongside theDeepsort [78] are utilized as object detection and trackingapproaches while surrounding each detected object with thebounding boxes Later these bounding boxes are utilized tocompute the pairwise L2 norm with computationally efficientvectorized representation for identifying the clusters of peoplenot obeying the order of social distancing Furthermore tovisualize the clusters in the live stream each bounding box

is color-coded based on its association with the group wherepeople belonging to the same group are represented withthe same color Each surveillance frame is also accompaniedwith the streamline plot depicting the statistical count of thenumber of social groups and an index term (violation index)representing the ratio of the number of people to the numberof groups Furthermore estimated violations can be computedby multiplying the violation index with the total number ofsocial groups

A Workflow

This section includes the necessary steps undertaken tocompose a framework for monitoring social distancing

1 Fine-tune the trained object detection model to identifyand track the person in a footage

2 The trained model is feeded with the surveillance footageThe model generates a set of bounding boxes and an IDfor each identified person

3 Each individual is associated with three-dimensional fea-ture space (x y d) where (x y) corresponds to thecentroid coordinates of the bounding box and d definesthe depth of the individual as observed from the camera

d = ((2 lowast 314 lowast 180)(w + h lowast 360) lowast 1000 + 3) (16)

where w is the width of the bounding box and h is theheight of the bounding box [83]

4 For the set of bounding boxes pairwise L2 norm iscomputed as given by the following equation

||D||2 =

radicradicradicradic nsumi=1

(qi minus pi)2 (17)

where in this work n = 35 The dense matrix of L2 norm is then utilized to assign the

neighbors for each individual that satisfies the closenesssensitivity With extensive trials the closeness thresholdis updated dynamically based on the spatial location ofthe person in a given frame ranging between (90 170)pixels

6 Any individual that meets the closeness property isassigned a neighbour or neighbours forming a grouprepresented in a different color coding in contrast to otherpeople

7 The formation of groups indicates the violation of thepractice of social distancing which is quantified with helpof the following

ndash Consider ng as number of groups or clusters identi-fied and np as total number of people found in closeproximity

ndash vi = npng where vi is the violation index

V EXPERIMENTS AND RESULTS

The above discussed object detection models are fine tunedfor binary classification (person or not a person) with Inceptionv2 as a backbone network on the Nvidia GTX 1060 GPUusing the dataset acquired from the open image dataset (OID)

8

Fig 6 Data samples showing (a) true samples and (b) falsesamples of a ldquoPersonrdquo class from the open image dataset

Fig 7 Losses per iteration of the object detection modelsduring the training phase on the OID validation set fordetecting the person in an image

repository [73] maintained by the Google open source com-munity The diverse images with a class label as ldquoPersonrdquo aredownloaded via OIDv4 toolkit [84] along with the annotationsFig 6 shows the sample images of the obtained dataset con-sisting of 800 images which is obtained by manually filteringto only contain the true samples The dataset is then dividedinto training and testing sets in 82 ratio In order to make thetesting robust the testing set is also accompanied by the framesof surveillance footage of the Oxford town center [23] Laterthis footage is also utilized to simulate the overall approach formonitoring the social distancing In case of faster RCNN theimages are resized to P pixels on the shorter edge with 600and 1024 for low and high resolution while in SSD and YOLOthe images are scaled to the fixed dimension P times P with Pvalue as 416 During the training phase the performance of themodels is continuously monitored using the mAP along withthe localization classification and overall loss in the detectionof the person as indicated in Fig 7 Table III summarizes theresults of each model obtained at the end of the training phasewith the training time (TT) number of iterations (NoI) mAPand total loss (TL) value It is observed that the faster RCNNmodel achieved minimal loss with maximum mAP howeverhas the lowest FPS which makes it not suitable for real-timeapplications Furthermore as compared to SSD YOLO v3achieved better results with balanced mAP training time andFPS score The trained YOLO v3 model is then utilized formonitoring the social distancing on the surveillance video

TABLE III Performance comparison of the object detectionmodels

Model TT (in sec) NoI mAP TL FPSFaster RCNN 9651 12135 0969 002 3SSD 2124 1200 0691 022 10YOLO v3 5659 7560 0846 087 23

Fig 8 Sample output of the proposed framework for monitor-ing social distancing on surveillance footage of Oxford TownCenter

VI OUTPUT

The proposed framework outputs (as shown in Fig 8) theprocessed frame with the identified people confined in thebounding boxes while also simulating the statistical analysisshowing the total number of social groups displayed by samecolor encoding and a violation index term computed as theratio of the number of people to the number of groups Theframes shown in Fig 8 displays violation index as 3 2 2 and233 The frames with detected violations are recorded withthe timestamp for future analysis

VII FUTURE SCOPE AND CHALLENGES

Since this application is intended to be used in any workingenvironment accuracy and precision are highly desired toserve the purpose Higher number of false positive may raisediscomfort and panic situation among people being observedThere may also be genuinely raised concerns about privacy andindividual rights which can be addressed with some additionalmeasures such as prior consents for such working environ-ments hiding a personrsquos identity in general and maintainingtransparency about its fair uses within limited stakeholders

VIII CONCLUSION

The article proposes an efficient real-time deep learningbased framework to automate the process of monitoring thesocial distancing via object detection and tracking approacheswhere each individual is identified in the real-time with the

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

3

unique set of threats to the public and is challenging to theworkforce

Human detection using visual surveillance system is anestablished area of research which is relying upon manualmethods of identifying unusual activities however it haslimited capabilities [24] In this direction recent advancementsadvocate the need for intelligent systems to detect and capturehuman activities Although human detection is an ambitiousgoal due to a variety of constraints such as low-resolutionvideo varying articulated pose clothing lighting and back-ground complexities and limited machine vision capabilitieswherein prior knowledge on these challenges can improve thedetection performance [25]

Detecting an object which is in motion incorporates twostages object detection [26] and object classification [27] Theprimary stage of object detection could be achieved by usingbackground subtraction [28] optical flow [29] and spatio-temporal filtering techniques [30] In the background subtrac-tion method [31] the difference between the current frameand a background frame (first frame) at pixel or block level iscomputed Adaptive Gaussian mixture temporal differencinghierarchical background models warping background andnon-parametric background are the most popular approachesof background subtraction [32] In optical flow-based objectdetection technique [29] flow vectors associated with theobjectrsquos motion are characterised over a time span in order toidentify regions in motion for a given sequence of images [33]Researchers reported that optical flow based techniques consistof computational overheads and are sensitive to various motionrelated outliers such as noise colour and lighting etc [34]In another method of motion detection Aslani et al [30]proposed spatio-temporal filter based approach in which themotion parameters are identified by using three-dimensional(3D) spatio-temporal features of the person in motion in theimage sequence These methods are advantageous due to itssimplicity and less computational complexity however showslimited performance because of noise and uncertainties onmoving patterns [35]

Object detection problems have been efficiently addressedby recently developed advanced techniques In the last decadeconvolutional neural networks (CNN) region-based CNN [36]and faster region-based CNN [37] used region proposal tech-niques to generate the objectness score prior to its classifica-tion and later generates the bounding boxes around the objectof interest for visualization and other statistical analysis [38]Although these methods are efficient but suffer in termsof larger training time requirements [39] Since all theseCNN based approaches utilize classification another approachYOLO considers a regression based method to dimension-ally separate the bounding boxes and interpret their classprobabilities [40] In this method the designed frameworkefficiently divides the image into several portions representingbounding boxes along with the class probability scores foreach portion to consider as an object This approach offersexcellent improvements in terms of speed while trading thegained speed with the efficiency The detector module exhibitspowerful generalization capabilities of representing an entireimage [41]

Based on the above concepts many research findings havebeen reported in the last few years Crowd counting emerged asa promising area of research with many societal applicationsEshel et al [42] focused on crowd detection and personcount by proposing multiple height homographies for headtop detection and solved the occlusions problem associatedwith video surveillance related applications Chen et al [43]developed an electronic advertising application based on theconcept of crowd counting In similar application Chih-Wenet al [44] proposed a vision-based people counting modelFollowing this Yao et al [45] generated inputs from stationarycameras to perform background subtraction to train the modelfor the appearance and the foreground shape of the crowd invideos

Once an object is detected classification techniques canbe applied to identify a human on the basis of shape tex-ture or motion-based features In shape-based methods theshape related information of moving regions such as pointsboxes and blobs are determined to identify the human Thismethod performs poorly due to certain limitations in stan-dard template-matching schemes [46] [47] which is furtherenhanced by applying part-based template matching [48] ap-proach In another research Dalal et al [49] proposed texture-based schemes such as histograms of oriented gradient (HOG)which utilises high dimensional features based on edges alongwith the support vector machine (SVM) to detect humans

According to recent research further identification of aperson through video surveillance can be done by usingface [50] [51] and gait recognition [52] techniques Howeverdetection and tracking of people under crowd are difficultsometimes due to partial or full occlusion problems Leibeet al [53] proposed trajectory estimation based solution whileAndriluka et al [54] proposed a solution to detect partiallyoccluded people using tracklet-based detectors Many othertracking techniques including a variety of object and motionrepresentations are reviewed by Yilmaz et al [55]

A large number of studies are available in the area of videosurveillance Among many publically available datasets KTHhuman motion dataset [56] shows six categories of activitieswhereas INRIA XMAS multi-view dataset [57] and Weizmannhuman action dataset [58] contain 11 and 10 categories ofactions respectively Another dataset named as performanceevaluation of tracking and surveillance (PETS) is proposedby a group of researchers at university of Oxford [59] Thisdataset is available for vision based research comprising alarge number of datasets for varying tasks in the field ofcomputer vision In the present research in order to fine-tunethe object detection and tracking models for identifying theperson open images datasets [60] are considered It is a col-lection of 19957 classes out of which the models are trainedfor the identification of a person The images are annotatedwith image-level labels and corresponding coordinates of thebounding boxes representing the person Furthermore the finetuned proposed framework is simulated on the Oxford towncenter surveillance footage [23] to monitor social distancing

We believe that having a single dataset with unified an-notations for image classification object detection visualrelationship detection instance segmentation and multimodal

4

Fig 2 Performance overview of the most popular objectdetection models on PASCAL-VOC and MS-COCO datasets

image descriptions will enable us to study and perform ob-ject detection tasks efficiently and stimulate progress towardsgenuine understanding of the scene All explored literatureand related research work clearly establishes a picture thatthe application of human detection can easily get extended tomany applications to cater the situation that arises presentlysuch as to check prescribed standards for hygiene socialdistancing work practices etc

III OBJECT DETECTION AND TRACKING MODELS

As observed from Fig 2 the successful object detectionmodels like RCNN [61] fast RCNN [62] faster RCNN [38]SSD [63] YOLO v1 [40] YOLO v2 [64] and YOLO v3 [65]tested on PASCAL-VOC [66] and MS-COCO [67] datasetsundergo trade-off between speed and accuracy of the detec-tion which is dependent on various factors like backbonearchitecture (feature extraction network eg VGG-16 [68]ResNet-101 [69] Inception v2 [70] etc) input sizes modeldepth varying software and hardware environment A featureextractor tends to encode the modelrsquos input into certain featurerepresentation which aids in learning and discovering thepatterns associated with the desired objects In order to identifymultiple objects of varying scale or size it also uses predefinedboxes covering an entire image termed as anchor boxes Table Idescribes the performance in terms of accuracy for each ofthese popular and powerful feature extraction networks onILSVRC ImageNet challenge [71] along with the numberof trainable parameters which have a direct impact on thetraining speed and time As highlighted in Table I the ratio ofaccuracy to the number of parameters is highest for Inceptionv2 model indicating that Inception v2 achieved adequateclassification accuracy with minimal trainable parameters incontrast to other models and hence is utilized as a backbonearchitecture for faster and efficient computations in the fasterRCNN and SSD object detection models whereas YOLO v3uses different architecture Darknet-53 as proposed by Redmonet al [65]

A Anchor boxes

With the exhaustive literature survey it is observed thatevery popular object detection model utilizes the concept ofanchor boxes to detect multiple objects in the scene [36] Theseboxes are overlaid on the input image over various spatial

TABLE I Performance of the feature extraction network onImageNet challenge

Backbone model Accuracy (a) Parameters (p) Ratio (a100p)VGG-16 [68] 071 15 M 473ResNet-101 [69] 076 425 M 178Inception v2 [70] 074 10 M 740Inception v3 [72] 078 22 M 358Resnet v2 [72] 080 54 M 148

TABLE II Hyperparameters for generating the anchor boxes

Detectionmodel

Size vector(p)

Aspect ratio(r)

Anchorboxes

IoU thfor NMS

FasterRCNN [025 05 10] [05 10 20] 9 07

SSD [02 057 095] [03 05 10] 9 06YOLO v3 [025 05 10] [05 10 20] 9 07

locations (per filter) with varying sizes and aspect ratio Inthis article for an image of dimension breadth (b) times height(h) the anchor boxes are generated in the following mannerConsider the parameters size as p ε (0 1] and aspect ratioas r gt 0 then the anchor boxes for a certain location in animage can be constructed with dimensions as bp

radicr times hp

radicr

Table II shows the values of p and r configured for each modelLater the object detection model is trained to predict for eachgenerated anchor box to belong to a certain class and an offsetto adjust the dimensions of the anchor box to better fit theground-truth of the object while using the classification andregression loss Since there are many anchor boxes for a spatiallocation the object can get associated with more than oneanchor box This problem is dealt with non-max suppression(NMS) by computing intersection over union (IoU) parameterthat limits the anchor boxes association with the object ofinterest by calculating the score as the ratio of overlappingregions between the assigned anchor box and the ground-truthto the union of regions of the anchor box and the ground-truth The score value is then compared with the set thresholdhyperparameter to return the best bounding box for an object

1) Loss Function With each step of model training pre-dicted anchor box lsquoarsquo is assigned a label as positive (1) ornegative (0) based on its associativity with the object ofinterest having ground-truth box lsquogrsquo The positive anchor boxis then assigned a class label yo ε c1 c2 cn here cnindicates the category of the nth object while also generatingthe encoding vector for box lsquogrsquo with respect to lsquoarsquo as f(ga|a)where yo = 0 for negative anchor boxes Consider an imageI for some anchor lsquoarsquo model with trained parameters ωpredicted the object class as Ycls(I|aω) and the correspond-ing box offset as Yreg(I|aω) then the loss for a singleanchor prediction can be computed (Lcls) and bounding boxregression loss (Lreg) as given by the Eq 4

L(a|Iω) = α1obja Lreg(f(ga|a)minus Yreg(I|aω))+βLcls(ya Ycls(I|aω))

(4)

where 1obja is 1 if lsquoarsquo is a positive anchor α and β arethe weights associated with the regression and classificationloss Later the overall loss of the model can be computed as

5

Fig 3 Schematic representation of faster RCNN architecture

the average of the L(a|Iw) over the predictions for all theanchors

B Faster RCNN

Proposed by Ren et al [38] the faster RCNN is derivedfrom its predecessors RCNN [61] and fast RCNN [62] whichrely on external region proposal approach based on selectivesearch (SS) [73] Many researchers [74] [75] [76] observedthat instead of using the SS it is recommended to utilizethe advantages of convolution layers for better and fasterlocalization of the objects Hence Ren et al proposed theRegion Proposal Network (RPN) which uses CNN modelseg VGGNet ResNet etc to generate the region proposalsthat made faster RCNN 10 times faster than fast RCNNFig 3 shows the schematic representation of faster RCNNarchitecture where RPN module performs binary classificationof an object or not an object (background) while classificationmodule assigns categories for each detected object (multi-class classification) by using the region of interest (RoI)pooling [38] on the extracted feature maps with projectedregions

1) Loss function The faster RCNN is the combinationof two modules RPN and fast RCNN detector The overallmulti-task loss function is composed of classification loss andbounding box regression loss as defined in Eq 4 with Lclsand Lreg functions defined in Eq 5

Lcls(pi plowasti ) = minusplowasti log(pi)minus (1minus plowasti ) log(1minus pi)

Lreg(tu v) =

sumxεxywh

Lsmooth1 (tui minus v)

Lsmooth1 (q) =

05q2 if | q |lt 1

| q | minus05 otherwise

(5)

where tu is the predicted corrections of the bounding box tu =tux tuy tuw tuh Here u is a true class label (x y) correspondsto the top-left coordinates of the bounding box with height hand width w v is a ground-truth bounding box plowasti is thepredicted class and pi is the actual class

C Single Shot Detector (SSD)

In this research single shot detector (SSD) [63] is also usedas another object identification method to detect people in real-time video surveillance system As discussed earlier fasterR-CNN works on region proposals to create boundary boxesto indicate objects shows better accuracy but has slow pro-cessing of frames per second (FPS) For real-time processingSSD further improves the accuracy and FPS by using multi-scale features and default boxes in a single process It follows

Fig 4 Schematic representation of SSD architecture

the principle of the feed-forward convolution network whichgenerates bounding boxes of fixed sizes along with a scorebased on the presence of object class instances in those boxesfollowed by NMS step to produce the final detections Thusit consists of two steps extracting feature maps and applyingconvolution filters to detect objects by using an architecturehaving three main parts First part is a base pretrained networkto extract feature maps whereas in the second part multi-scale feature layers are used in which series of convolutionfilters are cascaded after the base network The last part isa non-maximum suppression unit for eliminating overlappingboxes and one object only per box The architecture of SSDis shown in Fig 4

1) Loss function Similar to the above discussed fasterRCNN model the overall loss function of the SSD modelis equal to the sum of multi-class classification loss (Lcls)and bounding box regression loss (localization loss Lreg) asshown in Eq 4 where Lreg and Lcls is defined by Eq 6 and 7

Lreg(x l g) =

Nsumiεpos

summεcxcywh

xkijsmoothL1(lmi minus gmj )

gcxj =(gcxj minus a

cxi )

awi g

cyj =

(gcyj minus a

cyi )

ahi

gwj = log

(gwjawi

) ghj = log

(ghjahi

)

xpij =

1 if IoU gt 05

0 otherwise

(6)

where l is the predicted box g is the ground truth box xpij isan indicator that matches the ith anchor box to the jth groundtruth box cx and cy are offsets to the anchor box a

Lcls(x c) = minusNsum

iεPos

xpij log(cpi )minus

sumiεNeg

log(coi ) (7)

where cpi =exp cpisump exp cpi

and N is the number of default matchedboxes

D YOLO

For object detection another competitor of SSD isYOLO [40] This method can predict the type and location ofan object by looking only once at the image YOLO considersthe object detection problem as a regression task insteadof classification to assign class probabilities to the anchorboxes A single convolutional network simultaneously predicts

6

Fig 5 Schematic representation of YOLO v3 architecture

multiple bounding boxes and class probabilities Majorly thereare three versions of YOLO v1 v2 and v3 YOLO v1 isinspired by GoogleNet (Inception network) which is designedfor object classification in an image This network consistsof 24 convolutional layers and 2 fully connected layersInstead of the Inception modules used by GoogLeNet YOLOv1 simply uses a reduction layer followed by convolutionallayers Later YOLO v2 [64] is proposed with the objectiveof improving the accuracy significantly while making it fasterYOLO v2 uses Darknet-19 as a backbone network consistingof 19 convolution layers along with 5 max pooling layersand an output softmax layer for object classification YOLOv2 outperformed its predecessor (YOLO v1) with significantimprovements in mAP FPS and object classification scoreIn contrast YOLO v3 performs multi-label classification withthe help of logistic classifiers instead of using softmax asin case of YOLO v1 and v2 In YOLO v3 Redmon et alproposed Darknet-53 as a backbone architecture that extractsfeatures maps for classification In contrast to Darknet-19Darknet-53 consists of residual blocks (short connections)along with the upsampling layers for concatenation and addeddepth to the network YOLO v3 generates three predictionsfor each spatial location at different scales in an image whicheliminates the problem of not being able to detect small objectsefficiently [77] Each prediction is monitored by computingobjectness boundary box regressor and classification scoresIn Fig 5 a schematic description of the YOLOv3 architectureis presented

1) Loss function The overall loss function of YOLO v3consists of localization loss (bounding box regressor) crossentropy and confidence loss for classification score definedas follows

λcoord

S2sumi=0

Bsumj=0

1objij ((tx minus tx)2+ (ty minus ty)

2+ (tw minus tw)

2+

(th minus th)2)

+

S2sumi=0

Bsumj=0

1objij (minus log(σ(to)) +

Csumk=1

BCE(yk σ(sk)))

+λnoobj

S2sumi=0

Bsumj=0

1noobjij (minus log(1minus σ(to))

(8)where λcoord indicates the weight of the coordinate errorS2 indicates the number of grids in the image and B isthe number of generated bounding boxes per grid 1objij = 1

describes that object confines in the jth bounding box in grid

i otherwise it is 0

E Deepsort

Deepsort is a deep learning based approach to track customobjects in a video [78] In the present research Deepsort isutilized to track individuals present in the surveillance footageIt makes use of patterns learned via detected objects in theimages which is later combined with the temporal informationfor predicting associated trajectories of the objects of interestIt keeps track of each object under consideration by mappingunique identifiers for further statistical analysis Deepsort isalso useful to handle associated challenges such as occlusionmultiple viewpoints non-stationary cameras and annotatingtraining data For effective tracking the Kalman filter andthe Hungarian algorithm are used Kalman filter is recursivelyused for better association and it can predict future positionsbased on the current position Hungarian algorithm is used forassociation and id attribution that identifies if an object in thecurrent frame is the same as the one in the previous frameInitially a Faster RCNN is trained for person identification andfor tracking a linear constant velocity model [79] is utilized todescribe each target with eight dimensional space as follows

x = [u v λ h x y λ h]T (9)

where (u v) is the centroid of the bounding box a is the aspectratio and h is the height of the image The other variables arethe respective velocities of the variables Later the standardKalman filter is used with constant velocity motion and linearobservation model where the bounding coordinates (u v λ h)are taken as direct observations of the object state

For each track k starting from the last successful measure-ment association ak the total number of frames are calculatedWith positive prediction from the Kalman filter the counteris incremented and later when the track gets associated witha measurement it resets its value to 0 Furthermore if theidentified tracks exceed a predefined maximum age thenthose objects are considered to have left the scene and thecorresponding track gets removed from the track set Andif there are no tracks available for some detected objectsthen new track hypotheses are initiated for each unidentifiedtrack of novel detected objects that cannot be mapped to theexisting tracks For the first three frames the new tracks areclassified as indefinite until a successful measurement map-ping is computed If the tracks are not successfully mappedwith measurement then it gets deleted from the track setHungarian algorithm is then utilized in order to solve themapping problem between the newly arrived measurementsand the predicted Kalman states by considering the motion andappearance information with the help of Mahalanobis distancecomputed between them as defined in Eq 10

d(1)(i j) = (dj minus yi)TSminus1i (dj minus yi) (10)

where the projection of the ith track distribution into measure-ment space is represented by (yi Si) and the jth boundingbox detection by dj The Mahalanobis distance considers thisuncertainty by estimating the count of standard deviationsthe detection is away from the mean track location Further

7

using this metric it is possible to exclude unlikely associationsby thresholding the Mahalanobis distance This decision isdenoted with an indicator that evaluates to 1 if the associationbetween the ith track and jth detection is admissible (Eq 11)

b(1)ij = 1[d(1)(i j) lt t(1)] (11)

Though Mahalanobis distance performs efficiently but failsin the environment where camera motion is possible therebyanother metric is introduced for the assignment problem Thissecond metric measures the smallest cosine distance betweenthe ith track and jth detection in appearance space as follows

d(2)(i j) = min1minus rjT rk(i) | rk(i) ε R2 (12)

Again a binary variable is introduced to indicate if an asso-ciation is admissible according to the following metric

b(1)ij = 1[d(2)(i j) lt t(2)] (13)

and a suitable threshold is measured for this indicator on aseparate training dataset To build the association problemboth metrics are combined using a weighted sum

cij = λd(1)(i j) + (1minus λ)d(2)(i j) (14)

where an association is admissible if it is within the gatingregion of both metrics

bij =prodm=1

2b(m)ij (15)

The influence of each metric on the combined association costcan be controlled through hyperparameter λ

IV PROPOSED APPROACH

The emergence of deep learning has brought the best per-forming techniques for a wide variety of tasks and challengesincluding medical diagnosis [74] machine translation [75]speech recognition [76] and a lot more [80] Most of thesetasks are centred around object classification detection seg-mentation tracking and recognition [81] [82] In recent yearsthe convolution neural network (CNN) based architectureshave shown significant performance improvements that areleading towards the high quality of object detection as shownin Fig 2 which presents the performance of such modelsin terms of mAP and FPS on standard benchmark datasetsPASCAL-VOC [66] and MS-COCO [67] and similar hard-ware resources

In the present article a deep learning based framework isproposed that utilizes object detection and tracking modelsto aid in the social distancing remedy for dealing with theescalation of COVID-19 cases In order to maintain thebalance of speed and accuracy YOLO v3 [65] alongside theDeepsort [78] are utilized as object detection and trackingapproaches while surrounding each detected object with thebounding boxes Later these bounding boxes are utilized tocompute the pairwise L2 norm with computationally efficientvectorized representation for identifying the clusters of peoplenot obeying the order of social distancing Furthermore tovisualize the clusters in the live stream each bounding box

is color-coded based on its association with the group wherepeople belonging to the same group are represented withthe same color Each surveillance frame is also accompaniedwith the streamline plot depicting the statistical count of thenumber of social groups and an index term (violation index)representing the ratio of the number of people to the numberof groups Furthermore estimated violations can be computedby multiplying the violation index with the total number ofsocial groups

A Workflow

This section includes the necessary steps undertaken tocompose a framework for monitoring social distancing

1 Fine-tune the trained object detection model to identifyand track the person in a footage

2 The trained model is feeded with the surveillance footageThe model generates a set of bounding boxes and an IDfor each identified person

3 Each individual is associated with three-dimensional fea-ture space (x y d) where (x y) corresponds to thecentroid coordinates of the bounding box and d definesthe depth of the individual as observed from the camera

d = ((2 lowast 314 lowast 180)(w + h lowast 360) lowast 1000 + 3) (16)

where w is the width of the bounding box and h is theheight of the bounding box [83]

4 For the set of bounding boxes pairwise L2 norm iscomputed as given by the following equation

||D||2 =

radicradicradicradic nsumi=1

(qi minus pi)2 (17)

where in this work n = 35 The dense matrix of L2 norm is then utilized to assign the

neighbors for each individual that satisfies the closenesssensitivity With extensive trials the closeness thresholdis updated dynamically based on the spatial location ofthe person in a given frame ranging between (90 170)pixels

6 Any individual that meets the closeness property isassigned a neighbour or neighbours forming a grouprepresented in a different color coding in contrast to otherpeople

7 The formation of groups indicates the violation of thepractice of social distancing which is quantified with helpof the following

ndash Consider ng as number of groups or clusters identi-fied and np as total number of people found in closeproximity

ndash vi = npng where vi is the violation index

V EXPERIMENTS AND RESULTS

The above discussed object detection models are fine tunedfor binary classification (person or not a person) with Inceptionv2 as a backbone network on the Nvidia GTX 1060 GPUusing the dataset acquired from the open image dataset (OID)

8

Fig 6 Data samples showing (a) true samples and (b) falsesamples of a ldquoPersonrdquo class from the open image dataset

Fig 7 Losses per iteration of the object detection modelsduring the training phase on the OID validation set fordetecting the person in an image

repository [73] maintained by the Google open source com-munity The diverse images with a class label as ldquoPersonrdquo aredownloaded via OIDv4 toolkit [84] along with the annotationsFig 6 shows the sample images of the obtained dataset con-sisting of 800 images which is obtained by manually filteringto only contain the true samples The dataset is then dividedinto training and testing sets in 82 ratio In order to make thetesting robust the testing set is also accompanied by the framesof surveillance footage of the Oxford town center [23] Laterthis footage is also utilized to simulate the overall approach formonitoring the social distancing In case of faster RCNN theimages are resized to P pixels on the shorter edge with 600and 1024 for low and high resolution while in SSD and YOLOthe images are scaled to the fixed dimension P times P with Pvalue as 416 During the training phase the performance of themodels is continuously monitored using the mAP along withthe localization classification and overall loss in the detectionof the person as indicated in Fig 7 Table III summarizes theresults of each model obtained at the end of the training phasewith the training time (TT) number of iterations (NoI) mAPand total loss (TL) value It is observed that the faster RCNNmodel achieved minimal loss with maximum mAP howeverhas the lowest FPS which makes it not suitable for real-timeapplications Furthermore as compared to SSD YOLO v3achieved better results with balanced mAP training time andFPS score The trained YOLO v3 model is then utilized formonitoring the social distancing on the surveillance video

TABLE III Performance comparison of the object detectionmodels

Model TT (in sec) NoI mAP TL FPSFaster RCNN 9651 12135 0969 002 3SSD 2124 1200 0691 022 10YOLO v3 5659 7560 0846 087 23

Fig 8 Sample output of the proposed framework for monitor-ing social distancing on surveillance footage of Oxford TownCenter

VI OUTPUT

The proposed framework outputs (as shown in Fig 8) theprocessed frame with the identified people confined in thebounding boxes while also simulating the statistical analysisshowing the total number of social groups displayed by samecolor encoding and a violation index term computed as theratio of the number of people to the number of groups Theframes shown in Fig 8 displays violation index as 3 2 2 and233 The frames with detected violations are recorded withthe timestamp for future analysis

VII FUTURE SCOPE AND CHALLENGES

Since this application is intended to be used in any workingenvironment accuracy and precision are highly desired toserve the purpose Higher number of false positive may raisediscomfort and panic situation among people being observedThere may also be genuinely raised concerns about privacy andindividual rights which can be addressed with some additionalmeasures such as prior consents for such working environ-ments hiding a personrsquos identity in general and maintainingtransparency about its fair uses within limited stakeholders

VIII CONCLUSION

The article proposes an efficient real-time deep learningbased framework to automate the process of monitoring thesocial distancing via object detection and tracking approacheswhere each individual is identified in the real-time with the

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

4

Fig 2 Performance overview of the most popular objectdetection models on PASCAL-VOC and MS-COCO datasets

image descriptions will enable us to study and perform ob-ject detection tasks efficiently and stimulate progress towardsgenuine understanding of the scene All explored literatureand related research work clearly establishes a picture thatthe application of human detection can easily get extended tomany applications to cater the situation that arises presentlysuch as to check prescribed standards for hygiene socialdistancing work practices etc

III OBJECT DETECTION AND TRACKING MODELS

As observed from Fig 2 the successful object detectionmodels like RCNN [61] fast RCNN [62] faster RCNN [38]SSD [63] YOLO v1 [40] YOLO v2 [64] and YOLO v3 [65]tested on PASCAL-VOC [66] and MS-COCO [67] datasetsundergo trade-off between speed and accuracy of the detec-tion which is dependent on various factors like backbonearchitecture (feature extraction network eg VGG-16 [68]ResNet-101 [69] Inception v2 [70] etc) input sizes modeldepth varying software and hardware environment A featureextractor tends to encode the modelrsquos input into certain featurerepresentation which aids in learning and discovering thepatterns associated with the desired objects In order to identifymultiple objects of varying scale or size it also uses predefinedboxes covering an entire image termed as anchor boxes Table Idescribes the performance in terms of accuracy for each ofthese popular and powerful feature extraction networks onILSVRC ImageNet challenge [71] along with the numberof trainable parameters which have a direct impact on thetraining speed and time As highlighted in Table I the ratio ofaccuracy to the number of parameters is highest for Inceptionv2 model indicating that Inception v2 achieved adequateclassification accuracy with minimal trainable parameters incontrast to other models and hence is utilized as a backbonearchitecture for faster and efficient computations in the fasterRCNN and SSD object detection models whereas YOLO v3uses different architecture Darknet-53 as proposed by Redmonet al [65]

A Anchor boxes

With the exhaustive literature survey it is observed thatevery popular object detection model utilizes the concept ofanchor boxes to detect multiple objects in the scene [36] Theseboxes are overlaid on the input image over various spatial

TABLE I Performance of the feature extraction network onImageNet challenge

Backbone model Accuracy (a) Parameters (p) Ratio (a100p)VGG-16 [68] 071 15 M 473ResNet-101 [69] 076 425 M 178Inception v2 [70] 074 10 M 740Inception v3 [72] 078 22 M 358Resnet v2 [72] 080 54 M 148

TABLE II Hyperparameters for generating the anchor boxes

Detectionmodel

Size vector(p)

Aspect ratio(r)

Anchorboxes

IoU thfor NMS

FasterRCNN [025 05 10] [05 10 20] 9 07

SSD [02 057 095] [03 05 10] 9 06YOLO v3 [025 05 10] [05 10 20] 9 07

locations (per filter) with varying sizes and aspect ratio Inthis article for an image of dimension breadth (b) times height(h) the anchor boxes are generated in the following mannerConsider the parameters size as p ε (0 1] and aspect ratioas r gt 0 then the anchor boxes for a certain location in animage can be constructed with dimensions as bp

radicr times hp

radicr

Table II shows the values of p and r configured for each modelLater the object detection model is trained to predict for eachgenerated anchor box to belong to a certain class and an offsetto adjust the dimensions of the anchor box to better fit theground-truth of the object while using the classification andregression loss Since there are many anchor boxes for a spatiallocation the object can get associated with more than oneanchor box This problem is dealt with non-max suppression(NMS) by computing intersection over union (IoU) parameterthat limits the anchor boxes association with the object ofinterest by calculating the score as the ratio of overlappingregions between the assigned anchor box and the ground-truthto the union of regions of the anchor box and the ground-truth The score value is then compared with the set thresholdhyperparameter to return the best bounding box for an object

1) Loss Function With each step of model training pre-dicted anchor box lsquoarsquo is assigned a label as positive (1) ornegative (0) based on its associativity with the object ofinterest having ground-truth box lsquogrsquo The positive anchor boxis then assigned a class label yo ε c1 c2 cn here cnindicates the category of the nth object while also generatingthe encoding vector for box lsquogrsquo with respect to lsquoarsquo as f(ga|a)where yo = 0 for negative anchor boxes Consider an imageI for some anchor lsquoarsquo model with trained parameters ωpredicted the object class as Ycls(I|aω) and the correspond-ing box offset as Yreg(I|aω) then the loss for a singleanchor prediction can be computed (Lcls) and bounding boxregression loss (Lreg) as given by the Eq 4

L(a|Iω) = α1obja Lreg(f(ga|a)minus Yreg(I|aω))+βLcls(ya Ycls(I|aω))

(4)

where 1obja is 1 if lsquoarsquo is a positive anchor α and β arethe weights associated with the regression and classificationloss Later the overall loss of the model can be computed as

5

Fig 3 Schematic representation of faster RCNN architecture

the average of the L(a|Iw) over the predictions for all theanchors

B Faster RCNN

Proposed by Ren et al [38] the faster RCNN is derivedfrom its predecessors RCNN [61] and fast RCNN [62] whichrely on external region proposal approach based on selectivesearch (SS) [73] Many researchers [74] [75] [76] observedthat instead of using the SS it is recommended to utilizethe advantages of convolution layers for better and fasterlocalization of the objects Hence Ren et al proposed theRegion Proposal Network (RPN) which uses CNN modelseg VGGNet ResNet etc to generate the region proposalsthat made faster RCNN 10 times faster than fast RCNNFig 3 shows the schematic representation of faster RCNNarchitecture where RPN module performs binary classificationof an object or not an object (background) while classificationmodule assigns categories for each detected object (multi-class classification) by using the region of interest (RoI)pooling [38] on the extracted feature maps with projectedregions

1) Loss function The faster RCNN is the combinationof two modules RPN and fast RCNN detector The overallmulti-task loss function is composed of classification loss andbounding box regression loss as defined in Eq 4 with Lclsand Lreg functions defined in Eq 5

Lcls(pi plowasti ) = minusplowasti log(pi)minus (1minus plowasti ) log(1minus pi)

Lreg(tu v) =

sumxεxywh

Lsmooth1 (tui minus v)

Lsmooth1 (q) =

05q2 if | q |lt 1

| q | minus05 otherwise

(5)

where tu is the predicted corrections of the bounding box tu =tux tuy tuw tuh Here u is a true class label (x y) correspondsto the top-left coordinates of the bounding box with height hand width w v is a ground-truth bounding box plowasti is thepredicted class and pi is the actual class

C Single Shot Detector (SSD)

In this research single shot detector (SSD) [63] is also usedas another object identification method to detect people in real-time video surveillance system As discussed earlier fasterR-CNN works on region proposals to create boundary boxesto indicate objects shows better accuracy but has slow pro-cessing of frames per second (FPS) For real-time processingSSD further improves the accuracy and FPS by using multi-scale features and default boxes in a single process It follows

Fig 4 Schematic representation of SSD architecture

the principle of the feed-forward convolution network whichgenerates bounding boxes of fixed sizes along with a scorebased on the presence of object class instances in those boxesfollowed by NMS step to produce the final detections Thusit consists of two steps extracting feature maps and applyingconvolution filters to detect objects by using an architecturehaving three main parts First part is a base pretrained networkto extract feature maps whereas in the second part multi-scale feature layers are used in which series of convolutionfilters are cascaded after the base network The last part isa non-maximum suppression unit for eliminating overlappingboxes and one object only per box The architecture of SSDis shown in Fig 4

1) Loss function Similar to the above discussed fasterRCNN model the overall loss function of the SSD modelis equal to the sum of multi-class classification loss (Lcls)and bounding box regression loss (localization loss Lreg) asshown in Eq 4 where Lreg and Lcls is defined by Eq 6 and 7

Lreg(x l g) =

Nsumiεpos

summεcxcywh

xkijsmoothL1(lmi minus gmj )

gcxj =(gcxj minus a

cxi )

awi g

cyj =

(gcyj minus a

cyi )

ahi

gwj = log

(gwjawi

) ghj = log

(ghjahi

)

xpij =

1 if IoU gt 05

0 otherwise

(6)

where l is the predicted box g is the ground truth box xpij isan indicator that matches the ith anchor box to the jth groundtruth box cx and cy are offsets to the anchor box a

Lcls(x c) = minusNsum

iεPos

xpij log(cpi )minus

sumiεNeg

log(coi ) (7)

where cpi =exp cpisump exp cpi

and N is the number of default matchedboxes

D YOLO

For object detection another competitor of SSD isYOLO [40] This method can predict the type and location ofan object by looking only once at the image YOLO considersthe object detection problem as a regression task insteadof classification to assign class probabilities to the anchorboxes A single convolutional network simultaneously predicts

6

Fig 5 Schematic representation of YOLO v3 architecture

multiple bounding boxes and class probabilities Majorly thereare three versions of YOLO v1 v2 and v3 YOLO v1 isinspired by GoogleNet (Inception network) which is designedfor object classification in an image This network consistsof 24 convolutional layers and 2 fully connected layersInstead of the Inception modules used by GoogLeNet YOLOv1 simply uses a reduction layer followed by convolutionallayers Later YOLO v2 [64] is proposed with the objectiveof improving the accuracy significantly while making it fasterYOLO v2 uses Darknet-19 as a backbone network consistingof 19 convolution layers along with 5 max pooling layersand an output softmax layer for object classification YOLOv2 outperformed its predecessor (YOLO v1) with significantimprovements in mAP FPS and object classification scoreIn contrast YOLO v3 performs multi-label classification withthe help of logistic classifiers instead of using softmax asin case of YOLO v1 and v2 In YOLO v3 Redmon et alproposed Darknet-53 as a backbone architecture that extractsfeatures maps for classification In contrast to Darknet-19Darknet-53 consists of residual blocks (short connections)along with the upsampling layers for concatenation and addeddepth to the network YOLO v3 generates three predictionsfor each spatial location at different scales in an image whicheliminates the problem of not being able to detect small objectsefficiently [77] Each prediction is monitored by computingobjectness boundary box regressor and classification scoresIn Fig 5 a schematic description of the YOLOv3 architectureis presented

1) Loss function The overall loss function of YOLO v3consists of localization loss (bounding box regressor) crossentropy and confidence loss for classification score definedas follows

λcoord

S2sumi=0

Bsumj=0

1objij ((tx minus tx)2+ (ty minus ty)

2+ (tw minus tw)

2+

(th minus th)2)

+

S2sumi=0

Bsumj=0

1objij (minus log(σ(to)) +

Csumk=1

BCE(yk σ(sk)))

+λnoobj

S2sumi=0

Bsumj=0

1noobjij (minus log(1minus σ(to))

(8)where λcoord indicates the weight of the coordinate errorS2 indicates the number of grids in the image and B isthe number of generated bounding boxes per grid 1objij = 1

describes that object confines in the jth bounding box in grid

i otherwise it is 0

E Deepsort

Deepsort is a deep learning based approach to track customobjects in a video [78] In the present research Deepsort isutilized to track individuals present in the surveillance footageIt makes use of patterns learned via detected objects in theimages which is later combined with the temporal informationfor predicting associated trajectories of the objects of interestIt keeps track of each object under consideration by mappingunique identifiers for further statistical analysis Deepsort isalso useful to handle associated challenges such as occlusionmultiple viewpoints non-stationary cameras and annotatingtraining data For effective tracking the Kalman filter andthe Hungarian algorithm are used Kalman filter is recursivelyused for better association and it can predict future positionsbased on the current position Hungarian algorithm is used forassociation and id attribution that identifies if an object in thecurrent frame is the same as the one in the previous frameInitially a Faster RCNN is trained for person identification andfor tracking a linear constant velocity model [79] is utilized todescribe each target with eight dimensional space as follows

x = [u v λ h x y λ h]T (9)

where (u v) is the centroid of the bounding box a is the aspectratio and h is the height of the image The other variables arethe respective velocities of the variables Later the standardKalman filter is used with constant velocity motion and linearobservation model where the bounding coordinates (u v λ h)are taken as direct observations of the object state

For each track k starting from the last successful measure-ment association ak the total number of frames are calculatedWith positive prediction from the Kalman filter the counteris incremented and later when the track gets associated witha measurement it resets its value to 0 Furthermore if theidentified tracks exceed a predefined maximum age thenthose objects are considered to have left the scene and thecorresponding track gets removed from the track set Andif there are no tracks available for some detected objectsthen new track hypotheses are initiated for each unidentifiedtrack of novel detected objects that cannot be mapped to theexisting tracks For the first three frames the new tracks areclassified as indefinite until a successful measurement map-ping is computed If the tracks are not successfully mappedwith measurement then it gets deleted from the track setHungarian algorithm is then utilized in order to solve themapping problem between the newly arrived measurementsand the predicted Kalman states by considering the motion andappearance information with the help of Mahalanobis distancecomputed between them as defined in Eq 10

d(1)(i j) = (dj minus yi)TSminus1i (dj minus yi) (10)

where the projection of the ith track distribution into measure-ment space is represented by (yi Si) and the jth boundingbox detection by dj The Mahalanobis distance considers thisuncertainty by estimating the count of standard deviationsthe detection is away from the mean track location Further

7

using this metric it is possible to exclude unlikely associationsby thresholding the Mahalanobis distance This decision isdenoted with an indicator that evaluates to 1 if the associationbetween the ith track and jth detection is admissible (Eq 11)

b(1)ij = 1[d(1)(i j) lt t(1)] (11)

Though Mahalanobis distance performs efficiently but failsin the environment where camera motion is possible therebyanother metric is introduced for the assignment problem Thissecond metric measures the smallest cosine distance betweenthe ith track and jth detection in appearance space as follows

d(2)(i j) = min1minus rjT rk(i) | rk(i) ε R2 (12)

Again a binary variable is introduced to indicate if an asso-ciation is admissible according to the following metric

b(1)ij = 1[d(2)(i j) lt t(2)] (13)

and a suitable threshold is measured for this indicator on aseparate training dataset To build the association problemboth metrics are combined using a weighted sum

cij = λd(1)(i j) + (1minus λ)d(2)(i j) (14)

where an association is admissible if it is within the gatingregion of both metrics

bij =prodm=1

2b(m)ij (15)

The influence of each metric on the combined association costcan be controlled through hyperparameter λ

IV PROPOSED APPROACH

The emergence of deep learning has brought the best per-forming techniques for a wide variety of tasks and challengesincluding medical diagnosis [74] machine translation [75]speech recognition [76] and a lot more [80] Most of thesetasks are centred around object classification detection seg-mentation tracking and recognition [81] [82] In recent yearsthe convolution neural network (CNN) based architectureshave shown significant performance improvements that areleading towards the high quality of object detection as shownin Fig 2 which presents the performance of such modelsin terms of mAP and FPS on standard benchmark datasetsPASCAL-VOC [66] and MS-COCO [67] and similar hard-ware resources

In the present article a deep learning based framework isproposed that utilizes object detection and tracking modelsto aid in the social distancing remedy for dealing with theescalation of COVID-19 cases In order to maintain thebalance of speed and accuracy YOLO v3 [65] alongside theDeepsort [78] are utilized as object detection and trackingapproaches while surrounding each detected object with thebounding boxes Later these bounding boxes are utilized tocompute the pairwise L2 norm with computationally efficientvectorized representation for identifying the clusters of peoplenot obeying the order of social distancing Furthermore tovisualize the clusters in the live stream each bounding box

is color-coded based on its association with the group wherepeople belonging to the same group are represented withthe same color Each surveillance frame is also accompaniedwith the streamline plot depicting the statistical count of thenumber of social groups and an index term (violation index)representing the ratio of the number of people to the numberof groups Furthermore estimated violations can be computedby multiplying the violation index with the total number ofsocial groups

A Workflow

This section includes the necessary steps undertaken tocompose a framework for monitoring social distancing

1 Fine-tune the trained object detection model to identifyand track the person in a footage

2 The trained model is feeded with the surveillance footageThe model generates a set of bounding boxes and an IDfor each identified person

3 Each individual is associated with three-dimensional fea-ture space (x y d) where (x y) corresponds to thecentroid coordinates of the bounding box and d definesthe depth of the individual as observed from the camera

d = ((2 lowast 314 lowast 180)(w + h lowast 360) lowast 1000 + 3) (16)

where w is the width of the bounding box and h is theheight of the bounding box [83]

4 For the set of bounding boxes pairwise L2 norm iscomputed as given by the following equation

||D||2 =

radicradicradicradic nsumi=1

(qi minus pi)2 (17)

where in this work n = 35 The dense matrix of L2 norm is then utilized to assign the

neighbors for each individual that satisfies the closenesssensitivity With extensive trials the closeness thresholdis updated dynamically based on the spatial location ofthe person in a given frame ranging between (90 170)pixels

6 Any individual that meets the closeness property isassigned a neighbour or neighbours forming a grouprepresented in a different color coding in contrast to otherpeople

7 The formation of groups indicates the violation of thepractice of social distancing which is quantified with helpof the following

ndash Consider ng as number of groups or clusters identi-fied and np as total number of people found in closeproximity

ndash vi = npng where vi is the violation index

V EXPERIMENTS AND RESULTS

The above discussed object detection models are fine tunedfor binary classification (person or not a person) with Inceptionv2 as a backbone network on the Nvidia GTX 1060 GPUusing the dataset acquired from the open image dataset (OID)

8

Fig 6 Data samples showing (a) true samples and (b) falsesamples of a ldquoPersonrdquo class from the open image dataset

Fig 7 Losses per iteration of the object detection modelsduring the training phase on the OID validation set fordetecting the person in an image

repository [73] maintained by the Google open source com-munity The diverse images with a class label as ldquoPersonrdquo aredownloaded via OIDv4 toolkit [84] along with the annotationsFig 6 shows the sample images of the obtained dataset con-sisting of 800 images which is obtained by manually filteringto only contain the true samples The dataset is then dividedinto training and testing sets in 82 ratio In order to make thetesting robust the testing set is also accompanied by the framesof surveillance footage of the Oxford town center [23] Laterthis footage is also utilized to simulate the overall approach formonitoring the social distancing In case of faster RCNN theimages are resized to P pixels on the shorter edge with 600and 1024 for low and high resolution while in SSD and YOLOthe images are scaled to the fixed dimension P times P with Pvalue as 416 During the training phase the performance of themodels is continuously monitored using the mAP along withthe localization classification and overall loss in the detectionof the person as indicated in Fig 7 Table III summarizes theresults of each model obtained at the end of the training phasewith the training time (TT) number of iterations (NoI) mAPand total loss (TL) value It is observed that the faster RCNNmodel achieved minimal loss with maximum mAP howeverhas the lowest FPS which makes it not suitable for real-timeapplications Furthermore as compared to SSD YOLO v3achieved better results with balanced mAP training time andFPS score The trained YOLO v3 model is then utilized formonitoring the social distancing on the surveillance video

TABLE III Performance comparison of the object detectionmodels

Model TT (in sec) NoI mAP TL FPSFaster RCNN 9651 12135 0969 002 3SSD 2124 1200 0691 022 10YOLO v3 5659 7560 0846 087 23

Fig 8 Sample output of the proposed framework for monitor-ing social distancing on surveillance footage of Oxford TownCenter

VI OUTPUT

The proposed framework outputs (as shown in Fig 8) theprocessed frame with the identified people confined in thebounding boxes while also simulating the statistical analysisshowing the total number of social groups displayed by samecolor encoding and a violation index term computed as theratio of the number of people to the number of groups Theframes shown in Fig 8 displays violation index as 3 2 2 and233 The frames with detected violations are recorded withthe timestamp for future analysis

VII FUTURE SCOPE AND CHALLENGES

Since this application is intended to be used in any workingenvironment accuracy and precision are highly desired toserve the purpose Higher number of false positive may raisediscomfort and panic situation among people being observedThere may also be genuinely raised concerns about privacy andindividual rights which can be addressed with some additionalmeasures such as prior consents for such working environ-ments hiding a personrsquos identity in general and maintainingtransparency about its fair uses within limited stakeholders

VIII CONCLUSION

The article proposes an efficient real-time deep learningbased framework to automate the process of monitoring thesocial distancing via object detection and tracking approacheswhere each individual is identified in the real-time with the

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

5

Fig 3 Schematic representation of faster RCNN architecture

the average of the L(a|Iw) over the predictions for all theanchors

B Faster RCNN

Proposed by Ren et al [38] the faster RCNN is derivedfrom its predecessors RCNN [61] and fast RCNN [62] whichrely on external region proposal approach based on selectivesearch (SS) [73] Many researchers [74] [75] [76] observedthat instead of using the SS it is recommended to utilizethe advantages of convolution layers for better and fasterlocalization of the objects Hence Ren et al proposed theRegion Proposal Network (RPN) which uses CNN modelseg VGGNet ResNet etc to generate the region proposalsthat made faster RCNN 10 times faster than fast RCNNFig 3 shows the schematic representation of faster RCNNarchitecture where RPN module performs binary classificationof an object or not an object (background) while classificationmodule assigns categories for each detected object (multi-class classification) by using the region of interest (RoI)pooling [38] on the extracted feature maps with projectedregions

1) Loss function The faster RCNN is the combinationof two modules RPN and fast RCNN detector The overallmulti-task loss function is composed of classification loss andbounding box regression loss as defined in Eq 4 with Lclsand Lreg functions defined in Eq 5

Lcls(pi plowasti ) = minusplowasti log(pi)minus (1minus plowasti ) log(1minus pi)

Lreg(tu v) =

sumxεxywh

Lsmooth1 (tui minus v)

Lsmooth1 (q) =

05q2 if | q |lt 1

| q | minus05 otherwise

(5)

where tu is the predicted corrections of the bounding box tu =tux tuy tuw tuh Here u is a true class label (x y) correspondsto the top-left coordinates of the bounding box with height hand width w v is a ground-truth bounding box plowasti is thepredicted class and pi is the actual class

C Single Shot Detector (SSD)

In this research single shot detector (SSD) [63] is also usedas another object identification method to detect people in real-time video surveillance system As discussed earlier fasterR-CNN works on region proposals to create boundary boxesto indicate objects shows better accuracy but has slow pro-cessing of frames per second (FPS) For real-time processingSSD further improves the accuracy and FPS by using multi-scale features and default boxes in a single process It follows

Fig 4 Schematic representation of SSD architecture

the principle of the feed-forward convolution network whichgenerates bounding boxes of fixed sizes along with a scorebased on the presence of object class instances in those boxesfollowed by NMS step to produce the final detections Thusit consists of two steps extracting feature maps and applyingconvolution filters to detect objects by using an architecturehaving three main parts First part is a base pretrained networkto extract feature maps whereas in the second part multi-scale feature layers are used in which series of convolutionfilters are cascaded after the base network The last part isa non-maximum suppression unit for eliminating overlappingboxes and one object only per box The architecture of SSDis shown in Fig 4

1) Loss function Similar to the above discussed fasterRCNN model the overall loss function of the SSD modelis equal to the sum of multi-class classification loss (Lcls)and bounding box regression loss (localization loss Lreg) asshown in Eq 4 where Lreg and Lcls is defined by Eq 6 and 7

Lreg(x l g) =

Nsumiεpos

summεcxcywh

xkijsmoothL1(lmi minus gmj )

gcxj =(gcxj minus a

cxi )

awi g

cyj =

(gcyj minus a

cyi )

ahi

gwj = log

(gwjawi

) ghj = log

(ghjahi

)

xpij =

1 if IoU gt 05

0 otherwise

(6)

where l is the predicted box g is the ground truth box xpij isan indicator that matches the ith anchor box to the jth groundtruth box cx and cy are offsets to the anchor box a

Lcls(x c) = minusNsum

iεPos

xpij log(cpi )minus

sumiεNeg

log(coi ) (7)

where cpi =exp cpisump exp cpi

and N is the number of default matchedboxes

D YOLO

For object detection another competitor of SSD isYOLO [40] This method can predict the type and location ofan object by looking only once at the image YOLO considersthe object detection problem as a regression task insteadof classification to assign class probabilities to the anchorboxes A single convolutional network simultaneously predicts

6

Fig 5 Schematic representation of YOLO v3 architecture

multiple bounding boxes and class probabilities Majorly thereare three versions of YOLO v1 v2 and v3 YOLO v1 isinspired by GoogleNet (Inception network) which is designedfor object classification in an image This network consistsof 24 convolutional layers and 2 fully connected layersInstead of the Inception modules used by GoogLeNet YOLOv1 simply uses a reduction layer followed by convolutionallayers Later YOLO v2 [64] is proposed with the objectiveof improving the accuracy significantly while making it fasterYOLO v2 uses Darknet-19 as a backbone network consistingof 19 convolution layers along with 5 max pooling layersand an output softmax layer for object classification YOLOv2 outperformed its predecessor (YOLO v1) with significantimprovements in mAP FPS and object classification scoreIn contrast YOLO v3 performs multi-label classification withthe help of logistic classifiers instead of using softmax asin case of YOLO v1 and v2 In YOLO v3 Redmon et alproposed Darknet-53 as a backbone architecture that extractsfeatures maps for classification In contrast to Darknet-19Darknet-53 consists of residual blocks (short connections)along with the upsampling layers for concatenation and addeddepth to the network YOLO v3 generates three predictionsfor each spatial location at different scales in an image whicheliminates the problem of not being able to detect small objectsefficiently [77] Each prediction is monitored by computingobjectness boundary box regressor and classification scoresIn Fig 5 a schematic description of the YOLOv3 architectureis presented

1) Loss function The overall loss function of YOLO v3consists of localization loss (bounding box regressor) crossentropy and confidence loss for classification score definedas follows

λcoord

S2sumi=0

Bsumj=0

1objij ((tx minus tx)2+ (ty minus ty)

2+ (tw minus tw)

2+

(th minus th)2)

+

S2sumi=0

Bsumj=0

1objij (minus log(σ(to)) +

Csumk=1

BCE(yk σ(sk)))

+λnoobj

S2sumi=0

Bsumj=0

1noobjij (minus log(1minus σ(to))

(8)where λcoord indicates the weight of the coordinate errorS2 indicates the number of grids in the image and B isthe number of generated bounding boxes per grid 1objij = 1

describes that object confines in the jth bounding box in grid

i otherwise it is 0

E Deepsort

Deepsort is a deep learning based approach to track customobjects in a video [78] In the present research Deepsort isutilized to track individuals present in the surveillance footageIt makes use of patterns learned via detected objects in theimages which is later combined with the temporal informationfor predicting associated trajectories of the objects of interestIt keeps track of each object under consideration by mappingunique identifiers for further statistical analysis Deepsort isalso useful to handle associated challenges such as occlusionmultiple viewpoints non-stationary cameras and annotatingtraining data For effective tracking the Kalman filter andthe Hungarian algorithm are used Kalman filter is recursivelyused for better association and it can predict future positionsbased on the current position Hungarian algorithm is used forassociation and id attribution that identifies if an object in thecurrent frame is the same as the one in the previous frameInitially a Faster RCNN is trained for person identification andfor tracking a linear constant velocity model [79] is utilized todescribe each target with eight dimensional space as follows

x = [u v λ h x y λ h]T (9)

where (u v) is the centroid of the bounding box a is the aspectratio and h is the height of the image The other variables arethe respective velocities of the variables Later the standardKalman filter is used with constant velocity motion and linearobservation model where the bounding coordinates (u v λ h)are taken as direct observations of the object state

For each track k starting from the last successful measure-ment association ak the total number of frames are calculatedWith positive prediction from the Kalman filter the counteris incremented and later when the track gets associated witha measurement it resets its value to 0 Furthermore if theidentified tracks exceed a predefined maximum age thenthose objects are considered to have left the scene and thecorresponding track gets removed from the track set Andif there are no tracks available for some detected objectsthen new track hypotheses are initiated for each unidentifiedtrack of novel detected objects that cannot be mapped to theexisting tracks For the first three frames the new tracks areclassified as indefinite until a successful measurement map-ping is computed If the tracks are not successfully mappedwith measurement then it gets deleted from the track setHungarian algorithm is then utilized in order to solve themapping problem between the newly arrived measurementsand the predicted Kalman states by considering the motion andappearance information with the help of Mahalanobis distancecomputed between them as defined in Eq 10

d(1)(i j) = (dj minus yi)TSminus1i (dj minus yi) (10)

where the projection of the ith track distribution into measure-ment space is represented by (yi Si) and the jth boundingbox detection by dj The Mahalanobis distance considers thisuncertainty by estimating the count of standard deviationsthe detection is away from the mean track location Further

7

using this metric it is possible to exclude unlikely associationsby thresholding the Mahalanobis distance This decision isdenoted with an indicator that evaluates to 1 if the associationbetween the ith track and jth detection is admissible (Eq 11)

b(1)ij = 1[d(1)(i j) lt t(1)] (11)

Though Mahalanobis distance performs efficiently but failsin the environment where camera motion is possible therebyanother metric is introduced for the assignment problem Thissecond metric measures the smallest cosine distance betweenthe ith track and jth detection in appearance space as follows

d(2)(i j) = min1minus rjT rk(i) | rk(i) ε R2 (12)

Again a binary variable is introduced to indicate if an asso-ciation is admissible according to the following metric

b(1)ij = 1[d(2)(i j) lt t(2)] (13)

and a suitable threshold is measured for this indicator on aseparate training dataset To build the association problemboth metrics are combined using a weighted sum

cij = λd(1)(i j) + (1minus λ)d(2)(i j) (14)

where an association is admissible if it is within the gatingregion of both metrics

bij =prodm=1

2b(m)ij (15)

The influence of each metric on the combined association costcan be controlled through hyperparameter λ

IV PROPOSED APPROACH

The emergence of deep learning has brought the best per-forming techniques for a wide variety of tasks and challengesincluding medical diagnosis [74] machine translation [75]speech recognition [76] and a lot more [80] Most of thesetasks are centred around object classification detection seg-mentation tracking and recognition [81] [82] In recent yearsthe convolution neural network (CNN) based architectureshave shown significant performance improvements that areleading towards the high quality of object detection as shownin Fig 2 which presents the performance of such modelsin terms of mAP and FPS on standard benchmark datasetsPASCAL-VOC [66] and MS-COCO [67] and similar hard-ware resources

In the present article a deep learning based framework isproposed that utilizes object detection and tracking modelsto aid in the social distancing remedy for dealing with theescalation of COVID-19 cases In order to maintain thebalance of speed and accuracy YOLO v3 [65] alongside theDeepsort [78] are utilized as object detection and trackingapproaches while surrounding each detected object with thebounding boxes Later these bounding boxes are utilized tocompute the pairwise L2 norm with computationally efficientvectorized representation for identifying the clusters of peoplenot obeying the order of social distancing Furthermore tovisualize the clusters in the live stream each bounding box

is color-coded based on its association with the group wherepeople belonging to the same group are represented withthe same color Each surveillance frame is also accompaniedwith the streamline plot depicting the statistical count of thenumber of social groups and an index term (violation index)representing the ratio of the number of people to the numberof groups Furthermore estimated violations can be computedby multiplying the violation index with the total number ofsocial groups

A Workflow

This section includes the necessary steps undertaken tocompose a framework for monitoring social distancing

1 Fine-tune the trained object detection model to identifyand track the person in a footage

2 The trained model is feeded with the surveillance footageThe model generates a set of bounding boxes and an IDfor each identified person

3 Each individual is associated with three-dimensional fea-ture space (x y d) where (x y) corresponds to thecentroid coordinates of the bounding box and d definesthe depth of the individual as observed from the camera

d = ((2 lowast 314 lowast 180)(w + h lowast 360) lowast 1000 + 3) (16)

where w is the width of the bounding box and h is theheight of the bounding box [83]

4 For the set of bounding boxes pairwise L2 norm iscomputed as given by the following equation

||D||2 =

radicradicradicradic nsumi=1

(qi minus pi)2 (17)

where in this work n = 35 The dense matrix of L2 norm is then utilized to assign the

neighbors for each individual that satisfies the closenesssensitivity With extensive trials the closeness thresholdis updated dynamically based on the spatial location ofthe person in a given frame ranging between (90 170)pixels

6 Any individual that meets the closeness property isassigned a neighbour or neighbours forming a grouprepresented in a different color coding in contrast to otherpeople

7 The formation of groups indicates the violation of thepractice of social distancing which is quantified with helpof the following

ndash Consider ng as number of groups or clusters identi-fied and np as total number of people found in closeproximity

ndash vi = npng where vi is the violation index

V EXPERIMENTS AND RESULTS

The above discussed object detection models are fine tunedfor binary classification (person or not a person) with Inceptionv2 as a backbone network on the Nvidia GTX 1060 GPUusing the dataset acquired from the open image dataset (OID)

8

Fig 6 Data samples showing (a) true samples and (b) falsesamples of a ldquoPersonrdquo class from the open image dataset

Fig 7 Losses per iteration of the object detection modelsduring the training phase on the OID validation set fordetecting the person in an image

repository [73] maintained by the Google open source com-munity The diverse images with a class label as ldquoPersonrdquo aredownloaded via OIDv4 toolkit [84] along with the annotationsFig 6 shows the sample images of the obtained dataset con-sisting of 800 images which is obtained by manually filteringto only contain the true samples The dataset is then dividedinto training and testing sets in 82 ratio In order to make thetesting robust the testing set is also accompanied by the framesof surveillance footage of the Oxford town center [23] Laterthis footage is also utilized to simulate the overall approach formonitoring the social distancing In case of faster RCNN theimages are resized to P pixels on the shorter edge with 600and 1024 for low and high resolution while in SSD and YOLOthe images are scaled to the fixed dimension P times P with Pvalue as 416 During the training phase the performance of themodels is continuously monitored using the mAP along withthe localization classification and overall loss in the detectionof the person as indicated in Fig 7 Table III summarizes theresults of each model obtained at the end of the training phasewith the training time (TT) number of iterations (NoI) mAPand total loss (TL) value It is observed that the faster RCNNmodel achieved minimal loss with maximum mAP howeverhas the lowest FPS which makes it not suitable for real-timeapplications Furthermore as compared to SSD YOLO v3achieved better results with balanced mAP training time andFPS score The trained YOLO v3 model is then utilized formonitoring the social distancing on the surveillance video

TABLE III Performance comparison of the object detectionmodels

Model TT (in sec) NoI mAP TL FPSFaster RCNN 9651 12135 0969 002 3SSD 2124 1200 0691 022 10YOLO v3 5659 7560 0846 087 23

Fig 8 Sample output of the proposed framework for monitor-ing social distancing on surveillance footage of Oxford TownCenter

VI OUTPUT

The proposed framework outputs (as shown in Fig 8) theprocessed frame with the identified people confined in thebounding boxes while also simulating the statistical analysisshowing the total number of social groups displayed by samecolor encoding and a violation index term computed as theratio of the number of people to the number of groups Theframes shown in Fig 8 displays violation index as 3 2 2 and233 The frames with detected violations are recorded withthe timestamp for future analysis

VII FUTURE SCOPE AND CHALLENGES

Since this application is intended to be used in any workingenvironment accuracy and precision are highly desired toserve the purpose Higher number of false positive may raisediscomfort and panic situation among people being observedThere may also be genuinely raised concerns about privacy andindividual rights which can be addressed with some additionalmeasures such as prior consents for such working environ-ments hiding a personrsquos identity in general and maintainingtransparency about its fair uses within limited stakeholders

VIII CONCLUSION

The article proposes an efficient real-time deep learningbased framework to automate the process of monitoring thesocial distancing via object detection and tracking approacheswhere each individual is identified in the real-time with the

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

6

Fig 5 Schematic representation of YOLO v3 architecture

multiple bounding boxes and class probabilities Majorly thereare three versions of YOLO v1 v2 and v3 YOLO v1 isinspired by GoogleNet (Inception network) which is designedfor object classification in an image This network consistsof 24 convolutional layers and 2 fully connected layersInstead of the Inception modules used by GoogLeNet YOLOv1 simply uses a reduction layer followed by convolutionallayers Later YOLO v2 [64] is proposed with the objectiveof improving the accuracy significantly while making it fasterYOLO v2 uses Darknet-19 as a backbone network consistingof 19 convolution layers along with 5 max pooling layersand an output softmax layer for object classification YOLOv2 outperformed its predecessor (YOLO v1) with significantimprovements in mAP FPS and object classification scoreIn contrast YOLO v3 performs multi-label classification withthe help of logistic classifiers instead of using softmax asin case of YOLO v1 and v2 In YOLO v3 Redmon et alproposed Darknet-53 as a backbone architecture that extractsfeatures maps for classification In contrast to Darknet-19Darknet-53 consists of residual blocks (short connections)along with the upsampling layers for concatenation and addeddepth to the network YOLO v3 generates three predictionsfor each spatial location at different scales in an image whicheliminates the problem of not being able to detect small objectsefficiently [77] Each prediction is monitored by computingobjectness boundary box regressor and classification scoresIn Fig 5 a schematic description of the YOLOv3 architectureis presented

1) Loss function The overall loss function of YOLO v3consists of localization loss (bounding box regressor) crossentropy and confidence loss for classification score definedas follows

λcoord

S2sumi=0

Bsumj=0

1objij ((tx minus tx)2+ (ty minus ty)

2+ (tw minus tw)

2+

(th minus th)2)

+

S2sumi=0

Bsumj=0

1objij (minus log(σ(to)) +

Csumk=1

BCE(yk σ(sk)))

+λnoobj

S2sumi=0

Bsumj=0

1noobjij (minus log(1minus σ(to))

(8)where λcoord indicates the weight of the coordinate errorS2 indicates the number of grids in the image and B isthe number of generated bounding boxes per grid 1objij = 1

describes that object confines in the jth bounding box in grid

i otherwise it is 0

E Deepsort

Deepsort is a deep learning based approach to track customobjects in a video [78] In the present research Deepsort isutilized to track individuals present in the surveillance footageIt makes use of patterns learned via detected objects in theimages which is later combined with the temporal informationfor predicting associated trajectories of the objects of interestIt keeps track of each object under consideration by mappingunique identifiers for further statistical analysis Deepsort isalso useful to handle associated challenges such as occlusionmultiple viewpoints non-stationary cameras and annotatingtraining data For effective tracking the Kalman filter andthe Hungarian algorithm are used Kalman filter is recursivelyused for better association and it can predict future positionsbased on the current position Hungarian algorithm is used forassociation and id attribution that identifies if an object in thecurrent frame is the same as the one in the previous frameInitially a Faster RCNN is trained for person identification andfor tracking a linear constant velocity model [79] is utilized todescribe each target with eight dimensional space as follows

x = [u v λ h x y λ h]T (9)

where (u v) is the centroid of the bounding box a is the aspectratio and h is the height of the image The other variables arethe respective velocities of the variables Later the standardKalman filter is used with constant velocity motion and linearobservation model where the bounding coordinates (u v λ h)are taken as direct observations of the object state

For each track k starting from the last successful measure-ment association ak the total number of frames are calculatedWith positive prediction from the Kalman filter the counteris incremented and later when the track gets associated witha measurement it resets its value to 0 Furthermore if theidentified tracks exceed a predefined maximum age thenthose objects are considered to have left the scene and thecorresponding track gets removed from the track set Andif there are no tracks available for some detected objectsthen new track hypotheses are initiated for each unidentifiedtrack of novel detected objects that cannot be mapped to theexisting tracks For the first three frames the new tracks areclassified as indefinite until a successful measurement map-ping is computed If the tracks are not successfully mappedwith measurement then it gets deleted from the track setHungarian algorithm is then utilized in order to solve themapping problem between the newly arrived measurementsand the predicted Kalman states by considering the motion andappearance information with the help of Mahalanobis distancecomputed between them as defined in Eq 10

d(1)(i j) = (dj minus yi)TSminus1i (dj minus yi) (10)

where the projection of the ith track distribution into measure-ment space is represented by (yi Si) and the jth boundingbox detection by dj The Mahalanobis distance considers thisuncertainty by estimating the count of standard deviationsthe detection is away from the mean track location Further

7

using this metric it is possible to exclude unlikely associationsby thresholding the Mahalanobis distance This decision isdenoted with an indicator that evaluates to 1 if the associationbetween the ith track and jth detection is admissible (Eq 11)

b(1)ij = 1[d(1)(i j) lt t(1)] (11)

Though Mahalanobis distance performs efficiently but failsin the environment where camera motion is possible therebyanother metric is introduced for the assignment problem Thissecond metric measures the smallest cosine distance betweenthe ith track and jth detection in appearance space as follows

d(2)(i j) = min1minus rjT rk(i) | rk(i) ε R2 (12)

Again a binary variable is introduced to indicate if an asso-ciation is admissible according to the following metric

b(1)ij = 1[d(2)(i j) lt t(2)] (13)

and a suitable threshold is measured for this indicator on aseparate training dataset To build the association problemboth metrics are combined using a weighted sum

cij = λd(1)(i j) + (1minus λ)d(2)(i j) (14)

where an association is admissible if it is within the gatingregion of both metrics

bij =prodm=1

2b(m)ij (15)

The influence of each metric on the combined association costcan be controlled through hyperparameter λ

IV PROPOSED APPROACH

The emergence of deep learning has brought the best per-forming techniques for a wide variety of tasks and challengesincluding medical diagnosis [74] machine translation [75]speech recognition [76] and a lot more [80] Most of thesetasks are centred around object classification detection seg-mentation tracking and recognition [81] [82] In recent yearsthe convolution neural network (CNN) based architectureshave shown significant performance improvements that areleading towards the high quality of object detection as shownin Fig 2 which presents the performance of such modelsin terms of mAP and FPS on standard benchmark datasetsPASCAL-VOC [66] and MS-COCO [67] and similar hard-ware resources

In the present article a deep learning based framework isproposed that utilizes object detection and tracking modelsto aid in the social distancing remedy for dealing with theescalation of COVID-19 cases In order to maintain thebalance of speed and accuracy YOLO v3 [65] alongside theDeepsort [78] are utilized as object detection and trackingapproaches while surrounding each detected object with thebounding boxes Later these bounding boxes are utilized tocompute the pairwise L2 norm with computationally efficientvectorized representation for identifying the clusters of peoplenot obeying the order of social distancing Furthermore tovisualize the clusters in the live stream each bounding box

is color-coded based on its association with the group wherepeople belonging to the same group are represented withthe same color Each surveillance frame is also accompaniedwith the streamline plot depicting the statistical count of thenumber of social groups and an index term (violation index)representing the ratio of the number of people to the numberof groups Furthermore estimated violations can be computedby multiplying the violation index with the total number ofsocial groups

A Workflow

This section includes the necessary steps undertaken tocompose a framework for monitoring social distancing

1 Fine-tune the trained object detection model to identifyand track the person in a footage

2 The trained model is feeded with the surveillance footageThe model generates a set of bounding boxes and an IDfor each identified person

3 Each individual is associated with three-dimensional fea-ture space (x y d) where (x y) corresponds to thecentroid coordinates of the bounding box and d definesthe depth of the individual as observed from the camera

d = ((2 lowast 314 lowast 180)(w + h lowast 360) lowast 1000 + 3) (16)

where w is the width of the bounding box and h is theheight of the bounding box [83]

4 For the set of bounding boxes pairwise L2 norm iscomputed as given by the following equation

||D||2 =

radicradicradicradic nsumi=1

(qi minus pi)2 (17)

where in this work n = 35 The dense matrix of L2 norm is then utilized to assign the

neighbors for each individual that satisfies the closenesssensitivity With extensive trials the closeness thresholdis updated dynamically based on the spatial location ofthe person in a given frame ranging between (90 170)pixels

6 Any individual that meets the closeness property isassigned a neighbour or neighbours forming a grouprepresented in a different color coding in contrast to otherpeople

7 The formation of groups indicates the violation of thepractice of social distancing which is quantified with helpof the following

ndash Consider ng as number of groups or clusters identi-fied and np as total number of people found in closeproximity

ndash vi = npng where vi is the violation index

V EXPERIMENTS AND RESULTS

The above discussed object detection models are fine tunedfor binary classification (person or not a person) with Inceptionv2 as a backbone network on the Nvidia GTX 1060 GPUusing the dataset acquired from the open image dataset (OID)

8

Fig 6 Data samples showing (a) true samples and (b) falsesamples of a ldquoPersonrdquo class from the open image dataset

Fig 7 Losses per iteration of the object detection modelsduring the training phase on the OID validation set fordetecting the person in an image

repository [73] maintained by the Google open source com-munity The diverse images with a class label as ldquoPersonrdquo aredownloaded via OIDv4 toolkit [84] along with the annotationsFig 6 shows the sample images of the obtained dataset con-sisting of 800 images which is obtained by manually filteringto only contain the true samples The dataset is then dividedinto training and testing sets in 82 ratio In order to make thetesting robust the testing set is also accompanied by the framesof surveillance footage of the Oxford town center [23] Laterthis footage is also utilized to simulate the overall approach formonitoring the social distancing In case of faster RCNN theimages are resized to P pixels on the shorter edge with 600and 1024 for low and high resolution while in SSD and YOLOthe images are scaled to the fixed dimension P times P with Pvalue as 416 During the training phase the performance of themodels is continuously monitored using the mAP along withthe localization classification and overall loss in the detectionof the person as indicated in Fig 7 Table III summarizes theresults of each model obtained at the end of the training phasewith the training time (TT) number of iterations (NoI) mAPand total loss (TL) value It is observed that the faster RCNNmodel achieved minimal loss with maximum mAP howeverhas the lowest FPS which makes it not suitable for real-timeapplications Furthermore as compared to SSD YOLO v3achieved better results with balanced mAP training time andFPS score The trained YOLO v3 model is then utilized formonitoring the social distancing on the surveillance video

TABLE III Performance comparison of the object detectionmodels

Model TT (in sec) NoI mAP TL FPSFaster RCNN 9651 12135 0969 002 3SSD 2124 1200 0691 022 10YOLO v3 5659 7560 0846 087 23

Fig 8 Sample output of the proposed framework for monitor-ing social distancing on surveillance footage of Oxford TownCenter

VI OUTPUT

The proposed framework outputs (as shown in Fig 8) theprocessed frame with the identified people confined in thebounding boxes while also simulating the statistical analysisshowing the total number of social groups displayed by samecolor encoding and a violation index term computed as theratio of the number of people to the number of groups Theframes shown in Fig 8 displays violation index as 3 2 2 and233 The frames with detected violations are recorded withthe timestamp for future analysis

VII FUTURE SCOPE AND CHALLENGES

Since this application is intended to be used in any workingenvironment accuracy and precision are highly desired toserve the purpose Higher number of false positive may raisediscomfort and panic situation among people being observedThere may also be genuinely raised concerns about privacy andindividual rights which can be addressed with some additionalmeasures such as prior consents for such working environ-ments hiding a personrsquos identity in general and maintainingtransparency about its fair uses within limited stakeholders

VIII CONCLUSION

The article proposes an efficient real-time deep learningbased framework to automate the process of monitoring thesocial distancing via object detection and tracking approacheswhere each individual is identified in the real-time with the

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

7

using this metric it is possible to exclude unlikely associationsby thresholding the Mahalanobis distance This decision isdenoted with an indicator that evaluates to 1 if the associationbetween the ith track and jth detection is admissible (Eq 11)

b(1)ij = 1[d(1)(i j) lt t(1)] (11)

Though Mahalanobis distance performs efficiently but failsin the environment where camera motion is possible therebyanother metric is introduced for the assignment problem Thissecond metric measures the smallest cosine distance betweenthe ith track and jth detection in appearance space as follows

d(2)(i j) = min1minus rjT rk(i) | rk(i) ε R2 (12)

Again a binary variable is introduced to indicate if an asso-ciation is admissible according to the following metric

b(1)ij = 1[d(2)(i j) lt t(2)] (13)

and a suitable threshold is measured for this indicator on aseparate training dataset To build the association problemboth metrics are combined using a weighted sum

cij = λd(1)(i j) + (1minus λ)d(2)(i j) (14)

where an association is admissible if it is within the gatingregion of both metrics

bij =prodm=1

2b(m)ij (15)

The influence of each metric on the combined association costcan be controlled through hyperparameter λ

IV PROPOSED APPROACH

The emergence of deep learning has brought the best per-forming techniques for a wide variety of tasks and challengesincluding medical diagnosis [74] machine translation [75]speech recognition [76] and a lot more [80] Most of thesetasks are centred around object classification detection seg-mentation tracking and recognition [81] [82] In recent yearsthe convolution neural network (CNN) based architectureshave shown significant performance improvements that areleading towards the high quality of object detection as shownin Fig 2 which presents the performance of such modelsin terms of mAP and FPS on standard benchmark datasetsPASCAL-VOC [66] and MS-COCO [67] and similar hard-ware resources

In the present article a deep learning based framework isproposed that utilizes object detection and tracking modelsto aid in the social distancing remedy for dealing with theescalation of COVID-19 cases In order to maintain thebalance of speed and accuracy YOLO v3 [65] alongside theDeepsort [78] are utilized as object detection and trackingapproaches while surrounding each detected object with thebounding boxes Later these bounding boxes are utilized tocompute the pairwise L2 norm with computationally efficientvectorized representation for identifying the clusters of peoplenot obeying the order of social distancing Furthermore tovisualize the clusters in the live stream each bounding box

is color-coded based on its association with the group wherepeople belonging to the same group are represented withthe same color Each surveillance frame is also accompaniedwith the streamline plot depicting the statistical count of thenumber of social groups and an index term (violation index)representing the ratio of the number of people to the numberof groups Furthermore estimated violations can be computedby multiplying the violation index with the total number ofsocial groups

A Workflow

This section includes the necessary steps undertaken tocompose a framework for monitoring social distancing

1 Fine-tune the trained object detection model to identifyand track the person in a footage

2 The trained model is feeded with the surveillance footageThe model generates a set of bounding boxes and an IDfor each identified person

3 Each individual is associated with three-dimensional fea-ture space (x y d) where (x y) corresponds to thecentroid coordinates of the bounding box and d definesthe depth of the individual as observed from the camera

d = ((2 lowast 314 lowast 180)(w + h lowast 360) lowast 1000 + 3) (16)

where w is the width of the bounding box and h is theheight of the bounding box [83]

4 For the set of bounding boxes pairwise L2 norm iscomputed as given by the following equation

||D||2 =

radicradicradicradic nsumi=1

(qi minus pi)2 (17)

where in this work n = 35 The dense matrix of L2 norm is then utilized to assign the

neighbors for each individual that satisfies the closenesssensitivity With extensive trials the closeness thresholdis updated dynamically based on the spatial location ofthe person in a given frame ranging between (90 170)pixels

6 Any individual that meets the closeness property isassigned a neighbour or neighbours forming a grouprepresented in a different color coding in contrast to otherpeople

7 The formation of groups indicates the violation of thepractice of social distancing which is quantified with helpof the following

ndash Consider ng as number of groups or clusters identi-fied and np as total number of people found in closeproximity

ndash vi = npng where vi is the violation index

V EXPERIMENTS AND RESULTS

The above discussed object detection models are fine tunedfor binary classification (person or not a person) with Inceptionv2 as a backbone network on the Nvidia GTX 1060 GPUusing the dataset acquired from the open image dataset (OID)

8

Fig 6 Data samples showing (a) true samples and (b) falsesamples of a ldquoPersonrdquo class from the open image dataset

Fig 7 Losses per iteration of the object detection modelsduring the training phase on the OID validation set fordetecting the person in an image

repository [73] maintained by the Google open source com-munity The diverse images with a class label as ldquoPersonrdquo aredownloaded via OIDv4 toolkit [84] along with the annotationsFig 6 shows the sample images of the obtained dataset con-sisting of 800 images which is obtained by manually filteringto only contain the true samples The dataset is then dividedinto training and testing sets in 82 ratio In order to make thetesting robust the testing set is also accompanied by the framesof surveillance footage of the Oxford town center [23] Laterthis footage is also utilized to simulate the overall approach formonitoring the social distancing In case of faster RCNN theimages are resized to P pixels on the shorter edge with 600and 1024 for low and high resolution while in SSD and YOLOthe images are scaled to the fixed dimension P times P with Pvalue as 416 During the training phase the performance of themodels is continuously monitored using the mAP along withthe localization classification and overall loss in the detectionof the person as indicated in Fig 7 Table III summarizes theresults of each model obtained at the end of the training phasewith the training time (TT) number of iterations (NoI) mAPand total loss (TL) value It is observed that the faster RCNNmodel achieved minimal loss with maximum mAP howeverhas the lowest FPS which makes it not suitable for real-timeapplications Furthermore as compared to SSD YOLO v3achieved better results with balanced mAP training time andFPS score The trained YOLO v3 model is then utilized formonitoring the social distancing on the surveillance video

TABLE III Performance comparison of the object detectionmodels

Model TT (in sec) NoI mAP TL FPSFaster RCNN 9651 12135 0969 002 3SSD 2124 1200 0691 022 10YOLO v3 5659 7560 0846 087 23

Fig 8 Sample output of the proposed framework for monitor-ing social distancing on surveillance footage of Oxford TownCenter

VI OUTPUT

The proposed framework outputs (as shown in Fig 8) theprocessed frame with the identified people confined in thebounding boxes while also simulating the statistical analysisshowing the total number of social groups displayed by samecolor encoding and a violation index term computed as theratio of the number of people to the number of groups Theframes shown in Fig 8 displays violation index as 3 2 2 and233 The frames with detected violations are recorded withthe timestamp for future analysis

VII FUTURE SCOPE AND CHALLENGES

Since this application is intended to be used in any workingenvironment accuracy and precision are highly desired toserve the purpose Higher number of false positive may raisediscomfort and panic situation among people being observedThere may also be genuinely raised concerns about privacy andindividual rights which can be addressed with some additionalmeasures such as prior consents for such working environ-ments hiding a personrsquos identity in general and maintainingtransparency about its fair uses within limited stakeholders

VIII CONCLUSION

The article proposes an efficient real-time deep learningbased framework to automate the process of monitoring thesocial distancing via object detection and tracking approacheswhere each individual is identified in the real-time with the

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

8

Fig 6 Data samples showing (a) true samples and (b) falsesamples of a ldquoPersonrdquo class from the open image dataset

Fig 7 Losses per iteration of the object detection modelsduring the training phase on the OID validation set fordetecting the person in an image

repository [73] maintained by the Google open source com-munity The diverse images with a class label as ldquoPersonrdquo aredownloaded via OIDv4 toolkit [84] along with the annotationsFig 6 shows the sample images of the obtained dataset con-sisting of 800 images which is obtained by manually filteringto only contain the true samples The dataset is then dividedinto training and testing sets in 82 ratio In order to make thetesting robust the testing set is also accompanied by the framesof surveillance footage of the Oxford town center [23] Laterthis footage is also utilized to simulate the overall approach formonitoring the social distancing In case of faster RCNN theimages are resized to P pixels on the shorter edge with 600and 1024 for low and high resolution while in SSD and YOLOthe images are scaled to the fixed dimension P times P with Pvalue as 416 During the training phase the performance of themodels is continuously monitored using the mAP along withthe localization classification and overall loss in the detectionof the person as indicated in Fig 7 Table III summarizes theresults of each model obtained at the end of the training phasewith the training time (TT) number of iterations (NoI) mAPand total loss (TL) value It is observed that the faster RCNNmodel achieved minimal loss with maximum mAP howeverhas the lowest FPS which makes it not suitable for real-timeapplications Furthermore as compared to SSD YOLO v3achieved better results with balanced mAP training time andFPS score The trained YOLO v3 model is then utilized formonitoring the social distancing on the surveillance video

TABLE III Performance comparison of the object detectionmodels

Model TT (in sec) NoI mAP TL FPSFaster RCNN 9651 12135 0969 002 3SSD 2124 1200 0691 022 10YOLO v3 5659 7560 0846 087 23

Fig 8 Sample output of the proposed framework for monitor-ing social distancing on surveillance footage of Oxford TownCenter

VI OUTPUT

The proposed framework outputs (as shown in Fig 8) theprocessed frame with the identified people confined in thebounding boxes while also simulating the statistical analysisshowing the total number of social groups displayed by samecolor encoding and a violation index term computed as theratio of the number of people to the number of groups Theframes shown in Fig 8 displays violation index as 3 2 2 and233 The frames with detected violations are recorded withthe timestamp for future analysis

VII FUTURE SCOPE AND CHALLENGES

Since this application is intended to be used in any workingenvironment accuracy and precision are highly desired toserve the purpose Higher number of false positive may raisediscomfort and panic situation among people being observedThere may also be genuinely raised concerns about privacy andindividual rights which can be addressed with some additionalmeasures such as prior consents for such working environ-ments hiding a personrsquos identity in general and maintainingtransparency about its fair uses within limited stakeholders

VIII CONCLUSION

The article proposes an efficient real-time deep learningbased framework to automate the process of monitoring thesocial distancing via object detection and tracking approacheswhere each individual is identified in the real-time with the

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

9

help of bounding boxes The generated bounding boxes aidin identifying the clusters or groups of people satisfyingthe closeness property computed with the help of pairwisevectorized approach The number of violations are confirmedby computing the number of groups formed and violationindex term computed as the ratio of the number of peopleto the number of groups The extensive trials were conductedwith popular state-of-the-art object detection models FasterRCNN SSD and YOLO v3 where YOLO v3 illustratedthe efficient performance with balanced FPS and mAP scoreSince this approach is highly sensitive to the spatial locationof the camera the same approach can be fine tuned to betteradjust with the corresponding field of view

ACKNOWLEDGMENT

The authors gratefully acknowledge the helpful commentsand suggestions of colleagues Authors are also indebted toInterdisciplinary Cyber Physical Systems (ICPS) ProgrammeDepartment of Science and Technology (DST) Governmentof India (GoI) vide Reference No244 for their financialsupport to carry out the background research which helpedsignificantly for the implementation of present research work

REFERENCES

[1] W H Organization ldquoWHO corona-viruses (COVID-19)rdquo httpswwwwhointemergenciesdiseasesnovel-corona-virus-2019 2020 [Onlineaccessed May 02 2020]

[2] WHO ldquoWho director-generalrsquos opening remarks at the media briefingon covid-19-11 march 2020rdquo httpswwwwhointdgspeechesdetail2020 [Online accessed March 12 2020]

[3] L Hensley ldquoSocial distancing is out physical distancing is inmdashherersquoshow to do itrdquo Global NewsndashCanada (27 March 2020) 2020

[4] ECDPC ldquoConsiderations relating to social distancing measures inresponse to COVID-19 ndash second updaterdquo httpswwwecdceuropaeuenpublications-dataconsiderations 2020 [Online accessed March 232020]

[5] M W Fong H Gao J Y Wong J Xiao E Y Shiu S Ryu andB J Cowling ldquoNonpharmaceutical measures for pandemic influenza innonhealthcare settingsmdashsocial distancing measuresrdquo 2020

[6] F Ahmed N Zviedrite and A Uzicanin ldquoEffectiveness of workplacesocial distancing measures in reducing influenza transmission a system-atic reviewrdquo BMC public health vol 18 no 1 p 518 2018

[7] W O Kermack and A G McKendrick ldquoContributions to the mathe-matical theory of epidemicsndashi 1927rdquo 1991

[8] C Eksin K Paarporn and J S Weitz ldquoSystematic biases in diseaseforecastingndashthe role of behavior changerdquo Epidemics vol 27 pp 96ndash105 2019

[9] M Zhao and H Zhao ldquoAsymptotic behavior of global positive solutionto a stochastic sir model incorporating media coveragerdquo Advances inDifference Equations vol 2016 no 1 pp 1ndash17 2016

[10] P Alto ldquoLanding AI Named an April 2020 Cool Vendor in the Gart-ner Cool Vendors in AI Core Technologiesrdquo httpswwwyahoocomlifestylelanding-ai-named-april-2020-152100532html 2020 [Onlineaccessed April 21 2020]

[11] A Y Ng ldquoCurriculum Vitaerdquo httpsaistanfordedusimangcurriculum-vitaepdf

[12] L AI ldquoLanding AI Named an April 2020 Cool Vendor in the GartnerCool Vendors in AI Core Technologiesrdquo httpswwwprnewswirecomnews-releases 2020 [Online accessed April 22 2020]

[13] B News ldquoChina coronavirus Lockdown measures rise across Hubeiprovincerdquo httpswwwbbccouknewsworld-asia-china512174552020 [Online accessed January 23 2020]

[14] N H C of the Peoplersquos Republic of China ldquoDaily briefing on novelcoronavirus cases in Chinardquo httpennhcgovcn2020-0320c 78006htm 2020 [Online accessed March 20 2020]

[15] K Prem Y Liu T W Russell A J Kucharski R M Eggo N DaviesS Flasche S Clifford C A Pearson J D Munday et al ldquoThe effectof control strategies to reduce social mixing on outcomes of the covid-19 epidemic in wuhan china a modelling studyrdquo The Lancet PublicHealth 2020

[16] C Adolph K Amano B Bang-Jensen N Fullman and J WilkersonldquoPandemic politics Timing state-level social distancing responses tocovid-19rdquo medRxiv 2020

[17] K E Ainslie C E Walters H Fu S Bhatia H Wang X XiM Baguelin S Bhatt A Boonyasiri O Boyd et al ldquoEvidence ofinitial success for china exiting covid-19 social distancing policy afterachieving containmentrdquo Wellcome Open Research vol 5 no 81 p 812020

[18] S K Sonbhadra S Agarwal and P Nagabhushan ldquoTarget specificmining of covid-19 scholarly articles using one-class approachrdquo 2020

[19] N S Punn and S Agarwal ldquoAutomated diagnosis of covid-19 withlimited posteroanterior chest x-ray images using fine-tuned deep neuralnetworksrdquo 2020

[20] N S Punn S K Sonbhadra and S Agarwal ldquoCovid-19 epidemicanalysis using machine learning and deep learning algorithmsrdquomedRxiv 2020 [Online] Available httpswwwmedrxivorgcontentearly202004112020040820057679

[21] O website of Indian Government ldquoDistribution of the novelcoronavirus-infected pneumoni Aarogya Setu Mobile Apprdquo httpswwwmygovinaarogya-setu-app 2020

[22] M Robakowska A Tyranska-Fobke J Nowak D Slezak P ZuratynskiP Robakowski K Nadolny and J R Ładny ldquoThe use of drones duringmass eventsrdquo Disaster and Emergency Medicine Journal vol 2 no 3pp 129ndash134 2017

[23] J Harvey Adam LaPlace (2019) Megapixelscc Origins ethicsand privacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

[24] N Sulman T Sanocki D Goldgof and R Kasturi ldquoHow effectiveis human video surveillance performancerdquo in 2008 19th InternationalConference on Pattern Recognition IEEE 2008 pp 1ndash3

[25] X Wang ldquoIntelligent multi-camera video surveillance A reviewrdquo Pat-tern recognition letters vol 34 no 1 pp 3ndash19 2013

[26] K A Joshi and D G Thakore ldquoA survey on moving object detectionand tracking in video surveillance systemrdquo International Journal of SoftComputing and Engineering vol 2 no 3 pp 44ndash48 2012

[27] O Javed and M Shah ldquoTracking and object classification for automatedsurveillancerdquo in European Conference on Computer Vision Springer2002 pp 343ndash357

[28] S Brutzer B Hoferlin and G Heidemann ldquoEvaluation of backgroundsubtraction techniques for video surveillancerdquo in CVPR 2011 IEEE2011 pp 1937ndash1944

[29] S Aslani and H Mahdavi-Nasab ldquoOptical flow based moving objectdetection and tracking for traffic surveillancerdquo International Journal ofElectrical Computer Energetic Electronic and Communication Engi-neering vol 7 no 9 pp 1252ndash1256 2013

[30] P Dollar V Rabaud G Cottrell and S Belongie ldquoBehavior recognitionvia sparse spatio-temporal featuresrdquo in 2005 IEEE International Work-shop on Visual Surveillance and Performance Evaluation of Trackingand Surveillance IEEE 2005 pp 65ndash72

[31] M Piccardi ldquoBackground subtraction techniques a reviewrdquo in 2004IEEE International Conference on Systems Man and Cybernetics (IEEECat No 04CH37583) vol 4 IEEE 2004 pp 3099ndash3104

[32] Y Xu J Dong B Zhang and D Xu ldquoBackground modeling methodsin video analysis A review and comparative evaluationrdquo CAAI Trans-actions on Intelligence Technology vol 1 no 1 pp 43ndash60 2016

[33] H Tsutsui J Miura and Y Shirai ldquoOptical flow-based person trackingby multiple camerasrdquo in Conference Documentation International Con-ference on Multisensor Fusion and Integration for Intelligent SystemsMFI 2001 (Cat No 01TH8590) IEEE 2001 pp 91ndash96

[34] A Agarwal S Gupta and D K Singh ldquoReview of optical flowtechnique for moving object detectionrdquo in 2016 2nd InternationalConference on Contemporary Computing and Informatics (IC3I) IEEE2016 pp 409ndash413

[35] S A Niyogi and E H Adelson ldquoAnalyzing gait with spatiotemporalsurfacesrdquo in Proceedings of 1994 IEEE Workshop on Motion of Non-rigid and Articulated Objects IEEE 1994 pp 64ndash69

[36] Z-Q Zhao P Zheng S-t Xu and X Wu ldquoObject detection with deeplearning A reviewrdquo IEEE transactions on neural networks and learningsystems vol 30 no 11 pp 3212ndash3232 2019

[37] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classificationwith deep convolutional neural networksrdquo in Advances in neural infor-mation processing systems 2012 pp 1097ndash1105

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References

10

[38] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towards real-timeobject detection with region proposal networksrdquo in Advances in neuralinformation processing systems 2015 pp 91ndash99

[39] X Chen and A Gupta ldquoAn implementation of faster rcnn with studyfor region samplingrdquo arXiv preprint arXiv170202138 2017

[40] J Redmon S Divvala R Girshick and A Farhadi ldquoYou only lookonce Unified real-time object detectionrdquo in Proceedings of the IEEEconference on computer vision and pattern recognition 2016 pp 779ndash788

[41] M Putra Z Yussof K Lim and S Salim ldquoConvolutional neuralnetwork for person and car detection using yolo frameworkrdquo Journalof Telecommunication Electronic and Computer Engineering (JTEC)vol 10 no 1-7 pp 67ndash71 2018

[42] R Eshel and Y Moses ldquoHomography based multiple camera detectionand tracking of people in a dense crowdrdquo in 2008 IEEE Conference onComputer Vision and Pattern Recognition IEEE 2008 pp 1ndash8

[43] D-Y Chen C-W Su Y-C Zeng S-W Sun W-R Lai and H-Y M Liao ldquoAn online people counting system for electronic advertisingmachinesrdquo in 2009 IEEE International Conference on Multimedia andExpo IEEE 2009 pp 1262ndash1265

[44] C-W Su H-Y M Liao and H-R Tyan ldquoA vision-based peoplecounting approach based on the symmetry measurerdquo in 2009 IEEEInternational Symposium on Circuits and Systems IEEE 2009 pp2617ndash2620

[45] J Yao and J-M Odobez ldquoFast human detection from joint appearanceand foreground feature subset covariancesrdquo Computer Vision and ImageUnderstanding vol 115 no 10 pp 1414ndash1426 2011

[46] B Wu and R Nevatia ldquoDetection and tracking of multiple partiallyoccluded humans by bayesian combination of edgelet based part de-tectorsrdquo International Journal of Computer Vision vol 75 no 2 pp247ndash266 2007

[47] F Z Eishita A Rahman S A Azad and A Rahman ldquoOcclusionhandling in object detectionrdquo in Multidisciplinary Computational Intelli-gence Techniques Applications in Business Engineering and MedicineIGI Global 2012 pp 61ndash74

[48] M Singh A Basu and M K Mandal ldquoHuman activity recognitionbased on silhouette directionalityrdquo IEEE transactions on circuits andsystems for video technology vol 18 no 9 pp 1280ndash1292 2008

[49] N Dalal and B Triggs ldquoHistograms of oriented gradients for humandetectionrdquo in 2005 IEEE computer society conference on computervision and pattern recognition (CVPRrsquo05) vol 1 IEEE 2005 pp886ndash893

[50] P Huang A Hilton and J Starck ldquoShape similarity for 3d videosequences of peoplerdquo International Journal of Computer Vision vol 89no 2-3 pp 362ndash381 2010

[51] A Samal and P A Iyengar ldquoAutomatic recognition and analysis ofhuman faces and facial expressions A surveyrdquo Pattern recognitionvol 25 no 1 pp 65ndash77 1992

[52] D Cunado M S Nixon and J N Carter ldquoUsing gait as a biometricvia phase-weighted magnitude spectrardquo in International Conference onAudio-and Video-Based Biometric Person Authentication Springer1997 pp 93ndash102

[53] B Leibe E Seemann and B Schiele ldquoPedestrian detection in crowdedscenesrdquo in 2005 IEEE Computer Society Conference on Computer Visionand Pattern Recognition (CVPRrsquo05) vol 1 IEEE 2005 pp 878ndash885

[54] M Andriluka S Roth and B Schiele ldquoPeople-tracking-by-detectionand people-detection-by-trackingrdquo in 2008 IEEE Conference on com-puter vision and pattern recognition IEEE 2008 pp 1ndash8

[55] A Yilmaz O Javed and M Shah ldquoObject tracking A surveyrdquo Acmcomputing surveys (CSUR) vol 38 no 4 pp 13ndashes 2006

[56] C Schuldt I Laptev and B Caputo ldquoRecognizing human actions alocal svm approachrdquo in Proceedings of the 17th International Confer-ence on Pattern Recognition 2004 ICPR 2004 vol 3 IEEE 2004pp 32ndash36

[57] D Weinland R Ronfard and E Boyer ldquoFree viewpoint action recog-nition using motion history volumesrdquo Computer vision and imageunderstanding vol 104 no 2-3 pp 249ndash257 2006

[58] M Blank L Gorelick E Shechtman M Irani and R Basri ldquoActionsas space-time shapesrdquo in Tenth IEEE International Conference onComputer Vision (ICCVrsquo05) Volume 1 vol 2 IEEE 2005 pp 1395ndash1402

[59] O Parkhi A Vedaldi A Zisserman and C Jawahar ldquoThe oxford-iiitpet datasetrdquo 2012

[60] A Kuznetsova H Rom N Alldrin J Uijlings I Krasin J Pont-TusetS Kamali S Popov M Malloci A Kolesnikov et al ldquoThe open imagesdataset v4rdquo International Journal of Computer Vision pp 1ndash26 2020

[61] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmentationrdquoin Proceedings of the IEEE conference on computer vision and patternrecognition 2014 pp 580ndash587

[62] R Girshick ldquoFast r-cnnrdquo in Proceedings of the IEEE internationalconference on computer vision 2015 pp 1440ndash1448

[63] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu and A CBerg ldquoSsd Single shot multibox detectorrdquo in European conference oncomputer vision Springer 2016 pp 21ndash37

[64] J Redmon and A Farhadi ldquoYolo9000 better faster strongerrdquo inProceedings of the IEEE conference on computer vision and patternrecognition 2017 pp 7263ndash7271

[65] J R A Farhadi and J Redmon ldquoYolov3 An incremental improvementrdquoRetrieved September vol 17 p 2018 2018

[66] M Everingham L Van Gool C K Williams J Winn and A Zisser-man ldquoThe pascal visual object classes (voc) challengerdquo Internationaljournal of computer vision vol 88 no 2 pp 303ndash338 2010

[67] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollar and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo in European conference on computer vision Springer 2014pp 740ndash755

[68] K Simonyan and A Zisserman ldquoVery deep convolutional networks forlarge-scale image recognitionrdquo arXiv preprint arXiv14091556 2014

[69] K He X Zhang S Ren and J Sun ldquoDeep residual learning for imagerecognitionrdquo in Proceedings of the IEEE conference on computer visionand pattern recognition 2016 pp 770ndash778

[70] C Szegedy V Vanhoucke S Ioffe J Shlens and Z Wojna ldquoRethinkingthe inception architecture for computer visionrdquo in Proceedings of theIEEE conference on computer vision and pattern recognition 2016 pp2818ndash2826

[71] O Russakovsky J Deng H Su J Krause S Satheesh S MaZ Huang A Karpathy A Khosla M Bernstein et al ldquoImagenet largescale visual recognition challengerdquo International journal of computervision vol 115 no 3 pp 211ndash252 2015

[72] C Szegedy S Ioffe V Vanhoucke and A A Alemi ldquoInception-v4inception-resnet and the impact of residual connections on learningrdquo inThirty-first AAAI conference on artificial intelligence 2017

[73] Google ldquoOpen image dataset v6rdquo httpsstoragegoogleapiscomopenimageswebindexhtml 2020 [Online accessed 25-February-2020]

[74] N S Punn and S Agarwal ldquoInception u-net architecture for semanticsegmentation to identify nuclei in microscopy cell imagesrdquo ACM Trans-actions on Multimedia Computing Communications and Applications(TOMM) vol 16 no 1 pp 1ndash15 2020

[75] A Vaswani S Bengio E Brevdo F Chollet A N Gomez S GouwsL Jones Ł Kaiser N Kalchbrenner N Parmar et al ldquoTensor2tensorfor neural machine translationrdquo arXiv preprint arXiv180307416 2018

[76] D Amodei S Ananthanarayanan R Anubhai J Bai E BattenbergC Case J Casper B Catanzaro Q Cheng G Chen et al ldquoDeepspeech 2 End-to-end speech recognition in english and mandarinrdquo inInternational conference on machine learning 2016 pp 173ndash182

[77] A Sonawane ldquoMedium YOLOv3 A Huge Improvementrdquo httpsmediumcomanand sonawaneyolo3 2020 [Online accessed Dec6 2019]

[78] N Wojke A Bewley and D Paulus ldquoSimple online and realtimetracking with a deep association metricrdquo in 2017 IEEE internationalconference on image processing (ICIP) IEEE 2017 pp 3645ndash3649

[79] N Wojke and A Bewley ldquoDeep cosine metric learning for personre-identificationrdquo in 2018 IEEE winter conference on applications ofcomputer vision (WACV) IEEE 2018 pp 748ndash756

[80] S Pouyanfar S Sadiq Y Yan H Tian Y Tao M P Reyes M-L ShyuS-C Chen and S Iyengar ldquoA survey on deep learning Algorithmstechniques and applicationsrdquo ACM Computing Surveys (CSUR) vol 51no 5 pp 1ndash36 2018

[81] A Brunetti D Buongiorno G F Trotta and V Bevilacqua ldquoCom-puter vision and deep learning techniques for pedestrian detection andtracking A surveyrdquo Neurocomputing vol 300 pp 17ndash33 2018

[82] N S Punn and S Agarwal ldquoCrowd analysis for congestion controlearly warning system on foot over bridgerdquo in 2019 Twelfth InternationalConference on Contemporary Computing (IC3) IEEE 2019 pp 1ndash6

[83] Pias ldquoObject detection and distance measurementrdquo httpsgithubcompaul-piasObject-Detection-and-Distance-Measurement 2020 [Onlineaccessed 01-March-2020]

[84] J Harvey Adam LaPlace (2019) Megapixels Origins ethics andprivacy implications of publicly available face recognition imagedatasets [Online] Available httpsmegapixelscc

  • I Introduction
  • II Background study and related work
  • III Object detection and tracking models
    • III-A Anchor boxes
      • III-A1 Loss Function
        • III-B Faster RCNN
          • III-B1 Loss function
            • III-C Single Shot Detector (SSD)
              • III-C1 Loss function
                • III-D YOLO
                  • III-D1 Loss function
                    • III-E Deepsort
                      • IV Proposed approach
                        • IV-A Workflow
                          • V Experiments and results
                          • VI Output
                          • VII Future scope and challenges
                          • VIII Conclusion
                          • References