IJSDR · Web viewIEEE Transactions on Industrial Electronics, 65 (6), 5060-5068 (2017) 6. Sindagi, V.A., Patel, V.M.: A survey of recent advances in cnn-based single image crowd counting

Perspective Crowd Counting: Compilation of Recent Practices and Studies

G. Roshini1 *

1 School of Computing, SASTRA Deemed to be University, Thanjavur 613401, India

[email protected]

Abstract. Effective crowd counting remains toward assessment of the total count of people in crowds along with density maps. It has a vast range related to various applications in object detection, image segmentation. It serves as an effective technique to establish the count of the crowd through various approaches. The analysis has emerged from previous methods that remain usually defined as slight changes in densities, as well as scales directed toward the current methods. The achievement based on counting the crowd by various mechanisms with the recent years widely associated with neural networks. This paper presents a review based on the new crowd counting methods that use deep learning to exhibit significant development over traditional techniques such as adopts hand-crafted representations. The details on standard benchmark datasets are presented and promising pathway-related to research in the emerging field is addressed.

Keywords: Crowd Counting, Density Maps, Deep learning

1 Introduction

Crowd counting is an interesting topic and mostly used in urban planning, behavior understanding, video surveillance with the help of computer vision [1-5]. There are a variety of CNN-based methods compared with the traditional methods that have improvements over people estimation. It remains important whereas effective in the field of surveillance (security) domain as well as various other tasks such as vehicle counting, cell counting under the microscope, animal crowd estimations for ecological survey and environmental investigation. It is extensively used in estimating crowd visiting a sports event, political rallies and high-traffic areas. The major problem is the background regions remain the same as the crowded scene that leads to miscount related to the density. The scene transformation with the crowded area view cannot be completely encoded. These problems lead to low-quality density maps. The advancements in deep learning mechanisms have given rise to CNN based models to effectively encode the crowd density maps.

Huge clutter, large appearance closeness (similarity) also a complex perspective transformation of the overhead techniques degrades the performance of the models. Toward the first problem, numerous experiments are directed to disclose all head in the overcrowd scenes. However, the people head remain insignificant to identify completely as well as exactly. In the second problem, the whole image is processed using handcrafted features to exploit the information on a global context. Yet, the methods do not completely encode perspective changes in complex scenes whereas, in common scenes, density dissemination in real-world does not go through perspective.

On the other hand, in the 2-D models, people density expands in which the area referring to the scene. The people density in which it means the place can assist the density of population, but this is not always true when it comes to crowded scenes. So, in the single model intuitive expression, people's density and directions are provoked by the changes in perspective.

Towards the effective evaluation of people number in the crowded area in which initially input images are given within network as shown in Fig.1. Then the features are extracted from which the density maps are generated for the scene as presented in Fig 2. In the learning step, the model is trained with natural language processing and finally actual and predicted observations are evaluated with ground truth provided for the testing data.

mailto:[email protected]

Fig. 1. Steps for creating density map

(i) (ii)

Fig. 2. (i) Input crowd image (ii) Density map of corresponding image

1.1 Related previous reviews

As shown in Table 1, Vishwanath A and Sindagi et al. [6] have given a review about people counting also evaluation of density as a single model that has an improvement over earlier CNN based approaches. David Ryan et al. [7] use an evaluation based on the various approaches, features, and designs in regression using a cross-validation protocol. Sami Abdulla Mohsen Saleh et al. [8] have contributed a review on visual surveillance based on density estimation and listed novel methods used for counting approach in computer vision research. Teng Li and Huan Chang et al. [9] provided the essence of crowded scene analysis in automated video surveillance. Di Kangs goal et al. [10] presented a variety of analysis tasks in crowd estimation tasks that utilize density estimation techniques. Sami Zitouni et al. [11] has focused on issues by gathering major statistical confirmation in the current literature, moreover, the authors have also provided suggestions for facing regular conditions. Guanghua Gao et al. [12] surveyed over 220 approaches overall and they are consistently based on the counting models of the crowd in CNN density map estimation methods.

Table 1. List of previous surveys and reviews

Title Description

Vishwanath A and Sindagi et al. [6]

It establishes improvement over previous approaches and this depends at hand-crafted embodiment

David Ryan et al. [7] This paper has an evaluation of algorithms on the categories of holistic, local and histogram-based mechanism on crowd counting

Sami Abdulla Mohsen Saleh et al. [8]

The paper goal is to fill the gap between the density estimation and people counting

Huan Chang et al. [9] It has the essence of crowded scene analysis in automated video surveillanceDi Kangs goal et al. [10]

The aim is to analyze density maps developed by the tasks of analysis in counting, detection and tracking by density estimation

Sami Zitouni et al. [11]

In modeling research, this paper reviews advances and trends from various theoretical and practical standpoints

Guanghua Gao et al. [12]

The survey has a complete and methodological overview from various aspects, attribute-based performance analysis and open question

2 Related Work

2.1 Traditional Techniques

The Traditional approaches which include the Detection Based Techniques, Regression Based Techniques and Density Estimation Based Techniques.

2.1.1 Detection Based Techniques

In an image, a sliding window like detector is directed toward determining the total crowd count. The detection-based approach depends upon well-trained classifiers modeled with low-level features. These approaches failed to efficiently work on crowded images but the technique is good for detecting faces and focus on the objects that are not clear [13-14]. In AGRD [27], the head detection network is designed to detect head furthermore to introduce a method at a low cost to develop a bounding box for ground truth. The enhanced detection network enforced multi-scale heads into detecting in nearly sparse areas. At LSC-CNN [26], the detection framework is introduced and excluded the use of the new density regression model in dense areas.

2.1.2 Regression Based Techniques

In a detection-based approach, it is unable to select low-level features. Therefore, at the image level, patches are cropped and then for each patch, low-level features are extracted to overcome the issues prevailing in regression-based approaches for crowd count estimation [15-16]. In AGRD [27], the density maps are predicted by the regression method in highly dense areas.

2.1.3 Density Estimation Based Techniques

The objects have a density map created by the density estimation methods. A linear mapping is performed within the extracted features along with density maps of objects by the algorithm. Lempitsky and Zisserman [17] in 2010, represents to capture the spatial knowledge through the linear mapping within local features furthermore their density maps of corresponding images. In the density map, the number of objects can be gathered by combining each range. Pham et al. [18] in 2015 proposed to establish a nonlinear mapping within the local appearance containing image blocks along with the density maps. Xu and Qiu. [19] in 2016, influenced by the high-dimensional features like face recognition, proposed an approach with better performance by making use of additional huge feature sets. The Random Forest Regression is used to determine a non-linear mapping also adopt the mechanism of a model in regression because the ridge regression remains more difficult facing deal for high-dimensional features.

3 Taxonomy of Crowd Counting

In the taxonomy of crowd counting, CNN-Based Methods are classified based upon the Network Architecture and Training Process.

Fig. 3. Taxonomy of crowd counting

3.1 CNN Based Methods

Some CNN based approaches build an end - to - end regression approach instead of focusing on the patches of an image. It extracts the entire image in the process as input and directly generates the number of heads in the crowd. It out-performs traditional regression and classification approaches to generate density maps. The fast advancement in deep learning leads to numerous CNN-based approaches that can achieve great improvement over the traditional methods in recent years [20-23]. And the network architecture are classified as Basic CNN, Single layer, Multi-layer/scale aware and Multi-task as shown in Table 1.

3.1.1 Basic CNN

The basic CNN is the initial technique for density estimation and counting in deep learning. The methods which involve the basic CNN consist of convolutional, pooling, and fully connected layers. The method introduces a scalable neural network scheme with the density of decomposed uncertainty on applying a bootstrap ensemble. The yields predicted the uncertainty exactly. It dissolves uncertainties expected against the models including the inputs with varying significance for image analysis. The designed uncertainty quantification architecture is wide a framework that can show flexibility towards the architecture. The network scheme consists of the front-end as convolutional layers for extract features and the back-end as dilated convolution that adopts kernels directed toward giving more receptive fields including toward operations of pooling also a bootstrap ensemble as output layer. In the DUB Net [24] method, “DUB” means decomposed uncertainty using bootstrap ensemble. The uncertainty in epistemic is taking the absence of data learning and in aleatoric is taking inherent noise in the image.

3.1.2 Single Layer

A single layer CNN rather than the enlarged structure of multi-layer network architecture used to reduce the complexity of the network. A Hierarchical Scale Recalibration Network (HSR Net) [25] is a single column but multi-scale network which can exploit global contextual information and aggregate multi-scale information. It consists of two main parts: Stacked Focus Module (SFM) and Scale Recalibration Module (SRM). SFM [25] portraits the global contextual dependencies (belief) with the channel and spatial dimensions that contribute towards extra descriptive feature representations. SRM improves feature responses developed by the SAM [25] to make multi-scale predictions and apply a scale-specific fusion approach to combine scale-associated outputs to generate the final density maps. A Scale Consistency loss is introduced to improve the learning of scale-associated outputs

approaching the comparable multi-scale ground truth of the density maps. By the advanced modules integrated, the network can outfit the adversity of scale variations and also develop further explicit density maps in hugely congested scenes.

3.1.3 Multi-Layer/ Scale-Aware Models

The network mostly uses different layers to take multi-scale information comparable to various receptive fields, that include outstanding achievement in crowd counting and the basic approaches in CNN expand toward other refined images persist robustness changes in scale. LSC-CNN [26] utilizes a multi-layer architecture that extracts features on different resolutions with a top-down feature extractor. From multiple resolutions, it produces refined predictions. Non-Maximum Suppression (NMS) choose accurate detections to achieve the final output. For training, LSC-CNN exploits a winner-take-all (WTA) model and in the testing, the GWTA module follows the prediction fusion operation to generate predictions on different resolutions.

3.1.4 Multi-Task

The advancement related to multi-task training contains numerous ways to incorporate crowd counting as well as estimation and more tasks like foreground-background segmentation along with crowd density estimation. At AGRD [27], in the sparse areas, regression-based methods maximize the total number of people count although detection-based approaches influence miscalculation in the dense areas. They proposed the attention mechanism to fuse regression along with detection methods that can split images within dense also sparse range. The enhanced detection network applied for detecting heads in multi-scale resolution in sparse areas. In dense areas, the method is adapted to attain actual bounding boxes of the head by providing accuracy.

3.2 Inference/Training process

The CNN-based methods are classified based on the inference techniques into Patch-based methods and Whole-image based methods as shown in Table 2.

3.2.1 Patch-Based Methods

In the input image, training is performed by employing patches and the images are cropped. Various methods use different crop sizes. Testing is done by a sliding window run over the whole test image, to obtain evaluation on each window including the combined overall count of the image. In DUB Net [24], it is a single network framework to capture the epistemic uncertainty in the training phase, each iteration randomly determines the head and combine it during the testing phase. At incorporating aleatoric uncertainty, the utilization usually persists by covering such images which can occur from diverse cameras and scenes. Likewise, occlusions are also a perspective problem in a single image and observation noise differs from one to another image part (patches). LSC-CNN [26] figures out bounded boxes at heads about people in images on the crowd. Still, it looks like a task of multi-stage around first finding also sizing a particular person they develop an end-to-end single-stage process. AGRD [27] network has improvement connected with the two methods increase the accuracy in counting also have a bounding box by the same time.

3.2.2 Whole Image-Based Methods

The whole image-based methods avoid sliding windows operation and take the entire image as input together with the corresponding output as a density map or number of the total crowds. Still, these approaches may suffer convergence through local knowledge occasionally but it outperforms the patch-based approaches since they constantly neglect global information due to the sliding window process which also alleviates the problem of high computational cost. In HSRNet [25], the sequential convolutional layers receptive field sizes increase from shallow to deep and the captured pedestrian scales from each are distinct. And it can deduct two assumptions if the network flows deeper then wider scale range captured through corresponding layers of convolution and the sensitivity varies from different scales over different layers. The training phase of HSRNet is a whole-image based manner.

Table 2. Classification of CNN-based methods

Method Network Property Inference Manner/Training ProcessDUBNet [24 ] Basic Patch-basedHSRNet [ 25] Scale-aware/Multi layer Patch-basedLSC-CNN [26 ] Single layer Whole image basedAGRD [ 27] Multi task Patch-based

4 Datasets

The diversity in which the datasets in the past consist of low-density images and the recent one's has high-density images which influence many more algorithms to handle various challenges such as scale variation, occlusion, bad illumination.

Shanghai Tech: The dataset in Shanghai Tech [28] has 1198 annotated images consisting of 330,165 persons. Part A consist of 482 images obtains randomly gathered from the internet including highly congested scenes also Part B consist of 716 images from the street views along nearly sparse people scenes.

UCF_CC_50: The early challenging dataset [29] established from openly accessible web images and combines diversities of densities including various perspective distortions. In this dataset, 50 images are present and 5-fold cross-validation is employed. The protocol remains upon it. Owed to the limited scale data amount, alike effective much progressive CNN-based approaches act deep from excellent as an effective outcome with it.

UCSD: The UCSD dataset [30] is the first dataset that is gathered at the sidewalk using cameras and the dataset consist of 2000 frames along with a resolution about 238×158 including the ground truth annotations of all pedestrian in whole five frames. Considering the rest of the frames, the labels remain to establish through applying linear interpolation. After all, it is gathered from a single location, so there obtains no difference in the perspective scene in various frames.

UCF-QNRF : UCF-QNRF dataset [31] proposed by Idrees et al. in 2018 including enormous training images as 1201 together with testing images as 334 and 1.25 million people dot annotations occur from distinct images and the images in the ground truth from 49 to 12,865.

Mall dataset: The dataset [15] gathered from the video surveillance of a shopping mall furthermore the video sequence composed about 2000 frames with a resolution of 320×240, which consists of totally 62,325 pedestrians. Related with UCSD [30], Mall dataset covers extra diversity densities in addition to various activity patterns (static and moving persons) covered by extra important brightness circumstances. Also, there remain further perspective distortion, appearing in bigger size change and display of objects, and severe occlusions are present the display items.

4.1 Discussion on Results

The various datasets are presented in table 3 and in table 4, the new traditional, as well as CNN based mechanism results, are tabulated. The metrics used are Mean Absolute Error (MAE) and Mean Square Error (MSE). [32-34]

4.1.1Evaluation Metrics

MAE measures errors in an average magnitude in a prediction set of error values without considering their direction of the test data-set and it is the difference within actual and prediction values and measures accuracy. MSE measures squared average magnitude error and the overall sum of the data points difference within actual and prediction values and divided by a number of inputs and measures robustness.

MAE= 1N∑

k=1

N

|v k−v̂k|(1)

MSE=√ 1N∑

k=1

N

|v k−v̂k|2(2)

Where N is a number of input sample, vk refers to a total input image also v̂k refers to an estimated count of

corresponding density map for the k th sample.

Table 3. Description of crowd counting datasets

Dataset Images Resolution Total Min Average MaxShanghaitech PartA [ 28 ]

482 589×868 241,677 33 501.4 3,139

Shanghaitech PartB [ 28 ]

716 768×1024 88,488 9 123.6 578

UCF_CC_50 [29] 50 2101×2888 63,974 94 1,280 4,543UCSD [30] 2000 238×158 49,885 11 24.9 46UCF-QNRF [ 31] 1,535 2013×2902 1,251,642 49 815 12,865

Mall [15] 2000 320×240 62,325 13 31 53

(i) (ii) (iii) (iv) (v) (vi)

Fig. 4. Representaion of a few images from crowd counting datasets (i)Shanghai PartA [28] (ii) Shangahi Part B [28] (iii)UCF_CC_50 [29] (iv) UCSD [30] (v) UCF-QNRF [31] (vi) MALL [15]

5 Benchmarking and Analysis

The results of the CNN based methods and traditional techniques over five standard datasets in crowd counting work are presented in Table 4. The two extensively used evaluation metrics such as MAE and MSE for measuring the models accuracy as well as robustness is presented in equations. The models remain ideal and the paper results categorization are presented or described through more approaches.

When compared with the traditional approaches, CNN-based approaches showed improved performance with a vast margin. Further, it establishes a particular active aspect in training capability related to deep convolution neural network occupying at a huge range of annotated data.

The comparison of performance on CNN-based methods in view of the year 2015, effective performance acquire more enhanced supplement, that has supported the important advancement related to the model of crowd counting.

Table 4. Comparison of methods results using various benchmark datasets

Method

Datasets

Shanghai Tech

Part A

Shanghai Tech

Part B

UCF_CC_50 UCSD UCF QNRF Mall

MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE

DUBNet[24] 64.6 106.8 7.7 12.5 243.8 329.3 105.6 180.5

HSRNet [25] 62.3 100.3 7.2 11.8 1.03 1.32 1.80 2.28

LSCCNN[26] 66.4 117.0 8.1 12.7 225.6 302.7 120.5 218.2

AGRD[27 ] 61.4 97.5 7.2 11.8 194.7 246.8

6 Future Research Directions

There need for enormous crowd counting datasets to train the deep networks since they are data hungry. The adversity regarding training new views stand significant via analyzing the advantage from trained

models with existing limitations. The current mechanisms retrain the images and they are unpractical to deploy on real-world schemes. It is a tedious process to obtain annotations for each unique scene.

We consider scheming to organize other scale information through impressive improvement based on models in scale-aware along with context-aware decision to obtain better performance over the crowd counting task.

7 Conclusion:

This paper has given an analysis of new approaches in crowd counting as well as density estimation based on CNN-based approaches. Further, the different traditional methods and CNN-based methods are gathered. The CNN-based classification techniques are established with a training process through deep learning models. And the results determined through traditional and CNN-based methods are compared and analyzed. Then as a future direction, we insist that by integrating scaling and context information one can alleviate errors in CNN-based techniques.

References1. Coşar, S., Donatiello, G., Bogorny, V., Garate, C., Alvares, L.O., Brémond, F.: Toward abnormal trajectory and event detection in video surveillance. IEEE Transactions on Circuits and Systems for Video Technology, 27(3), 683-695 (2016)

2. Zhao, L., He, Z., Cao, W., Zhao, D.: Real-time moving object segmentation and classification from HEVC compressed surveillance video. IEEE Transactions on Circuits and Systems for Video Technology, 28(6), 1346-1357 (2016)

3. Li, X., Chen, M., Nie, F., Wang, Q.: A multiview-based parameter free framework for group detection. In: Thirty-First AAAI Conference on Artificial Intelligence, February (2017)

4. Chen, M., Wang, Q., Li, X.: Patch-based topic model for group detection. Science China Information Sciences, 60(11), 113101 (2017)

5. Yu, J., Hong, C., Rui, Y., Tao, D.: Multitask autoencoder model for recovering human poses. IEEE Transactions on Industrial Electronics, 65(6), 5060-5068 (2017)

6. Sindagi, V.A., Patel, V.M.: A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters, 107, 3-16 (2018)

7. Ryan, D., Denman, S., Sridharan, S., Fookes, C.: An evaluation of crowd counting methods, features and regression models. Computer Vision and Image Understanding, 130, 1-17 (2015)

8. Saleh, S.A.M., Suandi, S.A., Ibrahim,H.: Recent survey on crowd density estimation and counting for visual surveillance. Engineering Applications of Artificial Intelligence, 41, 103-114 (2015)

9. Li, T., Chang, H., Wang, M., Ni, B., Hong, R., Yan, S.: Crowded scene analysis: A survey. IEEE transactions on circuits and systems for video technology, 25(3), 367-386 (2014)

10. Kang, D., Ma, Z., Chan, A.B.: Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks-Counting, Detection, and Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 29(5), 1408-1422 (2018)

11. Zitouni, M.S., Bhaskar, H., Dias, J., Al-Mualla, M.E.: Advances and trends in visual crowd analysis: A systematic survey and evaluation of crowd modelling techniques. Neurocomputing, 186, 139-159 (2016)

12. Gao, G., Gao, J., Liu, Q., Wang, Q., Wang, Y.: CNN-based Density Estimation and Crowd Counting: A Survey. Computer vision and pattern recognition (2020)

13. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05), Vol. 1, pp. 886-893. IEEE, June (2005)

14. Wu, B., Nevatia, R.: Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In: Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, Vol. 1, pp. 90-97. IEEE, October (2005)

15. Chen, K., Loy, C.C., Gong, S., Xiang, T.: Feature mining for localised crowd counting. In: BMVC. 1(2), 3, September (2012)

16. Chan, A.B., Vasconcelos, N.: Bayesian poisson regression for crowd counting. In: 2009 IEEE 12th international conference on computer vision, pp. 545-551. IEEE, September (2009)

17. Lempitsky, V., Zisserman, A.: Learning to count objects in images. In Advances in neural information processing systems, pp. 1324-1332 (2010)

18. Pham, V.Q., Kozakaya, T., Yamaguchi, O., Okada, R.: Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3253-3261 (2015)

19. Xu, B., Qiu, G.: Crowd density estimation based on rich features and random projection forest. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1-8. IEEE, March (2016)

20. Liu, J., Gao, C., Meng, D., Hauptmann, A.G.: Decidenet: Counting varying density crowds through attention guided detection and density estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197-5206 (2018)

21. Sindagi, V.A., Patel, V.M.: Generating high-quality crowd density maps using contextual pyramid cnns. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1861-1870 (2017)

22. Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Single-image crowd counting via multi-column convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589-597 (2016)

23. Wang, Q., Chen, M., Nie, F., Li, X.: Detecting coherent groups in crowd scenes by multiview clustering. IEEE transactions on pattern analysis and machine intelligence, 42(1), 46-58 (2018)

24. Oh, M.H., Olsen, P.A., Ramamurthy, K.N.: Crowd Counting with Decomposed Uncertainty. In: AAAI, pp. 11799-11806 (2020)

25. Zou, Z., Liu, Y., Xu, S., Wei, W., Wen, S., Zhou, P.: Crowd Counting via Hierarchical Scale Recalibration Network. Computer Vision and Pattern Recognition (2020)

26. Sam, D.B., Peri, S.V., Sundararaman, M.N., Kamath, A., Radhakrishnan, V.B.: Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (2020)

27. Pan, X., Mo, H., Zhou, Z., Wu, W.: Attention Guided Region Division for Crowd Counting. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2568-2572. IEEE, May (2020)


29. Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-source multi-scale counting in extremely dense crowd images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2547-2554 (2013)

30. Chan, A.B., Liang, Z.S. J., Vasconcelos, N.: Privacy preserving crowd monitoring: Counting people without people models or tracking. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-7. IEEE, June (2008)

31. Idrees, H., Tayyab, M., Athrey, K., Zhang, D., Al-Maadeed, S., Rajpoot, N., Shah, M.: Composition loss for counting, density map estimation and localization in dense crowds. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 532-546 (2018)

32. Sermanet, P., Chintala, S., LeCun, Y.: Convolutional neural networks applied to house numbers digit classification. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 3288-3291. IEEE , November (2012)

33. Zhang, C., Li, H., Wang, X., Yang, X.: Cross-scene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 833-841 (2015)


Documents

IJSDR · Web viewIEEE Transactions on Industrial Electronics, 65 (6), 5060-5068 (2017) 6. Sindagi, V.A., Patel, V.M.: A survey of recent advances in cnn-based single image crowd counting