6
Low-Cost IoT Surveillance System Using Hardware-Acceleration and Convolutional Neural Networks Epaminondas S. Lage 1 , Rodolfo L. Santos 2 , Sandro M.T. Junior 3 , Fernando Andreotti 4 Abstract—Reliable object detection is crucial for intelligent surveillance systems. Despite the abundance of available systems, these are often expensive or require cloud connectivity which restricts their usage. In this paper, we present a complete low-cost solution using embedded devices. Person detection is performed locally, using an efficient implementation of a state-of-the-art convolutional Single Shot Detector. The proposed model is fine- tuned to the surveillance task at hand and tested using a Raspberry Pi and the Up Squared Board devices. Additionally, the benefits of hardware-acceleration are evaluated using an Intel R Movidius TM Neural Compute Stick (NCS). The final model performs person detection with an average precision of 0.48. The Raspberry Pi is capable of acquiring and processing images at a rate of 0.4 to 5.3 FPS without/with NCS, whereas the Up Board achieves 2.1 to 7.3 FPS. Further tests demonstrate how the system performs when multiple cameras and multiple NCS devices are deployed. In conclusion, the proposed architecture is suitable for real-time, affordable surveillance systems in home and commercial settings. I. INTRODUCTION Securing access to facilities is crucial for both businesses and homes. Following the novel paradigms of the Internet of Things (IoT), several commercial surveillance systems have been proposed in recent years, e.g. Nest Cam (Nest Labs Inc., Palo Alto, USA), Arlo Pro (Arlo Technologies Inc., San Jose, USA) and Spotlight Cam (Ring Inc., Santa Monica, USA). Despite the abundance of available monitoring systems, most of those depend on expensive proprietary cameras or require internet connectivity for performing cloud-based analysis, the so-called Video Surveillance as a Service (VSaaS). Current solutions are generally destined to developed countries and their elevated costs render them unaffordable for small busi- nesses and personal use in low/medium income countries. In addition to this, concerns about privacy and unauthorized access to intranet and footage or even poor/nonexistent internet connection deter some users from VSaaS solutions. Therefore, an offline low-cost surveillance system is desirable to attend the needs of low-resource costumers that aim to prevent theft and unauthorized access to their facilities/homes. 1 Department of Electrical and Automation, Centro Federal de Educac ¸˜ ao Tecnolgica de Minas Gerais, Belo Horizonte, Brazil 2 Centro Universit´ ario Cesumar, Maring´ a - Brazil 3 Centro Universit´ ario de Belo Horizonte, Belo Horizonte - Brazil 4 Department of Engineering Science, University of Oxford, UK. ESL thanks the Centro Federal de Educac ¸˜ ao Tecnol´ ogica de Minas Gerais (CEFET-MG) for its support. FA is supported by the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC) and the Engineering and Physical Sciences Research Council (EPSRC – grant EP/N024966/1). Corresponding author: E.S. Lage ([email protected]) Video surveillance is a multidisciplinary field consisting of sensor technology, network/storage solutions, display inter- faces and software development for image processing algo- rithms [1]. Image processing is an active area of research, including topics such as object detection, tracking, person identification, and crowd surveillance systems [2]. With re- cent developments in machine/deep learning, the accuracy of these systems has increased as well as their computational requirements and complexity. Open-source frameworks like TensorFlow [3] have promoted a further democratization of deep neural networks by facilitating its development, training and deployment. However, with the growth of ubiquitous CCTV cameras and the increasing complexity of required algorithms, in-house self-contained video surveillance systems become a chimera for most small companies [1]. Some studies have focused on developing low-cost object detection systems [4], [5], [6], [7], many of which made use of low-powered computers. In order to compensate for the lower computational power, approaches have limited themselves to only process frames once movement is detected. For this purpose, some studies proposed the usage of additional passive infrared sensors [4], whereas others perform background sub- traction followed by motion detection on the camera stream. After motion detection, videos are usually collected and person detection is performed. Traditional person detection algorithms make use of features derived from Histogram of Oriented Gra- dient (HOG) [8] or face recognition Haar-like features [9], [4], [7]. Feature extraction is followed by vanilla classifiers such as Support Vector Machine [10], [8] or Adaptive Boosting. The reader is referred to [11] for a review on such approaches. At last, these systems notify users either per email [7], web- stream [6], SMS messages [5], using in-app or Telegram (Telegram Messenger LLP, London, UK) notifications [4]. Nonetheless, very few studies [4], [12], [13] evaluated the usage of low-power computers for IoT applications. In this study we propose a low-cost IoT solution to perform object detection with state-of-the-art deep learning algorithms in situ. The system is tested for the purpose of surveillance, i.e. detection of people, using different low-powered devices. These devices include two single board computer and a neural stick with dedicated hardware for machine learning purposes. The setup performance is evaluated using multiple cameras to assess its real-time potential. In case internet connection is available, the system is capable of informing end-users about events via Telegram notifications. 978-1-5386-4980-0/19/$31.00 c 2019 IEEE 940

Low-Cost IoT Surveillance System Using Hardware

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Low-Cost IoT Surveillance System Using Hardware-Acceleration andConvolutional Neural Networks

Epaminondas S. Lage1, Rodolfo L. Santos2, Sandro M.T. Junior3, Fernando Andreotti4

Abstract—Reliable object detection is crucial for intelligentsurveillance systems. Despite the abundance of available systems,these are often expensive or require cloud connectivity whichrestricts their usage. In this paper, we present a complete low-costsolution using embedded devices. Person detection is performedlocally, using an efficient implementation of a state-of-the-artconvolutional Single Shot Detector. The proposed model is fine-tuned to the surveillance task at hand and tested using aRaspberry Pi and the Up Squared Board devices. Additionally,the benefits of hardware-acceleration are evaluated using anIntel R© Movidius

TMNeural Compute Stick (NCS). The final model

performs person detection with an average precision of 0.48. TheRaspberry Pi is capable of acquiring and processing images ata rate of 0.4 to 5.3 FPS without/with NCS, whereas the UpBoard achieves 2.1 to 7.3 FPS. Further tests demonstrate howthe system performs when multiple cameras and multiple NCSdevices are deployed. In conclusion, the proposed architecture issuitable for real-time, affordable surveillance systems in homeand commercial settings.

I. INTRODUCTION

Securing access to facilities is crucial for both businessesand homes. Following the novel paradigms of the Internet ofThings (IoT), several commercial surveillance systems havebeen proposed in recent years, e.g. Nest Cam (Nest Labs Inc.,Palo Alto, USA), Arlo Pro (Arlo Technologies Inc., San Jose,USA) and Spotlight Cam (Ring Inc., Santa Monica, USA).Despite the abundance of available monitoring systems, mostof those depend on expensive proprietary cameras or requireinternet connectivity for performing cloud-based analysis, theso-called Video Surveillance as a Service (VSaaS). Currentsolutions are generally destined to developed countries andtheir elevated costs render them unaffordable for small busi-nesses and personal use in low/medium income countries.In addition to this, concerns about privacy and unauthorizedaccess to intranet and footage or even poor/nonexistent internetconnection deter some users from VSaaS solutions. Therefore,an offline low-cost surveillance system is desirable to attendthe needs of low-resource costumers that aim to prevent theftand unauthorized access to their facilities/homes.

1 Department of Electrical and Automation, Centro Federal de EducacaoTecnolgica de Minas Gerais, Belo Horizonte, Brazil

2 Centro Universitario Cesumar, Maringa - Brazil3 Centro Universitario de Belo Horizonte, Belo Horizonte - Brazil4 Department of Engineering Science, University of Oxford, UK.ESL thanks the Centro Federal de Educacao Tecnologica de Minas Gerais

(CEFET-MG) for its support. FA is supported by the National Institute forHealth Research (NIHR) Oxford Biomedical Research Centre (BRC) andthe Engineering and Physical Sciences Research Council (EPSRC – grantEP/N024966/1).

Corresponding author: E.S. Lage ([email protected])

Video surveillance is a multidisciplinary field consisting ofsensor technology, network/storage solutions, display inter-faces and software development for image processing algo-rithms [1]. Image processing is an active area of research,including topics such as object detection, tracking, personidentification, and crowd surveillance systems [2]. With re-cent developments in machine/deep learning, the accuracy ofthese systems has increased as well as their computationalrequirements and complexity. Open-source frameworks likeTensorFlow [3] have promoted a further democratization ofdeep neural networks by facilitating its development, trainingand deployment. However, with the growth of ubiquitousCCTV cameras and the increasing complexity of requiredalgorithms, in-house self-contained video surveillance systemsbecome a chimera for most small companies [1].

Some studies have focused on developing low-cost objectdetection systems [4], [5], [6], [7], many of which made use oflow-powered computers. In order to compensate for the lowercomputational power, approaches have limited themselves toonly process frames once movement is detected. For thispurpose, some studies proposed the usage of additional passiveinfrared sensors [4], whereas others perform background sub-traction followed by motion detection on the camera stream.After motion detection, videos are usually collected and persondetection is performed. Traditional person detection algorithmsmake use of features derived from Histogram of Oriented Gra-dient (HOG) [8] or face recognition Haar-like features [9], [4],[7]. Feature extraction is followed by vanilla classifiers such asSupport Vector Machine [10], [8] or Adaptive Boosting. Thereader is referred to [11] for a review on such approaches.At last, these systems notify users either per email [7], web-stream [6], SMS messages [5], using in-app or Telegram(Telegram Messenger LLP, London, UK) notifications [4].Nonetheless, very few studies [4], [12], [13] evaluated theusage of low-power computers for IoT applications.

In this study we propose a low-cost IoT solution to performobject detection with state-of-the-art deep learning algorithmsin situ. The system is tested for the purpose of surveillance,i.e. detection of people, using different low-powered devices.These devices include two single board computer and a neuralstick with dedicated hardware for machine learning purposes.The setup performance is evaluated using multiple camerasto assess its real-time potential. In case internet connection isavailable, the system is capable of informing end-users aboutevents via Telegram notifications.

978-1-5386-4980-0/19/$31.00 c©2019 IEEE

940

Fig. 1: Architecture for proposed low-cost IoT surveillancesystem proposed in this study.

II. MATERIALS AND METHODS

A. Architecture Description

The architecture of the proposed low-cost video surveillancesolution is depicted in Fig. 1. Two widely available low-power,low-cost, single board computers were evaluated, namely theRaspberry Pi (RPi) and the Up Squared (UP2). Both devicesare sold in different configurations, the specifications usedin this study are shown in Table I. Further, we evaluate theaddition of the Intel R© Movidius

TMNeural Compute Stick

(NCS) to the embedded computers. The NCS is a low-power USB stick containing a dedicated Myriad

TM2 Vision

Processing Unit (VPU), specialized in running deep learningnetworks, video encoding/decoding and performing imageprocessing tasks. The proposed system enables the processingand storing of object detection events independently frominternet connectivity.

The single board computer takes care of orchestrating avariety of services illustrated in Fig. 2. The system supportsnative Real Time Stream Protocol (RTSP) streams from legacyDigital/Network Video Recorder (DVR / NVR) installationsby using a Nginx server on the cloud, which encodes RTSPto HTTP Live Streaming (HLS) on-the-fly using the FFmpeglibrary. Similarly, surveillance cameras streams are convertedto Motion JPEG or WebM streams using FFmpeg. Afterencoding, images are available for processing. The embeddeddevice also acts as Message Queuing Telemetry Transport(MQTT) protocol broker by providing secure transmission forfurther IoT devices connected to the intranet and also acts assubscriber to the cloud network.

In addition to data acquisition services, the embeddeddevice contains a relational database using MySQL, whichis used to register object detection events, sensor inputs andrelays outputs. Webmin v1.860 enables user interface via port80, allowing the registration of new cameras/devices on thenetwork. When internet connection is available, PHP v7.0 isused to notify users of these events via Telegram bot or peremail alerts. These events comprise 10 s MPEG-4 videos and

TABLE I: Single board devices used in this study.

Specs. Raspberry Pi 3B Up Board UP2

Manufact. Raspberry Pi Founda-tion, UK

AAEON TechnologyInc., Taiwan

OS Raspbian OS (StretchLite v4.9)

Ubuntu 16.04.2 LTS

Processor ARM Cortex-A53(1.4 GHz)

Intel R© PentiumTM

N4200 (2.5 GHz)RAM 1GB LPDDR2 8 GB LPDDR4Storage 16 GB miniSD 64 GB eMMCEthernet 3 3 (2×)Wireless 3 7Bluetooth 3 7

USB 4× USB2.0 2× USB2.03× USB3.0

Cost approx. US$35 approx. US$359

Fig. 2: Services running on embedded device.

can be activated/deactivated using Telegram/Webmin. Lastly,the cameras can be remotely observed/controlled from any-where using a external MQTT broker hosted on online cloudplatforms such as Amazon Web Services (Amazon.com, Inc.,Seattle, USA).

B. Object Detection Algorithm

In the last decade, the field of computer vision has experi-enced unprecedented growth. This rapid rate of developmentis due to three factors: advances in hardware technology(graphics processing units - GPUs becoming less expensiveand more powerful), increasing availability of image data(datasets containing millions of labelled images, e.g. MicrosoftCOCO [14]) and algorithmic improvements in various levelsof deep neural networks and optimization methodologies [15].The advent of Convolutional Neural Networks (CNNs) enabledgeneration of features at multiple levels of abstraction (i.e.layers), which allowed systems to learn complex functionsby mapping the input to the output directly from data, not

941

Save inlocalDB.

YES

YES

NO

NOPixel change > 10 px

OPENCV Run VIDEOCAPTURE

Load frozen model in memoryssdlite_mobilenet_v1_coco

First frame → backgroundFollowing frames comparedifference (i.e. optical flow)

TENSORFLOWPerform object detection

Return boxes andprediction scores

Is object of interest present?

Save frames for 10 seconds. Process first 5 frames.Run FFMPEG to create

a 5 FPS video.

Is Internetavailable?

YES

NO

Alert Telegram, Email, Activate IP relays

Start

Fig. 3: Flowchart explaining person detection routine.

relying on hand-engineered features (e.g. HOG features) [16].Moreover, due to its computationally efficient algorithm andproperties such as translation invariance, parameter sharingand sparse connectivity, CNNs are often the method of choicefor operating over grid-like structures such as images [17].With advancements in deep learning, novel object detectionalgorithms such as SSD [18], R-CNN [19] and Mask R-CNN[20] were proposed.

Despite their high accuracy, the general trend is to makedeeper and more complex networks which are not necessarilymore efficient with respect to model size and speed [21].In fact, these models are usually only capable of processinga few frames per second (FPS) on high-end GPUs [22].Due to the limited computational resources of the low-costdevices proposed in this study, these networks cannot bedirectly applied. A few strategies to reduce model size andincrease computational speed were recently proposed focusingon embedded and smartphone applications [23], [21], [24],[22].

A recent development that enabled smaller and faster CNNswas the MobileNet model [21]. MobileNet is built primarilyfrom depth-wise separable convolutions, a form of factorizedconvolution, which factorizes standard convolution into depth-wise and point-wise convolutions (i.e. 1×1 convolution). Inthis model, depth-wise convolution is applied to each inputchannel, whereas the point-wise convolution is used to createa linear combination of the output of the depth-wise layer. Thisstructure leads to a substantial reduction in computational cost.Counting depth-wise and point-wise convolutions as separatelayers, MobileNet has 28 layers and two model shrinkinghyper-parameters: width multiplier and resolution multiplier[21].

In this work, person detection (i.e. object detection focus-ing solely on class ”person”) is achieved using the strategysummarized in Fig. 3. Image acquisition from the camera orDVR/NVR streams and preliminary processing were devel-oped in Python v3.5 and OpenCV v3.3.1. Motion detectionwas performed using OpenCV’s absolute differences betweenbackground image and current frame. In case movement wasdetected, a 10 s video at 5 FPS and resolution 320x240 pixelswas saved. This temporal resolution for the video enablesusage of state-of-the-art computer vision approaches at nearreal-time performance. Similar frame rates are achieved byFaster R-CNN [25] on high-end GPUs. Person detection isthen performed over each frame of the video using SSDMobileNet [21], implemented using TensorFlow v1.5.

C. Dataset, Training and Performance Evaluation

Several pre-trained SSD MobileNet checkpoints (i.e. pre-trained models) are openly available in the TensorFlowObject Detection API [26]. These checkpoints depend ondatabase used during training and task at hand. For thesurveillance task, the most suitable starting point is thessd mobilenet v1 coco 2017 11 17 checkpoint, which waspre-trained on the Microsoft’s COCO Database (MSCOCO)[14]. This dataset contains various objects (total of 91 cat-egories - e.g. humans, toaster, motorcycle, toilet) in com-mon scenes. Although MSCOCO images present everydayscenes, they are often compositions of close-up objects thatdo not match what is expected from a distant, generally high-positioned surveillance camera.

In order to adapt the model to farther images from surveil-lance cameras, we made use of the WIDER PedestrianChallenge Database [27]. The challenge aimed at detecting

942

Setup #Cameras FPS achieved

RPi

1 0.382 0.25±0.003 0.17±0.004 0.14±0.005 0.11±0.00

UP2

1 2.102 1.04±0.063 0.71±0.054 0.57±0.055 0.49±0.05

RPi with NCS

1 5.272 1.23±0.013 0.75±0.014 0.46±0.005 0.33±0.01

UP2 with NCS

1 7.312 2.44±0.013 1.57±0.014 1.00±0.015 0.71±0.00

TABLE II: Real-time performance for object detection usingthe various embedded devices proposed in this study. Resultsare presented as mean ± standard deviation depending on thenumber of camera streams used.

pedestrians and cyclist (i.e. 2 classes) in unconstrained envi-ronments. The database consists of surveillance camera and cardriving footage and is split into 11,500 images for training set(with 46,513 labels), 5,000 validation set (19,696 labels) and3,500 test set. In this work, we fine-tuned the model usingthe surveillance camera training data (5,759 images) whileregarding both categories as one, so that both cyclists andpedestrians were detected as “persons”. The validation setfrom surveillance cameras (2,481 images) was then used toreport the model’s accuracy in terms of average precision (AP)using Intersect over Union (IoU) threshold of 50% ([email protected] i.e. the PASCAL VOC metric).

III. RESULTS

The fine-tuned SSD MobileNet v1 model achieved a detec-tion accuracy of [email protected] = 0.48. The results of applying theSSD model before and after fine-tuning are demonstrated inFig. 4. In Fig. 5 day and night-time images of three surveil-lance cameras demonstrate the performance of our model inreal-world scenarios.

Further, to evaluate the system’s real-time performance,we evaluate the maximum number of FPS that the eachsetup achieves given the different embedded hardware andnumber of cameras. For this purpose a video with 176 sduration was recorded in which various events take place.This video was streamed up to five times simultaneously (i.e.simulating 5 camera streams) to the each embedded device.The devices performed object detection on each frame usingthe SSD model. As the NCS support to TensorFlow is limited,the SSD MobileNet Caffe model available in [28] was usedfor benchmarking the usage of the neural stick. Results aredisplayed in Table II.

(a) Pre-trained model

(b) Fine-tuned model

Fig. 4: Example of model performance using pre-trained SSDmodel on COCO database (a) and fine-tuned model on WIDERPedestrian database (b). Example image is cropped fromimg00063.jpg of the WIDER validation set. Blue boundingboxes represent reference annotations and red boxes detec-tions.

IV. DISCUSSION

The proposed architecture allows low-cost on-site detec-tion of people in near real-time. The solution is suitablefor low/medium income home and commercial use. Theembedded system achieved accuracy of [email protected] = 0.48.Comparison between similar studies from literature is onlypartially possible, since many make use of different object cat-egories, CNN models, performance metrics and image sizes.Nonetheless, the original implementation of MobileNet v1 [26]shows mean average precision (mAP ) of 0.21 on all the 91categories of the MSCOCO database [14]. The mAP metricfrom the MSCOCO evaluation protocol takes into accountvarious levels of IoU thresholds. Similarly, the best scoringentry during the test phase in the Wider Pedestrian Challengeachieved mAP = 0.71 [27] [14], although the model usedis unclear. In addition, as mentioned in Section II-C, thechallenge consisted of two classes (cyclists and pedestrians).All in all the current result illustrates the well-known trade-off

943

(a) Example 1 street camera. (b) Example 2 street camera. (c) Example indoors camera.

Fig. 5: Person detection during day (top row) and night (below). White rectangles indicate user-defined region of interest (ROI)and red rectangles show the bounding box produced by object detection algorithm.

between accuracy and network computational cost [29].From Fig. 4 it is evident that fine-tuning improves the

person detection in surveillance settings. In Fig. 4(a) the onlydetection is a person whose face is clearly distinguishableand facing the camera, whereas in (b) even further awayindividuals are correctly identified. Moreover, Fig. 5 confirmsthat the model is capable of detecting real-world scenarioimages in both day and night conditions.

With regards to the near real-time performance (presentedin Table II), as expected neither RPi or UP2 on their own arenot able to run deep learning algorithms at desirable framerates. The addition of a NCS reduces their processing timeconsiderably and the FPS improvement is evident. As a resultof externalizing the execution of computer vision algorithms,simpler lower-cost hardware can be used. In [13] a comparisonbetween RPi and competing devices is shown for the purposeof IoT applications. As mentioned, the choice of model alsoplays an important role on the performance, [24], [29] comparethe FPS achieved by various methods using high-end GPUs.As novel reduced network architectures emerge in the litera-ture, performance improvements for embedded/mobile devicesare to be expected.

During the development of this study, we noticed somesetup factors that lead to higher efficiency and detectionaccuracy. First, the camera position with at least 15 lux duringday-time and infra-red support during night-time is essentialfor reliable detections. Second, we recommend the detectionof individuals in a distance between 4 and 20 m and the angleas close to perpendicular to the person as possible. Last, theusage of user-defined regions of interest (ROIs - as seen inFig. 5) around gateways lessens the computational load on

the embedded device, as the video size is reduced.Future work will focus on evaluate the usage of multiple

NCS in a single embedded device. Alternative networks suchas Tiny YOLO [23], Tiny SSD [22] and Feature PyramidNetworks (FPN) [30] also appear as suitable alternatives. Re-cently, TensorFlow Lite, a lightweight solution for mobile andembedded devices of TensorFlow library was released. Theplatform should enable on-device machine learning inferencewith low latency, small binary size and also supports hardwareacceleration (for now for Androids only). At last, dedicatedhardware solutions such as NVIDIA Jetson TX2 may providean increased performance at slightly higher cost, which shouldbe evaluated.

V. CONCLUSIONS

In this contribution, we propose a complete low-cost surveil-lance system using state-of-the-art deep learning object de-tection. The study demonstrates the benefits of hardware-acceleration for embedded computer vision tasks which canprovide near real-time performances using the proposed solu-tion.

REFERENCES

[1] A. Prati, R. Vezzani, M. Fornaciari, and R. Cucchiara, Intelligent VideoSurveillance as a Service. Springer, Nov. 2013, pp. 1–16.

[2] A. Hampapur, L. Brown, J. Connell, A. Ekin, N. Haas, M. Lu, H. Merkl,and S. Pankanti, “Smart video surveillance: exploring the concept ofmultiscale spatiotemporal tracking,” IEEE Signal Processing Magazine,vol. 22, no. 2, pp. 38–51, mar 2005.

944

[3] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray,C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals,P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,“TensorFlow: Large-scale machine learning on heterogeneous systems,”2015, software available from tensorflow.org. [Online]. Available:https://www.tensorflow.org/

[4] I. Aydin and N. A. Othman, “A new IoT combined face detectionof people by using computer vision for security application,” in 2017International Artificial Intelligence and Data Processing Symposium(IDAP). IEEE, 2017, pp. 1–6.

[5] K. H. S. Murugan, V. Jacintha, and S. A. Shifani, “Security systemusing raspberry Pi,” in 2017 Third International Conference on ScienceTechnology Engineering & Management (ICONSTEM), mar 2017, pp.863–864.

[6] Huu-Quoc Nguyen, Ton Thi Kim Loan, Bui Dinh Mao, and Eui-Nam Huh, “Low cost real-time system monitoring using Raspberry Pi,”in 2015 Seventh International Conference on Ubiquitous and FutureNetworks, 2015, pp. 857–859.

[7] W. F. Abaya, J. Basa, M. Sy, A. C. Abad, and E. P. Dadios, “Lowcost smart security camera with night vision capability using RaspberryPi and OpenCV,” in 2014 International Conference on Humanoid,Nanotechnology, Information Technology, Communication and Control,Environment and Management (HNICEM), no. November, nov 2014, pp.1–6.

[8] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for HumanDetection,” in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR’05), vol. 1, no. 3, 2004, pp. 886–893.

[9] P. Viola and M. J. Jones, “Robust Real-Time Face Detection,” Interna-tional Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.

[10] S. Rujikietgumjorn and N. Watcharapinchai, “Real-Time HOG-basedpedestrian detection in thermal images for an embedded system,” in2017 14th IEEE International Conference on Advanced Video and SignalBased Surveillance (AVSS), no. August. IEEE, 2017, pp. 1–6.

[11] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection:An evaluation of the state of the art,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, 2012.

[12] D. Shah and V. Haradi, “IoT Based Biometrics Implementation onRaspberry Pi,” Procedia Comput. Sci., vol. 79, no. 8, pp. 328–336, feb2016.

[13] M. Maksimovic, V. Vujovic, N. Davidovic, V. Milosevic, and B. Perisic,“Raspberry Pi as Internet of Things hardware : Performances and Con-straints,” in Proceedings of 1st International Conference on Electrical,Electronic and Computing Engineering, 2014, p. 6.

[14] T.-y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollar, “Microsoft COCO:Common Objects in Context,” ArXiv:1405.0312, pp. 1–15, 2014.

[15] F. Chollet, Deep Learning with Python. Manning, 2017.[16] Y. Bengio, “Learning Deep Architectures for AI,” Found. Trends Ma-

chine Learn., vol. 2, no. 1, pp. 1–127, 2009.[17] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,

2016.[18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and

A. C. Berg, “SSD: Single Shot MultiBox Detector,” ArXiv:1512.02325,2015.

[19] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semantic segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142–158,nov 2013.

[20] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,”ArXiv:1703.06870, 2017.

[21] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,M. Andreetto, and H. Adam, “MobileNets: Efficient ConvolutionalNeural Networks for Mobile Vision Applications,” ArXiv:1704.04861,2017.

[22] A. Wong, M. J. Shafiee, F. Li, and B. Chwyl, “Tiny SSD: A TinySingle-shot Detection Deep Convolutional Neural Network for Real-time Embedded Object Detection,” ArXiv:1802.06488, 2018.

[23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only LookOnce: Unified, Real-Time Object Detection,” in 2016 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779–788.

[24] B. Wu, A. Wan, F. Iandola, P. H. Jin, and K. Keutzer, “SqueezeDet: Uni-fied, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving,” ArXiv:1612.01051,2016.

[25] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2015.

[26] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracytrade-offs for modern convolutional object detectors,” Proceedings - 30thIEEE Conference on Computer Vision and Pattern Recognition, CVPR2017, pp. 3296–3305, 2016.

[27] “WIDER Pedestrian Challenge 2018,” in ECCV 2018, 2018. [Online].Available: https://competitions.codalab.org/competitions/19118

[28] C. Qi, “Mobilenet-ssd,” 2018. [Online]. Available:https://github.com/chuanqi305/MobileNet-SSD

[29] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer,Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracytrade-offs for modern convolutional object detectors,” in Proceedings -30th IEEE Conference on Computer Vision and Pattern Recognition,CVPR 2017, vol. 2017, 2016, pp. 3296–3305.

[30] T. Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in Proceedings - 30thIEEE Conference on Computer Vision and Pattern Recognition, CVPR2017, 2017, pp. 936–944.

945