16
A Distributed Perception Infrastructure for Robot Assisted Living Stefano Ghidoni a , Salvatore M. Anzalone b , Matteo Munaro a , Stefano Michieletto a , Emanuele Menegatti a a Intelligent Autonomous Systems Laboratory, University of Padova, Padova, Italy b Institut des Syst` emes Intelligents et de Robotique, University Pierre et Marie Curie, Paris, France Abstract This paper presents an ambient intelligence system designed for assisted living. The system processes audio and video data acquired from multiple sensors spread in the environment to automatically detect dangerous events and generate automatic warning messages. The paper presents the distributed perception infrastruc- ture that has been implemented by means of an open-source software middleware called NMM. Different processing nodes have been developed which can cooperate to extract high level information about the en- vironment. Examples of implemented nodes running algorithms for people detection or face recognition are presented. Experiments on novel algorithms for people fall detection and sound classification and localization are discussed. Eventually, we present successful experiments in two test bed scenarios. Keywords: Ambient intelligence, Intelligent autonomous systems, Assisted living, Camera network, Distributed sensing, Autonomous robots. 1 Introduction Distributing intelligence among agents and ob- jects that share the same environment is the main idea of Ambient Intelligence (AmI) [1, 2, 3]. Such systems are based on sensors and actuators spread across the environment to capture relevant infor- mation and to act on the environment accordingly. Research in this area focused on the extraction of relevant information from the raw data streams collected by sensors [4, 5, 6] and on user profiling [7] (i.e. the intelligent environment can adapt its be- haviors according to the people that are inside it). In recent years, particular attention has been given to ambient intelligence applications aimed at help- ing elderly people living alone and maintaining their independence, while assuring a high level of assis- tance when needed [8, 9]. In this paper, an ambient intelligence system meant for assisted living and surveillance applica- tions is described. Our system is composed of sev- eral audio and video sensors spread in the environ- ment that let the system process images and sounds from multiple sources, and acquire also panoramic views of the room. The system presented in this paper offers two main advantages over the works presented in the literature: first, it includes both audio and video event detection, that is identified as being highly beneficial also by previous works [10]. Second, unlike most of the works in this field [11], it does not rely on accelerometers or any other tech- nology that requires the cooperation from the peo- ple living in the intelligent environment. This is the first time a “passive” system handling both audio and video data is presented in the literature. The capability of an AmI system to work without needing wearable sensors is particularly important when assisting elderly people [10]. In fact, it is well known from clinical studies [12] and by companies in the business of assisted and monitored living for elderly people that any system based on the coop- eration of the monitored persons is doomed to fail, even if this “cooperation” means only to wear a small intelligent device (e.g. a bracelet with an ac- celerometer or a more intrusive necklace with an emergency button on it). Therefore, our system is designed to be non intrusive and to work with non- cooperating people. An additional feature of our system is the use of the innovative Omnidome R [13] dual-camera sen- sor together with standard perspective and Pan- 1

A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

A Distributed Perception Infrastructure for Robot Assisted Living

Stefano Ghidonia, Salvatore M. Anzaloneb,Matteo Munaroa, Stefano Michielettoa,

Emanuele Menegattia

a Intelligent Autonomous Systems Laboratory, University of Padova, Padova, ItalybInstitut des Systemes Intelligents et de Robotique, University Pierre et Marie Curie, Paris, France

Abstract

This paper presents an ambient intelligence system designed for assisted living. The system processes audioand video data acquired from multiple sensors spread in the environment to automatically detect dangerousevents and generate automatic warning messages. The paper presents the distributed perception infrastruc-ture that has been implemented by means of an open-source software middleware called NMM. Differentprocessing nodes have been developed which can cooperate to extract high level information about the en-vironment. Examples of implemented nodes running algorithms for people detection or face recognition arepresented. Experiments on novel algorithms for people fall detection and sound classification and localizationare discussed. Eventually, we present successful experiments in two test bed scenarios.

Keywords: Ambient intelligence, Intelligent autonomous systems, Assisted living, Camera network,Distributed sensing, Autonomous robots.

1 Introduction

Distributing intelligence among agents and ob-jects that share the same environment is the mainidea of Ambient Intelligence (AmI) [1, 2, 3]. Suchsystems are based on sensors and actuators spreadacross the environment to capture relevant infor-mation and to act on the environment accordingly.

Research in this area focused on the extractionof relevant information from the raw data streamscollected by sensors [4, 5, 6] and on user profiling [7](i.e. the intelligent environment can adapt its be-haviors according to the people that are inside it).In recent years, particular attention has been givento ambient intelligence applications aimed at help-ing elderly people living alone and maintaining theirindependence, while assuring a high level of assis-tance when needed [8, 9].

In this paper, an ambient intelligence systemmeant for assisted living and surveillance applica-tions is described. Our system is composed of sev-eral audio and video sensors spread in the environ-ment that let the system process images and soundsfrom multiple sources, and acquire also panoramicviews of the room. The system presented in thispaper offers two main advantages over the works

presented in the literature: first, it includes bothaudio and video event detection, that is identified asbeing highly beneficial also by previous works [10].Second, unlike most of the works in this field [11], itdoes not rely on accelerometers or any other tech-nology that requires the cooperation from the peo-ple living in the intelligent environment. This is thefirst time a “passive” system handling both audioand video data is presented in the literature.

The capability of an AmI system to work withoutneeding wearable sensors is particularly importantwhen assisting elderly people [10]. In fact, it is wellknown from clinical studies [12] and by companiesin the business of assisted and monitored living forelderly people that any system based on the coop-eration of the monitored persons is doomed to fail,even if this “cooperation” means only to wear asmall intelligent device (e.g. a bracelet with an ac-celerometer or a more intrusive necklace with anemergency button on it). Therefore, our system isdesigned to be non intrusive and to work with non-cooperating people.

An additional feature of our system is the use ofthe innovative Omnidome R© [13] dual-camera sen-sor together with standard perspective and Pan-

1

Page 2: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

Tilt-Zoom (PTZ) cameras. The Omnidome sensoris particularly suited for video-surveillance applica-tions, as it couples the 360◦ panoramic view of anomnidirectional camera with the high resolution of-fered by a PTZ unit; the cross-calibration of the twosensors represents a strong plus, that makes it pos-sible to consider it as a single, high-resolution om-nidirectional sensor. This paper presents the firstexperiments employing this sensor in an assistedliving application, and shows which are the bene-fits in such context.

The system focuses on the detection of audio andvideo salient events in order to guarantee a safeand secure environment for elderly people livingat home. Several papers in the literature describewhich are the main tasks to be accomplished forassisting elderly people living at home [1, 14]. Thescenario outlined in these papers is rather complex,because the events to be detected can belong tothe audio, video or inertial sensory field. Systemsfor assisted living are therefore equipped with veryheterogeneous sensors, in order to gather a highamount of data about the environment, and betterknowledge about the context. The ultimate goal ofassisted living is to provide a set of user-centeredservices [1, 15].

The structure of an assisted living system is usu-ally based on a set of applications working at thesame time [16, 17]. Such applications are quite in-dependent, because they analyze data provided bydiverse sensors, and aim at detecting very differentevents: for this reason a data fusion stage is oftenmissing, since observations made by different sen-sors are unrelated.

Our system aims at detecting high level knowl-edge about the environment, and exploits severalprocessing algorithms: some of them, like fall de-tection, are specific for assisted living, and were ex-pressly developed for this system, while others (likeface detection) are more general, therefore state-of-the-art solutions have been employed. A PioneerP3-AT robot is also part of the system to extendthe range of functionalities of a typical AmI setup:the integration of a mobile platform represents astrong advantage, since it enables the system to be-come active on the environment [18].

A key advantage of the system we propose isthat it is built upon a framework that handlescommunications among different nodes that pro-cess the video streams acquired by the cameras. Inother words, we exploited an existing frameworkmeant for multi-media applications to implement

the distributed architectures described in the litera-ture [16]. This is the only way to ensure fast realiza-tion and scalability of this kind of systems that arespread across the environment. Since the system ismulti-modal i.e. it analyzes audio and video data,we chose a multi-media framework called NMM(Network-integrated Multimedia Middleware) [19].

2 System infrastructure

Our system is part of the “Safe Home” project,and was designed to provide increased safety to el-derly people living at home. This includes a widevariety of services:

• a home security system,

• a home safety system,

• a domotic system for automating doors, win-dows and air conditioning,

• a video communication framework for connect-ing the assisted person with friends and rela-tives,

• a telemedicine system,

• a data collection and recording infrastructure,

• a front-end for handling the interface from andto external systems (e.g. for controlling thedomotic system from outside the assisted livingenvironment).

Regarding the first two points, it should be notedthat security means the protection of the assistedpeople against possible intrusions from outside (e.g.a thief), while safety refers to the protection of thepeople against unwanted dangers caused inside theapartment (e.g. a fall).

The whole structure of the Safe Home system isshown in Figure 1. As it can be seen, the systemincludes sensors, actuators and an interface to a re-mote control room that is used by human operatorsto react to the alarms issued by the system. Suchsystem has been developed in the context of a co-operation project among university and industry.Our task was to develop computer vision and audioprocessing algorithms to assist elderly people: wetherefore decided to focus on fall detection, that isthe most dangerous event in this context [12], andon sound classification, focused on events related tosecurity.

2

Page 3: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

Figure 1: Scheme of the complete system developed for the SafeHome project. The safety system is detailed in this paper.

As in the case of several AmI systems proposed inthe literature [16], our system relies on multiple sen-sors (like microphones and cameras). A distributedprocessing is therefore desirable for scalability pur-poses. In our approach, each sensor is connected toa separate computer that processes its data streamto extract high-level information. This model letsthe system communicate effectively, because onlycompact high-level information is exchanged, whileheavy data streams are processed locally. A cen-tralized approach would impose strong limitationson the total number of sensors, because only few ofthem can be connected to the same computer, andcables have a maximum length in the orders of afew meters.

To realize the distributed perception infras-tructure described above, we exploited the open-source Network-integrated Multimedia Middleware(NMM) by Motama [19], that provides communica-tion facilities over the network. In NMM, the basicprocessing unit is a node, and every processing al-gorithm should be placed inside a node in order tointeract with other entities in the middleware.

Nodes are organized by means of graphs, that de-scribe how nodes are connected and exchange data.One of the main advantages of using NMM is that

a single graph can connect nodes that reside in dif-ferent machines on the same network, without anyoverhead for handling the communication over thenetwork. A complex perception system can be syn-thesized by implementing the system functionalitiesinside nodes in the appropriate machines, and con-necting them using a distributed graph.

NMM is not designed to control active devices,thus for controlling mobile robots that can act inthe environment a different system has to be used.The control of the mobile robot in the environmentis achieved through the Robot Operating System(ROS) framework [20], that provides an easy accessto both high level behavior and low level controlof the robot. We created a communication link be-tween NMM and ROS to let the robot and the cam-era network cooperate effectively. In the followingsections, we present the different processing mod-ules and how we have implemented the cooperationamong them, in order to achieve the desired systemfunctionalities.

An example: the click & move functionality

To show the typical structure of a NMM graphand the set of processing nodes involved, the simple“click & move” application is considered. This was

3

Page 4: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

developed for sending commands from the cameranetwork to the robot. It presents a graphical userinterface in which all video flows collected by thedifferent cameras are shown (like in Figure 9): theuser can click on any view, and the application willmark the same point on the other views, sendingthe robot to that location.

The click & move application can be run on anycombination of sensors, and a possible NMM graphis shown in Figure 2. In this example, four camerasavailable in the test room are involved:

• “door”, installed over the room door;

• “PTZ”, a PTZ camera fixed to the ceilingroughly at the center of the room;

• “ceiling”, a perspective camera placed so thatits optical axis is perpendicular to the groundand, again, installed roughly at the center ofthe room;

• “omnicam”, an omnidirectional camera at-tached to the wall, at the center of one sideof the room.

A set of nodes are activated on each machine con-nected to a sensor. The distributed system struc-ture can be observed: the four blue columns in Fig-ure 2 represent four different machines that are con-nected to the network. Each machine is directlyconnected to a sensor, which is managed by a setof NMM nodes that handle acquisition, processingand data transmission to other machines over thenetwork. All MouseDetectionNodes send their out-put to one VideoEventReceiverNode; this last nodemanages the interface of the application. The nodesworking on the stream generated by the door cam-era, for example, are the following:

• TVCardReadNode2: manages the acquisitionfrom an analogic source;

• YUVtoRGBConverterNode: converts the im-age format from YUV, provided by the acqui-sition node, to RGB;

• VideoScalingNode: rescales the video;

• VideoUndistortingNode: undistorts the imagesin the video, according to the results of theintrinsic parameter calibration;

• MouseDetectionNode: catches the mouseevents on the window.

Figure 2: Graph of the click & move application. The dis-tributed structure is visible, since all nodes (white dashedboxes) but the VideoEventReceiverNode are distributedamong the different machines (blue dashed boxes) directlyconnected to the video sources.

All the nodes in the structure discussed aboveare included in NMM basic functionalities but theVideoUndistortingNode and the MouseDetectionN-ode, that were expressly developed for this project.

3 Distributed perception modules

In our system, cameras of different types and mi-crophones cooperate to recognize events using im-ages and sounds. The system structure is shown inFigure 3, where it is possible to identify a block formultimodal sensor integration that collects the out-put of several sensing blocks. This is not a sensorfusion module, because data provided by the differ-ent sensing systems are unrelated, as explained insection 1. In the sensor integration module two ac-tivities take place: on one side, observations madeby computer vision modules from different view-points are related to each other: for example, ifan event is detected by one camera, but needs fur-ther investigation, the sensor integration modulecan provide information about sensor geometry tohelp further sensors focus on the same portion ofthe environment. The second task performed inthis module is the creation of links between eventscoming from different sensory fields. For example,a fall and a scream can be related: if they occur

4

Page 5: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

Background subtraction

Multimodal sensorintegration

Sensor calibration

Perceived eventsvoting scheme

Sensor geometry information

Fall detection

Sound direction estimation

Sound features extraction

SVM classifier

Sound models

Sound classifiedFall detected

Camera orientation controlAlarm generation

Figure 3: Scheme of the proposed system. Cameras andmicrophones placed in the environment are used as input tothe fall detector and sound classification modules.

with short difference in time, they can be linked toeach other.

In the left part of the scheme in Figure 3, thevideo processing is shown: images acquired by thecameras are processed using motion detection andfall detection, which is based on the motion pat-terns found in the first block. The fall detectorprovides a high-level output that is suitable for sen-sor integration. Sound is processed in two differentblocks: one is dedicated to sound localization, whilethe other analyzes the sound content exploiting afeature extractor and a classifier. The sensor inte-gration block is meant to collect the high-level dataprovided by all sensing blocks, fuse them together,and trigger further analyzes of the environment, asit will be explained in section 3.1.

3.1 Distributed pipeline and cameras cooperation

Some of the events detected by the system caninvolve one or more of its sensing modalities: for in-stance, a scream, detected by the sound subsystem,could be caused by a person falling. On the otherhand, a scream and a fall can occur at the sametime, but happening to different people at differentpositions. To understand whether two events are re-lated or not, cooperation algorithms between audioand video sensors have been developed, as shown inFigure 3. Such cooperation is based on spatial infor-mation, that is available for both modalities (audioand video). When events occur close in time, thesystem verifies whether they occurred in the samelocation, and if they are compatible: for example,the video event “person falling” is compatible withthe audio events “scream” and “broken glass”, but

not with “barking dog” (see paragraph 5.3 for thediscussion about audio and video events that aredetected by the system). This association mech-anism provides a more detailed description of theevents that aims at increasing the confidence of thedetected events.

Cooperation between audio and video is also usedto further investigate events: if only a single modal-ity is able to detect an event, it can trigger the otherto focus on it. For example, the localization by theaudio processing system of a broken glass or of abarking dog can trigger and guide a PTZ cameratowards the sound source. Such kind of geometri-cal information is particularly useful while dealingwith PTZ cameras: they are a natural choice foracquiring highly detailed views of the environment,but it is impossible to get an overall understandingof the environment using their images. Cooperationcan overcome this problem by letting PTZ camerasbe controlled by other modules: in this case the ge-ometric information of all the sensors present in theenvironment is used to point the PTZ cameras tothe direction of the perceived event, focusing on itwhile keeping under control the other parts of theenvironment.

The camera network is composed of PTZ, staticand omnidirectional cameras that cooperate. Thesystem exploits these sensors in different ways inorder to achieve the maximum coverage and perfor-mance. For example, omnidirectional cameras offerlow resolution, but extremely wide images, that aresuitable for keeping under control large spaces. Onthe other hand, only PTZ cameras can acquire highresolution images that can be used for face recogni-tion. The fields of view of the sensors should over-lap, since a certain level of redundancy can be usefulfor overcoming the problem of occlusions and to in-crease performance. As an example of this, we willpresent a distributed people recognition approach.Most people detection and face recognition algo-rithms work better when observing a person froma frontal view, but they drop in performance whenthe person or the face is seen from the side.

In the following, all system modules will be de-scribed in detail.

3.2 People detection

Recognizing people is one of the basic capabilitiesof an ambient intelligence system, because manyevents that should be observed, especially for el-derly care, are related to human activities. In Fig-ure 4 we propose a pipeline suitable for perform-

5

Page 6: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

ing distributed face recognition in an heterogeneouscamera network composed of fixed and PTZ cam-eras and a central server. Fixed cameras are usedfor running the people detection function based ona simple Viola-Jones algorithm trained with fullbodies (obviously more modern and better perform-ing approaches to people detection could be used,like [21, 22, 23], but the focus of this work is ondistributed perception, and the performance of theViola-Jones algorithm were satisfactory in the cur-rent scenario).

A people detection module runs on each fixedcamera of the system. Once people are detected, animportant security task is to recognize their identi-ties, which is done by observing their faces; imagesprovided by fixed cameras, however, have too lowresolution to perform an accurate face recognition.To overcome this issue the coordinates of the de-tected people are sent over the network to the nodesconnected to the PTZ camera, that is able to zoomon the faces to be recognized.

3.3 Distributed people recognition

Since the location of each person has already beenmeasured by other cameras, the expected positionfor the face to be detected is then calculated todrive the PTZ unit to acquire a close-up image ofthe face. The coordinates of the person positionon the ground plane are estimated by intersectingthe 3D ground plane equation computed in the cal-ibration phase with the ray passing by the cameraoptical center and the lowest point of the personin the image. The 3D position of the face is es-timated as the intersection of two 3D lines: oneconnects the image face position and the cameraoptical center, and the second is perpendicular tothe ground plane and passes by the person 3D posi-tion. If a PTZ camera receives the person positiononly, it zooms at a standard height of 170 cm fromthe ground plane. A face detection algorithm isexploited to segment the face from the image; theselected region of interest is then sent over the net-work to the face recognition node, that stores theface database and runs the face recognition algo-rithm in the next section. Only one PTZ unit isemployed in our case, but multiple units could beadded and exploit the same server.

Faces found by the Viola-Jones detector are pro-cessed to calculate a vector of features that bestdescribe them, using the technique called Eigen-faces [24]. A small set of sample pictures is usedto generate a space capable to better encode the

Person detected?

Person detected?

Person detected?

People detection

People detection

Zoom/Face detection

Fixed cameras PTZs Recognition server

Face recognition

Face detected?

Network

No No

No

Yes Yes

Yes

Figure 4: Pipeline for distributed face recognition in an het-erogeneous camera network.

variations between faces, called Eigenspace. In thisproject, this Eigenspace has been built using thesamples of the free face database “YaleFaces” [25];a generalized features space for frontal faces hasbeen obtained. As a final step, features are fed intoa SVM properly trained to classify frontal faces ofhuman users, giving for them a face claim identifier.A statistical refinement of the output can be per-formed in this stage, forcing the system to returnas a result of the current face identity the mode ofthe last 5 identities claimed.

3.4 Fall detection

One event that deserves special care with elderlypeople is the fall: this is a dangerous situation, be-cause after a fall people could be unable to move,even for pushing an alarm button. Several tech-niques based on diverse sensors are described in theliterature [10], but two are the main approaches tothe problem: computer vision and wearable inertialsensors. The former choice turns out to be more ex-pensive, but the latter requires the assisted peopleto cooperate and always wear the sensors, which isnever the case, because inertial sensors are felt asintrusive by the elderly or they just forget to wearthem. For this reason, computer vision is the mostfavorable choice for assisting the elderly [10].

The topic of fall detection using computer visionhas already been addressed: the main approach isbased on motion analysis, since a fall is a move-ment that generates a specific motion pattern, aspointed out in [26, 27]. Fall detectors can be basedon either motion analysis of the whole body, or fo-

6

Page 7: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

cusing on some parts only, as it is the case of thehead in [28], or, finally, exploiting features like gra-dients [29]. Fall detectors operating on omnidirec-tional images [30] and range data [31] have also beendescribed. In all cases, the aim of the system is todistinguish between a fall and a normal motion, likesitting on a chair or on a sofa, or going to bed. Forthis reason, motion direction and speed should beanalyzed.

The developed fall detection module is aimed atdetecting the particular motion cues of a falling per-son. The module exploits a mixed technique relyingon both background subtraction and frame differ-encing. Such combination is used to compensatethe drawbacks they both have; in particular, framedifferencing turns out to be very efficient when cop-ing with illumination changes, as discussed in [13].This is a very important feature to be considered inreal-world scenarios, and a strong advantage overother methods like [27], whose performance hasbeen assessed on sequences with a flat background.

Unlike the algorithms presented in the literature,our fall detector can be executed indifferently onimages provided by both perspective cameras oromnidirectional ones, after the unwarping process.Even though an unwarped image is in theory simi-lar to a perspective image, in practice a strong dif-ference is represented by the non-uniformity of theimage resolution. The upper part of the unwarpedimage has a much higher resolution than the lowerpart, because the former comes from the rectifica-tion of the outer circle in the omnidirectional image.The non-uniform resolution causes a certain degreeof deformation that may invalidate some assump-tions that can normally be made in fall detectionsystems [27]. On the other hand, algorithms fo-cused on omnidirectional images [30] are not gen-eral, and cannot be applied on all video sources ofthe system, since just one omnidirectional camerais present in the system.

To keep our approach general and obtain a goodperformance on unwarped omnidirectional images,we developed a fall detection algorithm that con-siders all blobs found by the motion detector, andperforms a further processing based on three fac-tors: aspect ratio, gradient and movements duringand after a possible fall. The first parameter, aspectratio, is the ratio W/H between width and heightof the bounding box containing a person. This pa-rameter is very commonly used in people detectionand tracking, but for detecting falls it is used in aslightly different way here: it is tracked over time

in order to detect possible changes in orientation.The change in the bounding box aspect ratio is mea-sured by means of the β-parameter, evaluated as:

β =

√(W −W ′)2 − (H −H ′)2 , (1)

where W and H are the bounding box width andheight at k-th frame, while W ′ and H ′ have thesame meaning referred to frame k-1.

The second parameter considered by the fall de-tector is the gradient. For performing gradientanalysis, both horizontal and vertical Sobel edgedetectors are exploited for evaluating correspondingedge images Ix and Iy. Two features, called hori-zontal (Gx) and vertical (Gy) gradients are thenevaluated by summing all pixel values inside theconsidered motion blob in the two images:

Gdir =

W∑i=0

H∑j=0

Idir(i, j) , (2)

where dir indicates the direction (either x or y).This could seem similar to aspect ratio, but it rep-resents a different approach, because gradient anal-ysis considers the content of the bounding box, ne-glecting the ratio between width and height: an ob-ject can present a strong horizontal gradient eventhough it is vertically placed, and vice-versa. Hor-izontal and vertical gradients provide informationbecause people that are walking or standing have avertical component that is sensibly stronger with re-spect to the horizontal one, while fallen people havetheir main component in the horizontal direction.The fall detector then compares the two gradientswith their evolution over time, for understandingif the moving blob corresponds to a person that isfalling.

The third factor considered by the fall detectionalgorithm is the movement during and after thefall. As mentioned above, moving objects in thescene are detected using a mixed technique basedon background subtraction with active backgroundand frame differencing [13]. The output of this al-gorithm is used to validate those bounding boxeswhose β-parameter and Gdir follows an evolutionthat is compatible with a fall, which happens whenthey are both above a threshold for a number ofconsecutive frames corresponding to 1 s. When suchpattern is found, the system verifies that no motion(or limited motion) is found inside the boundingbox, in order to signal a fall has happened.

7

Page 8: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

(a)

(b)

(c)

Figure 5: Examples of fall detection: in (a) the green bounding box indicates a tracked person that did not experience any fall.Fallen people have been detected in (b) and (c), and marked by a red bounding box.

Some results of the fall detection algorithm arepresented in Figure 5: in (a) a person is surroundedby a green bounding box, indicating that no falloccurred. In (b) and (c) two images in whichfalls have been correctly detected (indicated by thered bounding box) are shown. These results havebeen obtained on unwarped omnidirectional im-ages, which represent a difficult test for the algo-rithm: it needs to process images with non uni-form resolution, as mentioned above, and boundingboxes of limited size (as it can be seen in the im-ages).

3.5 Sound identification and localization

Sound information is important to reveal dan-gerous events. The classification of audio eventsadds information to detections made on video data.Moreover, when objects are occluded or the illumi-nation is too low, sound could be the only way tolocalize an event.

3.5.1 Sound identification

Focusing on the problem of sound identificationinside an intelligent environment, some assump-tions can be made: first of all a closed set of “dan-gerous” sounds is identified, in our case: screams,shots, bangs and breaking glasses. Moreover, thesystem needs to work in real-time. According to

these constraints, a novel sound classification sys-tem has been developed and implemented in aNMM node. The node performs a short time anal-ysis of the perceived sound, whose spectrum is pro-cessed to extract relevant features that can be usedas a “signature” of the sound class. Mel-FrequencyCepstrals Coefficients (MFCC) are used as fast,real-time and reliable sound features [32].

During the feature extraction step, short framesof 30 ms of the audio signal are captured from themicrophone, collected and filtered through a raisedcosine Hamming window to prevent aliasing. Eachwindow is overlapped with the following one by10 ms. The frequency spectrum is then calculated,applying the FFT algorithm. Finally, MFCC arecalculated: the module in decibel of the frequencyspectrum is multiplied by a Mel filter bank andthen, finally, the Discrete Cosine Transform is ap-plied. The Mel filter bank is a set of band-passfilters localized according to the Mel scale, an audi-tory scale capable of modeling the response of thehuman ear. The Discrete Cosine Transform is ap-plied to the results of the Mel filter bank, to obtainthe Mel-Cepstral vectors, that represent the short-term power spectrum of a sound, that can in turnbe used as a feature of the current sound.

Features are classified in a second stage: dur-ing an offline training step, cepstral vectors are col-

8

Page 9: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

lected and used to create a model for each sound;in the identification phase the cepstral vectors arecompared to these models in order to find an iden-tity claim. The classification process of cepstralvectors is carried out by a Support Vector Machine.

In order to prevent fluctuations and to obtainmore stable performances, a final refinement stagehas been applied: through a statistical analysis ofthe last 50 identity claims, the most frequent iden-tity is claimed as the identity of the current soundperceived.

3.5.2 Sound localization

Standard methodologies use two or more micro-phones spread over the environment in order toestimate the time delay of arrival of the soundwave. According to this delay and to the geo-metrical position of the microphones, it is pos-sible to accurately calculate not only the direc-tion but also the three-dimensional position of thesound source. Commonly used techniques to cal-culate the time delay are based on the GeneralizedCross-Correlation (GCC) function between couplesof signals. In particular, in order to take advan-tage of redundant information provided by multiplesensors, the Multichannel Cross-Correlation Coef-ficient (MCCC) method can be a good choice tocalculate the sound source position [33].

Sound source localization has been implementedin a NMM node connected to two microphonesplaced in the environment. Each couple of mi-crophones will be able to perceive a single soundsource. Knowing the distance between the two mi-crophones we estimate the position of the audiosource through the estimation of the time delay be-tween the waveforms perceived by each microphone(TDOA):

θ = arcsin(τcd

), (3)

where θ is the incoming direction of the sound, d isthe inter-aural distance, c is the sound speed andτ is the TDOA. The TDOA is evaluated using theGeneralized Cross-Correlation. The GCC in thefrequency domain is defined as:

rGCCx1x2

[p] =

N−1∑f=0

Ψ[f ]Sx1x2 [f ]ej2πpfL , (4)

where N is the number of samples, Ψ[f ] is thefrequency domain weighting function. The cross-spectrum of the two input signals is defined as:

SX1X2(f) = E{X1(f)X∗2 (f)} , (5)

Figure 6: The mobile platform used for the experiments.

were X1(f) and X2(f) are the Discrete FourierTransform of the input signals and (*) denotes thecomplex conjugate. GCC shows a peak in corre-spondence of the time delay, and minimizes theinfluences of uncorrelated noise. In order to ex-ploit the phase information during the computationof the GCC, the PHAT weighting function (PhaseTransform Filter) has been used to normalize theamplitude of the spectral density of the two signals:

ΨPHAT[f ] =1

|Sx1x2[f ]|

. (6)

The GCC can show consistent information whenthe sound signal changes over time without anypseudo-periodic feature: otherwise the performanceis sensibly reduced.

4 Robot control

Elderly assistance sometimes requires small in-terventions by an external agent. Service roboticscould be the solution in situations in which the hu-man presence is not necessary, but an active de-vice can further investigate or even solve dangerousevents. A teleoperated robot can be guided by ahuman to a specific place in the intelligent environ-ment in order to look at elderly conditions, help ina specific task, or simply give a better view of ascene through a mobile camera.

Our system integrates a tool to manage themovement of a mobile robotic platform inside the

9

Page 10: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

environment monitored by the camera network.The robot we use (Figure 6) is a Pioneer P3-ATequipped with several sensors. A Microsoft Kinectprovides RGB-D data with 640×480 pixel resolu-tion at 30 frames per second. A SICK LMS100Laser Range Finder mounted on a pan-tilt unit scanan angle of 270◦ in the environment at a maximumfrequency of 50 Hz. Front and rear sonars senseobstacles from 15 cm to 7 m paying particular at-tention to transparent or moving objects. All thesesensors are interfaced with the robot by using ROS.The Robot Operating System (ROS) [20] by Wil-low Garage is an open-source, meta-operating sys-tem providing services like hardware abstraction,low-level device control, message-passing betweenprocesses, and package management. ROS containstools and libraries useful for robotics applications,such as navigation, motion planning and image and3D data processing. The system runs on a laptopequipped with an integrated Wi-Fi wireless LANadapter in order to communicate with the whole in-telligent environment. Among the drivers providedwith ROS, the ones that are used in this systemare:

• OpenNI driver that allows to interface withcommon RGB-D devices;

• p2os driver that provides the controller for P3-AT mobile platform and sonar;

• LMS1xx driver that obtains raw data from sev-eral types of Laser Range Finders.

The use of ROS gives us the possibility of achievinga full integration between robot and sensors.

Thanks to some open-source software, we couldintegrate a lot of useful funtionalities to our robot:

• gmapping [34] or vslam packages implementmethods of localization and mapping usingdepth information;

• navigation [35] package processes informationfrom odometry and sensor streams and pro-vides velocity commands to be sent to the mo-bile robot;

• peopleTracking [36] package is multi-peopletracking and following algorithm based onRGB-D data, specifically designed for mobileplatforms;

In the intelligent environment two different sys-tems have to coexist: NMM, to manage the commu-nication between different video and audio devices,and ROS, to make the robot able to safely move inan indoor environment. The cooperation betweenthese systems is necessary to let the robot properlyreact to dangerous events.

As we said before, the communication betweenthe robot and the intelligent environment is realizedby message passing between NMM and ROS. Theinterface consists of a wrapper which takes infor-mation from NMM nodes and republishes it usingthe proper topic in the ROS framework. The onlydata to be exported consists of the action the robothas to accomplish and the position it has to reachin 3D coordinates. Robot actions are decided bythe sensing system based on the events detected inthe environment; commands coming from the sen-sory system are then converted and sent to the ROSframework. It is important to note that all issuesrelated to robot navigation and sensing are solvedinside the ROS environment, which is best suitedfor this task.

5 Experimental results

Tests of the system have been performed in twodifferent intelligent environments. The first one wasset up in our laboratory, the second one was in-stalled in a real domotic apartment designed forelderly people. Tests conducted in the laboratoryof Figure 7(a) used a PTZ camera, an Omnidomecamera [13], two perspective cameras, two micro-phones and a Pioneer robot. More extensive testshave been performed also in the assisted livingapartment for elderly people, set up jointly withindustrial partners in a complex infrastructure, inthe context of the project “Safe Home”. Indus-trial partners provided several domotic nodes tocontrol shutters, air conditioning, lights, door andwindows. A sketch of the system is shown in Fig-ure 7(b): two perspective Axis M1103 IP camerashave been placed in the kitchen and in the bedroom.In the Safe Home environment the system has beentested during 10 days of uninterrupted running.

To show how the intelligent environment works,consider the test bed of Figure 7(a). When a humanenters the room, the omnidirectional and perspec-tive cameras perceive him and the tracking starts.Each detected person is associated to a numericalidentifier and to a bounding box, and the track ofeach person is kept until the person leaves the room.

10

Page 11: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

(a)

(b)

Figure 7: The two testing environments used for experi-ments: a laboratory equipped with a number of sensors (a)and a real apartment (b).

The system also monitors the bounding box of eachtrack, and considers the variation on the boundingbox dimension in order to detect falls, as describedin section 3.4. If a fall occurs, the PTZ camera willreceive a command to point at the direction of thefalling person to get a more detailed image of thedangerous event – this is achieved by means of theintrinsic calibration of the two cameras constitut-ing the Omnidome sensor, rather than exploitingthe sensor integration module. At the same time,a warning message is issued to a human operatorin a remote control room, who can simultaneouslymonitor a high number of smart environments.

The behavior of the system in response to audi-tive stimuli is similar. The sound detection system

isolates sounds from silence; then, if a sound hasbeen revealed, it is identified to recognize danger-ous events, such a scream, a shoot or a broken glass,beside natural human voice.

When events are detected by the system, a hu-man operator has the possibility to control therobot and drive it to verify what happened in theenvironment.

5.1 Multi camera calibration and robot action

The cameras placed in the environment need tobe calibrated to cooperate effectively. Thus, obser-vations made from each viewpoint can be referredto a common reference system. The common ref-erence has been placed at one corner of the roomand all cameras have been calibrated with respectto that reference. Two NMM nodes for each cameratransform coordinates in the image reference fromand to the world coordinates.

The standard calibration process [37] imple-mented in the OpenCV library1 has been exploited.The process is based on multiple observations ofa checkerboard for inferring the intrinsic parame-ters, and on the observation of the same object in aknown position for extrinsic parameters. The omni-directional camera, which is part of the Omnidomesensor [13], was calibrated using the OCamCalibtoolbox [38], which is again based on the observa-tion of a checkerboard.

The calibration accuracy that was achieved is suf-ficient for the modules of the system, that do notsuffer from errors in the order of 20-30 mm. Sim-plicity was preferred to more complex calibrationtechniques because it would be unfeasible to requirean accurate calibration when the system will be in-stalled in real-world scenarios.

Calibration average errors when projecting im-age points to the 3D ground were measured for allcameras. They are 3.96 cm for the PTZ camera,7.7 cm for the door camera, 2.49 cm for the centralcamera and 4.37 cm for the omnidirectional cam-era. Since the uncertainties related to the detec-tion algorithms composing the system are in thesame order of magnitude, a more accurate calibra-tion would not provide significant benefits. Thiscalibration procedure that aims at simplicity ratherthan precision is therefore suitable for the systembeing considered.

1http://www.opencv.org

11

Page 12: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

(a) (b)

Figure 8: Two experiments: in the lab, the omnidirectional camera detects a fall, then the PTZ camera focuses on it (a); inthe real apartment, the perspective camera tracks a person then detects a fall (b).

Figure 9: The result of two simple actions interpreted fromthe wrapper and executed by the robot showed from differentcamera views.

Integration with the robot has also been testedby means of a simple application, similar to theclick & move described in section 2. At http:

//youtu.be/gYqBdVLBqPQ, a video shows how justby clicking in one of the image of the distributedcamera network, the operator can move the robotor the PTZ camera. Tests involved both commu-nication and calibration, since the robot shares thesame reference system used by the camera network.The uncertainties related to robot positioning arecomparable with reprojection errors.

5.2 People detection and recognition evaluation

For testing the pipeline described in Figure 4, weused a fixed and a PTZ camera placed as reported inFigure 7(a). At http://www.youtube.com/watch?v=j7SV0hl8wOs, a video shows our system at work:two people enter the room and are detected by thefixed camera. Then, their positions are sent overthe network and read by the PTZ camera, whichtakes a close-up of one person at a time, finds theface and sends the corresponding part of the im-age to the network again. Finally, the recognitionserver analyzes the face image and recognizes theperson. Some screenshots of this video are reportedin Figure 10.

The face recognition training set was composed ofvideos containing the faces of three people acquiredin different positions, orientations and lighting con-ditions. In Figure 11, a graphic representation ofthe two first PCA components of the examples con-tained in the training set is shown. Different colorshighlight points belonging to different people. Asit can be noticed, people faces are well separated inthis space. In Table 1, execution times are reportedfor every part of the algorithm. The most computa-tionally expensive part is given by the face detectionperformed by the PTZ camera while zooming.

The performance of the system has also beenevaluated by varying the number of subjects thatshould be recognized during each experimental ses-sion. As shown in Figure 11, in case of a smallnumber of identities to be distinguished, the per-

12

Page 13: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

(a) (b)

(c) (d)

Figure 10: Screenshots from the video showing our complete system at work. Colored arrows show the information flow throughthe network components.

Table 1: Computational times for every algorithm (ms).

People detection 83.3Face detection in a whole image 181.8Face detection in a person bbox 9.7Face recognition 0.1

formance of the system is very good; but while in-crementing the number of subjects, the discrimi-nation becomes more difficult. In order to avoidthis problem it is possible to increase the dimen-sion of the feature vectors extracted by the eigen-faces algorithm: in this way it will be possible torely on a more detailed descriptor of each face. Fig-ure 11 shows also a comparison of the results ob-tained while using different dimensions of the fea-ture vectors. Recognition problems might also beovercome by using more adaptive techniques of clus-tering instead of SVM or by using a more reliablemultimodal approach, as using voice claims.

5.3 Audio and video events evaluation

Some experiments focused on situations with justone person in the scene, while other were performedwith several people. Event-based performance hasbeen measured: test sequences have been dividedinto events, that are a set of consecutive framesin which one or more people are either walking orfalling. In 21 events a person falls, while in 35no falls happen. The algorithm was tested on se-quences acquired using both perspective and om-nidirectional cameras, and the performance level isthe same in both cases.

For measuring performance, true positives (TP)and negatives (TN) have been measured, togetherwith false positives (FP) and negatives (FN). TPis the number of sequences in which the event (i.e.fall in this case) occurs, and is correctly detected bythe system, while TN is the number of sequences inwhich the event does not occur, and is correctlynot detected by the system. Misclassifications are

13

Page 14: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

(a) (b)

Figure 11: First two PCA components of the faces used as training set. Different colors mean different people (a). Performanceof the people identification by varying the number of faces to be recognized, and among different lengths of the featuresvector (b).

Table 2: Types of falls considered for testing the system, andperformance indicators.

Fall type TP FP FNFront fall 8 3 1

Lateral fall 2 0 0Backward fall 3 1 0

Crowded environment 6 3 1

measured by FP and FN: FP is the number of oc-currences in which the event does not occur, but isdetected by the system, while FN is the vice-versa.

Several different types of falls exist, that gen-erate different motion cues. Test sequences con-sidered a variety of situations, that are groupedin Table 2, together with the performance indica-tors mentioned above. Additional sequences notcontaining any fall were also used.

All the above statistics are event-based: videosexploited for performance measurements have beendivided into sub-sequences, each one representing asingle event to be recognized. During the experi-ments, 90% true positives and 80% true negativeswere measured, leading to an overall accuracy of85%, and a precision of 90% (see Table 3). Er-rors were mainly due to instabilities in the patternsfound by the motion detector, and can therefore becorrected by smoothing such motion information.

The sound detection system has been trained us-ing a total of 60 seconds pre-recorded sound data;

samples belonged to four classes: screams, bangs,broken glass, dog barking, that are the sounds thesystem is able to recognize. A further “background”class of human voices in unharmful situations is alsoused. Experiments on the sound recognition systemreported a performance of 83% of true positives and94% true negatives with 16% of false negatives and5% of false positives, showing a correct perceptionof sound events in the 85% of the cases. The systemgathered an overall accuracy of 88% and a precisionof 83% on its measurements (see Table 3).

However, thanks to the multimodal approach, thecomplete system has been able to outperform theperformances of the single modalities, giving a cor-rect evaluation of true positives and negatives of98%: when the system is not able to recognize anevent through one channel, a second one will recoverthe situation, improving the total performances ofthe whole system. At last, the tracking boundingboxes and the reliable sound localization estimationassured in the 83% of the cases a correct framing ofthe PTZ cameras to the event sources.

6 Conclusions

In this paper, an intelligent perception systemhas been presented. It is capable of performingmonitoring tasks based on multimodal sensors, thatmakes it possible to perceive different kinds of fea-tures of the environment. The system has sev-eral strong points: it is equipped with omnidirec-tional and perspective cameras, and makes use of

14

Page 15: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

Table 3: The confusion matrices of (a) the people fallingdetection system and of (b) the sound identification system.

(a) Fall Detection

Predictedclasses

Falling Not FallingFalling 0.90 0.09

Not falling 0.2 0.8

(b) Sound Identification

Predicted classesScream Gunshot Glass Bark

Scream 0.85 0.11 0.04 0.00Gunshot 0.08 0.85 0.07 0.00

Glass 0.02 0.02 0.96 0.00Bark 0.06 0.27 0.00 0.67

the novel Omnidome dual-camera sensor. Softwaremodules have been developed to be general andwork on both kinds of images: for example, the fall-detection algorithm is capable of working on bothunwarped omnidirectional images and common per-spective images. The system is multimodal, andis capable of offering several user-centered servicesthat are the goal of an assisted living environment.

The system was successfully used during the in-dustrial project “Safe Home” funded by the VenetoRegion, Italy, in a real environment meant to pro-vide assistance to elderly people, by detecting peo-ple falls and classifying sounds connected with dan-gerous situations. The use of the visual and au-ditive modalities assured a performance improve-ment with respect to single channels. The systemhas been also capable of coordinating active sen-sors (PTZ cameras) and a robot in order to react toperceived stimuli, for instance by letting the activecameras framing a location where a sound trigger-ing an alarm has been perceived, or by moving therobot to that location.

Acknowledgments

The authors wish to thank Alberto Sala-mone, Riccardo Levorato, Andrea Cervellin andCarlo Giuliani. This project was partiallyfunded by Regione Veneto with Cod. pro-getto: 2105/201/5/2215/2009 approved with DGR

n. 2215, 21/07/2009 and with Cod. progetto2105/201/4/1268/2008 approved with DDR n. 112,15/10/2008, and with DGR.1966, 15/7/2008 “Si-curezza Integrata” and with DGR. 2549, 4/8/2009“Safe Home.”

References

[1] E. Aarts and B. de Ruyter. New research perspectiveson ambient intelligence. Journal of Ambient Intelli-gence and Smart Environments, 1(1):5–14, 2009.

[2] A. Kofod-Petersen and A. Aamodt. Case-based reason-ing for situation-aware ambient intelligence: A hospitalward evaluation study. Case-Based Reasoning Researchand Development, pages 450–464, 2009.

[3] H. Nakashima, H. Aghajan, and J.C. Augusto. Hand-book of Ambient Intelligence and Smart Environments.Springer, 2010.

[4] Y. Shang, H. Shi, and S.S. Chen. An intelligentdistributed environment for active learning. Jour-nal on Educational Resources in Computing (JERIC),1(2es):4, 2001.

[5] J. Bohn, V. Coroama, M. Langheinrich, F. Mattern,and M. Rohs. Social, economic, and ethical implica-tions of ambient intelligence and ubiquitous computing.Ambient intelligence, pages 5–29, 2005.

[6] D. Preuveneers, J. Van den Bergh, D. Wagelaar,A. Georges, P. Rigole, T. Clerckx, Y. Berbers, K. Con-inx, V. Jonckers, and K. De Bosschere. Towards an ex-tensible context ontology for ambient intelligence. Am-bient Intelligence, pages 148–159, 2004.

[7] S. M. Anzalone, E. Menegatti, E. Pagello,Y. Yoshikawa, H. Ishiguro, and A. Chella. Audio-videopeople recognition system for an intelligent environ-ment. In Human System Interactions (HSI), 2011IEEE, 4th International Conference on. IEEE, 2011.

[8] T. Kleinberger, M. Becker, E. Ras, A. Holzinger,and P. Muller. Ambient intelligence in assisted liv-ing: Enable elderly people to handle future interfaces.In Constantine Stephanidis, editor, Universal Accessin Human-Computer Interaction. Ambient Interaction,volume 4555 of Lecture Notes in Computer Science,pages 103–112. Springer Berlin Heidelberg, 2007.

[9] Chernbumroong S., Cang S., Atkins A., and Yu H. El-derly activities recognition and classification for appli-cations in assisted living. Expert Systems with Applica-tions, 40(5):1662–1674, 2013.

[10] M. Mubashir, L. Shao, and L. Seed. A survey on falldetection: Principles and approaches. Neurocomputing,100:144–152, 2013.

[11] F. Cardinaux, D. Bhowmik, C. Abhayaratne, and M.S.Hawley. Video based technology for ambient assistedliving: A review of the literature. Journal of Am-bient Intelligence and Smart Environments, 3(3):253–269, 2011.

[12] S. R. Lord. Falls in older people: risk factors and strate-gies for prevention. Cambridge University Press, 2007.

[13] S. Ghidoni, A. Pretto, and E. Menegatti. Cooper-ative tracking of moving objects and face detectionwith a dual camera sensor. In Robotics and Automa-tion (ICRA), 2010 IEEE International Conference on,pages 2568–2573, May 2010.

15

Page 16: A Distributed Perception Infrastructure for Robot Assisted ...ghidoni/papers/ghidoni_distributed_perception.pdfnetwork. A complex perception system can be syn-thesized by implementing

[14] J. Van Hoof, H.S.M. Kort, P.G.S. Rutten, and M.S.H.Duijnstee. Ageing-in-place with the use of ambient in-telligence technology: Perspectives of older users. Inter-national journal of medical informatics, 80(5):310–331,2011.

[15] M. Hossain and D. Ahmed. Virtual caregiver: Anambient-aware elderly monitoring system. Informa-tion Technology in Biomedicine, IEEE Transactionson, 16(6):1024–1031, 2012.

[16] J. C. Augusto, H. Nakashima, and H. Aghajan. Ambi-ent intelligence and smart environments: A state of theart. In Handbook of Ambient Intelligence and SmartEnvironments, pages 3–31. Springer, 2010.

[17] A. Costa, J.C. Castillo, P. Novais, A. Fernandez-Caballero, and R. Simoes. Sensor-driven agenda forintelligent home care of the elderly. Expert Systemswith Applications, 39(15):12192–12204, 2012.

[18] A. Saffiotti and M. Broxvall. Peis ecologies: Ambient in-telligence meets autonomous robotics. In Proceedings ofthe 2005 joint conference on Smart objects and ambientintelligence: innovative context-aware services: usagesand technologies, pages 277–281. ACM, 2005.

[19] M. Lohse, M. Repplinger, and P. Slusallek. Network-Integrated Multimedia Middleware. Services, and Ap-plications, Department of Computer Science, SaarlandUniversity, Germany, Diss, 2005.

[20] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust,Tully B. Foote, Jeremy Leibs, Rob Wheeler, and An-drew Y. Ng. ROS: an open-source robot operating sys-tem. In ICRA Workshop on Open Source Software,2009.

[21] P. Dollar, S. Belongie, and P. Perona. The fastest pedes-trian detector in the west. BMVC 2010, Aberystwyth,UK, 2010.

[22] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedes-trian detection: An evaluation of the state of theart. Pattern Analysis and Machine Intelligence, IEEETransactions on, 34(4):743–761, 2012.

[23] M. Munaro, F. Basso, and E. Menegatti. Tracking peo-ple within groups with rgb-d data. In Intelligent Robotsand Systems (IROS), 2012 IEEE/RSJ InternationalConference on, pages 2101–2107. IEEE, 2012.

[24] M. Turk and A. Pentland. Eigenfaces for recognition.Journal of cognitive neuroscience, 3(1):71–86, 1991.

[25] A.S. Georghiades, P.N. Belhumeur, and D.J. Kriegman.From few to many: Illumination cone models for facerecognition under variable lighting and pose. IEEETrans. Pattern Anal. Mach. Intelligence, 23(6):643–660, 2001.

[26] N. Noury, A. Fleury, P. Rumeau, A.K. Bourke, G.O.Laighin, V. Rialle, and J.E. Lundy. Fall detection –principles and methods. In Engineering in Medicineand Biology Society, 2007. EMBS 2007. 29th AnnualInternational Conference of the IEEE, pages 1663–1666, Aug. 2007.

[27] H. Foroughi, B.S. Aski, and H. Pourreza. Intelligentvideo surveillance for monitoring fall detection of el-derly in home environments. In Computer and Infor-mation Technology, 2008. ICCIT 2008. 11th Interna-tional Conference on, pages 219–224. IEEE, 2008.

[28] H. Nait-Charif and S.J. McKenna. Activity summari-sation and fall detection in a supportive home envi-ronment. In Pattern Recognition, 2004. ICPR 2004.Proceedings of the 17th International Conference on,volume 4, pages 323–326, Aug. 2004.

[29] M.J. Khan and H.A. Habib. Video analytic for falldetection from shape features and motion gradients.Lecture Notes in Engineering and Computer Science,2179:1311–1316, 2009.

[30] S.-G. Miaou, P.-H. Sung, and C.-Y. Huang. Acustomized human fall detection system using omni-camera images and personal information. In DistributedDiagnosis and Home Healthcare, 2006. D2H2. 1stTransdisciplinary Conference on, pages 39–42. IEEE,2006.

[31] G. Mastorakis and D. Makris. Fall detection system us-ing kinects infrared sensor. Journal of Real-Time ImageProcessing, pages 1–12, 2012.

[32] F. Bimbot, J.F. Bonastre, C. Fredouille, G. Gravier,I. Magrin-Chagnolleau, S. Meignier, T. Merlin,J. Ortega-Garcıa, D. Petrovska-Delacretaz, and D.A.Reynolds. A tutorial on text-independent speaker veri-fication. EURASIP Journal on Applied Signal Process-ing, pages 430–451, 2004.

[33] D. Salvati, A. Roda, S. Canazza, and G. L. Foresti. Areal-time system for multiple acoustic sources localiza-tion based on isp comparison. In H. Pomberger, F. Zot-ter, and A. Sontacchi, editors, Proc. of the 13th Int.Conference on Digital Audio Effects - DAFx-10, pages201–208, Graz, Austria, September 2010.

[34] G. Grisetti, C. Stachniss, and W. Burgard. Improvedtechniques for grid mapping with rao-blackwellized par-ticle filters. Robotics, IEEE Transactions on, 23(1):34–46, feb. 2007.

[35] Eitan Marder-Eppstein, Eric Berger, Tully Foote, BrianGerkey, and Kurt Konolige. The office marathon: Ro-bust navigation in an indoor office environment. In In-ternational Conference on Robotics and Automation,2010.

[36] Matteo Munaro, Filippo Basso, Stefano Michieletto,Enrico Pagello, and Emanuele Menegatti. A SoftwareArchitecture for RGB-D People Tracking Based on ROSFramework for a Mobile Robot. In Sukhan Lee, Kwang-Joon Yoon, and Jangmyung Lee, editors, Frontiers ofIntelligent Autonomous Systems, volume 466 of Studiesin Computational Intelligence, pages 53–68. SpringerBerlin Heidelberg, 2013.

[37] Z. Zhang. A flexible new technique for camera calibra-tion. Pattern Analysis and Machine Intelligence, IEEETransactions on, 22(11):1330–1334, 2000.

[38] D. Scaramuzza, A. Martinelli, and R. Siegwart. A tool-box for easily calibrating omnidirectional cameras. InIntelligent Robots and Systems, 2006 IEEE/RSJ Inter-national Conference on, pages 5695–5701, oct. 2006.

16