D4.4 Model for the prediction of VRUs intentions · of VRUs paths could improve the current automatic emergency braking systems. For this reason, methods for predicting future VRUs

BRAVE BRidging gaps for the adoption of Automated VEhicles

No 723021

D4.4 Model for the prediction of VRUs intentions

Lead Author: Miguel Ángel Sotelo (UAH) With contributions from: Rubén Izquierdo, Sandra Carrasco,

Carlota Salinas, and Javier Alonso (UAH)

Deliverable nature: Report (R) Dissemination level: (Confidentiality)

Public (PU)

Contractual delivery date:

31-May-2020

Actual delivery date: 28-May-2020 Version: 1.0 Total number of pages: 41 Keywords: predictions of intentions, pedestrians, cyclists, vulnerable road users

Ref. Ares(2020)2771430 - 28/05/2020

BRAVE Deliverable D4.4

723021 Page 2 of 41

Abstract

According to several reports published by worldwide organizations, thousands of pedestrians and cyclists (so called Vulnerable Road Users or VRU for short) die in road accidents every year. Due to this fact, vehicular technologies have been evolving with the intention of reducing these fatalities. This evolution has not finished yet, since the predictions of VRUs paths could improve the current automatic emergency braking systems. For this reason, methods for predicting future VRUs paths, poses, and intentions have been developed in the framework of the BRAVE project. This document provides a detailed description of the models developed for classifying VRUs activities and for predicting their intentions (crossing, not crossing, etc.). Results are presented in a critical way together with the main conclusions and findings for future research on this field.

[End of abstract]

Deliverable D4.4 BRAVE

723021 Page 3 of 41

Executive summary Due to the high number of road fatalities, during the last few years vehicles and infrastructures have been evolving to become intelligent machines with advanced technologies such as Assistive Intelligent Transportation Systems (AITS), Pedestrian Protection Systems (PPS), or other sort of Advanced Driver-Assistant Systems (ADAS). Improving these technological advances is imperative because the longer the braking initiation time, the higher the impact speed and thus, the injury risk. In fact, an adult has less than a 20% risk of dying if struck by a car at less than 50 km/h but almost a 60% risk of dying if hit at 80 km/h. Hence, a precise assessment about the current and future pedestrian positions and an early detection of people entering a road lane is a major challenge in order to increase the effectiveness of Automatic Emergency Braking (AEB) systems as well as for other ADAS. Similarly, an early recognition of pedestrians and cyclists’ intentions can lead to much more accurate active interventions in the last second automatic manoeuvres. In this way, in the last few years, with the aim of addressing these challenges, a lot of effort has been put into recognizing Vulnerable Road Users (VRU) activities and predicting their trajectories and intentions. From the drivers’ perspective, it is essential to inference the intentions of other road users (vehicles, pedestrians, cyclists, etc.) in order to anticipate their behaviour. There are many ways for drivers and cyclists to communicate their intentions, such as the use of blinkers (for vehicles) and the use of their arms to point in the intended direction of travel (for cyclists). However, in everyday traffic, a good deal of road users fails to use these means to communicate their intentions, or they simply forget or prefer not to use such active forms of communication. Therefore, other visual cues have to be used to anticipate critical encounters or events that might lead to potentially dangerous situations. This is applicable to all VRUs, comprising pedestrians, cyclists, and other two-wheelers, but especially critical for the case of pedestrians, given that they are not expected to use any active forms of communication when interacting with vehicles and other road users. One of the major challenges concerning the interaction between Autonomous Vehicles and VRUs is to detect risky situations and react accordingly in order to avoid or mitigate collisions. Thus, an Autonomous Vehicle has to be capable of understanding the behaviour and intentions of VRUs as well as estimating their future motion. Typically, this is addressed by looking at their past and current behaviour including their dynamics, activity and context, and providing an appropriate model to extrapolate their future positions. The problem is often treated as tracking a dynamic object by considering the changes in position, velocity, and orientation, and extrapolating the observed dynamics. However, this approach is subject to error for large look-ahead times due to the highly dynamic and nonlinear nature of VRUs motion and the diversity and unpredictability of their behaviour. Predicting the trajectory and short-term destination of VRUs is closely related to predicting their intentions. For instance, a pedestrian walking along the sidewalk in the same direction of travel as our car, looking backward repeatedly to the rear coming traffic might be a clear indication of the pedestrian intention to cross the street in the next seconds. As a consequence of this, BRAVE is looking at developing advanced methods for providing accurate predictions of Vulnerable Road Users (VRUs) intentions and trajectories. This task is complex, given that it requires a deep understanding of the contextual situations, not only based on kinematic data. For such purpose, several datasets containing representative data for VRUs have been used as the grounds for developing the predictive algorithms that are required for this task. The prediction of pedestrians’ intentions has been developed in two incremental steps. In the first step, only body pose features have been considered to provide predictions in a time horizon of 1s. In a second step, a context-based approach has been developed and tested in order to extend the ability to anticipate critical situations. Such approach is based on a combination of CNN and RNN, yielding very accurate results in predictive tasks. This document provides a detailed description of the models developed for classifying VRUs activities and for predicting their intentions (crossing, not crossing, etc.). Results are presented in a critical way together with the main conclusions and findings for future research on this field.


723021 Page 4 of 41

Document Information

IST Project Number

723021 Acronym BRAVE

Full Title BRidging gaps for the adoption of Automated VEhicles Project URL http://www.brave-project.eu/ EU Project Officer Damian Bornas Cayuela

Deliverable Number D4.4 Title Model for the prediction of VRUs

intentions Work Package Number WP4 Title Vehicle-environment interaction

concepts, enhancing ADAS Date of Delivery Contractual M36 Actual M36 Status version 1.0 Nature Report Dissemination level Public

Authors (Partner) University of Alcalá (UAH)

Responsible Author Name Miguel Ángel

Sotelo E-mail [email protected]

Partner UAH Phone +34 91 885 6573 Abstract (for dissemination)

According to several reports published by worldwide organizations, thousands of pedestrians and cyclists (so called Vulnerable Road Users or VRU for short) die in road accidents every year. Due to this fact, vehicular technologies have been evolving with the intention of reducing these fatalities. This evolution has not finished yet, since the predictions of VRUs paths could improve the current automatic emergency braking systems. For this reason, methods for predicting future VRUs paths, poses, and intentions have been developed in the framework of the BRAVE project. This document provides a detailed description of the models developed for classifying VRUs activities and for predicting their intentions (crossing, not crossing, etc.). Results are presented in a critical way together with the main conclusions and findings for future research on this field.

Keywords predictions of intentions, pedestrians, cyclists, vulnerable road users Version Log Issue Date Rev. No. Author Change 1st May, 2020 0.1 Miguel Ángel Sotelo Initial version 18th May, 2020 0.2 Miguel Ángel Sotelo Internal UAH review 28th May, 2020 1.0 Miguel Ángel Sotelo BRAVE internal review


723021 Page 5 of 41

Table of Contents Executive summary .............................................................................................................................. 3 Document Information ......................................................................................................................... 4 Table of Contents ................................................................................................................................. 5 Abbreviations ....................................................................................................................................... 6 1 Introduction ................................................................................................................................... 7 2 VRU Activity Recognition ............................................................................................................ 9

2.1 Homogenization of VRUs datasets ......................................................................................... 9 2.2 VRU detection performance comparison ............................................................................. 10 2.3 Feature-based activity recognition ........................................................................................ 10 2.4 Experimental results ............................................................................................................. 12

2.4.1 Activity recognition results ............................................................................................ 12 2.4.2 VRU detection results .................................................................................................... 14 2.4.3 Pose methods comparison results .................................................................................. 17

2.5 Discussion ............................................................................................................................. 18 3 Predictive systems ....................................................................................................................... 21

3.1 Pose-based pedestrian predictive system .............................................................................. 21 3.1.1 Dataset description ......................................................................................................... 22 3.1.2 Learning Pedestrian Activities ....................................................................................... 23 3.1.3 Activity recognition ....................................................................................................... 25 3.1.4 Path, pose and intention prediction ................................................................................ 26 3.1.5 Results ............................................................................................................................ 26

3.2 Pedestrian predictive system based on contextual cues ........................................................ 30 3.2.1 System description ......................................................................................................... 30 3.2.2 Experimental setup ......................................................................................................... 32 3.2.3 Results ............................................................................................................................ 34

4 Final remarks ............................................................................................................................... 38 5 References ................................................................................................................................... 39


723021 Page 6 of 41

Abbreviations

ADAS: Advanced Driver Assistance Systems

AEB: Automated Emergency Braking

AI: Artificial Intelligence

AITS: Assistive Intelligent Transportation Systems

BCE: Binary Cross Entropy

BDLST: Bidirectional Long Short Term Memory

BDGRU: Bidirectional Gated Recurrent Unit

B-GPDM: Balanced Gaussian Process with Dynamical Model

CNN: Convolutional Neural Network

DRIVERTIVE: DRIVER-less cooperaTIVE vehicle

FHD: Full High Definition

FPS: Frames per second

GPDM: Gaussian Process with Dynamical Model

GPS: Global Positioning System

HD: High Definition

HMM: Hidden Markov Model

IoU: Intersection Over Union

MED: Mean Euclidean Distance

PCA: Principal Components Analysis

PPS: Pedestrian Protection System

RMSE: Root Mean Square Error

RNN: Recurrent Neural Network

ROC: Receiver Operational Curve

TTA: Time-To-Action

TTC: Time-To-Curb

TTE: Time-To-Event

TTS: Time-To-STOP

V2V: Vehicle-to-Vehicle

V2VRU: Vehicle-to-Vulnerable Road User

VRU: Vulnerable Road User


723021 Page 7 of 41

1 Introduction Current advances in autonomous vehicles and active safety systems have demonstrated autonomy and safety in a wide set of driving scenarios. SAE Levels 3 and 4 have been achieved in some pre-defined areas under certain restrictions. In order to improve the level of safety and autonomy, self-driving cars need to be endowed with the capacity of anticipating potential hazards, which involves a deeper understanding of the complex driving behaviours corresponding to Vulnerable Road Users (VRUs) other human-driven vehicles, including inter-vehicle interactions. Due to the high number of road fatalities, during the last few years vehicles and infrastructures have been evolving to become intelligent machines with advanced technologies such as Assistive Intelligent Transportation Systems (AITS), Pedestrian Protection Systems (PPS), or other sort of Advanced Driver-Assistant Systems (ADAS). Improving these technological advances is imperative because the longer the braking initiation time, the higher the impact speed and thus, the injury risk. In fact, an adult has less than a 20% risk of dying if struck by a car at less than 50 km/h but almost a 60% risk of dying if hit at 80 km/h. Hence, a precise assessment about the current and future pedestrian positions and an early detection of people entering a road lane is a major challenge in order to increase the effectiveness of Automatic Emergency Braking (AEB) systems. Similarly, an early recognition of pedestrians and cyclists’ intentions can lead to much more accurate active interventions in the last second automatic manoeuvres. In this way, in the last few years, with the aim of addressing these challenges, a lot of effort has been put into recognizing Vulnerable Road Users (VRU) activities and predicting their trajectories and intentions. From the drivers’ perspective, it is essential to perform inferences about the intentions of other road users (vehicles, pedestrians, cyclists, etc.) in order to anticipate their behaviour. There are many ways for drivers and cyclists to communicate their intentions, such as the use of blinkers (for vehicles) and the use of their arms to point in the intended direction of travel (for cyclists). However, in everyday traffic, a good deal of road users fails to use these means to communicate their intentions, or they simply forget or prefer not to use such active forms of communication. Therefore, other visual cues have to be used to anticipate critical encounters or events that might lead to potentially dangerous situations. This is applicable to all VRUs, comprising pedestrians, cyclists, and other two-wheelers, but especially critical for the case of pedestrians, given that they are not expected to use any active forms of communication when interacting with vehicles and other road users. One of the major challenges concerning the interaction between Autonomous Vehicles and VRUs is to detect risky situations and react accordingly in order to avoid or mitigate collisions. Thus, an Autonomous Vehicle has to be capable of understanding the behaviour and intentions of VRUs as well as estimating their future motion. Typically, this is addressed by looking at their past and current behaviour including their dynamics, activity and context, and providing an appropriate model to extrapolate their future positions. The problem is often treated as tracking a dynamic object by considering the changes in position, velocity, and orientation, and extrapolating the observed dynamics. However, this approach is subject to error for large look-ahead times due to the highly dynamic and nonlinear nature of VRUs motion and the diversity and unpredictability of their behaviour. Predicting the trajectory and short-term destination of VRUs is closely related to predicting their intentions. For instance, a pedestrian walking along the sidewalk in the same direction of travel as our car, looking backward repeatedly to the rear coming traffic might be a clear indication of the pedestrian intention to cross the street in the next seconds. Something similar could be said about a cyclist in front of our vehicle (in the same direction of travel) turning his/her head to the left in order to explore the surrounding vehicles. As drivers, we might interpret this visual cue as an indication of the cyclist’s intention to start a left turn, even if the cyclists don’t use their arms to signal such manoeuvre. The complexity and variety of VRUs’ movements and manoeuvres is extremely high, with notable differences between pedestrians, cyclists, and motorcyclists in terms of dynamics


723021 Page 8 of 41

and use cases. Even for the specific case of pedestrians, there are significant differences between standing pedestrians and sitting pedestrians in terms of situation analysis and road safety implications. For example, a sitting pedestrian enjoying dinner at a terrace in a summer evening represents a very different situation from that of a pedestrian standing at the road curb looking at the direction of crossing. In principle, it is safe to say that the second one must be watched more carefully in terms of road safety. Accordingly, the mere extrapolation of the observed dynamics is a weak indicator of the collision threat. For pedestrians, it has been proved that the intention to cross and future motion can be inferred from more cues than just pure dynamics. As a matter of fact, the context has been proved to play an important role. By context we consider different factors such as the pedestrian’s distance to the curb, the relative gap between the pedestrian and the vehicle, the presence of traffic signals and their state, the presence of zebra crossing lines, the structure of the street, the road width, etc. However, at the point of crossing, in more than 90% of the cases pedestrians use some form of attention mechanism to communicate their intention of crossing, being looking in the direction of the approaching vehicles (awareness) the most prominent form of attention. Other forms of explicit communication such as nodding or hand gesture are observed in many cases, but awareness is one of the strongest indicators of crossing intention. Pedestrian awareness is usually measured by pedestrians’ head orientation relative to the vehicle, a very relevant feature that is not easy to measure accurately in practice. In the context of the BRAVE project, a preliminary VRU activity recognition step is proposed in order to support the subsequent prediction phases. For such purpose, VRUs are categorized in four different classes, namely Standing Pedestrian, Person sitting, Cyclist, and Motorcyclist. This prior knowledge eases the prediction steps by providing information about the appropriate dynamics model and assumptions to use in the different use cases. The activity recognition step allows approaching the prediction problem in a hierarchical fashion. The next step is to develop advanced methods for providing accurate predictions of Vulnerable Road Users (VRUs) intentions and trajectories. This task is complex, given that it requires a deep understanding of the contextual situations, not only based on kinematic data. For such purpose, several datasets containing representative data for VRUs have been used as the grounds for developing the predictive algorithms that are required for this task. The general scheme is provided in figure 1. This document provides a detailed description of the models developed for classifying VRUs activities and for predicting their intentions (crossing, not crossing, etc.). Results are presented in a critical way together with the main conclusions and findings for future research on this field.

Fig. 1. Overview of the proposed methodology.


723021 Page 9 of 41

2 VRU Activity Recognition VRU activity recognition is regarded as a preliminary step to discriminate the different types of VRUs (pedestrians, cyclists, motorcyclists, etc.) with a view to further focusing the attention on those. Different types of VRUs require different approaches. As previously mentioned, a sitting pedestrian does not pose the same kind of challenge to a moving vehicle than a pedestrian standing at the curb and looking for eye contact with the driver. VRU activity recognition can be best carried out by analysing the body features of all the persons in the road scene (head, hips, knees, shoulders, etc.), as acquired by a camera. This approach requires to develop a multi-person pose-estimation system. Multi-person pose-estimation approaches and their corresponding datasets were not originally developed for predicting VRUs’ intentions. Moreover, in many cases, they were not even developed for a specific final application. The absence of a common target has made every particular implementation to diverge from the others in terms of input/output data formats, definition of what a ”standard VRU” is (number of joints, the selection of joints, the order of the joints) or even what to consider inside the VRUs group (sitting people, cyclists, bikers, etc.). In order to compare their potential for predicting the intentions of VRUs, a previous harmonization of the available datasets is mandatory. This was done and documented in deliverable 4.2 of this project. For the sake of self-content, the methodology for data set homogenization is briefly described in the next section.

2.1 Homogenization of VRUs datasets

The first issue regarding the homogenisation of the different VRU datasets available was to find a common labelling format for a VRU detected in a given image. While in datasets a detection normally consists of a bounding box, some of the multi-person pose estimators don’t provide a bounding box for the candidate but a list of body joints. In order to be able to compare their performance as person detectors, these lists of joints had to be converted to a bounding box. To do so, a fully connected neural network was trained using the keypoints and manually labelled bounding boxes as ground truth (see Figure 2).

Fig. 2. Examples of key-point extraction (shown as a skeleton with different colours for different body parts) and their associated bounding boxes (in red): a) walking person; b) Cyclists; c) Sitting person.

The second issue was to find a common definition for the labelled VRU classes on the different datasets. The most common partition among VRUs is to consider person, cyclist, and motorcyclists. However, some of the datasets combine cyclists and motorcyclists and some other split the person class into pedestrian, group, and unusual posture. As previously stated, the goal is to carry out a preliminary VRU activity recognition step in order to support the subsequent prediction phases. For


723021 Page 10 of 41

such purpose, all the classes have been categorized in their original datasets into four different classes, namely Standing Pedestrian, Sitting Person, Cyclist, and Motorcyclist. The underlying hypothesis is that this prior knowledge will ease the prediction steps by providing information about the appropriate dynamics model and assumptions to use in the different use cases.

2.2 VRU detection performance comparison

Most of the multi-person pose estimation approaches have been designed in two steps: first, a person detection Convolutional Neural Network (CNN) that performs bounding box or segmentation-based detection, and second a key-point and pose extraction CNN that operates on the previously detected areas. Some methods skip the first detection step and look for keypoints in the whole image relying on the key-point extraction step for the detection of the candidates. Our objective is to test how the different approaches affect their capability to detect VRUs in the sequences, regardless of the precision of the key-point extraction step which will be analysed later on. For this study, the two top performing state-of-the-art methods trained on a generic dataset have been selected. Such methods are not specifically developed for VRUs, having been trained on the COCO [1] dataset, namely the Mask R-CNN [2] (using Detectron framework [3]) and YOLOv3 [4]). In addition, a third method has been used. This method is not based on previous detection of the candidates (OpenPose [5]) and provides feature detections (body parts) straightaway. As previously explained, to compare their VRUs detection performances a conversion from keypoints to bounding box detections is needed for OpenPose. A fully connected neural network was trained to perform regression with the (x; y) coordinates of the 18 keypoints detected by OpenPose as inputs, and the ground truth bounding box coordinates as targets. The data association between the detected keypoints and the ground truth was performed using the Intersection-over-Union (IoU) of the ground truth bounding box and the bounding box that frames all the keypoints.

2.3 Feature-based activity recognition

The Activity Recognition step classifies VRUs into four different classes to support the prediction phase. The proposed system is composed of two parts: a detection CNN and a classification CNN. For the detection, a Mask-R-CNN model with a ResNeXt-101-64x4d is used. For classification, the following models (pretrained with ImageNet [6]) have been tested:

- SqueezeNet [7] (213:6 fps). - AlexNet [8] (197:9 fps). - GoogleNet [9] (183:0 fps). - VGG-16 [10] (69:9 fps). - ResNet-50 [11] (87:8 fps). - ResNet-101 [11] (61:3 fps).

The hyperparameters used in training were the following:

- Fixed learning rate: 0.001. - Maximum number of epochs: 10. - Minibatch size: 64 in AlexNet and SqueezeNet and 32, for the rest of the models, due to

memory limitation. - Training method: Stochastic Gradient Descent with Momentum.


723021 Page 11 of 41

Prior to the training phase, the dataset was augmented by adding new samples to compensate for classes imbalance. As can be seen in Table I, the initial class distribution underrepresented motorcyclists and persons sitting classes. Also, traditional data augmentation was introduced in the training phase. 1) Adding new samples: New samples extracted from the Internet were manually added to the underrepresented classes, achieving 7558 and 7848 new samples for motorcyclist and sitting people classes. To facilitate the extraction of new samples the previously mentioned R-CNN was used as candidate extractor. After processing the data, all the detected samples were manually validated. Person and motorcycle detections were merged during manual validation given that the detector didn’t have the motorcyclist class. Some examples of the selected samples can be seen in Figure 3.

Table I. Initial class distribution in selected datasets.

Fig. 3. Examples of motorcyclist (lower row) and sitting people (upper row) classes extracted from the Internet.

The motorcyclist class did not exist either in the CityPersons dataset. However, several motorcyclists are labelled as a cyclist. In this case, their class was changed obtaining 259 new motorcyclist samples. 2) Data augmentation: Class-selective data augmentation was incorporated for those classes in order to further balance the classes representation. Data augmentation has also proved to enhance generalization. Three transformations were randomly taken from the following list:

- Scaled [0.8, 1.2] % in x and y axis. - Translation [-0.2, 0.2] % in x and y axis. - Rotation [-25, 25] in degrees. - Colour channel addition up to 30. - Cropping pixels randomly from [0, 20] interval. - Horizontal Flip with a 50% of probability.


723021 Page 12 of 41

- Gaussian blur with standard deviation in the interval [0, 3.0].

The resulting equalized dataset distribution can be seen in Table II. Results for the classifiers are provided in the next section.

Table II. Final class distribution in selected datasets.

2.4 Experimental results

In this section, the detection performance of different multi-person pose estimation approaches on different datasets is analysed. The impact of the dataset design for a VRUs’ detection system is also evaluated. In addition, a preliminary analysis of an activity recognition CNN is presented. 2.4.1 Activity recognition results

The average F1-Score is used as the main performance metric. As expected, the equalisation in the number of samples per class significantly improves the results for the underrepresented classes (see Table III for the equalised dataset and Table IV for the imbalanced data). As can be observed, underrepresented classes exhibit a lower score. For this relatively simple classification problem, deeper models did not improve the results, being VGG-16 the best. Figures 4 and 5 show VGG-16 confusion matrices for the equalised and not equalised dataset. After the equalisation, both precision and recall for the underrepresented classes increase their values over 90%. Some examples of well detected samples are depicted in Figure 6.

Table III. F1-score equalized dataset.


723021 Page 13 of 41

Table IV. F1-score imbalanced dataset.

Fig. 4. Activity recognition results with VGG-16 with unbalanced classes.

Fig. 5. Activity recognition results with VGG-16 in the equalized dataset.


723021 Page 14 of 41

Fig. 6. Activity Recognition results with VGG-16 and extracted pose (OpenPose). From the upper row to the lower row: pedestrians,

sitting people, motorcyclists and cyclists. 2.4.2 VRU detection results

In order to compare the results from the different VRU detectors, the same metrics used in the KITTI Detection Challenge are proposed. In this challenge, difficulty levels have been established depending on the target’s bounding box size. Occlusions and direction have not been considered as they are not labelled in all datasets. OpenPose has been evaluated without using threshold restriction in the detection confidence. This is because the keypoints confidence generated has very low values, since poses are not usually complete and the missing keypoints introduce null confidences. The no-


723021 Page 15 of 41

threshold condition contributes to the increase of false positives. Figures 7, 8, 9, and 10 show the precision-recall curves for the different datasets and difficulties while table V shows a comparative AP (Average Precision) detection results for the Citypersons [12], KITTI [13], INRIA [14], and Tsinghua [15] datasets.

Table V. AP detection results.

Mask R-CNN obtains the best mean Average Precision (mAP) in almost every dataset and difficulty. YOLOv3 results are close to them, with the advantage of being a faster model, useful for real time applications. OpenPose results are only comparable in the INRIA dataset which is the smallest and easiest one. It is important to note that those methods were trained with the COCO dataset, a general object detection. In addition, the influence of the bounding box labelling method is noticeable. In CityPersons, which uses an aspect ratio approach, YOLOv3 and Mask R-CNN obtain the results shown in Figure 7. These results are worse than in the rest of datasets (YOLOv3 result is comparable to OpenPose). In those other cases, the approach is different, and bounding boxes are labelled adjusting them to the objects as in COCO dataset. However, even having this disadvantage, Mask R-CNN obtains the best results in this case. For KITTI dataset, the progression of the performance decrease shown in Figures 8 and 9 was obtained. While in the easy subset (figure 8) the results are similar in the three methods, when it comes to the hard subset, Mask-R-CNN is the less affected one, keeping its recall value nearly intact. YOLOv3 and OpenPose, even maintaining the same precision values, have their recall decreased enlarging the distance from Mask-R-CNN.

Fig. 7. Cityscapes detection results (hard IoU >=0.5).

The need for safety imposes three main requirements to detection methods:

- Prioritizing the number of detections over their precision (without ignoring the false positive proportion).


723021 Page 16 of 41

- Anticipating detections as long as possible, i.e. small objects detection is important. - Having pragmatic computation requirements, which ensures their use in real-time scenarios

with the limited capacity inside the vehicle.

OpenPose is a method that does not satisfy these requirements, unlike the other methods that, despite being trained in generalist detection datasets, better satisfy these requirements, with a special mention to Mask-R-CNN which maintains a nearly constant recall in all difficulty sets. Some examples of the person and pose detection results on the different datasets are provided in Figure 11.

Fig. 8. KITTI detection results (Easy IoU >=0.5).

Fig. 9. KITTI detection results (Hard IoU >=0.5).


723021 Page 17 of 41

Fig. 10. INRIA detection results (Hard IoU >=0.5).

Fig. 11. Person and pose detection results.

2.4.3 Pose methods comparison results

Figure 12 depict the results of the different person keypoints detection methods for the same image. As can be observed, there are slightly differences concerning the keypoints location. Concerning the number of joints detected by each method, OpenPose intrinsically discards joints detected with low confidence, whereas YOLOv3 and Mask R-CNN do not filter low confidence keypoints.


723021 Page 18 of 41

Fig. 12. Example of keypoints detected by: a) OpenPose; b) YOLOv3; c) Mask-R-CNN.

2.5 Discussion

Obtaining visual data and labelling it in order to train supervised methods is one of the most time-consuming tasks the Artificial Intelligence (AI) community has to face. Most of the existing automated driving datasets, including the ones used in this project, lack some crucial features because they are focused on detection or segmentation problems. There are some good examples of datasets more focused on prediction including KITTI object tracking benchmark, JAAD dataset [16] and Daimler Pedestrian Path Prediction Benchmark Dataset [17]. They have some shortages individually in this task, but as a group, they gather most of the characteristics needed to advance in prediction tasks such as VRU intention awareness. As a conclusion, a list of desirable features that a dataset for this kind of tasks must include is provided below:

- Tracking information: both in image and three-dimensional coordinates. - Multi-person situations: in Daimler dataset there are only single person sequences. This

requirement aids in the study of group behaviour influence over VRU intention. - Naturalistic scenarios: it is important to record situations based on reality, instead of in the

use of actors. The reason for that is the higher certainty of the method’s functionality in the real world.

- Multiple sensor data: from single camera, stereo camera, LiDAR data, GPS data, etc. - Context information: it would be an important improvement to include context information

of the scene, such as traffic signals, zebra crossing, traffic lights, curbs, etc., similar to the JAAD dataset, in order to identify the key areas that affect VRUs intention.

- Multiple labelling types: including bounding box, segmentation ground truth, keypoints, etc. - Scene diversity: Scenes with different weather, in day and night conditions, located in cities

and villages and from different countries. - Adjusted bounding box to the VRU: including all body parts and also including means of

transport in the corresponding classes. On the other hand, the aspect ratio would be useful in cases with occlusion.

- Difficulty levels: criteria to separate sets based on difficulty given that not all detections must be done at the same level in order to measure the performance of the system. These criteria could include:

o Occlusion: including the visible percentage of the VRU, such as in CityPersons. In KITTI there is a discrete approach with four visibility levels.

o Bounding box size of the VRU, which is directly calculated from ground truth, similar to KITTI.


723021 Page 19 of 41

- Balance between samples of each class: In the most used datasets, the sitting person class is not considered at the same level as the cyclist or pedestrian classes. In our opinion, this class must be at the same level of importance as the others. This is important given that detecting sitting persons can be used to tell apart different risk levels for VRUs. This means that an Autonomous Vehicle must be able to detect every person in the scene, while taking special care with standing people that can move faster and interact earlier with the vehicle than the sitting persons. This difference could help to increase the performance of real-time systems.

After using and observing all datasets used in the experiments described in this deliverable, we can summarize a list of issues that must be solved in future research in order to improve the data quality:

- CityPersons includes the motorcyclist as part of the cyclist class. Clearly, both classes must be split. This is important because the kinematics of these classes are different and must be taken into account in a different way in order to improve the prediction stages.

- INRIA labels correspond to a specific class: pedestrian without occlusion. It could cause an increase in the number of false positives, creating a false decrease of precision on detectors. Tsinghua has also this problem, because in the training set only the cyclists are labelled and, consequently, it cannot be used to test person detectors generalization capacity.

At the VRU detector performance level, several issues have been identified. Those must be taken into account to be used in Autonomous Driving scenarios. The person detection systems are prone to detect people in wrong places, such as, in traffic signals (Figure 13f, 14b and 15b), in vehicles (Figure 13a, 13b, 13c, 14a), in street poster (Figure 13d, 13e and 15d) in publicity on vehicles (Figure 15a) or in urban furniture (Figure 14d and 15c). Although some of these cases would be discarded due to the size of the VRU (such as the large VRUs on street poster), some of them could involve risky situations. Above all, if the detection corresponds to a moving vehicle or appears on a road signal, these false positives can be considered as VRUs suddenly appearing and intersecting the Autonomous Vehicle trajectory.

Fig. 13. OpenPose failed detection examples: Detected persons in vehicles (a), (b) and (c), street poster (d), (e) or in traffic signal (f).


723021 Page 20 of 41

Fig. 14. AlphaPose failed detection examples: Detected persons on vehicles (a), on traffic signals (b) (c) or on urban furniture (d).

Fig. 15: Mask R-CNN failed detection examples: Detected persons in the publicity of vehicles (a), in traffic signals (b), in urban furniture (c) or in street poster (d).

As a conclusion of this section, the potential of recent multi-person pose estimation approaches for VRUs intention prediction has been explored by performing an exhaustive comparison of the best performing state-of-art multi-person pose estimation approaches. Sound conclusions about the optimal data format and typology required for VRUs datasets have been described, being one of the contributions of the proposed methodology. Also, as context has been proved to play an important role in the inference of VRUs intentions, a preliminary VRU activity recognition step has been proposed in order to support the subsequent prediction phases. Some findings that could lead to improving these results in the future are described below:

- Explore new specific datasets on VRU prediction and add new useful classes to improve the prediction and safety of VRUs e.g. group of people, etc.

- Test new pose methods (new models of OpenPose, pose Microsoft, etc) and value them in terms of speed and performance.


723021 Page 21 of 41

3 Predictive systems The previous section has provided the description of the VRU activity recognition systems developed in this work package. Such systems allow discriminating between sitting person, standing (or walking person), cyclist, and motorcyclist. Sitting persons are regarded as a low-challenge VRU for automated cars, given that their reactions from sitting to standing should take a while and could be easily detected by the system. On the other hand, cyclists and motorcyclists have very different behaviours, as compared to pedestrians. Thus, the tracking and prediction of intentions (and trajectories) of cyclists and motorcyclists is carried out based on kinematic information alone. From this point onward, the focus of the predictive systems is pedestrians. This section shows the incremental development carried out in the framework of the BRAVE project. The first section describes a pedestrian intention prediction system based only on pose and kinematic information. The second section provides the description of a more complete predictive system based on pose and contextual cues.

3.1 Pose-based pedestrian predictive system

As previously stated, this section describes a pose-based pedestrian predictive system. This is the first version of the system developed in this work and can be used for last-second reactions, but it does not provide further anticipation based on context. The system relies on the analysis of the pedestrians’ body features as detected by a single colour camera. The general scheme of the system is provided in Figure 16.

Fig. 16. General scheme of the pose-based pedestrian predictive system.


723021 Page 22 of 41

The system needs to be trained offline based on skeletons that represent the pose of human bodies. The online part is based on the so-called B-GPDMs (Balanced Gaussian Process with Dynamical Model). B-GPDMs reduce the 3D time-related positions and displacements extracted from key points or joints placed along the pedestrian bodies into low-dimensional latent spaces. The B-GPDM also has the peculiarity of inferring future latent positions and reconstructing the observation associated to a latent position from the latent space. Therefore, it is possible to reconstruct future observations from future latent positions. However, learning a generic model for all kind of pedestrian activities or combining some of them into a single model normally provides inaccurate estimations of future observations. For that reason, the proposed method learns multiple models of each type of pedestrian activity, i.e. walking, stopping, starting and standing, and selects the most appropriate one to estimate future pedestrian states at each time step. A training dataset of motion sequences, in which pedestrians perform different activities, is split into 8 subsets based on typical crossing orientations and type of activity. Then, a B-GPDM is obtained for each sequence with one activity contained in the dataset. After that, in the online execution, given a new pedestrian observation, the current activity is determined using a Hidden Markov Model (HMM). Thus, the selection of the most appropriate model among the trained ones is centred solely on that activity. Finally, the selected model is used to predict the future latent positions and reconstruct the future pedestrian path and poses. 3.1.1 Dataset description

An appropriate dataset is needed in order to train accurate models with different pedestrian dynamics and to test the feasibility and limits of the proposed method in an extensive way under ideal conditions. For that purpose, a high frequency and low noise dataset published by Carnegie Mellon University (CMU) [18] has been used. On the one side, the high frequency of the dataset helps the algorithms to properly learn the dynamics of different activities and increases the probability of finding a similar test observation in the trained data without missing intermediate observations. On the other side, low noise models improve the prediction when working with noisy test samples. The dataset contains sequences in which people are simulating typical pedestrian activities. 3D coordinates of 41 joints located along their bodies are gathered at 120 Hz. An example of a walking pedestrian observation from different points of view is shown in Figure 17. Nevertheless, not all the gathered joints offer discriminative information about the current and future pedestrian activities. In fact, joints located along the arms do not contribute to distinguish walking, starting, stopping or standing activities. For that reason, a subset of 11 joints was selected in order to infer future pedestrian states. An example of a pedestrian observation of this subset from different points of view is shown in red markers in Figure 17.

Fig. 17. Pedestrian observation extracted from the dataset published by CMU in which 41 joints, represented in blue markers, are shown. The subset composed of 11 joints is represented in red markers.


723021 Page 23 of 41

The CMU dataset contains a considerable number of different activities, including human interaction of several subjects. We pre-select and filter the available sequences according to the following criterion: only sequences including walking, starting, stopping and standing activities without orientation changes of the subjects are selected. This way, a total of 490 sequences composed of 302.470 pedestrian poses from 31 different subjects were extracted. Hereafter, this set of sequences will be named as CMU-UAH dataset. After this extraction, the CMU-UAH dataset was divided into 8 subsets. The first division was based on the orientation of typical crossing activities, i.e. left-to-right and right-to-left. The second one was based on the type of activity, i.e. walking, starting, stopping and standing. Those sequences with more than one activity were cropped into sub-sequences with only one action. A breakdown of the CMU-UAH dataset based on the number of sequences and pedestrian observations is shown in Table VI.

Table VI. Number of sequences and number of observations per activity type.

It is worth remarking that each pedestrian observation is composed of pose and displacements. The former is related to the 3D position of each joint and the latter are associated with the displacement of each joint between two consecutive iterations. In practice, the joint displacements are key features since they increase the feasibility of reconstructing future pedestrian paths and improve the accuracy of the pedestrian activity classification. The event-labelling methodology that has been followed in this task allows to identify the instant that a pedestrian starts or finishes an activity. Specifically, a starting activity is defined as the action that begins when the pedestrian moves one knee to initiate the gait and ends when the foot of that leg touches the ground again. A walking activity is defined as the action that happens after a starting activity and before a stopping activity. Moreover, a stopping activity is defined as the action that begins when a foot is raised for the last step and finishes when that foot touches on the ground. Finally, standing activities are defined as the actions that happen after stopping activities and before starting activities. These criteria were adopted because these transitions are easily labelled by human experts, thus enabling the creation of reliable ground truths. The well-known AlphaPose algorithm has been used for performing the observation of the pedestrian skeleton using the information provided by a colour camera. After extracting the body joints, the skeleton estimation can be carried out as shown in the example depicted in Figure 18. 3.1.2 Learning Pedestrian Activities

B-GPDM is a modified version of the Gaussian Process Dynamical Model (GPDM), to learn 3D time-related information extracted from pedestrian joints in order to predict paths, poses and intentions. GPDM provides a framework for transforming a sequence of feature vectors, which are related in time, into a low dimensional latent space. In order to apply this transformation, the observation and dynamics mappings are computed separately in a non-linear form, marginalizing out both mappings and optimizing the latent variables and the hyperparameters of the kernels. The incorporation of dynamics not only allows to make predictions about future data but also helps to


723021 Page 24 of 41

regularize the latent space for modelling temporal data. Therefore, if the dynamical process defined by the latent trajectories in the latent space is smooth, the models will tend to make good predictions. Likewise, given a latent position from the latent space, the associated observation can be reconstructed.

Fig. 18. Example of a pedestrian skeleton estimation. Green markers correspond to left joints, blue markers to right joints and red markers to head, centre of shoulders and centre of hips. The lines represent the pedestrian heading computed from each body part.

Nonetheless, learning a generic model for all kind of pedestrian activities or combining some of them into a single model could produce poor dynamical predictions. For that reason, the proposed method learns multiple models for each type of pedestrian activity, i.e. walking, stopping, starting and standing, and selects the most appropriate among them to predict future pedestrian states at each time step. The learning stage starts loading all sequences contained in the CMU-UAH dataset. Because of the coordinate system of these sequences is referenced to the sensor, the 3D translation parameters of each observation are removed so that the origin of the reference system is relocated in the pedestrian. This allows the algorithms to deal with pedestrians regardless of their positions with respect to the sensors. After that, the variables are scaled by subtracting the mean and dividing each one by its standard deviation in order to have zero-mean and unit-variance data. Since the B-GPDM requires iterative procedures for minimizing the log-posterior function, the latent positions X, the hyperparameters θ and β, and the constant κ have to be properly initialized. On the one hand, the latent coordinates are initialized by Principal Components Analysis (PCA). Finally, a B-GPDM is learned for each sequence. An example of a model corresponding to a pedestrian that is walking 6 steps is shown in Figure 19. The green markers indicate the projection of the pedestrian observations onto the subspace. The model variance is represented from cold to warm colours. Whereas a high variance (warm colours) indicates that illogical pedestrian observations can be reconstructed, a low variance (cold colours) indicates that pedestrian observations similar to the learned sequence may be obtained from a latent position.


723021 Page 25 of 41

Fig. 19. B-GPDM corresponding to a pedestrian that walks 6 steps. The projection of the pedestrian motion sequence onto the subspace is represented by green markers. The model variance is indicated from cold to warm colours.

3.1.3 Activity recognition

Since several models with different dynamics are previously trained, the activity recognition for the current pedestrian observation allows to select afterwards the most accurate model to estimate future pedestrian states. Naïve-Bayes classifiers, or the maximum similarity between the current observation and each observation of the training dataset may determine the activity. Nevertheless, in the last case, if the maximum similarity was applied directly, i.e., without modelling the evolution of the pedestrian activity, higher errors would be achieved when selecting the most appropriate model due to the similarity between observations of different dynamics, e.g., an observation of a pedestrian that is walking may be similar to an observation belonging to the beginning of a stopping sequence or to the end of a starting sequence. Thus, if the previous activity were recognized as walking, then the next dynamics would be determined as walking or stopping, not as starting. Thereby, the process of how a pedestrian changes its dynamics over time can be described by a Markov Process. At any time, the pedestrian can do one of a set of 4 distinct actions (s). These activities or states are not observable since only 3D information from joints belonging to the pedestrian is available. Therefore, the states can be only inferred through the observations (x). For this reason, the implementation of a first-order HMM allows to model the transitions between activities and to recognize the correct one taking into account the previous dynamics. The Viterbi algorithm is a dynamic programming procedure for finding the most likely state sequence given an observation sequence. The values of transitions between states were experimentally fixed maximizing the success rate. The Sum Squared Errors are computed between the current pedestrian observation xt and the N observations of the training data subset belonging to the j-th state of s. Before computing αi, the pose of the current pedestrian observation and the poses of the training observations are scaled and referenced to the same joint. The scale factor applied to each observation is obtained by the sum of ankle-knee and knee-hip distances. The displacements are not scaled with the intent of finding pedestrians with similar joint velocities.


723021 Page 26 of 41

3.1.4 Path, pose and intention prediction

Once the pedestrian activity in t has been estimated, the selection of the most appropriate model allows to make accurate predictions about paths, poses and intentions. For these tasks, a search of the most similar training observation and its model is computed. This observation corresponds to the i-th element for activity s. After that, the latent position that represents the most similar observation is used as the starting point for a more accurate search in the selected model applying a gradient descent algorithm. Due to the fact that close points in the latent space are also close in the data space, it is expected that a more similar non-trained observation can be found around this starting point. Once the final latent position has been estimated, predictions of N latent coordinates are iteratively made, and their associated observations are reconstructed. Thereby, given the current pedestrian location with respect to the sensor, the future pedestrian path can be computed adding the consecutive N predicted displacements. It is noteworthy that the reference point to reconstruct the path is the right hip since it corresponds to a point close to the centre of gravity. Additionally, given the N future pedestrian observations, the future intentions can be estimated through the application of the activity recognition algorithm. 3.1.5 Results

Throughout this section, the main results of the algorithms described above are discussed. All algorithms were tested using the CMU-UAH dataset adopting a one vs. all strategy. This means that all the models generated by one test subject were removed from the training data before performing tests on this subject. Because of the pedestrian displacements are computed from two consecutive poses, 301.980 observations are finally analysed. Additionally, the activity recognition and prediction algorithms were also tested using a sequence of pedestrian data extracted by the skeleton estimation algorithm. A. Activity Recognition Results The activity recognition results are summarized on a confusion matrix shown in Table VII. Nonetheless, a more exhaustive data assessment is represented in Table VIII where the different activity recognition performances are compared taking into account the pedestrian features, number of joints and activity. 1) Joints Influence on the Performance: The results verify that shoulder and leg motions, which are associated with the 11 joints, are more valuable sources of information than other body parts to recognize the current pedestrian action. E.g., including the arms and upper body parts does not improve the results, probably because they do not introduce distinctive information about them. Specifically, the maximum accuracy, 95.13%, is achieved when the observations are composed of poses and displacements from only 11 joints. However, the accuracy falls to 90.69% when 41 joints are used. Considering only body poses, a similar conclusion is drawn since the maximum accuracy is 91.28% and 88.39% for 11 and 41 joints respectively. Finally, when the observations are composed solely of pedestrian displacements, the activity recognition results are not significantly influenced by the number of joints. Regarding the distinction among activities, the displacements perform a better recognition of standing actions from the rest of activities. However, with respect to starting and stopping actions, a higher number of critical miss-classifications are produced. This means that the displacements do not allow to reliably distinguish whether a pedestrian is carrying out the first or last step. The poses and displacements offer a more discriminative information in these cases. Considering the body pose as the only feature, standing actions are repeatedly recognized as walking activities since, when the pedestrian legs are closed, the poses from both states are very similar in those instants of time. Therefore, the displacements are valuable information in these cases. Regarding the observations composed of body poses and


723021 Page 27 of 41

displacements, the most frequent miss-classifications are produced by delays or pedestrians with low-speed motions. The first cause is related to the event-labelling methodology selected by the human expert. It seems that the first half of the first step and the second half of the last step contain the most perceptible information to determine starting and stopping actions respectively. Hence, the rest of these steps are normally recognized as walking action. Concerning the second cause, walking activities are recognized as starting or stopping actions when pedestrians with low-speed motions are tested. All these last miss-classifications are not critical from the point of view of the path estimation since these actions have similar dynamics. Likewise, the beginning of a starting action and the ending of a stopping motion contains body poses which are equivalent to poses labelled as standing actions. Hence, a significant number of miss-classifications are also produced between these activities.

Table VII. Confusion matrix using 11 pedestrian joints.

Table VIII. Evaluation of activity recognition performance with respect to pedestrian features, number of joints and activity.

A graphical example of several of the previous statements is shown in Figure 20 where the classification probabilities using 11 joints along with the ground truth are illustrated. Several examples of pedestrian poses at different instants of time are illustrated at the top of the figure. These poses are represented in different colours according to the classification result. Black represents standing, green starting, red walking and blue stopping. In the middle, the probabilities of each activity at each instant of time are shown. Finally, at the bottom, a zoom in of the transitions is illustrated. The graph shows short delays in the standing-starting and stopping-standing transitions. On the other hand, throughout walking actions, local maxima and local minima of walking probabilities appear in the graph when the pedestrian legs are open and closed respectively.


723021 Page 28 of 41

Fig. 20. Example of activity recognition using poses and displacements extracted from 11 joints. Top: pedestrian poses at significant instants of time. Middle: probabilities for each activity. Bottom: zoom in of the transitions.

B. Pedestrian Path Prediction Results Throughout this section, the evaluation of path prediction results is performed considering 11 joints. As previously explained, once the pedestrian activity is estimated the most appropriate model is selected and the prediction of future observations is iteratively performed using that model. Accordingly, a robust path prediction depends strongly on a robust activity recognition. The path prediction evaluation is performed using the activity recognition results discussed in previous sections. In this evaluation, the Mean Euclidean Distance (MED) between the predicted pedestrian locations and the ground truth for time horizon values up to 1s are analysed. Due to the fact that the most dangerous traffic situations usually happen when the pedestrians start to cross or when they stop before crossing, the evaluation is done around these situations. Thereby, the MED are computed at different Time-To-Event (TTE), i.e. time to start walking and time to stop walking. Positive TTE values make reference to instants of time before the event and negative values to instants of time after the event. In Table IX and Figure 21, the combined longitudinal and lateral MED along with the standard deviation are shown. Regarding starting activities, the errors before the event are mainly produced due to the fact the algorithm assumes zero displacements when the pedestrian activity is recognized as standing. However, this is not the case in the ground truth since small motions were gathered. The errors after the event exponentially grows up since the recognition of a starting activity has a mean delay around 60ms and the pedestrian is accelerating. However, when the pedestrian finishes to speed up, the MED tends to be linear. Additionally, due to the fact that the B-GPDM is a dimensionality reduction technique, the errors are not significantly influenced by the number of joints. In order to contextualise the errors, the mean displacement for starting activities belonging to the CMU-UAH dataset was computed. Throughout a starting activity, the pedestrian has a mean displacement value of 193.98±78.52mm. Likewise, the mean displacement at 1s after and before the


723021 Page 29 of 41

event is 467.92±264.97 and 41.24±67.91mm respectively. It is worth mentioning that other dynamical changes could happen within the TTE range of [1-0] s, e.g., a stopping-standing transition could be carried out by the pedestrian a few hundred milliseconds before the event.

Table IX. Combined longitudinal and lateral MED and STD in mm at different TTE for predictions up to 1s using 11 joints.

Fig. 21. Combined longitudinal and lateral MEDs in mm at different TTEs for predictions up to 1s. (a) For starting events and 11 joints. (b) For stopping events and 11 joints.


723021 Page 30 of 41

The results focused on starting activities are similar to the results achieved in other similar works in the state-of-the-art. The method described in this document achieves a MED value of 89.1mm at the instant of a starting event for a time horizon of 0.5 ss. Predictions are evaluated for all time steps instead of being assessed at different TTE and need a temporal window of n trajectory points to be performed instead of using two observations, as some works in the state-of-the-art do. Regarding stopping activities, the errors before the event tend to be linear since the mean length of stopping steps are 381.22±78.92ms and the second half of the last step contain the most perceptible information to determine stopping actions. Thereby, an appropriate model could not be chosen until a few hundred milliseconds before the event. After the event, the error decreases and tends to be logarithmic. However, at a TTE value of -1s, the errors grow up due to the fact that a new pedestrian dynamical change could happen. Once again, in order to contextualize the errors, the mean displacement for stopping activities belonging to the CMU-UAH dataset was computed. Throughout these activities, the pedestrian has a mean displacement value of 164.37±63.33mm. Likewise, the mean displacement at 1s after and before the event is 102.15±63.50 and 679.15.37±306.77mm respectively. Comparing the results with the outcomes achieved by other works, these are similar. The proposed method achieves a MED value of 238.01mm for a TTE of 1s and a time horizon of 1s. The RMSE value for a TTE of 1s and a time horizon of 1s is 314.5mm. As conclusion, after exhaustive assessment on activity recognition and path prediction the results verify that shoulder and leg motions are more valuable sources of information than other body parts to recognize the current pedestrian action. Specifically, the maximum accuracy, 95.13%, is achieved when observations composed of a few joints placed along these body parts are taken into consideration. Moreover, at least two types of features are needed in the action recognition when more than two dynamical behaviours are considered, i.e. body poses and displacements. The proposed method detects starting intentions 125ms after the gait initiation with an accuracy of 80% and recognizes stopping intentions 58.33ms before the event with an accuracy of 70% when joints from shoulders and legs are considered. Concerning the path prediction results, similar errors are obtained with respect to other works. The measure of accuracy chosen for the path evaluation is the MED at different TTE that gives objective information of the path prediction performance. Although other works accomplished slightly smaller errors than the method proposed in this document, their prediction algorithms need a temporal window of n trajectory points instead of using two observations and the errors are evaluated for all time steps instead of being assessed at different TTE. On the negative side, the method proves to be a bit slow in practice, requiring up to 1 second to produce predictions. In addition, the method is difficult to be scaled up to hundreds of different pedestrian dynamics. While useful for last second reactions, it seems reasonable to extend the method using contextual cues in order to increase its ability to anticipate critical situations in practice.

3.2 Pedestrian predictive system based on contextual cues

As previously stated, a predictive system based on different contextual cues becomes necessary to overcome the limitations of the pose-based predictive approach. For this purpose, additional features will be gathered, such as looking/gaze direction, movement status, distance to curb, video input, etc. All these variables give a much better understanding of the contextual situation in which the pedestrian is located. 3.2.1 System description

The significant development of Deep Learning during the last decade has propelled the use of several variants of neural networks. In the context-based predictive model, two of these variants have been


723021 Page 31 of 41

used: CNNs, used to extract features from pedestrian image sequences and Recurrent Neural Networks (RNNs), utilized to extract temporary information from these features. The proposed deep learning systems try to answer the following question: ”is the pedestrian going to cross the street?”. The solution is found by studying the system as a sequence binary classification problem where the pedestrian intention is inferred in a future time horizon given an input sequence. In the next subsections, both the proposed problem and the architecture of the developed models are be discussed. A. Problem description The purpose of the proposed Deep Learning system is to predict the crossing intention of pedestrians by using temporal information provided by image sequences and other categorical variables. The input sequences are defined as a set of features Xt = {Xt−N,Xt−N+1, . . . ,Xt}. The output is defined as a binary label Yt+M. In the following, the model architecture is presented, and the role of each module is explained separately. B. General model architecture The proposed system is composed by two main modules: a feature extractor, used to get useful information from image data and a many-to-one Recurrent Neural Network (RNN) module. At high level, the extracted features are fed to the RNN module. Output from the RNN module is introduced to a fully-connected layer, and its output is passed through a sigmoid function in order to get the predicted probability of the crossing action in the trained time horizon. This architecture is represented in Figure 22.

Fig. 22. Diagram describing the proposed method. C. Feature extraction Input features are extracted from colour video sequences using three alternative techniques:

- Pretrained CNN models from the ResNet family [20] and from the ResNeXt family [21]. All models are pretrained on ImageNet [22]). The network was modified by cutting off the last fully connected layer and obtaining the features from the average pool layer output.

- Convolutional autoencoder with previous pre-trained ResNet34 used as encoder [23]. An autoencoder is a type of encoder-decoder variant which is trained for the task of input reconstruction in a self-supervised manner. After the training process, the encoder is separated from the decoder and used as a feature extraction method. A high-level diagram of this


723021 Page 32 of 41

architecture is shown in Figure 23. The network was trained with a learning rate of 10−3 and using Binary Cross Entropy (BCE) loss.

- SegNet-based autoencoder [24]. This method is pretrained on Cars Dataset [25].

No fine-tuning has been applied to any of the feature extractors during the RNN models training. Following the same pooling strategy as in the pretrained ResNet34, output features of both encoders extracted from trained autoencoders, with size 512×7×7, are averaged with a pooling layer with a 7×7 kernel, obtaining a 512×1×1 tensor. The obtained tensor is flattened in order to obtain a one-dimensional vector of size 512. In some experiments, categorical variables are used as inputs along with images. These variables are embedded in order to learn their multidimensional relationship between their categories. These embeddings are learned during training, and their dimension for each category is established according to the following heuristic: min(Int(Nc/2+1),50) where Nc is the number of categories of the variable (cardinality).

Fig. 23. High-level diagram describing a Convolutional Autoencoder. Output image example corresponds to SegNet-based autoencoder.

D. RNN module For the recurrent module of the system, two variants of RNNs are used: Long Short-Term Memory (LSTM) [26] and Gated Recurrent Unit (GRU) [27]. These variants help to compensate for the RNNs vanishing gradient problem. The main difference between GRUs and LSTMs is that GRUs are computationally more efficient and they achieve similar results in sequence modelling problems. Bidirectional variants of LSTMs and GRUs are also used in the experiments in order to test if the additional information of the reversed sequence can improve the understanding of the problem. 3.2.2 Experimental setup

All the experiments carried out are described separately. Unless noted otherwise, LSTM module with the pre-trained ResNet50 for feature extraction is the selected choice for the tests. A. Image Data Preparation All models have been trained on the JAAD dataset [28], a naturalistic dataset focused on the behaviour of pedestrians during their road-crossing action. It comprises 346 videos, filmed from a moving vehicle, with durations ranging from 5 to 10 seconds. Their format varies both in frame rate and in resolution. There are 8 videos at 60 fps (frames per second) and 10 videos in HD (High Definition) resolution (1280×720). The rest of the videos are filmed at 30 fps in FHD (Full High Definition) resolution (1920×1080). Default split sets for training and testing suggested by the authors have been used in order to encourage possible future comparisons with other algorithms. In this split,


723021 Page 33 of 41

HD videos are excluded, in addition to another set that presents low visibility (night scenes, heavy rain), totalling 323 videos. Sixty fps videos included in these splits have been lowered to 30 fps. Figure 24 shows some examples of images in the JAAD dataset.

Fig. 24. Two examples of JAAD pedestrian sequences: a crossing (top) and not crossing (down) situation. The input to the model is composed of several image sequences and, in some variants, some categorical variables. Image sequences are extracted using the ground truth 2D bounding box annotations of pedestrians with crossing behaviour. The height and width of them are equalized in order to avoid image deformation. All sequences are filtered by occlusion level and the bounding box height. Fully occluded samples and bounding boxes with height lower than 50 pixels have been removed only in the training set, leaving all the validation and test sets unchanged, in order to have the possibility of testing the behaviour of the model in challenging situations. Finally, in order to meet the input restrictions of feature extraction methods, images are resized to 224×224 (size used in training) and standardized using the per-channel mean and deviation of ImageNet. B. Testing the influence of the feature extraction method Various tests were performed changing the feature selection method. Autoencoders are used in order to test the contribution of features extracted with a method specialized on the reconstruction of images as a tool to improve the network performance in its training process as compared to a classification pre-trained network. C. Rescaling image features and normalization Output data of the average pooling layer has a range between 0 and N where N is variable and depends on the input image and also on the feature extractor. A rescaling approach has been. Rescaling is performed by dividing the sequence of image features between the maximum value in the batch (N), yielding a value ranging between 0 and 1. D. Influence of additional variables Three categorical variables related to pedestrians and extracted from ground truth annotations have been used to study their influence on predictions: looking/gaze direction, orientation and state of movement. The looking direction is a binary variable, whose value is 1 if the pedestrian looks at the vehicle (looking for direct eye contact with the driver) and 0 otherwise. The orientation variable has


723021 Page 34 of 41

the following categories and are defined relative to the car: front (0), back (1), left (2) and right (3). State of movement has two possible values: standing and moving. Another variable used is the bounding box centre (uc, vc), extracted from ground-truth annotations and divided by the maximum of each dimension in order to achieve independence from the camera sensor resolution. The output of each embedding layer and the centre of the bounding box are concatenated to the feature vector. As a result, the input vector used in the RNN module increases its size in 3+2+2+2 = 9. E. LSTM versus GRU As previously mentioned, a study on the influence of the RNN type and their bidirectional variants has been conducted. With this objective in mind, four RNN models are compared with the same hyperparameter configuration: LSTM, GRU, Bidirectional Long Short Term Memory (BDLSTM) and Bidirectional Gated Recurrent Unit (BDGRU). F. Hyperparameter search After an ablation study using grid search, the configuration used for the model is the following:

• RNN hidden dimension: 4. • Number of stacked RNN layers: 1. • Dropout (applied to RNN output): 0.5.

The simplicity of the network is due to the trend towards overfitting of more complex networks. G. Training configuration PyTorch [29] has been the framework chosen to carry out the experiments. All experiments have been trained and tested on a single NVIDIA GTX TITAN X GPU. The Adam optimizer [30] has been used with a learning rate of 10−4. The loss function used for training is the BCE loss. To make computations deterministic, a fixed random seed has been established in all pseudorandom number generators. Finally, to avoid unnecessary processing, validation patience with a value of five has been set, i.e., if validation losses stop improving during five epochs, the training is ended. 3.2.3 Results

In this section, the results obtained in the previously mentioned experiments are discussed. The metrics used to compare these results are accuracy, precision, recall and, finally Average Precision (AP) score, calculated as a weighted sum. All metrics values in tables are percentages. A. Feature extraction method Importance Observing the results in table X, pre-trained models obtain better results than self-trained ones. The increase in the complexity of the network is directly related to the increase in all performance metrics. One possible reason for these results is the difference in training data size and diversity between the ImageNet, JAAD, and CARS dataset. Although the images are reconstructed quite accurately in the self-trained extractors, the output features of the encoder lack useful information for the RNN module. This is shown in the recall value of 100%, which means that the model has converged in predicting that every pedestrian will cross. This problem may be caused by the use of an average pooling layer after training since, in pre-trained models, average pooling is used during the training stage.


723021 Page 35 of 41

Table X. Comparative results with different feature extraction methods.

B. Rescaling image features and normalization Rescaling input image features contributes to an improvement in the results (see Table XI). These results show that the high variation in input features penalizes the learning process.

Table XI. Comparative results after rescaling image features.

C. Influence of additional variables The incorporation of all additional variables improves the AP from 75.62% to 80.00% (see table XII). This result shows that the incorporation of meaningful data can act as a regularization factor to allow greater learning generalization. The 80% result is more than enough in practice for prediction purpose given that this result is obtained on a single frame fashion. After time integration, the prediction systems becomes extremely efficient for anticipating reactions. Individually, orientation and looking direction are the variables with more weight followed by the state of movement. Those variables are also used by drivers when they infer the pedestrians’ crossing intentions (e.g. a pedestrian walking towards the road and a pedestrian at the curb looking at the driver’s car are more likely to cross than a pedestrian walking parallel to the car and suddenly stopping). In the case of the bounding box centre in the image, it has less influence on the result. This is probably due to its relativeness and high variation as it belongs to the image coordinate system.

Table XII. Influence of additional variables on the final result.


723021 Page 36 of 41

D. LSTM versus GRU According to table XIII, in this problem, additional temporal information provided by bidirectionality can improve the results of an LSTM-based network. GRU obtains worse results than LSTM and in the case of the bidirectional variants, both RNNs improves the results, being the BDGRU slightly better than the BDLSTM. This can be due to the high dropout used and the fixed seed used for reproducibility.

Table XIII. RNN selection results.

E. Final results To see the effect of the above experiments together, a model has been trained including all the previous upgrades. According to AP scores in table XIV, improvements work well together with an increase of more than 8% with respect to the simpler model.

Table XIV. Best model compared with each improvement model.

F. Qualitative results In figure 25, two example sequences are shown with the input image sequence at the left and the output crossing probability at the right. The model used in this experiment is the best model from table XIV with a change in the output dimension. Instead of outputting the crossing probability one second in the future, the output is split into eight equidistant time steps between 0 and 1 second. Both sequences belong to the same pedestrian. In the top one, the pedestrian is not going to cross in one second in the future, and in the bottom one, the pedestrian is beginning to cross. As the graphs show, the probability of crossing is low in the first time-step of the top graph, but this value is doubled at the end of the prediction, indicating a possible future crossing, which becomes more likely in the bottom case.


723021 Page 37 of 41

Fig. 25. Two examples in test set. The top one represents a non-crossing sequence and at the bottom, a crossing one. Left graphics show the output crossing probability at eight future time steps between 0 and 1 seconds (0 and 30 frames).

G. Dataset limitations JAAD dataset is currently one of the few datasets focused on pedestrian behaviour. However, it is composed of short videos. Besides there are challenging situations that affect training: windshield wipers occlusion, bad weather conditions (raining, snowing) and reflections on the windshield. Additionally, small pedestrians are a problem that can be filtered out easily, but this is not the case for non-relevant pedestrians i.e., pedestrians who are not in the path of the vehicle. Filtering out these problems can lead to better training convergence, but at the same time, it leads to a loss of training data. In Figure 26 some examples of such challenging situations are shown.

Fig. 26. Some challenging cases in JAAD dataset.


723021 Page 38 of 41

4 Final remarks This deliverable has described the different methods implemented to predict the intentions and trajectories of VRUs. A first step is the classification of activities based on top-notch CNN systems, having divided VRUs in four main categories: standing pedestrians, sitting pedestrians, cyclists, and motorcyclists. After the activity classification step, standing pedestrians become the focus of the prediction system. Prediction of trajectories for cyclists and motorcyclists are carried out using traditional kinematic information and Kalman filtering, while sitting pedestrians are regarded as quasi-static elements that represent neither a challenge for the automated vehicle nor a threat for themselves. The prediction of pedestrians’ intentions has been developed in two incremental steps. In the first step, only body pose features have been considered to provide predictions in a time horizon of 1s. The accuracy of such predictions is sufficient for last second reactions, but the system is computationally heavy and difficult to scale-up to thousands of different pedestrian dynamics. Consequently, in a second step, a context-based approach has been developed and tested in order to extend the ability to anticipate critical situations. Such approach is based on a combination of CNN and RNN, yielding very accurate results in predictive tasks. As a final conclusion, even though the results attained in this project are considered sufficient for practical implementation in an automated car, further development of specific datasets for VRU, in particular pedestrian, prediction is necessary, given that the few already existing datasets for prediction have significant limitations, such as the variety of realistic and critical situations. In this regard, the use of simulation with humans in the loop is strongly advised with a view to recreate simulated critical situations with real pedestrians in a non-dangerous way.


723021 Page 39 of 41

5 References [1] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision – ECCV 2014, 2014, pp. 740–755. [2] K. He, G. Gkioxari, P. Doll´ar, and R. B. Girshick, “Mask r-cnn,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017. [3] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollar, and K. He, “Detectron,” https://github.com/facebookresearch/detectron, 2018. [4] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,”Tech. Rep. [Online]. Available: https://pjreddie.com/yolo/. [5] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” nov 2016. [Online]. Available: https://arxiv.org/abs/1611.08050. [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009. [7] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and,” feb 2016. [Online]. Available: http://arxiv.org/abs/1602.07360 [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Tech. Rep. [Online]. Available: http://code.google.com/p/cuda-convnet/ [9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,” sep 2014. [Online]. Available: http://arxiv.org/abs/1409.4842 [10] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” sep 2014. [Online]. Available: https://arxiv.org/abs/1409.1556 [11] K. He, X. Zhang, S. Ren, and J.

Documents

D4.4 Model for the prediction of VRUs intentions · of VRUs paths could improve the current automatic emergency braking systems. For this reason, methods for predicting future VRUs