Deep Learning Techniques for Autonomous Vehicle Path ...aium2021.felk.cvut.cz/papers/AI4UM_paper_4.pdfborhood of the target agent are considered based on the lane, direction and distance

Deep Learning Techniques for Autonomous Vehicle Path Prediction

Jagadish D. N., Arun Chauhan, Lakshman Mahto

Indian Institute of Information Technology Dharwad, India [email protected], [email protected], [email protected]

Abstract

Mobility of autonomous vehicles is a challenging task to implement. Under the given traffic circumstances, all agent vehicles behavior is to be understood and their paths for a short future needs to be predicted in order to decide upon the maneuver of the ego vehicle. We explore deep learning techniques to predict movement of agents in this work by implementing residual and conditional variational autoen-coder deep learning networks.

Introduction

Smart city management demands intelligent traffic control

encompassing safer and accident free traffic along with

optimum maneuverability. Urban mobility will focus on

providing staring point to end destination functionality,

with purposefully directed movement of autonomous vehi-

cles (AV) in an integrated transportation infrastructure

(Hancock, Illah, and Jack 2019). Differing forms of vehicle

control will exist side-by-side on the road, ranging from

pedestrians to AVs. As per a survey report by (Bertoncello

and Dominik 2015), AVs are going to become the primary

means of transport before the mid of the century. As a con-

sequence the major benefits will be: AVs free up to 50

minutes a day for drivers; reduction of parking space by

billions of square meters; and road accidents fall by 90%,

saving millions of lives and expenditure. Needless to say

the way ahead is to have AVs on the road.

The routine task in an AV will be to apply intelligence

to maneuver in the given scenario. The context could in-

volve other intelligent or non-intelligent vehicles, cyclists

and pedestrians in the surroundings dictated by traffic signs

and road locality, wherein each of such agents could pre-

sent independent movement in the vicinity of the AV. Un-

derstanding the scenario, the AV can maneuver by predict-

ing the future movements of the other agents. With predic-

tions dangerous situations ahead could be known and nec-

essary reactions can be presented to avoid such situations.

Copyright © 2020, Association for the Advancement of Artificial Intelli-gence (www.aaai.org). All rights reserved.

Several motion models for trajectory predictions are

presented in the literature. Physics-based models

(Brännström, Coelingh, and Sjöberg 2010) rely on low

level motion properties. They fail to anticipate motion

changes with changing context, and therefore are only lim-

ited to very short-term motion prediction. Maneuver-based

motion models represent vehicles as independent maneu-

vering entities banking on the early recognition of the ma-

neuvers that drivers intend to perform (Ortiz et al. 2011).

These models also suffer from the inaccuracies, especially

at traffic junctions. In the interaction-aware motion mod-

els, vehicles are maneuvering entities that interact with

each other. For methods relying on trajectory prototypes

(Lawitzky et al. 2013), trajectories leading to an unavoida-

ble collision are penalized. Dynamic Bayesian Networks

hold a higher share among the interaction-aware motion

models. Pairwise dependencies between agents are mod-

elled with asymmetric coupled Hidden Markov Models

(Oliver and Pentland 2000). The fact that agents’ interac-

tions are regulated by traffic rules is exploited in

(Agamennoni, Nieto, and Nebot 2012). The work in (Gin-

dele, Brechtel, and Dillmann 2010) rather accounts for

mutual influences instead of pair wise dependencies. The

causal dependencies between the agents are modeled as a

function of local situational context.

Computer vision techniques are utilized to predict pe-

destrians’ behavior as they are the vulnerable road users

Figure 1: A sample semantic map describing a traffic scenario.

(Gupta et al. 2018; Fernando et al. 2018) Object detection

and behavior prediction are the primary goals of an AV.

Acquired time series sensory data helps to classify and

build the localized traffic scenario, while the agents’ dy-

namics are understood and the behavior of them in the near

future will guide the decision making (Zhan et al. 2018).

Prominent non-trivial tasks under behavior predictions

involve learning the following. The interdependencies be-

tween the vehicles in the surrounding. The road geometry

and traffic rules that are impacting the vehicle trajectories.

Multimodal trajectories of the vehicles.

Several practical limitations exist while implementing

the prediction entity. Some of them could be that the AV

can only partially observe the surrounding environment

using on-board sensors that could suffer from limited

range, sensor noise and occlusion, and restricted computa-

tional resources available on-board. Most of existing stud-

ies assume availability of unobstructed top-down view of

the surrounding environment, which can be obtained by

infrastructure sensors like a surveillance camera on the AV

(Mozaffari et al. 2020). Such an arrangement is costly to

implement. Nevertheless, such a dataset (Houston et al.

2020) comprising EV translation and orientation, agents’

translation and yaws and traffic signal conditions along

with semantic maps of the road is available for exploration

using deep learning techniques. A sample semantic map

highlighting vehicles, intended paths and traffic lights are

shown in Figure 1. The AV under the drive collecting sen-

sor data and whose path prediction is the ultimate goal is

often referred to as ego vehicle (EV).

In this paper, we make use of residual and conditional

variational autoencoder (CVAE) deep learning networks

for trajectories prediction of traffic agents. Remainder of

paper provides insight into the related works in the applica-

tion domain, suggested methodologies and experimental

results.

Related Works

Several works have been carried over interaction-aware

motion models. In (Zyner, Worrall, and Nebot 2018; Zyner

et al. 2017; Xin et al. 2018) an agent’s behavior is learnt by

considering track history (e.g. position, velocity, accelera-

tion, direction) in relation to EV. The model considers the

agent as an isolated entity despite being surrounded by

other agents. Simple recurrent neural networks (RNN)

(Hammer 2000) are good at processing temporal depend-

encies. The works fail to capture the interdependencies

among the agents. The model could produce erroneous

results under a crowded environment. The above model

drawback is addressed by providing surrounding agents

track history, in relation to the target agent considered, in

(Deo and Trivedi 2018; Phillips, Wheeler, and Kochender-

fer 2017; Dai, Li, and Li 2019; Hu, Zhan, and Tomizuka

2018). Varied number of surrounding agents in the neigh-

borhood of the target agent are considered based on the

lane, direction and distance in these works. Multiple RNNs

constituting a gated recurrent unit (GRU) (Cho et al. 2014)

or long short-term memory (LSTM) (Hochreiter and

Schmidhuber 1997) are utilized to make predictions and

capture inter-dependencies between agents. These models

however lack the information of environmental factors that

could impact the behavior of the target agent. Since the

sensing is done by EV, few of the surrounding agents in

the vicinity of the target agent may get occluded as well. A

bird’s eye view (BEV) image represents agents with color

coded bounding boxes, and intended driving lanes and traf-

fic signals as color lines (Cui et al. 2019; Djuric et al.

2018) (refer Figure 1). Semantic segmentation of environ-

ment in BEV is proposed in (Lee et al. 2017). With image-

like data, convolutional neural networks (CNN) layers can

explore the spatial relationship of agents in the traffic con-

text scenario very effectively. The results are very promis-

ing. Going ahead, in (Luo, Yang, and Urtasun 2018) 3D

convolutions are applied to the temporal dimension of in-

put data. Later, a series of 2D convolution is utilized to

capture spatial features. Hybrid networks are developed to

reap the benefits of RNNs and CNNs (Lee et al. 2017;

Zhao et al. 2019). The output representation is complex in

actual and hence conditional generative model is suggested

by Lee et al. (2017) and for the purpose a CVAE network

is used for prediction. As far as network predictions are

concerned, the network may predict future maneuver inten-

tions or unimodal or multimodal trajectories. The maneu-

ver trajectory output is a high level under-standing of traf-

fic scenarios that can only help to infer the probabilities of

an agent moving on a straight lane or taking a left or right

turn at an intersection. However, predicting trajectory in-

stead of mere intention, provides more precise information

about future behavior of agents. Given the traffic scenario

and history of motion of an agent, the network can output

several possible future trajectories for the agent. The corre-

sponding trajectories distribution is therefore multimodal.

However, a unimodal trajectory predictors is limited to

predict one of these possible trajectories that corresponds

to the one with highest likelihood. The network in (Ho-

ermann, Bach, and Dietmayer 2018; Schreiber, Hoermann,

and Dietmayer 2019) outputs an occupancy map that will

describe the probability of occupancy of each pixel of BEV

image at each timestamp in the prediction horizon.

Methods

The problem of path prediction of an agent can be formu-

lated using a probabilistic model. If xit represents state of

an ith agent conveying its position, among N surrounding

agents at time instance t, then the predicted path of all

agents is defined as:

𝑋𝑎𝑁 = {𝑥𝑡𝑖 , 𝑥𝑡+1

𝑖 , 𝑥𝑡+2𝑖 , … , 𝑥𝑡+𝑘

𝑖 } ∀𝑖 ∈ 𝑁

Here k is the length of the prediction horizon.

The problem boils down to computing conditional dis-

tribution P(XaN|OEV) on recording the observations OEV by

the EV. In this section we propose a few of the neural net-

works for our objective. Upfront we have a residual deep

neural network and later a CVAE deep neural network.

Residual Network

As this distribution is computationally challenging, we can

resort to predicting path of each agent separately, given by:

𝑋𝑎𝑖 = {𝑥𝑡𝑖 , 𝑥𝑡+1

𝑖 , 𝑥𝑡+2𝑖 , … , 𝑥𝑡+𝑘

𝑖 }

Figure 2 shows the network architecture. The convolu-

tion layer upfront segregates temporal information in data,

whereas later convolutional layers in series capture spatial

information of agents. The input to the network is

(224x224) pixels x 25 channels. The output will be a mul-

timodal agent coordinate in 2D and their associated confi-

dence score. Each identity block has 3 (conv + batch norm

+ activation) layers with skip connection, whereas a conv

block has a (conv + batch norm) layer in the skip connec-

tion path. The network (He et al. 2016) is trained to predict

trajectories for all agents for a horizon of 50 by looking at

10 historical coordinates of those agents.

Figure 2: Residual network architecture.

If a, b, W, g() and l represents layer activations, bias,

weights, activation function and layer number, respective-

ly, in an identity block, then

𝑎[𝑙+1] = 𝑔(𝑊[𝑙+1]𝑎[𝑙] + 𝑏[𝑙+1])

𝑎[𝑙+3] = 𝑔(𝑊 [𝑙+3]𝑎[𝑙+2] + 𝑏[𝑙+3] + 𝑎[𝑙])

The L2 regularization does make sure weight decay

W[l+3]=0 and b[l+3]=0, and the ReLU activation fetches,

𝑎[𝑙+3] = 𝑎[𝑙] ; ensuring higher layers perform as good as

lower layers.

Conditional Variational Autoencoder

In contrast to the above architecture, where it is essentially

a many-to-one mapping, we understand the network output

to be complex in representation so as to get a close match

with the real traffic scenario. An autoencoder network is

capable of the task. The observations OEV of the EV and

future trajectories of all the agents XaN fed into the encoder

network creates a latent vector z. We rather generate a dis-

tribution Pθ(Z|OEV,XaN) at the bottleneck during training the

network. The network is trained to generate back the tra-

jectories OEV and XaN at the output, conditioned over the

vector y representing features of OEV. The decoder output

is compared against fed input. The cost function is mini-

mized by backpropagating the error using stochastic gradi-

ent variational Bayes method. Once trained, for an obser-

vation of agents’ motion OEV, we obtain a sample from the

stochastic latent space and condition the decoder on vector

y, to get predictions of all the agents by conditioned proba-

bility distribution Pθ(XaN|Z,Y) learnt by the decoder.

Figure 3: Conditional variational autoencoder architecture.

(a) (b) (c)

Figure 4: Sample images from dataset. (a) Sample image of a target agent (green color) along with other surrounding agents at a timestamp.

(b) Another sample image of another agent (in green color) and other agents at the same time stamp. (c) 2D view of a data tensor.

Figure 3 shows the CVAE network. The input to the net-

work is (224x224) pixels x 125 size tensor capturing 10

historical and 50 future trajectories. Out of this tensor, a

(224x224) pixels x 25 size tensor, representing 10 histori-

cal trajectories are fed into a feature extractor module to

deduce a vector representing OEV. This module just ex-

tracts trajectories and traffic light conditions present as raw

data, which the decoder part of the network is conditioned

upon. The encoder part of CVAE has a sequence of convo-

lutional, batch normalization and activation layers and end

up with dense layers. On the contrary, the decoder begins

with dense layers and finish with logit and softmax layers

for multimodal trajectories and their probability prediction.

At the bottleneck, between the encoder and the decoder

part, distributions are sampled and the decoder is condi-

tioned.

Experiment

In this section we introduce the dataset that we used for

experimenting on the selected neural networks, the pa-

rameter settings while implementing the networks and the

results.

Dataset

Dataset presented by Houston et al. (2020) gathers more

than 1,000 hours of traffic agent motion data. LiDAR

mounted self-driven EV moving on road segments captures

self and surrounding agents’ translation, rotation, yaw and

extent. Several EVs are utilized for putting up 16,000 miles

of coverage. Over the travel segments cameras are mount-

ed at high elevations to capture the scenario of road condi-

tions. Around 15,000 semantic map annotations are availa-

ble. The preprocessed data is available as a congregation of

scenes. Each scene constitutes on an average 248.36

frames. The frames sampled at 10 fps, arranged in chrono-

logical order, have all the details agents’ motion.

The library functions provided with the dataset can render

the data as an image so that convolutional deep learning

architectures can take advantage of. The target agent is

placed at the desired location of choice, and the other

agents are thereafter placed relative to this position and

also their extent represented by bounding box dimensions.

Figure 4a shows a sample image of 224x224 pixels from

the dataset. Centered at (64,112) is here the target vehicle.

Figure 4b shows another sample image at the same

timestamp, but a different agent in frame is assigned as the

target agent now and brought to location (64,112). In these

images the intention lanes, traffic light status, semantic

map and future trajectories of the target agent are also

shown. Several of the images, centered on a target agent,

are fused in a chronological order. Thus the tensor formed

will accommodate past history, present and future motion

of all agents present over the segment of road segment. To

train a residual net, past 10 frames of available agents is

fused as the predictions are taken to compute cost function

error. Figure 4c shows a 2D view of the 224x224x25 ten-

sor. It could be seen that the trace of the bounding box in

this image resembles agents’ motion within a timeframe.

Network Parameters

To train both the networks we choose each pixel size as

0.5mx0.5m, history_time_frames = 10, future_time_frames

= 50, batch_size = 32, Adam optimizer with learning_rate

= 1e-03, batch normalization layer having ε = 1e-05 and

momentum = 0.1, network to predict 03 probable trajecto-

ries per agent and cost function as negative multi-log like-

lihood. In the CVAE, we chose the hidden layers depth of

12 on encoder and 06 on decoder end. The latent space is

described by 06 mean and 06 standard deviation variables.

Results

We train both the residual network and the CVAE network

for 50,000 iterations. Plot of negative multi-log likelihood

for the same, averaged over 1,000 iterations is shown in

Figure 5. The validation error at those iterations is also

shown. The trained residual network has the error = 27.03

and validation error lowering to 58.27. The CVAE network

has the lowest training error equal to 113.67 and validation

error of 150.48.

The increased error of CVAE can be attributed to very

limited scope in capturing features representing the obser-

vations and lower network complexity. The error could

lower with increased training and better feature representa-

tion while conditioning the generator module.

Figure 6 shows predicted probable trajectories by the

networks against ground truth on a semantic map.

Figure 5: Performance of the networks. Plot of training and vali-

dation error (negative multi-log likelihood).

Figure 6: The ground truth and predicted trajectories plots for a horizon of 50. The upper row shows the ground truth trajectory for 3 sam-

ple images (shown in pink trace). The trajectories in the middle row are the predictions from the residual network (traces in blue, orange

and green color). The trajectories in bottom row are the predictions from the CVAE (again traces in blue, orange and green color). The

corresponding trajectory probability is also shown inside the image.

Shown in the first row is the ground truth trajectory for a

few of the input images from the dataset. Trajectories in

the second row images are predictions from the residual

network. The probabilities of the trajectories are also men-

tioned within the image. The same is shown for the CVAE

in the last row of the figure. It is evident that the trajectory

predictions of the residual network are much closer to the

ground truth.

Conclusion

In this paper we explore residual and conditional variation-

al autoencoder deep learning networks for trajectories pre-

diction of traffic agents. The data collected is essentially

represented as a sequence of simplified bird’s eye view

images. The convolutional layers are able to capture spatial

and temporal information as well. The networks’ com-

plexity, feature selection, parameters and training length

directly impacts the prediction accuracy.

References

Agamennoni, G.; Nieto, J.I.; and Nebot, E.M. 2012. Estimation of multivehicle dynamics by considering contextual information. IEEE Transactions on Robotics 28(4): 855-870.

Bertoncello, M.; and Dominik, W. 2016. Ten ways autonomous driving could redefine the automotive world. McKinsey & Com-pany, 6.

Brännström, M.; Coelingh, E.; and Sjöberg, J. 2010. Model-based threat assessment for avoiding arbitrary vehicle collisions. IEEE Transactions on Intelligent Transportation Systems 11(3): 658-669.

Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint. arXiv:1406.1078.

Cui, H.; Radosavljevic, V.; Chou, F.C.; Lin, T.H.; Nguyen, T.; Huang, T.K.; Schneider, J.; and Djuric, N. 2019. Multi-modal trajectory predictions for autonomous driving using deep convo-lutional networks. In International Conference on Robotics and Automation (ICRA), 2090-2096.

Dai, S.; Li, L.; and Li, Z. 2019. Modeling vehicle interactions via modified LSTM models for trajectory prediction. IEEE Access 7: 38287-38296.

Deo, N.; and Trivedi, M.M. 2018. Multi-modal trajectory predic-tion of surrounding vehicles with maneuver based LSTMs. In IEEE Intelligent Vehicles Symposium (IV), 1179-1184.

Djuric, N.; Radosavljevic, V.; Cui, H.; Nguyen, T.; Chou, F.C.; Lin, T.H.; and Schneider, J. 2018. Motion prediction of traffic actors for autonomous driving using deep convolutional net-works. arXiv preprint. arXiv:1808.05819, 2.

Fernando, T.; Denman, S.; Sridharan, S.; and Fookes, C. 2018. Soft+ hardwired attention: An LSTM framework for human tra-jectory prediction and abnormal event detection. Neural net-works 108: 466-478.

Gindele, T.; Brechtel, S.; and Dillmann, R. 2010. A probabilistic model for estimating driver behaviors and vehicle trajectories in

traffic environments. In 13th International IEEE Conference on Intelligent Transportation Systems, 1625-1631.

Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; and Alahi, A. 2018. Social GAN: Socially acceptable trajectories with genera-tive adversarial networks. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, 2255-2264.

Hammer, B. 2000. On the approximation capability of recurrent neural networks. Neurocomputing 31(1-4): 107-123.

Hancock, P. A.; Illah, N; and Jack, S. 2019. On the future of transportation in an era of automated and autonomous vehicles. In Proceedings of the National Academy of Sciences, 116, no. 16 7684-7691.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.

Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8): 1735-1780.

Hoermann, S.; Bach, M.; and Dietmayer, K. 2018. Dynamic oc-cupancy grid prediction for urban autonomous driving: A deep learning approach with fully automatic labeling. In IEEE Interna-tional Conference on Robotics and Automation (ICRA), 2056-2063.

Houston, J.; Zuidhof, G.; Bergamini, L.; Ye, Y.; Jain, A.; Omari, S.; Iglovikov, V.; and Ondruska, P. 2020. One Thousand and One Hours: Self-driving Motion Prediction Dataset. arXiv preprint. arXiv:2006.14480.

Hu, Y.; Zhan, W.; and Tomizuka, M. 2018. Probabilistic predic-tion of vehicle semantic intention and motion. In IEEE Intelligent Vehicles Symposium (IV), 307-313.

Lawitzky, A.; Althoff, D.; Passenberg, C.F.; Tanzmeister, G.; Wollherr, D.; and Buss, M. 2013. Interactive scene prediction for automotive applications. In IEEE Intelligent Vehicles Symposium (IV), 1028-1033.

Lee, N.; Choi, W.; Vernaza, P.; Choy, C.B.; Torr, P.H.; and Chandraker, M. 2017. Desire: Distant future prediction in dynam-ic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 336-345.

Luo, W.; Yang, B.; and Urtasun, R. 2018. Fast and furious: Real time end-to-end 3D detection, tracking and motion fore-casting with a single convolutional net. In Proceedings of the IEEE con-ference on Computer Vision and Pattern Recognition, 3569-3577.

Mozaffari, S.; Al-Jarrah, O.Y.; Dianati, M.; Jennings, P.; and Mouzakitis, A. 2020. Deep Learning-Based Vehicle Behavior Prediction for Autonomous Driving Applications: A Review. IEEE Transactions on Intelligent Transportation Systems.

Oliver, N.; and Pentland, A.P. 2000. Graphical models for driver behavior recognition in a smartcar. In Proceedings of the IEEE Intelligent Vehicles Symposium (Cat. No. 00TH8511), 7-12.

Ortiz, M.G.; Fritsch, J.; Kummert, F.; and Gepperth, A. 2011. Behavior prediction at multiple time-scales in inner-city scenari-os. In IEEE Intelligent Vehicles Symposium (IV), 1068-1073.

Phillips, D.J.; Wheeler, T.A.; and Kochenderfer, M.J. 2017. Gen-eralizable intention prediction of human drivers at intersections. In IEEE Intelligent Vehicles Symposium (IV), 1665-1670.

Schreiber, M.; Hoermann, S.; and Dietmayer, K. 2019. Long-term occupancy grid prediction using recurrent neural networks. In International Conference on Robotics and Automation (ICRA), 9299-9305.

Xin, L.; Wang, P.; Chan, C.Y.; Chen, J.; Li, S.E.; and Cheng, B. 2018. Intention-aware long horizon trajectory pre-diction of sur-rounding vehicles using dual LSTM networks. In 21st Interna-tional Conference on Intelligent Transportation Systems (ITSC), 1441-1446.

Zhan, W.; La de Fortelle, A.; Chen, Y.T.; Chan, C.Y.; and Tomi-zuka, M. 2018. Probabilistic prediction from planning perspec-tive: Problem formulation, representation simplification and eval-uation metric. In IEEE Intelligent Vehicles Symposium (IV), 1150-1156.

Zhao, T.; Xu, Y.; Monfort, M.; Choi, W.; Baker, C.; Zhao, Y.; Wang, Y.; and Wu, Y.N. 2019. Multi-agent tensor fusion for contextual trajectory prediction. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, 12126-12134.

Zyner, A.; Worrall, S.; and Nebot, E. 2018. A recurrent neural network solution for predicting driver intention at unsignalized intersections. IEEE Robotics and Automation Letters 3(3): 1759-1764.

Zyner, A.; Worrall, S.; Ward, J.; and Nebot, E. 2017. Long short-term memory for driver intent prediction. In IEEE Intelligent Vehicles Symposium (IV), 1484-1489.

Documents

Deep Learning Techniques for Autonomous Vehicle Path ...aium2021.felk.cvut.cz/papers/AI4UM_paper_4.pdfborhood of the target agent are considered based on the lane, direction and distance