Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Deep Learning Techniques for Autonomous Vehicle Path Prediction
Jagadish D. N., Arun Chauhan, Lakshman Mahto
Indian Institute of Information Technology Dharwad, India [email protected], [email protected], [email protected]
Abstract
Mobility of autonomous vehicles is a challenging task to implement. Under the given traffic circumstances, all agent vehicles behavior is to be understood and their paths for a short future needs to be predicted in order to decide upon the maneuver of the ego vehicle. We explore deep learning techniques to predict movement of agents in this work by implementing residual and conditional variational autoen-coder deep learning networks.
Introduction
Smart city management demands intelligent traffic control
encompassing safer and accident free traffic along with
optimum maneuverability. Urban mobility will focus on
providing staring point to end destination functionality,
with purposefully directed movement of autonomous vehi-
cles (AV) in an integrated transportation infrastructure
(Hancock, Illah, and Jack 2019). Differing forms of vehicle
control will exist side-by-side on the road, ranging from
pedestrians to AVs. As per a survey report by (Bertoncello
and Dominik 2015), AVs are going to become the primary
means of transport before the mid of the century. As a con-
sequence the major benefits will be: AVs free up to 50
minutes a day for drivers; reduction of parking space by
billions of square meters; and road accidents fall by 90%,
saving millions of lives and expenditure. Needless to say
the way ahead is to have AVs on the road.
The routine task in an AV will be to apply intelligence
to maneuver in the given scenario. The context could in-
volve other intelligent or non-intelligent vehicles, cyclists
and pedestrians in the surroundings dictated by traffic signs
and road locality, wherein each of such agents could pre-
sent independent movement in the vicinity of the AV. Un-
derstanding the scenario, the AV can maneuver by predict-
ing the future movements of the other agents. With predic-
tions dangerous situations ahead could be known and nec-
essary reactions can be presented to avoid such situations.
Copyright © 2020, Association for the Advancement of Artificial Intelli-gence (www.aaai.org). All rights reserved.
Several motion models for trajectory predictions are
presented in the literature. Physics-based models
(Brännström, Coelingh, and Sjöberg 2010) rely on low
level motion properties. They fail to anticipate motion
changes with changing context, and therefore are only lim-
ited to very short-term motion prediction. Maneuver-based
motion models represent vehicles as independent maneu-
vering entities banking on the early recognition of the ma-
neuvers that drivers intend to perform (Ortiz et al. 2011).
These models also suffer from the inaccuracies, especially
at traffic junctions. In the interaction-aware motion mod-
els, vehicles are maneuvering entities that interact with
each other. For methods relying on trajectory prototypes
(Lawitzky et al. 2013), trajectories leading to an unavoida-
ble collision are penalized. Dynamic Bayesian Networks
hold a higher share among the interaction-aware motion
models. Pairwise dependencies between agents are mod-
elled with asymmetric coupled Hidden Markov Models
(Oliver and Pentland 2000). The fact that agents’ interac-
tions are regulated by traffic rules is exploited in
(Agamennoni, Nieto, and Nebot 2012). The work in (Gin-
dele, Brechtel, and Dillmann 2010) rather accounts for
mutual influences instead of pair wise dependencies. The
causal dependencies between the agents are modeled as a
function of local situational context.
Computer vision techniques are utilized to predict pe-
destrians’ behavior as they are the vulnerable road users
Figure 1: A sample semantic map describing a traffic scenario.
(Gupta et al. 2018; Fernando et al. 2018) Object detection
and behavior prediction are the primary goals of an AV.
Acquired time series sensory data helps to classify and
build the localized traffic scenario, while the agents’ dy-
namics are understood and the behavior of them in the near
future will guide the decision making (Zhan et al. 2018).
Prominent non-trivial tasks under behavior predictions
involve learning the following. The interdependencies be-
tween the vehicles in the surrounding. The road geometry
and traffic rules that are impacting the vehicle trajectories.
Multimodal trajectories of the vehicles.
Several practical limitations exist while implementing
the prediction entity. Some of them could be that the AV
can only partially observe the surrounding environment
using on-board sensors that could suffer from limited
range, sensor noise and occlusion, and restricted computa-
tional resources available on-board. Most of existing stud-
ies assume availability of unobstructed top-down view of
the surrounding environment, which can be obtained by
infrastructure sensors like a surveillance camera on the AV
(Mozaffari et al. 2020). Such an arrangement is costly to
implement. Nevertheless, such a dataset (Houston et al.
2020) comprising EV translation and orientation, agents’
translation and yaws and traffic signal conditions along
with semantic maps of the road is available for exploration
using deep learning techniques. A sample semantic map
highlighting vehicles, intended paths and traffic lights are
shown in Figure 1. The AV under the drive collecting sen-
sor data and whose path prediction is the ultimate goal is
often referred to as ego vehicle (EV).
In this paper, we make use of residual and conditional
variational autoencoder (CVAE) deep learning networks
for trajectories prediction of traffic agents. Remainder of
paper provides insight into the related works in the applica-
tion domain, suggested methodologies and experimental
results.
Related Works
Several works have been carried over interaction-aware
motion models. In (Zyner, Worrall, and Nebot 2018; Zyner
et al. 2017; Xin et al. 2018) an agent’s behavior is learnt by
considering track history (e.g. position, velocity, accelera-
tion, direction) in relation to EV. The model considers the
agent as an isolated entity despite being surrounded by
other agents. Simple recurrent neural networks (RNN)
(Hammer 2000) are good at processing temporal depend-
encies. The works fail to capture the interdependencies
among the agents. The model could produce erroneous
results under a crowded environment. The above model
drawback is addressed by providing surrounding agents
track history, in relation to the target agent considered, in
(Deo and Trivedi 2018; Phillips, Wheeler, and Kochender-
fer 2017; Dai, Li, and Li 2019; Hu, Zhan, and Tomizuka
2018). Varied number of surrounding agents in the neigh-
borhood of the target agent are considered based on the
lane, direction and distance in these works. Multiple RNNs
constituting a gated recurrent unit (GRU) (Cho et al. 2014)
or long short-term memory (LSTM) (Hochreiter and
Schmidhuber 1997) are utilized to make predictions and
capture inter-dependencies between agents. These models
however lack the information of environmental factors that
could impact the behavior of the target agent. Since the
sensing is done by EV, few of the surrounding agents in
the vicinity of the target agent may get occluded as well. A
bird’s eye view (BEV) image represents agents with color
coded bounding boxes, and intended driving lanes and traf-
fic signals as color lines (Cui et al. 2019; Djuric et al.
2018) (refer Figure 1). Semantic segmentation of environ-
ment in BEV is proposed in (Lee et al. 2017). With image-
like data, convolutional neural networks (CNN) layers can
explore the spatial relationship of agents in the traffic con-
text scenario very effectively. The results are very promis-
ing. Going ahead, in (Luo, Yang, and Urtasun 2018) 3D
convolutions are applied to the temporal dimension of in-
put data. Later, a series of 2D convolution is utilized to
capture spatial features. Hybrid networks are developed to
reap the benefits of RNNs and CNNs (Lee et al. 2017;
Zhao et al. 2019). The output representation is complex in
actual and hence conditional generative model is suggested
by Lee et al. (2017) and for the purpose a CVAE network
is used for prediction. As far as network predictions are
concerned, the network may predict future maneuver inten-
tions or unimodal or multimodal trajectories. The maneu-
ver trajectory output is a high level under-standing of traf-
fic scenarios that can only help to infer the probabilities of
an agent moving on a straight lane or taking a left or right
turn at an intersection. However, predicting trajectory in-
stead of mere intention, provides more precise information
about future behavior of agents. Given the traffic scenario
and history of motion of an agent, the network can output
several possible future trajectories for the agent. The corre-
sponding trajectories distribution is therefore multimodal.
However, a unimodal trajectory predictors is limited to
predict one of these possible trajectories that corresponds
to the one with highest likelihood. The network in (Ho-
ermann, Bach, and Dietmayer 2018; Schreiber, Hoermann,
and Dietmayer 2019) outputs an occupancy map that will
describe the probability of occupancy of each pixel of BEV
image at each timestamp in the prediction horizon.
Methods
The problem of path prediction of an agent can be formu-
lated using a probabilistic model. If xit represents state of
an ith agent conveying its position, among N surrounding
agents at time instance t, then the predicted path of all
agents is defined as:
𝑋𝑎𝑁 = {𝑥𝑡𝑖 , 𝑥𝑡+1
𝑖 , 𝑥𝑡+2𝑖 , … , 𝑥𝑡+𝑘
𝑖 } ∀𝑖 ∈ 𝑁
Here k is the length of the prediction horizon.
The problem boils down to computing conditional dis-
tribution P(XaN|OEV) on recording the observations OEV by
the EV. In this section we propose a few of the neural net-
works for our objective. Upfront we have a residual deep
neural network and later a CVAE deep neural network.
Residual Network
As this distribution is computationally challenging, we can
resort to predicting path of each agent separately, given by:
𝑋𝑎𝑖 = {𝑥𝑡𝑖 , 𝑥𝑡+1
𝑖 , 𝑥𝑡+2𝑖 , … , 𝑥𝑡+𝑘
𝑖 }
Figure 2 shows the network architecture. The convolu-
tion layer upfront segregates temporal information in data,
whereas later convolutional layers in series capture spatial
information of agents. The input to the network is
(224x224) pixels x 25 channels. The output will be a mul-
timodal agent coordinate in 2D and their associated confi-
dence score. Each identity block has 3 (conv + batch norm
+ activation) layers with skip connection, whereas a conv
block has a (conv + batch norm) layer in the skip connec-
tion path. The network (He et al. 2016) is trained to predict
trajectories for all agents for a horizon of 50 by looking at
10 historical coordinates of those agents.
Figure 2: Residual network architecture.
If a, b, W, g() and l represents layer activations, bias,
weights, activation function and layer number, respective-
ly, in an identity block, then
𝑎[𝑙+1] = 𝑔(𝑊[𝑙+1]𝑎[𝑙] + 𝑏[𝑙+1])
𝑎[𝑙+3] = 𝑔(𝑊 [𝑙+3]𝑎[𝑙+2] + 𝑏[𝑙+3] + 𝑎[𝑙])
The L2 regularization does make sure weight decay
W[l+3]=0 and b[l+3]=0, and the ReLU activation fetches,
𝑎[𝑙+3] = 𝑎[𝑙] ; ensuring higher layers perform as good as
lower layers.
Conditional Variational Autoencoder
In contrast to the above architecture, where it is essentially
a many-to-one mapping, we understand the network output
to be complex in representation so as to get a close match
with the real traffic scenario. An autoencoder network is
capable of the task. The observations OEV of the EV and
future trajectories of all the agents XaN fed into the encoder
network creates a latent vector z. We rather generate a dis-
tribution Pθ(Z|OEV,XaN) at the bottleneck during training the
network. The network is trained to generate back the tra-
jectories OEV and XaN at the output, conditioned over the
vector y representing features of OEV. The decoder output
is compared against fed input. The cost function is mini-
mized by backpropagating the error using stochastic gradi-
ent variational Bayes method. Once trained, for an obser-
vation of agents’ motion OEV, we obtain a sample from the
stochastic latent space and condition the decoder on vector
y, to get predictions of all the agents by conditioned proba-
bility distribution Pθ(XaN|Z,Y) learnt by the decoder.
Figure 3: Conditional variational autoencoder architecture.
(a) (b) (c)
Figure 4: Sample images from dataset. (a) Sample image of a target agent (green color) along with other surrounding agents at a timestamp.
(b) Another sample image of another agent (in green color) and other agents at the same time stamp. (c) 2D view of a data tensor.
Figure 3 shows the CVAE network. The input to the net-
work is (224x224) pixels x 125 size tensor capturing 10
historical and 50 future trajectories. Out of this tensor, a
(224x224) pixels x 25 size tensor, representing 10 histori-
cal trajectories are fed into a feature extractor module to
deduce a vector representing OEV. This module just ex-
tracts trajectories and traffic light conditions present as raw
data, which the decoder part of the network is conditioned
upon. The encoder part of CVAE has a sequence of convo-
lutional, batch normalization and activation layers and end
up with dense layers. On the contrary, the decoder begins
with dense layers and finish with logit and softmax layers
for multimodal trajectories and their probability prediction.
At the bottleneck, between the encoder and the decoder
part, distributions are sampled and the decoder is condi-
tioned.
Experiment
In this section we introduce the dataset that we used for
experimenting on the selected neural networks, the pa-
rameter settings while implementing the networks and the
results.
Dataset
Dataset presented by Houston et al. (2020) gathers more
than 1,000 hours of traffic agent motion data. LiDAR
mounted self-driven EV moving on road segments captures
self and surrounding agents’ translation, rotation, yaw and
extent. Several EVs are utilized for putting up 16,000 miles
of coverage. Over the travel segments cameras are mount-
ed at high elevations to capture the scenario of road condi-
tions. Around 15,000 semantic map annotations are availa-
ble. The preprocessed data is available as a congregation of
scenes. Each scene constitutes on an average 248.36
frames. The frames sampled at 10 fps, arranged in chrono-
logical order, have all the details agents’ motion.
The library functions provided with the dataset can render
the data as an image so that convolutional deep learning
architectures can take advantage of. The target agent is
placed at the desired location of choice, and the other
agents are thereafter placed relative to this position and
also their extent represented by bounding box dimensions.
Figure 4a shows a sample image of 224x224 pixels from
the dataset. Centered at (64,112) is here the target vehicle.
Figure 4b shows another sample image at the same
timestamp, but a different agent in frame is assigned as the
target agent now and brought to location (64,112). In these
images the intention lanes, traffic light status, semantic
map and future trajectories of the target agent are also
shown. Several of the images, centered on a target agent,
are fused in a chronological order. Thus the tensor formed
will accommodate past history, present and future motion
of all agents present over the segment of road segment. To
train a residual net, past 10 frames of available agents is
fused as the predictions are taken to compute cost function
error. Figure 4c shows a 2D view of the 224x224x25 ten-
sor. It could be seen that the trace of the bounding box in
this image resembles agents’ motion within a timeframe.
Network Parameters
To train both the networks we choose each pixel size as
0.5mx0.5m, history_time_frames = 10, future_time_frames
= 50, batch_size = 32, Adam optimizer with learning_rate
= 1e-03, batch normalization layer having ε = 1e-05 and
momentum = 0.1, network to predict 03 probable trajecto-
ries per agent and cost function as negative multi-log like-
lihood. In the CVAE, we chose the hidden layers depth of
12 on encoder and 06 on decoder end. The latent space is
described by 06 mean and 06 standard deviation variables.
Results
We train both the residual network and the CVAE network
for 50,000 iterations. Plot of negative multi-log likelihood
for the same, averaged over 1,000 iterations is shown in
Figure 5. The validation error at those iterations is also
shown. The trained residual network has the error = 27.03
and validation error lowering to 58.27. The CVAE network
has the lowest training error equal to 113.67 and validation
error of 150.48.
The increased error of CVAE can be attributed to very
limited scope in capturing features representing the obser-
vations and lower network complexity. The error could
lower with increased training and better feature representa-
tion while conditioning the generator module.
Figure 6 shows predicted probable trajectories by the
networks against ground truth on a semantic map.
Figure 5: Performance of the networks. Plot of training and vali-
dation error (negative multi-log likelihood).
Figure 6: The ground truth and predicted trajectories plots for a horizon of 50. The upper row shows the ground truth trajectory for 3 sam-
ple images (shown in pink trace). The trajectories in the middle row are the predictions from the residual network (traces in blue, orange
and green color). The trajectories in bottom row are the predictions from the CVAE (again traces in blue, orange and green color). The
corresponding trajectory probability is also shown inside the image.
Shown in the first row is the ground truth trajectory for a
few of the input images from the dataset. Trajectories in
the second row images are predictions from the residual
network. The probabilities of the trajectories are also men-
tioned within the image. The same is shown for the CVAE
in the last row of the figure. It is evident that the trajectory
predictions of the residual network are much closer to the
ground truth.
Conclusion
In this paper we explore residual and conditional variation-
al autoencoder deep learning networks for trajectories pre-
diction of traffic agents. The data collected is essentially
represented as a sequence of simplified bird’s eye view
images. The convolutional layers are able to capture spatial
and temporal information as well. The networks’ com-
plexity, feature selection, parameters and training length
directly impacts the prediction accuracy.
References
Agamennoni, G.; Nieto, J.I.; and Nebot, E.M. 2012. Estimation of multivehicle dynamics by considering contextual information. IEEE Transactions on Robotics 28(4): 855-870.
Bertoncello, M.; and Dominik, W. 2016. Ten ways autonomous driving could redefine the automotive world. McKinsey & Com-pany, 6.
Brännström, M.; Coelingh, E.; and Sjöberg, J. 2010. Model-based threat assessment for avoiding arbitrary vehicle collisions. IEEE Transactions on Intelligent Transportation Systems 11(3): 658-669.
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint. arXiv:1406.1078.
Cui, H.; Radosavljevic, V.; Chou, F.C.; Lin, T.H.; Nguyen, T.; Huang, T.K.; Schneider, J.; and Djuric, N. 2019. Multi-modal trajectory predictions for autonomous driving using deep convo-lutional networks. In International Conference on Robotics and Automation (ICRA), 2090-2096.
Dai, S.; Li, L.; and Li, Z. 2019. Modeling vehicle interactions via modified LSTM models for trajectory prediction. IEEE Access 7: 38287-38296.
Deo, N.; and Trivedi, M.M. 2018. Multi-modal trajectory predic-tion of surrounding vehicles with maneuver based LSTMs. In IEEE Intelligent Vehicles Symposium (IV), 1179-1184.
Djuric, N.; Radosavljevic, V.; Cui, H.; Nguyen, T.; Chou, F.C.; Lin, T.H.; and Schneider, J. 2018. Motion prediction of traffic actors for autonomous driving using deep convolutional net-works. arXiv preprint. arXiv:1808.05819, 2.
Fernando, T.; Denman, S.; Sridharan, S.; and Fookes, C. 2018. Soft+ hardwired attention: An LSTM framework for human tra-jectory prediction and abnormal event detection. Neural net-works 108: 466-478.
Gindele, T.; Brechtel, S.; and Dillmann, R. 2010. A probabilistic model for estimating driver behaviors and vehicle trajectories in
traffic environments. In 13th International IEEE Conference on Intelligent Transportation Systems, 1625-1631.
Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; and Alahi, A. 2018. Social GAN: Socially acceptable trajectories with genera-tive adversarial networks. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, 2255-2264.
Hammer, B. 2000. On the approximation capability of recurrent neural networks. Neurocomputing 31(1-4): 107-123.
Hancock, P. A.; Illah, N; and Jack, S. 2019. On the future of transportation in an era of automated and autonomous vehicles. In Proceedings of the National Academy of Sciences, 116, no. 16 7684-7691.
He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778.
Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8): 1735-1780.
Hoermann, S.; Bach, M.; and Dietmayer, K. 2018. Dynamic oc-cupancy grid prediction for urban autonomous driving: A deep learning approach with fully automatic labeling. In IEEE Interna-tional Conference on Robotics and Automation (ICRA), 2056-2063.
Houston, J.; Zuidhof, G.; Bergamini, L.; Ye, Y.; Jain, A.; Omari, S.; Iglovikov, V.; and Ondruska, P. 2020. One Thousand and One Hours: Self-driving Motion Prediction Dataset. arXiv preprint. arXiv:2006.14480.
Hu, Y.; Zhan, W.; and Tomizuka, M. 2018. Probabilistic predic-tion of vehicle semantic intention and motion. In IEEE Intelligent Vehicles Symposium (IV), 307-313.
Lawitzky, A.; Althoff, D.; Passenberg, C.F.; Tanzmeister, G.; Wollherr, D.; and Buss, M. 2013. Interactive scene prediction for automotive applications. In IEEE Intelligent Vehicles Symposium (IV), 1028-1033.
Lee, N.; Choi, W.; Vernaza, P.; Choy, C.B.; Torr, P.H.; and Chandraker, M. 2017. Desire: Distant future prediction in dynam-ic scenes with interacting agents. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 336-345.
Luo, W.; Yang, B.; and Urtasun, R. 2018. Fast and furious: Real time end-to-end 3D detection, tracking and motion fore-casting with a single convolutional net. In Proceedings of the IEEE con-ference on Computer Vision and Pattern Recognition, 3569-3577.
Mozaffari, S.; Al-Jarrah, O.Y.; Dianati, M.; Jennings, P.; and Mouzakitis, A. 2020. Deep Learning-Based Vehicle Behavior Prediction for Autonomous Driving Applications: A Review. IEEE Transactions on Intelligent Transportation Systems.
Oliver, N.; and Pentland, A.P. 2000. Graphical models for driver behavior recognition in a smartcar. In Proceedings of the IEEE Intelligent Vehicles Symposium (Cat. No. 00TH8511), 7-12.
Ortiz, M.G.; Fritsch, J.; Kummert, F.; and Gepperth, A. 2011. Behavior prediction at multiple time-scales in inner-city scenari-os. In IEEE Intelligent Vehicles Symposium (IV), 1068-1073.
Phillips, D.J.; Wheeler, T.A.; and Kochenderfer, M.J. 2017. Gen-eralizable intention prediction of human drivers at intersections. In IEEE Intelligent Vehicles Symposium (IV), 1665-1670.
Schreiber, M.; Hoermann, S.; and Dietmayer, K. 2019. Long-term occupancy grid prediction using recurrent neural networks. In International Conference on Robotics and Automation (ICRA), 9299-9305.
Xin, L.; Wang, P.; Chan, C.Y.; Chen, J.; Li, S.E.; and Cheng, B. 2018. Intention-aware long horizon trajectory pre-diction of sur-rounding vehicles using dual LSTM networks. In 21st Interna-tional Conference on Intelligent Transportation Systems (ITSC), 1441-1446.
Zhan, W.; La de Fortelle, A.; Chen, Y.T.; Chan, C.Y.; and Tomi-zuka, M. 2018. Probabilistic prediction from planning perspec-tive: Problem formulation, representation simplification and eval-uation metric. In IEEE Intelligent Vehicles Symposium (IV), 1150-1156.
Zhao, T.; Xu, Y.; Monfort, M.; Choi, W.; Baker, C.; Zhao, Y.; Wang, Y.; and Wu, Y.N. 2019. Multi-agent tensor fusion for contextual trajectory prediction. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, 12126-12134.
Zyner, A.; Worrall, S.; and Nebot, E. 2018. A recurrent neural network solution for predicting driver intention at unsignalized intersections. IEEE Robotics and Automation Letters 3(3): 1759-1764.
Zyner, A.; Worrall, S.; Ward, J.; and Nebot, E. 2017. Long short-term memory for driver intent prediction. In IEEE Intelligent Vehicles Symposium (IV), 1484-1489.