Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
APPROXIMATE METHODS FOR VALIDATING AUTONOMOUS
SYSTEMS IN SIMULATION
A DISSERTATION
SUBMITTED TO THE DEPARTMENT OF AERONAUTICS AND
ASTRONAUTICS
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
Mark Koren
May 2021
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/pv383pd8838
© 2021 by Mark Koren. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Mykel Kochenderfer, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
J Gerdes
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Dorsa Sadigh
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Ritchie Lee
Approved for the Stanford University Committee on Graduate Studies.
Stacey F. Bent, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
Abstract
Because of the safety-critical nature of many autonomous systems, validation is essential before
deployment. However, validation is difficult—most of these systems act in high-dimensional spaces
that make formal methods intractable, and failures are too rare to rely on physical testing. Instead,
systems must be validated approximately in simulation. How to perform this validation tractably
while still ensuring safety is an open problem.
One approach to validation is adaptive stress testing (AST), where finding the most-likely failure
in simulation is formulated as a Markov decision process (MDP). Reinforcement learning techniques
can then be used to validate a system through falsification. We are interested in validating agents
that act in large, continuous, and complex spaces. Consequently, it is almost always the case that
forcing a failure is possible. Optimizing to find the most likely failure improves the relevance of
the failures uncovered, and provides valuable information to designers. This thesis presents two
new techniques for solving the MDP to find failures: 1) a deep reinforcement learning (DRL) based
approach and 2) a go-explore (GE) based approach.
Scalability is key to efficiently validating an autonomous agent, for which large, continuous
state and action spaces lead to a dimensional explosion in possible scenario rollouts. This problem is
exacerbated by the fact that designers are often interested in a space of similar test scenarios starting
from slightly different initial conditions. Running a validation method many times from different
initial conditions could quickly become intractable. DRL has been shown to perform better than
traditional reinforcement learning techniques, such as Monte Carlo tree search (MCTS) on problems
with continuous state spaces. In addition to scalability advantages, DRL can use recurrent networks
to explicitly capture the sequential structure of a policy. This thesis presents a DRL reinforcement
learner for AST based on recurrent neural networks (RNNs). By using an RNN, the reinforcement
learner learns a policy that generalizes across initial conditions, while also providing the scalability
advantages of deep learning.
While DRL techniques scale well, they also rely on the existence of a constant reward signal
to guide the agent towards better solutions during training. For validation, domain experts can
sometimes provide a heuristic that will guide the reinforcement learner towards failures. However,
iv
without such a heuristic, the problem becomes a hard-exploration problem. GE has shown state-
of-the-art results on traditional hard-exploration benchmarks such as Montezuma’s Revenge. This
thesis uses the tree search phase of go-explore to find failures without heuristics in domains where
DRL and MCTS do not find failures. In addition, this thesis shows that phase 2 of go-explore,
the backwards algorithm, can often be used to improve the likelihood of failures found by any
reinforcement learning method, with or without heuristics.
Autonomous vehicles are an example of an autonomous system that acts in a large, continuous
state space. In addition, failures are rare events for autonomous vehicles, with some experts propos-
ing that they will not be safe enough until they crash only once every 1.0× 109 miles. Consequently,
validating the safety of autonomous systems generally requires the use of high-fidelity simulators
that adequately capture the variability of real-world scenarios. However, it is generally not feasible
to exhaustively search the space of simulation scenarios for failures. Adaptive stress testing (AST)
is a method that uses reinforcement learning to find the most likely failure of a system. This thesis
presents a way of using low-fidelity simulation rollouts—generally much cheaper and faster to gen-
erate—to reduce the number of high-fidelity simulation rollouts needed to find failures, which allows
us to validate autonomous vehicles at scale in high-fidelty simulators.
As autonomous systems become more prevalent, and their development more wide-spread and
distributed, validation techniques likewise must become widely available. Towards that end, the
final contribution of this thesis is the AST Toolbox, an open-source Python package for applying
AST to any autonomous system. The Toolbox contains pre-implemented MCTS, DRL, and GE
reinforcement learners to allow designers to apply the work of this thesis to validating their own
systems. In addition, the Toolbox provides templates to simplify the process of wrapping the system
and simulator in a format that is conducive to reinforcement learning.
v
Acknowledgments
Finishing a PhD is never easy (and even if it was, I would never underrate my own achievement by
admitting so). However, I owe sincere thanks to many wonderful people for making my PhD process
as smooth and enjoyable as it could possibly be.
First and foremost, I have to thank my advisor, Mykel Kochenderfer. As one would expect, I owe
so much of my educational and professional development to his tutelage. However, I certainly didn’t
anticipate how influential he would be regarding my personal development. Mykel has an iron-clad
commitment to empathy and kindness, and he shows by his actions that being a good person and
being a successful person are not at all contradictory goals. His job was to help me become the
latter, and he did, but in truth, I value his example on being the former even more. Thank you,
Mykel.
I also owe gratitude to all the brilliant people who agreed to be on my committee. Dorsa
Sadigh’s faculty talk was my first exposure at Stanford to safety validation of autonomous vehicles
as a research area that promised to be fascinating and rewarding. Chris Gerdes and I became
acquainted when he was running a research group on the ethics of autonomous vehicles, and our
meetings were hugely influential in the direction in which I ended up taking my thesis. Ritchie
Lee wrote the first adaptive stress testing paper, and in many ways that paper became the basis
of my entire thesis. And while I didn’t personally work on formal methods, Clark Barrett’s work
in the area has always been something I followed closely, as the features of his work and my thesis
complement each other in wonderful ways. Thank you all for your incredibly useful feedback and
advice throughout this process.
I would also like to thank the people who helped put me on the path that led me to Stanford. At
the University of Alabama (roll tide!), John Baker, Jane Batson, Gary Cheng, Darren Evans-Young,
Paul Hubner, Amy Lang, and Shane Sharpe were all instrumental in preparing me to be a successful
researcher and engineer. Preceding them, I might not even have made it to Alabama if not for the
guidance I received at Fremd High School from Karen Clindaniel, Paul Hardy, and Michael Karasch,
nor would I be the person I am today without the influence of Steven Buenning and LoriAnne Frieri.
Finally, a quick shout-out to a person I can only describe as my “engineering uncle,” John Koepke.
Thank you all so much for making me who I am today.
vi
I would be remiss if I didn’t thank the people and organizations who financially made my time
at Stanford a reality. Thank you to Stanford and the Moler family for funding me as the Dr. Cleve
B. Moler Fellow. Thank you to Uber ATG and NVIDIA, as well as all the amazing folks I worked
with there, for both the funding and the internship opportunities. Additionally, thank you to the
Toyota Research Institute for funding my final quarter.
Finally, thank you to all of my friends and family, especially my mom, my dad, and my brother.
It seems unfair, considering how much you all have done for me, to only devote a few sentences to
thanking you. Alas, I don’t have room to thank all of you individually as you deserve, so I will leave
it at this: whether in small or large ways, every single one of you has influenced me for the better,
and I thank you for that.
vii
Contents
Abstract iv
Acknowledgments vi
1 Introduction 1
1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Crosswalk—A Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Black-Box Safety Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Path Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Adaptive Stress Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Adaptive Stress Testing 10
2.1 Sequential Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.2 Proofs of Desirable Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Direct Disturbance Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 Seed-Disturbance Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
viii
2.6 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.1 Cartpole with Disturbances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.2 Autonomous Vehicle at a Crosswalk . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.3 Aircraft Collision Avoidance Software . . . . . . . . . . . . . . . . . . . . . . 22
2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Scalable Validation 25
3.1 Fully Connected Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Preserving the Black-box Simulator Assumption . . . . . . . . . . . . . . . . . . . . 27
3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Simulator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4 Generalizing across Initial Conditions 38
4.1 Modified Recurrent Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.2 Modified Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.3 Reinforcement Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.2 Comparison to Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 Heuristic Rewards 46
5.1 Go-Explore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.1 Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2 Go-Explore for Black-Box Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.1 Cell Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.2.2 Cell Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.2 Modified Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
ix
5.3.3 Reinforcement Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6 Robustification 57
6.1 The Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.2 Robustification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7 Validation in High-Fidelity 62
7.1 Validation in High-Fidelity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.2.1 Case Study: Time Discretization . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2.2 Case Study: Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2.3 Case Study: Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.2.4 Case Study: Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2.5 Case Study: NVIDIA DriveSim . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8 The AST Toolbox 71
8.1 Comparison to Existing Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2 AST Toolbox Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
8.2.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2.2 Reinforcement Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2.3 Simulation Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2.4 Reward Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3.1 Cartpole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3.2 Autonomous Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3.3 Automatic Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
9 Summary and Future Work 83
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
9.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
9.3.1 Full Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
x
9.3.2 Fault Injection in Vision Systems . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.3.3 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
xi
List of Tables
3.1 Numerical results from both reinforcement learners. Reward without noise shows the
reward of the MCTS path if sensor noise was set to zero, to illustrate the difficulty
that MCTS has with eliminating noise. DRL is able to find a more probable path
than MCTS with a large reduction in calls to the Step function. . . . . . . . . . . . 35
4.1 The initial condition space. Initial conditions are drawn from a continuous uniform
distribution defined by the supports below. . . . . . . . . . . . . . . . . . . . . . . . 42
4.2 The aggregate results of the DRDRL and GRDRL reinforcement learners, as well as
the MCTS and FCDRL reinforcement learners as baselines, on an autonomous driving
scenario with a 5-dimensional initial condition space. Despite not having access to
the simulator’s internal state, the DRDRL reinforcement learner achieves results that
are competitive with both baselines. However, the GRDRL reinforcement learner
demonstrates a significant improvement over the other three reinforcement learners. 44
5.1 Parameters that define the easy, medium, and hard scenarios. Changing the pedes-
trian location results in failures being further from the average action, making ex-
ploration more difficult, whereas changing the horizon and timestep lengths makes
exploration more complex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
7.1 The results of the time discretization case study. Load Lofi Policy indicates whether
the policy of the BA was initialized from scratch or the weights were loaded from the
policy trained in lofi. The BA was able to significantly reduce the number of hifi steps
needed, and loading the lofi policy produced a further reduction. . . . . . . . . . . . 66
7.2 The results of the dynamics case study. Load Lofi Policy indicates whether the policy
of the BA was initialized from scratch or the weights were loaded from the policy
trained in lofi. The BA was able to significantly reduce the number of hifi steps
needed, and loading the lofi policy produced a further reduction. . . . . . . . . . . . 66
xii
7.3 The results of the tracker case study. Load Lofi Policy indicates whether the policy of
the BA was initialized from scratch or the weights were loaded from the policy trained
in lofi. The BA was able to significantly reduce the number of hifi steps needed, and
loading the lofi policy produced a further reduction. . . . . . . . . . . . . . . . . . . 67
7.4 The results of the perception case study. Due to differences in network sizes, resulting
from different disturbance vector sizes, it was not possible to load a lofi policy in this
case study. The BA was still able to significantly reduce the number of hifi steps needed. 68
8.1 A feature comparison of the AST Toolbox with three existing software solutions for
system validation and verification. The AST Toolbox is unique in two features: 1)
being able to treat the entire simulator as a black box, and 2) returning the most
likely failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
xiii
List of Figures
1.1 A general layout of the running crosswalk example. The ego vehicle approaches a
crosswalk where a pedestrian is trying to cross. The ego vehicle estimates the pedes-
trian location based on noisy observations. . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A naive approach to validation. The pedestrian is constrained to cross the street in a
straight line. Different pedestrian velocities, shown as different sized arrows, may be
simulated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 A validation approach that captures the full variance of real-world possibilities. The
pedestrian is unconstrained and can move in any direction, as shown by the blue circle. 3
1.4 The general validation problem. A model of the system under test interacts with an
environment, both of which are contained in a simulator. An adversary perturbs the
simulation with disturbances in an effort to force failures. . . . . . . . . . . . . . . . 4
1.5 The AST methodology. The simulator is treated as a black box. The reinforcement
learner interacts with the simulator through disturbances and receives a reward. Max-
imizing reward results in the most likely failure path. . . . . . . . . . . . . . . . . . . 6
1.6 A graphical overview of the chapters in this thesis, organized by the category of the
contribution in each chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 The AST methodology. The simulator is treated as a black box. The reinforcement
learner interacts with the simulator through disturbances and receives a reward. Max-
imizing reward results in the most likely failure path. . . . . . . . . . . . . . . . . . . 13
2.2 The AST process using direct disturbance control. The reinforcement learner controls
the simulator directly with disturbances, which are used the by the reward function
to calculate reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 The AST process using seed-disturbance control. The reinforcement learner controls
the simulator by outputting a seed for the random number generators to use. The
reward function uses the transition likelihoods from the simulator to calculate reward. 20
xiv
2.4 Layout of the cartpole environment. A control policy tries to keep the bar from falling
over or the cart from moving too far horizontally by applying a control force to the
cart [65]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Layout of the autonomous vehicle scenario. A vehicle approaches a crosswalk on a
neighborhood road as a single pedestrian attempts to walk across. Initial conditions
are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 An example result from Lee, Mengshoel, Saksena, et al. [53], showing an NMAC
identified by AST. Note that the planes must be both vertically and horizontally near
to each other to register as an NMAC. . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 The network architecture of the fully-connected DRL reinforcement learner. A num-
ber of hidden layers learn to map the simulation state st to x, the mean and diagonal
covariance of a multivariate normal distribution. The disturbance x is then sampled
from the distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2 The network architecture of the recurrent DRL reinforcement learner. An LSTM
learns to map the previous disturbance xt and the previous hidden state ht to the
next hidden state ht+1 and to x, the mean and diagonal covariance of a multivariate
normal distribution. The disturbance xt+1 is then sampled from the distribution. . . 27
3.3 The modular simulator implementation. The modules of the simulator can be easily
swapped to test different scenarios, SUTs, or sensor configurations. . . . . . . . . . . 28
3.4 A comparison of the reinforcement learner methods. MCTS uses a seed to control a
random number generator. DRL outputs a distribution, which is then sampled. Both
of these methods produce disturbances. . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.5 The setup of the three crosswalk scenarios. . . . . . . . . . . . . . . . . . . . . . . . 31
3.6 Pedestrian motion trajectories for each scenario and algorithm. The collision point is
the point of contact between the vehicle and the pedestrian. In scenario 3, pedestrian
1 does not collide with the vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 An example of a scenario class for the crosswalk example. The ego vehicle and the
pedestrian both have a range of initial conditions for their position and velocity. A
concrete scenario instantiation could be created by sampling specific initial condition
values from the ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Contrasting the new and old AST architectures. The new reinforcement learner uses
a recurrent architecture and is able to generalize across a continuous space of initial
conditions with a single trained instance. These improvements allow AST to be used
on problems that would have previously been intractable. . . . . . . . . . . . . . . . 39
xv
4.3 The network architecture of the generalized recurrent DRL reinforcement learner.
An LSTM learns to map the input—a concatenation of the previous disturbance xt
and the simulator’s initial conditions s0—and the previous hidden state ht to the
next hidden state ht+1 and to x, the mean and diagonal covariance of a multivariate
normal distribution. The disturbance xt+1 is then sampled from the distribution. . . 40
4.4 The crosswalk scenario class. To instantiate a concrete scenario, the initial conditions
s0,ped, s0,car, v0,ped, and v0,car are drawn from their respective ranges, defined in
table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 The Mahalanobis distance of the most likely failure found at each iteration for both
architectures. The conservative discrete architecture runs each of the discrete rein-
forcement learners in sequential order. The optimistic discrete architecture runs each
of the discrete reinforcement learners in a single batch. . . . . . . . . . . . . . . . . . 43
5.1 The path to the first reward in the Atari 2600 version of Montezuma’s Revenge [78].
The player must take the numbered steps in order, without dying, before getting the
first key. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2 The layout of the crosswalk example scenario. The car approaches the road where a
pedestrian is trying to cross. Initial conditions are shown, and values for s0,ped,y can
be found in table 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 The reward of the most likely failure found at each iteration of the GE, DRL, and
MCTS reinforcement learners in the easy scenario. Results are cropped to only show
results when a failure was found. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 The reward of the most likely failure found at each iteration of the GE and MCTS
reinforcement learners in the medium scenario. The DRL reinforcement learner was
unable to find a failure. Results are cropped to show results only when a failure was
found. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 The reward of the most likely failure found at each iteration of the GE reinforcement
learner in the hard scenario. The DRL and MCTS reinforcement learners were unable
to find a failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.1 The reward of the most likely failure found at each iteration of the GE, DRL, and
MCTS reinforcement learners in the easy scenario, as well as GE+BA, DRL+BA,
and MCTS+BA. The dashed lines indicate the respective scores after robustification
of each reinforcement learner. Results are cropped to show results only when a failure
was found. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
xvi
6.2 The reward of the most likely failure found at each iteration of the GE and MCTS
reinforcement learners in the medium scenario, as well as GE+BA and MCTS+BA.
The dashed lines indicate the respective scores after robustification of each reinforce-
ment learner. The DRL reinforcement learner was unable to find a failure. Results
are cropped to show results only when a failure was found. . . . . . . . . . . . . . . 61
6.3 The reward of the most likely failure found at each iteration of the GE reinforce-
ment learner in the hard scenario, as well as GE+BA. The dashed line indicates the
score after robustification of the GE reinforcement learner. The DRL and MCTS
reinforcement learners were unable to find a failure. . . . . . . . . . . . . . . . . . . . 61
7.1 Layout of the crosswalk example. A car approaches a crosswalk on a neighborhood
road with one lane in each direction. A pedestrian is attempting to cross the street
at the crosswalk. Initial conditions are shown. . . . . . . . . . . . . . . . . . . . . . . 64
7.2 Example rendering of an intersection from NVIDIA’s Drivesim simulator, an industry
example of a high-fidelity simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.1 The AST Toolbox framework architecture. The core concepts of the method are
shown, as well as their associated abstract classes. ASTEnv combines the simulator
and reward function in a gym environment. The reinforcement learner is implemented
using the garage package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Layout of the cartpole environment. A control policy tries to keep the bar from falling
over, or the cart from moving too far horizontally, by applying a control force to the
cart [65]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.3 Best return found up to each iteration. The value is averaged over 10 different trials.
Both the MCTS and DRL reinforcement learners are able to find failures, but the
DRL reinforcement learner is more computationally efficient. . . . . . . . . . . . . . 77
8.4 Layout of the autonomous vehicle scenario. A vehicle approaches a cross-walk on a
neighborhood road as a single pedestrian attempts to walk across. Initial conditions
are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.5 Reward of the most likely failure found at each iteration. The Batch Max is the
maximum per-iteration summed Mahalanobis distance. The Cumulative Max is the
best Batch Max up to that iteration. The reinforcement learner finds the best solution
by iteration 6 out of 80. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.6 The average return at each iteration. The Batch Average is the average return from
each trajectory in an iteration, while the Cumulative Max Average is the maximum
Batch Average so far. The reinforcement learner is mostly converged by iteration
10, although there are slight improvements later. The large returns indicate that not
every trajectory is ending in a collision. . . . . . . . . . . . . . . . . . . . . . . . . . 80
xvii
8.7 The results of the automatic transmission case study, averaged over 10 trials. The
DRL reinforcement learner is able to outperform both MCTS and a random search
baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
xviii
Chapter 1
Introduction
Recent years have seen an explosion of work on solving real world problems with autonomous systems.
Engineers have proposed using autonomous systems for applications such as delivering packages with
quadcopters, delivering food orders with robots, providing security, and growing food. Autonomous
systems are even being proposed in safety-critical domains, for example autonomous vehicles. In
order for autonomous systems to deliver on their promising potential across these domains, they
will need to be safe. Designing the systems to be safe is not enough; safety validation—the process
of proving that a system is as safe as it is purported or required to be—is essential. Unfortunately,
validating the safety of autonomous systems is also a serious challenge.
1.1 Challenges
For many autonomous systems, validation through real-world testing is infeasible. For example, it
has been estimated that a fleet of autonomous vehicles would have to drive over 5 billion miles to
demonstrate safety equivalent to that of commercial airplanes [1]. That estimate only gets worse
when we consider that the 5 billion miles must be representative of the types of driving the vehicles
will be expected to do in operation—testing only on sunny highways, for example, does not provide
any validation of how the car performs in a snowy mountain neighborhood. Worse, this testing might
have to be redone for every update in the self-driving vehicle’s software. Due to the infeasibility
of validation through real-world testing, many designers of autonomous systems are turning to
simulation.
Validation in simulation presents its own challenges, however. First, autonomous systems are
complex, and may include components like large deep-learning networks. As a consequence, it may
be impossible in many cases to use formal methods to prove safety guarantees. Second, many
autonomous systems act in continuous, high-dimensional state and action spaces. As a result, there
are far too many possible rollout trajectories to exhaustively simulate all possible outcomes. Third,
1
CHAPTER 1. INTRODUCTION 2
ego vehicle pedestrian perceived pedestrian
Figure 1.1: A general layout of the running crosswalk example. The ego vehicle approaches acrosswalk where a pedestrian is trying to cross. The ego vehicle estimates the pedestrian locationbased on noisy observations.
some safety-critical autonomous systems require high levels of safety. Due to this requirement, we
may be interested in finding rare or low-likelihood failures. The rarity of failures complicates the
use of approximate methods for validating autonomous systems. Fourth, in order to have confidence
in the results of validation, simulators must be highly accurate representations of the real world,
which may require features like simulated perception or software-in-the-loop (SiL) simulation. The
downside to these desirable features is computational cost—high-fidelity simulators are often slow
and expensive to run. Consequently, designers are often stuck with a tradeoff between the robustness
of validation and computational cost. Finally, autonomous systems are often large and complex,
and therefore access to internal state variables and processes may be limited and challenging to
implement. If testing is being performed by a third-party or government institution, internal state
access may also be impossible for legal reasons Therefore, we want to be able to validate autonomous
systems in a way that requires limited or no access to the internal state of the system-under-test,
an approach known as black-box validation. To contextualize these challenges, we now present a
real-world example.
1.2 The Crosswalk—A Running Example
Consider the case of an autonomous vehicle approaching a crosswalk on a neighborhood road—a
scenario that will serve as a running example through this thesis. The basic crosswalk layout is
shown in fig. 1.1. The autonomous vehicle is approaching at 25 mph to 34 mph, and the pedestrian
is on the side of the road, near the crosswalk entrance. In order to make validation tractable, the
current approach might be to constrain the pedestrian to move in a straight line across the crosswalk,
as shown in fig. 1.2. Designers could then do a grid search over different pedestrian velocities and
establish the range of pedestrian velocities that their vehicle can safely handle. However, this
simplification is far too restrictive to guarantee safety.
In the real world, pedestrians can and will take a massively diverse range of trajectories while
CHAPTER 1. INTRODUCTION 3
ego vehicle pedestrian
Figure 1.2: A naive approach to validation. The pedestrian is constrained to cross the street in astraight line. Different pedestrian velocities, shown as different sized arrows, may be simulated.
ego vehicle pedestrian
Figure 1.3: A validation approach that captures the full variance of real-world possibilities. Thepedestrian is unconstrained and can move in any direction, as shown by the blue circle.
crossing the street, as shown in fig. 1.3. While many of these trajectories are unlikely, they are not
so unlikely that we can neglect them completely. However, there are so many different instantiations
of unlikely pedestrian trajectories that we would never be able to simulate them all through brute
force methods. Furthermore, many failures are unlikely enough to make them hard to find, but
likely enough that—in aggregate—they can still result in a high likelihood of system failure, the so
called heavy-tail problem [2]. We need a way to approximately search the space of possible rollouts
to make validation tractable, while still preserving the black-box assumption introduced earlier. We
are therefore interested in the field of black-box safety validation.
1.3 Black-Box Safety Validation
The general field of safety validation contains numerous sub-fields. For example, the field of formal
verification focuses on extracting provable guarantees about a system. We are interested specifically
in the sub-field of black-box safety validation, which is the process of ensuring the safety of a system
CHAPTER 1. INTRODUCTION 4
SystemM
EnvironmentE
AdversaryA
action disturbance x
observation state s
Figure 1.4: The general validation problem. A model of the system under test interacts with anenvironment, both of which are contained in a simulator. An adversary perturbs the simulation withdisturbances in an effort to force failures.
with limited or no access to the internal state of the system. In contrast, white-box safety validation
would be the process of ensuring the safety of a system with full access to the internal state of the
system.
The general problem set up for black-box validation is shown in fig. 1.4. A model M of the
system-under-test (SUT) takes actions a in and receives observations o from the environment E .
Both the system and the environment are contained in a simulator. Simultaneously, an adversary Athat might have access to the environment state s outputs disturbances x. The disturbances control
the environment, and they are chosen by the adversary with the goal of forcing a failure in the SUT.
We briefly cover some of the existing methods of black-box validation (see Corso, Moss, Koren, et
al. [3] for an in-depth survey of black-box validation).
1.3.1 Optimization
One approach to efficiently finding failures is to define a cost function that measures the level of
safety of the system over the duration of an environment rollout (s0, . . . , sT ). Once a cost function
is defined, the black-box validation problem can be solved as an optimization problem. While a
cost function is often application specific, much work has been done on creating cost functions using
temporal logic expressions for a range of domains [4]–[11], and a range of algorithms have been
applied to black-box validation optimization problems:
• Simulated annealing is a stochastic global optimization method that can perform well on
problems with many local minima, and therefore has been shown to be effective for black-box
validation [12], [13].
• Genetic algorithms mimic a basic model of genetic evolution, and they have been used to solve
black-box validation problems [14], [15].
• Bayesian optimization methods select disturbances that are likely to lower the cost function
by building a probabilistic surrogate model of the cost function over the space of disturbance
trajectories. Bayesian optimization is designed to handle stochastic objective functions and
uncertainty and has been shown to perform well on validation problems [10], [16]–[20].
CHAPTER 1. INTRODUCTION 5
• Extended ant-colony optimization, a probabilistic technique for solving continuous optimiza-
tion problems inspired by the way certain ant species leave pheromone traces while searching
for food, has also been successfully used to perform black-box validation [21].
1.3.2 Path Planning
Another approach to finding failures is to solve the problem as a path planning problem of finding
the best trajectory through the state space, using disturbances as control inputs. The disturbance
trajectory is sequentially built to reach E, the subset of the state space containing failures, from
the initial state s0. Rapidly-exploring random tree (RRT) is one of the most popular path planning
algorithms, and has been frequently applied to black-box safety validation [7], [22]–[29]. However,
other approaches like multiple shooting methods [30], [31] and Las Vegas tree search (LVTS) [8]
have also been explored.
1.3.3 Importance Sampling
Many of the above methods are focused on finding failures in a system when it is hard to do so. A
different subset of black-box validation is interested in estimating the overall likelihood of system
failure. Importance sampling (IS) techniques are the most common approach to estimating the
failure probability of a system, and there is a broad library of existing work on how to best use IS
for black-box safety validation. In some cases, the extreme rarity of failures can prevent IS from
converging. In such cases, IS with adaptive sampling has been shown to successfully find failure
probabilities [32]–[36]. Validation problems can be high-dimensional, therefore a non-parametric
version of IS that uses Markov chain Monte Carlo (MCMC) estimation to achieve better scalability
has been used to find failures in systems [37]. Another approach to finding failures is to combine
IS with sequential decision-making techniques to efficiently find the optimal importance sampling
policy for a specific state [38], [39].
1.3.4 Reinforcement Learning
Scalability is an issue with many of the above approaches to black-box safety validation, as we are
searching for failures in the space of disturbance trajectories, which scales exponentially with the
number of simulation timesteps. One approach, the one taken throughout this thesis, is to formulate
the problem of black-box safety validation as a Markov decision process (MDP) (see section 2.1.1).
Reinforcement learning (RL) techniques can then be used to find failures. The two most common
reinforcement learning approaches for finding failures are Monte Carlo tree search (MCTS) [40]–[44]
and deep reinforcement learning (DRL) [45]–[52]. In this thesis we will focus on an approach that
is capable of using both MCTS and DRL, or other RL algorithms, to find the most likely failure in
a system: adaptive stress testing (AST).
CHAPTER 1. INTRODUCTION 6
Simulator S
Environment E SystemUnder Test M
ReinforcementLearner A
disturbance xsim state s,reward r
Most LikelyFailure Path
Figure 1.5: The AST methodology. The simulator is treated as a black box. The reinforcementlearner interacts with the simulator through disturbances and receives a reward. Maximizing rewardresults in the most likely failure path.
1.4 Adaptive Stress Testing
In adaptive stress testing (AST), we formulate the problem of finding the most likely failure as a
Markov decision process [53]. As such a process, the problem can be approached with standard
reinforcement learning (RL) techniques for validation. The process is shown in fig. 1.5. The re-
inforcement learner, which contains the RL agent, controls the simulation through disturbances.
The simulation should be deterministic with respect to the disturbances. The simulation updates
according to the disturbances and then outputs the likelihood of the timestep, and whether an event
of interest—in our application, a collision—occurred. The output from the simulator is then used
to calculate a reward, which is used by the reinforcement learner to improve the RL agent through
optimization. See chapter 2 for a detailed coverage of AST.
While vanilla AST provides a useful starting point, there are still unsolved issues that limit the
cases in which we can uses AST for validation. AST needs to be able to scale to high-dimensional
problems, which may be an issue when using Monte Carlo tree search (MCTS). AST needs to be
able to find failures on system where finding failures may be difficult or require long search horizons.
AST needs to be efficient enough to work with high-fidelity simulations. Finally, AST needs to be
readily available for researchers and engineers to apply to their systems. The work presented in this
thesis focuses on addressing these unresolved issues.
1.5 Contributions
This thesis contributes to the field of approximate validation for autonomous systems by addressing
several limitations in AST. Contributions are highlighted in the chapters in which they are presented,
and chapter 9 summarizes the specific contributions made throughout this thesis. This section
provides a brief, high-level overview of the primary contributions. Figure 1.6 shows a graphical
representation of these contributions.
CHAPTER 1. INTRODUCTION 7
AST Utility
Chapter 4: Generalizing across Initial Conditions
Chapter 6: Robustification
Scalability
Chapter 3: Scalable Validation
Chapter 5: Heuristic Reward
Applicability
Chapter 7: Validation in High-Fidelity
Chapter 8: The AST Toolbox
Figure 1.6: A graphical overview of the chapters in this thesis, organized by the category of thecontribution in each chapter.
In order to be useful for validating autonomous systems, which often act in large, continuous
state and action spaces, AST reinforcement learners must be scalable. Monte Carlo tree search
may not provide the scalability needed for validating systems that act based on perception, like
autonomous vehicles. This thesis introduces a deep reinforcement learning (DRL) reinforcement
learner that is shown to have better performance and scalability. Furthermore, we show that with
slight modifications the recurrent neural-network architecture allows the DRL reinforcement learner
to generalize across initial conditions, adding an avenue to significant computational savings.
An issue for DRL approaches is their reliance on a consistent reward signal to guide them to their
goals during training. For validation tasks, heuristic rewards can be used to provide a useful signal
for the reinforcement learner by giving reward for intermediate steps that lead towards failures.
However, it may not always be possible or desirable to craft heuristic reward functions, and, without
the reward signal, DRL could struggle to find failures. MCTS may perform better, but can also
struggle on these types of hard-exploration domains if the trajectories get too long. This thesis
introduces a reinforcement learner based on go-explore [54], a state-of-the-art algorithm for hard-
exploration problems that can find failures when no heuristics are available and search horizons are
long.
The accuracy required by validation tasks necessitates the use of high-fidelity (hifi) simulators
when moving away from real-world testing. However, hifi can be slow and expensive. While AST
allows us to search the space of possible rollouts, its use of RL means that finding a failure could still
take hundreds or thousands of iterations, which may be intractable for a hifi simulator. This thesis
presents a way of running AST in low-fidelity simulation, and then using the information acquired
CHAPTER 1. INTRODUCTION 8
to make running in hifi tractable.
As autonomous vehicles find widespread adoption in safety-critical applications, it is essential
for safety to become collaborate, not competitive. To facilitate this collaboration, safety validation
methods should be as open-source and transparent as possible. Towards that end, this thesis presents
the AST Toolbox, an open-source Python package that allows designers to easily apply AST to their
own system.
1.6 Overview
This thesis presents advancements in approximate methods for validating an autonomous system
in simulation. This chapter introduced the overall challenges of validation, explained how adap-
tive stress testing might alleviate those challenges, presented outstanding limitations in AST, and
outlined the contributions of this thesis. The remainder of this thesis proceeds as follows:
Chapter 2 provides a background on Adaptive Stress Testing. We show that the formulation of
the reward function provides the same optimal trajectory as our motivational optimization problem,
and we provide guidance on how to set certain reward parameters to achieve desirable behavior.
Chapter 2 also provides background on Markov decision processes and deep reinforcement learning.
Chapter 3 presents a new reinforcement learner for AST that uses deep reinforcement learning
(DRL). It is essential for validation methods to be scalable in order to be applicable to autonomous
systems that act in high-dimensional or continuous state and action spaces. The use of DRL improves
the scalability of AST, allowing it to be applied to such systems.
Chapter 4 provides a way to use a single run of AST to validate a system over a set of initial
conditions. Designers are often interested in a scenario class, which is defined by a parameter range.
Instead of having to run AST an infeasible number of times from different instantiations of the
scenario class, we show that AST can learn to generalize across the entire scenario class.
Chapter 5 presents a new reinforcement learner for AST that uses go-explore to handle long-
horizon validation problems with no heuristic rewards. Heuristic rewards are domain-specific terms
in the reward function meant to guide RL agents to goal states. We show that the new reinforcement
learner can find failures even in cases where it is infeasible or undesirable to craft heuristic rewards.
Chapter 5 also presents background on go-explore.
Chapter 6 shows that the backward algorithm can be used to robustify the results of all existing
reinforcement learners. AST must search a massive space of possible simulation rollouts to find
failures, which can lead to significant variance in the final validation results. To avoid such variance,
which is unsafe, we show that the backward algorithm can be applied after training to get improved
results that are also more consistent. Chapter 6 also provides background on the backward algorithm.
Chapter 7 presents a way of using the backward algorithm to transfer failures found in low-fidelity
simulation to high-fidelity simulation. High-fidelity simulation is needed for validation because of its
CHAPTER 1. INTRODUCTION 9
accurate representation of the real world, but it is slow and expensive to run. We show that we can
first find failures in low-fidelity simulation, which is much cheaper to run, and then transfer them
to high-fidelity, significantly reducing the number of high-fidelity simulation steps needed.
Chapter 8 introduces the AST Toolbox, an open-source software toolbox that allows designers to
easily apply AST to their own systems. The Toolbox provides an environment that turns AST into
a standard OpenAI gym reinforcement learning environment. Off-the-shelf reinforcement learners,
like those provided by garage, can then be used to find failures. The designer needs only to write a
wrapper to interface with their simulator. Chapter 8 also covers existing software packages.
Chapter 9 concludes the thesis with a summary of the contributions and results as well as a brief
discussion of ideas for further research.
Chapter 2
Adaptive Stress Testing
When performing approximate validation of autonomous systems, we must find the right balance
between computational cost and thoroughness. This tradeoff is a focus within the field of black-box
safety validation. Black-box safety validation is a rich field with a variety of different approaches
(see section 1.3 for brief overview of the existing approaches to black-box safety validation). This
thesis focuses in on a single approach, adaptive stress testing (AST), that has the unique feature
of returning the most likely failure of a system for a given scenario. As the rest of this thesis
contributes several advancements to validating autonomous systems based on AST, this chapter will
provide background on the latest formulation of AST.
2.1 Sequential Decision Making
2.1.1 Markov Decision Process
Adaptive stress testing (AST) frames the problem of finding the most likely failure as a Markov
decision process (MDP) [55]. In an MDP, an agent takes action a while in state s at each timestep.
The agent may receive a reward from the environment according to the reward function R(s, a). The
agent then transitions to the next state s′ according to the transition probability P (s′ | a, s). Both
the reward and transition functions may be deterministic or stochastic. The Markov assumption
requires that the next state and reward be independent of the past history conditioned on the current
state-action pair (s, a). An agent’s behavior is specified by a policy π(s) that maps states to actions,
either stochastically or deterministically. An optimal policy is one that maximizes expected reward.
Reinforcement learning is one way to approximately optimize policies in large MDPs.
10
CHAPTER 2. ADAPTIVE STRESS TESTING 11
2.1.2 Monte Carlo Tree Search
Monte Carlo tree search (MCTS) [56] has been successfully applied to a variety of problems, including
the game of Go, and has been demonstrated to perform well on large scale MDPs [57]. MCTS
incrementally builds a search tree where each node represents a state or action in the MDP. It
uses forward simulation to evaluate the return of state-action pairs. To balance exploration and
exploitation, each action in the tree is chosen according to its upper confidence bound (UCB)
evaluation:
a← arg maxa
Q(s, a) + c
√log(N(s))
N(s, a)(2.1)
where Q(s, a) is the average return of choosing action a at state s, N(s) is the number of times that
s has been visited, N(s, a) is the number of times that a has been chosen as the next action at state
s, and c is a parameter that controls the exploration. UCB helps bias the search to focus on the
most promising areas of the action space.
For problems with a large or continuous action space (as is common in the AST context), a
technique called double progressive widening (DPW) is used to control the branching factor of the
tree [58]. In DPW, the number of different actions tried at each state node, |N(s, a)|, is constrained
by |N(s, a)| < kN(s)α. The parameters k and α control the widening speed of the tree. Since the
transition is deterministic in AST, no constraint on the number of different next states, |N(s, a, s′)|,is needed [40].
2.1.3 Deep Reinforcement Learning
In deep reinforcement learning (DRL), a policy is represented by a neural network [59]. Whereas
a feed-forward neural network maps an input to an output, we use a recurrent neural network
(RNN), which maps an input and a hidden state from the previous timestep to an output and an
updated hidden state. An RNN is naturally suited to sequential data due to the hidden state, which
is a learned latent representation of the current state. RNNs suffer from exploding or vanishing
gradients, a problem addressed by variations such as long-short term memory (LSTM) [60] or gated
recurrent unit (GRU) [61] networks.
There are many different algorithms for optimizing a neural network, proximal policy optimiza-
tion (PPO) [62] being one of the most popular. PPO is a policy-gradient method that updates
the network parameters to minimize the cost function. Improvement in a policy is measured by an
advantage function, which can be estimated for a batch of rollout trajectories using methods such as
generalized advantage estimation (GAE) [63]. However, variance in the advantage estimate can lead
to poor performance if the policy changes too much in a single step. To prevent such a step leading
to a collapse in training, PPO can limit the step size in two ways: 1) by incorporating a penalty
proportional to the KL-divergence between the new and old policy or 2) by clipping the estimated
CHAPTER 2. ADAPTIVE STRESS TESTING 12
advantage function when the new and old policies are too different.
2.2 Preliminaries
A safety validation problem consists of a system under test (SUT), represented by a SUT modelMwith state µ ∈ M , that is acting in an environment E , as shown in fig. 1.4. The safety validation
problem evolves over a discrete time range t ∈ [0, . . . , tend], where tend ≤ tmax for some horizon
tmax. We use a subscript to denote a variable at time t (e.g., the SUT state at time t is µt) and
a subscript colon range to denote a sequence of a variable over a range of timesteps (e.g., the SUT
state path from time 0 to time t is µ0:t = [µ0, . . . , µt]). The SUT receives an observation o ∈ O from
the environment—which depends on the environment state z ∈ Z—and then takes action a ∈ A.
The SUT state and action depend on the environment observations:
µt+1, at =M (o0:t) (2.2)
The SUT and environment are both contained by a simulator S with state s ∈ S, where the
simulator state is the stacked SUT and environment states [µ, z]. An adversary A also interacts
with the simulator by producing disturbances x ∈ X. A disturbance can take many forms (see
section 2.6), but the disturbance vector must control all stochasticity within the environment. The
disturbance can affect both the environment state and the observation:
zt+1, ot+1 = E (a0:t, x0:t) (2.3)
Since we assume that the disturbance controls all stochasticity in the environment, eq. (2.2) and
eq. (2.3) together mean that the simulator state is determined by the disturbance:
st+1 = S (x0:t) (2.4)
We assume disturbances are independent across time and distributed with known probability density
p (x | s). The disturbance model could be learned from data or generated from expert knowledge.
Finally, we define an event space E ⊂ S where an event of interest occurs. While this event space
can be arbitrarily defined, for validation tasks we focus on failure events. A trajectory is said to be
a failure when stend∈ E.
2.3 Problem Formulation
Finding the most likely failure of a system is a sequential decision-making problem. Given an event
space E, we want to find the most likely simulator path (x0:tend−1, s0:tend) that ends in the event
CHAPTER 2. ADAPTIVE STRESS TESTING 13
Simulator S
Environment E SystemUnder Test M
ReinforcementLearner A
disturbance xsim state s,reward r
Most LikelyFailure Path
Figure 2.1: The AST methodology. The simulator is treated as a black box. The reinforcementlearner interacts with the simulator through disturbances and receives a reward. Maximizing rewardresults in the most likely failure path.
space by controlling the adversary’s disturbances:
maximizex0,...,xtend−1
P (s0, x0, . . . , xtend−1, stend)
subject to stend∈ E
(2.5)
where P (s0, x0, . . . , xtend−1, stend) is the probability of a path in simulator S.
Because we assume the simulator is Markovian,
P (s0, x0, . . . , xtend−1, stend) = P (s0)
tend−1∏t=0
P (st+1 | xt, st)P (xt | st) (2.6)
By definition, the simulator is deterministic with respect to xt, so P (st+1 | xt, st) = 1, and we
assume P (s0) = 1, and therefore eq. (2.5) becomes
maximizex0,...,xtend−1
tend−1∏t=0
P (xt | st)
subject to stend∈ E
(2.7)
2.4 Reinforcement Learning
AST solves the optimization problem in eq. (2.7) through reinforcement learning (RL) by letting
an RL agent A act as the adversary. The general process is shown in fig. 2.1. The reinforcement
learner passes the disturbance to the simulator, which may be treated by the reinforcement learner
as a black box. The simulator uses the disturbance to update the environment and the SUT. The
simulator returns a reward and some simulation information (see section 2.5). If the simulator is
treated as black box, the simulation information may merely be an indicator of whether a trajectory
CHAPTER 2. ADAPTIVE STRESS TESTING 14
has ended in failure. If the simulator is not fully treated as a black box, the simulation information
may include a part or all of the simulation state—or heuristics that depend on a part or all of the
simulation state—depending on how much of the simulation state is exposed. Through repeated
interactions with the simulator, the reinforcement learner is optimized to choose disturbances that
maximize reward. The process will therefore return the most likely failure path when the reward
function returns higher rewards for failure events and higher likelihood transitions.
2.4.1 Reward Function
In order to find the most likely failure, the agent tries to maximize the expected sum of rewards,
E
[tmax∑t=0
R(st, at)
], (2.8)
where the reward function must be structured as follows:
R (st, xt−1, st−1) = h (st, xt−1, st−1) +
RE st ∈ E
−RE st /∈ E, t = tmax
ρt st /∈ E, t < tmax
(2.9)
where the parameters are
• h (st, xt−1, st−1): An optional training heuristic given at each timestep of the form h (st, xt−1, st−1) =
Φ(st)−Φ(st−1). When Φ is a potential function that smoothly measures the closeness to fail-
ure, h (st, xt−1, st−1) is a difference of potential functions and will not change the optimal
trajectory [64].
• RE : A reward for trajectories that end in the event space E.
• RE : A penalty for trajectories that do not end in the event space E.
• ρt: The action likelihood reward. For direct action control, ρ = logP (xt | st). For seed-
action control, ρt = logP (xt | st; x), where x is the pseudorandom seed that is output by
the reinforcement learner. (see section 2.5 for the differences between direct disturbance and
seed-action control, and for when each version is appropriate). In practice, ρt may be replaced
by a reward proportional to the log-probabilities.
Within eq. (2.9), there are three cases:
• s ∈ E: The trajectory has terminated because an event has been found. This is the goal, so
the AST agent receives a reward.
CHAPTER 2. ADAPTIVE STRESS TESTING 15
• s /∈ E, t = tmax: The trajectory has terminated by reaching the horizon tmax without reaching
an event. This is the least-useful outcome, so the AST agent receives a penalty.
• s /∈ E, t < tmax: The trajectory has not terminated, which is most timesteps. The reward is
generally the negative log-likelihood of the disturbance, which promotes likely actions.
2.4.2 Proofs of Desirable Properties
Under certain conditions, we can guarantee some desirable properties of the RL approach. For the
purposes of proving the propositions below, we will briefly introduce some common notation. Con-
sider the set of all possible failure trajectories TE and the set of all possible non-failure trajectories
TE , where TE ∩ TE = ∅ and TE ∪ TE = T, with T being the set of all possible trajectories. Let
τx = (s0:tend, x0:tend−1) be a trajectory in some set of trajectories, τx ∈ Tx, where τE is a trajectory
that ends in failure (stend∈ E), so τE ∈ TE , and τE is a trajectory that does not end in a failure
(stend/∈ E), so τE ∈ TE We have already defined ρt = logP (xt | st), and we will further define the
likelihood of a trajectory as
ρτ =∑
(st,xt)∈τ
logP (xt | st) (2.10)
We will denote the most likely trajectory in a set with a star, such that ∀τx ∈ Tx, ρτ∗x ≥ ρτx . We
will denote the least likely trajectory in a set with a prime, such that ∀τx ∈ Tx, ρτx ≥ ρτ ′x . We will
denote the trajectory that is the most likely of the trajectories that end closest to a failure with a
dagger, such that ∀τx ∈ Tx,Φ(s†tend
)≥ Φ (stend
), and if Φ(s†tend
)= Φ (stend
), then ρ†E> ρE , where
Φ(s) is a potential function that is a smooth measure of closeness to failure. We will also consider
the total sum of rewards that a trajectory receives from eq. (2.9):
Gτ =∑
(st,xt)∈τ
R (st, xt−1, st−1) (2.11)
= RE1{st ∈ E} −RE1{st /∈ E, t = tmax}+ ρτ (2.12)
The most important property of the RL approach is that the optimal trajectory is the same as
that for eq. (2.7), which is to say that by maximizing eq. (2.8) we actually find the most likely failure.
Equation (2.9) shows that, of the two failure paths, the likelier path will receive a higher reward.
Similarly, of the two non-failure paths, the likelier path will receive a higher reward. Therefore, to
show that our RL approach yields the same trajectory as eq. (2.7), we only need to show that the
most likely failure path yields a higher reward than the most likely non-failure path.
Proposition 2.4.1. Consider the most likely trajectory that ends in failure, τ∗E. Let ρ∗min > −ρτ∗E .
If (RE +RE) ≥ ρ∗min, then τ∗E is also the trajectory that maximizes eq. (2.9).
Proof. Because τ∗E ∈ TE and τ∗E∈ TE , Gτ∗E = ρτ∗E + RE and Gτ∗
E= ρτ∗
E− RE . We want to show
CHAPTER 2. ADAPTIVE STRESS TESTING 16
that the most likely failure receives a higher reward than the most likely non-failure, so
Gτ∗E > Gτ∗E
(2.13)
ρτ∗E +RE > ρτ∗E−RE (2.14)
RE +RE > ρτ∗E− ρτ∗E (2.15)
ρτ∗E
is a sum of log-probabilities, and therefore we know ρτ∗E≤ 0, so −ρτ∗E ≥ ρτ∗E − ρτ∗E . In addition,
by definition ρ∗min > −ρτ∗E and, consequently, if (RE +RE) ≥ ρ∗min, then we can easily show that
RE +RE ≥ ρmin (2.16)
> −ρτ∗E (2.17)
≥ ρτ∗E− ρτ∗E (2.18)
Therefore, if (RE +RE) ≥ ρ∗min, then Gτ∗E > Gτ∗E
.
There are two difficulties with applying the above proof in practice. The first difficulty is that
ρτ∗E is not known ahead of time. However, ρ∗min can be set to a large enough number to ensure
that ρ∗min > −ρτ∗E . The second, and more problematic, difficulty is that AST provides only an
approximate solution to the RL problem. The approximate solution might have converged to a local
optimum that is not the global optimum, and therefore the most likely failure found would have had
lower probability than the most likely failure ρτ∗E . In such cases, we would prefer the local optimum
to still be a failure, even if that failure is less likely than the most likely failure. With a slight
variation of proposition 2.4.1, we can show that, under certain conditions, all trajectories that end
in failure will result in a higher reward than any trajectory that does not end in failure.
Proposition 2.4.2. Consider the least likely trajectory that ends in failure τ ′E and the most likely
trajectory that does not end in failure τ∗E
. Let ρ′min > −ρτ ′E . If (RE +RE) ≥ ρ′min, then eq. (2.9)
yields a higher reward for τ ′E than for τ∗E
.
Proof. Note that we can ignore the heuristic reward term without loss of generality (see corol-
lary 2.4.1). We want to show that the least likely failure will receive a higher reward than the most
likely non-failure, so
Gτ ′E > Gτ∗E
(2.19)
ρτ ′E +RE > ρτ∗E−RE (2.20)
RE +RE > ρτ∗E− ρτ ′E (2.21)
Because ρτ∗E
is a sum of log-probabilities, we know ρτ∗E≤ 0, so −ρτ ′E ≥ ρτ∗
E− ρτ ′E . In addition, by
CHAPTER 2. ADAPTIVE STRESS TESTING 17
definition ρ′min > −ρτ ′E and, consequently, if (RE +RE) ≥ ρ′min, then we can easily show that
RE +RE ≥ ρ′min (2.22)
> −ρτ ′E (2.23)
≥ ρτ∗E− ρτ ′E (2.24)
Therefore, if (RE +RE) ≥ ρ′min, then Gτ ′E > Gτ∗E
.
We have shown that the approximate solution to the RL problem will be the most likely failure
found. However, in practice we still have the difficulty of not knowing ρτ ′E ahead of time. We can
overcome this difficulty by setting a minimum threshold for the likelihood of failures that we are
interested in and considering any trajectory that is less likely than the threshold to be a non-failure.
This threshold would then be a lower bound on ρτ ′E , and ρ′min can be set accordingly.
Both of the above proofs ignored the heuristic terms h (st, xt−1, st−1), but it is easy to show that
the results still hold when a heuristic reward is used.
Corollary 2.4.1. If h (st, xt−1, st−1) = Φ(st)−Φ(st−1) where Φ is a smooth measure of closeness to
failure, then proposition 2.4.1 and proposition 2.4.2 are true when using a non-zero heuristic reward.
Proof. In eq. (2.16) and eq. (2.22), both proofs arrive at an inequality of the form
RE +RE ≥ ρ (2.25)
where ρ = ρ∗min and ρ = ρ′min, respectively. For the purpose of this proof we can ignore the specific ρ
terms and note that including a heuristic reward for either proof would have resulted in an inequality
of the form
RE +RE ≥ ρ+ Φ (sE)− Φ (sE) (2.26)
Because Φ is a measure of closeness to failure, Φ (sE) > Φ (sE), and, therefore, Φ (sE)−Φ (sE) < 0.
Consequently, if RE and RE are set such that eq. (2.25) is true, then eq. (2.26) is also true.
There are also situations in which, if we do not find a failure, we would like to return the
trajectory that was closest to being a failure. We can show that, under certain conditions, when we
do not find a failure, the trajectory that is closest to being a failure will receive the highest reward
(and if there are multiple trajectories that are equally close to failure, the most likely one will receive
the highest reward).
Proposition 2.4.3. Assume that the heuristic reward h (st, xt−1, st−1) = Φ(st) − Φ(st−1), where
Φ(s) is a potential function that is a smooth measure to closeness to failure. Further assume that
there are no trajectories that end in failure, so TE = ∅ and TE = T. Consider the most likely of the
CHAPTER 2. ADAPTIVE STRESS TESTING 18
trajectories that end closest to failure τ †E
. If Φ(s†tend
)− Φ (stend
) > −ρ†E
, then τ †E
is the trajectory
in T that maximizes eq. (2.9).
Proof. We want to show that τ †E
receives a greater reward than any other trajectory, so
Gτ†E
> GτE (2.27)
ρ†E
+RE + Φ(s†tend
)− Φ (s0) > ρE +RE + Φ (stend
)− Φ (s0) (2.28)
ρ†E
+ Φ(s†tend
)> ρE + Φ (stend
) (2.29)
Φ(s†tend
)− Φ (stend
) > ρE − ρ†E
(2.30)
Note that in eq. (2.28), all the middle Φ (s1:tend−1) terms cancel out due to it being a telescoping
sum. Because ρE is a sum of log-probabilities, we know ρE ≤ 0, and therefore
− ρ†E≥ ρE − ρ
†E
(2.31)
Consequently, Gτ†E
> GτE when
Φ(s†tend
)− Φ (stend
) > −ρ†E
(2.32)
In the case in which Φ(s†tend
)= Φ (stend
), we must have ρE − ρ†E< 0, which is the case if, and only
if, τ †E
is the more likely of the two failures.
As with previous propositions, a difficulty arises in practice in that we do not know −ρ†E
ahead
of time. However, we can still use proposition 2.4.3 as a guideline for tuning the heuristic reward
in eq. (2.9). Our result shows that the change in Φ as a trajectory gets closer to a failure must be
greater than the sum of its log rewards. As a consequence, we can see that the more we care about
returning the trajectory that is closest to failure, the more we should scale Φ.
2.5 Methodology
The AST approach treats both the system under test and the simulator itself as a black box.
However, the reinforcement learner does need the following three access functions in order to interact
with the simulator:
• Initialize(S, s0): Resets S to a given initial state s0.
• Step(S, E, x): Steps the simulation in time by drawing the next state s′ based on disturbance
x. The function returns ρ, the log-probability of the transition and an indicator showing
whether or not s′ is in E. The function may return additional simulation information for the
CHAPTER 2. ADAPTIVE STRESS TESTING 19
Simulator S
Environment E SystemUnder Test M
action a
observation o
ReinforcementLearner A
RewardFunction R
disturbance x
rewardr
event e, simulation information s
Figure 2.2: The AST process using direct disturbance control. The reinforcement learner controlsthe simulator directly with disturbances, which are used the by the reward function to calculatereward.
reward function to use if the simulation state is partially or fully exposed. The simulation
information may include the exposed portions of the simulation state itself, metrics based on
the exposed portions of the simulation state, or a combination of both.
• IsTerminal(S, E): Returns true if the current state of the simulation is in E or if the horizon
of the simulation tmax has been reached.
Depending on the simulator design, AST has two different ways to control the simulation rollouts:
direct disturbance control, and seed-disturbance control.
2.5.1 Direct Disturbance Control
Under direct disturbance control, the reinforcement learner directly outputs a disturbance vector that
is used by the simulator when updating to the next timestep, as shown in fig. 2.2. The disturbance is
used by the reward function, with any additional simulation information, to determine the reward.
At each timestep, the disturbance output may depend only on the previous disturbance, or it may
depend on simulation information if some of the simulator state is exposed.
2.5.2 Seed-Disturbance Control
In some simulators, it may not be feasible to allow an outside adversary to directly control the update
of a simulator. Under seed-disturbance control, AST instead controls stochasticity at each timestep
by setting the global random seed x. When stochastic elements of the simulator are generated from
pseudorandom number generators, the outcome is determined by the global random seed. Therefore,
the disturbance x is fully determined by x, and the simulator is still deterministic with respect to
the reinforcement learner’s output. The seed-disturbance control process is shown in fig. 2.3. The
CHAPTER 2. ADAPTIVE STRESS TESTING 20
Simulator S
Environment E SystemUnder Test M
action a
observation o
MCTS-SA
RewardFunction
seed x
reward r
transition likelihood ρ,event e, simulation information s
Figure 2.3: The AST process using seed-disturbance control. The reinforcement learner controls thesimulator by outputting a seed for the random number generators to use. The reward function usesthe transition likelihoods from the simulator to calculate reward.
reward function cannot determine a reward directly from the seed x and must rely instead on the
simulator to provide it with the likelihood of the transition to the current simulator state.
2.6 Case Studies
We present three case studies in which an autonomous system needs to be validated. For each
scenario, we provide an example of how it could be formulated as an AST problem.
2.6.1 Cartpole with Disturbances
Problem
Cartpole is a classic test environment for continuous control algorithms [66]. The system under
test (SUT) is a neural network control policy trained by TRPO. The control policy controls the
horizontal force ~F applied to the cart, and the goal is to prevent the bar on top of the cart from
falling over.
The cartpole scenario from [67] is shown in fig. 2.4. The state s = [x, x, θ, θ] represents the cart’s
horizontal position and speed as well as the bar’s angle and angular velocity. The control policy, a
neural network trained by TRPO, controls the horizontal force ~F applied to the cart. The failure of
the system is defined as |x| > xmax or |θ| > θmax. The initial state is at s0 = [0, 0, 0, 0].
Formulation
We define an event as the pole reaching some maximum rotation or the cart reaching some maximum
horizontal distance from the start position. The disturbance is δ ~F , the disturbance force applied to
the cart at each timestep. The reward function uses RE = 1× 104 and RE = 0. There is also a
CHAPTER 2. ADAPTIVE STRESS TESTING 21
Figure 2.4: Layout of the cartpole environment. A control policy tries to keep the bar from fallingover or the cart from moving too far horizontally by applying a control force to the cart [65].
heuristic reward where Φ is 1000 times the normalized distance of the final state to the nearest failure
state. The choice of Φ encourages the reinforcement learner to push the SUT closer to failure. The
disturbance likelihood reward ρ is set to the log of the probability density function of the natural
disturbance force distribution. See Koren, Ma, Corso, et al. [67].
y
x
(−35, 0) m
(0.0,−2) m
11.2 m/s
1.0 m/s
ego vehicle pedestrian
Figure 2.5: Layout of the autonomous vehicle scenario. A vehicle approaches a crosswalk on aneighborhood road as a single pedestrian attempts to walk across. Initial conditions are shown.
CHAPTER 2. ADAPTIVE STRESS TESTING 22
2.6.2 Autonomous Vehicle at a Crosswalk
Problem
Autonomous vehicles must be able to safely interact with pedestrians. Consider an autonomous
vehicle approaching a crosswalk on a neighborhood road. There is a single pedestrian who is free to
move in any direction. The autonomous vehicle has imperfect sensors.
The autonomous vehicle scenario from Koren, Alsaif, Lee, et al. [46] is shown in fig. 2.5. The x-
axis is aligned with the center of the SUT’s lane, with East being the positive x-direction. The y-axis
is aligned with the center of the cross-walk, with North being the positive y-direction. The pedestrian
is crossing from South to North. The simulator state s = [scar,x, scar,y, sped,x, sped,y, vcar,x, vcar,y, vped,x, vped,y]
represents the x and y position and velocity of both the vehicle and the pedestrian. The vehicle also
observes a noisy vector o = [srel,x, srel,y, vrel,x, vrel,y], which represents the position and velocity
of the pedestrian relative to the vehicle. The vehicle starts 35 m back from the crosswalk, with an
initial velocity of 11.2 m/s East. The pedestrian starts 2 m back from the edge of the road, with an
initial velocity of 1 m/s North. The autonomous vehicle policy is a modified version of the intelligent
driver model [68].
Formulation
We define an event as an overlap between the car and the pedestrian, which occurs when
|scar,x − sped,x| ≤ 2.5 and |scar,y − sped,y| ≤ 1.4. The disturbance vector controls both the motion of
the pedestrian and the scale and direction of the sensor noise. The reward function for this scenario
uses RE = −1× 105 and RE = 0. There is also a heuristic reward with Φ = 10000 · dist(pv,pp
),
where dist(pv,pp
)is the distance between the pedestrian and the SUT. This heuristic encourages
the reinforcement learner to move the pedestrian closer to the car in early iterations, which can
significantly increase training speeds. The reward function also uses ρ = M (x, µx | s), which is
the Mahalanobis distance function [69]. Mahalanobis distance is a generalization of distance to the
mean for multivariate distributions. Using the Mahalanobis distance results in a reward that is still
proportional to the likelihood of a trajectory but handles very small probabilities without exploding
towards negative infinity. See Koren, Alsaif, Lee, et al. [46].
2.6.3 Aircraft Collision Avoidance Software
Problem
The next-generation Airborne Collision Avoidance System (ACAS X) [70] gives instructions to pilots
to avoid collisions. We want to identify system failures in simulation to ensure the system is robust
enough to replace the Traffic Alert and Collision Avoidance System (TCAS) [71]. We are interested
in a number of different scenarios in which two or three aircraft are in the same airspace.
CHAPTER 2. ADAPTIVE STRESS TESTING 23
Formulation
The event will be a near mid-air collision (NMAC), which is when two planes pass within 100 vertical
feet and 500 horizontal feet of each other. The simulator is quite complicated, involving sensor,
aircraft, and pilot models. Consequently, it is too difficult to define or access the full simulator state.
Therefore, instead of trying to control the simulation state explicitly, we will use seed-disturbance
control, so the reinforcement learner will output seeds to the random number generators in the
simulator. The reward function for this scenario uses RE = −1× 105, RE = 0, and a heuristic
reward equivalent to the negative miss distance at the end of the trajectory. While this heuristic is
not technically of the form h (st, xt−1, st−1) = Φ(st)−Φ(st−1), it still works due to how the middle
terms cancel out in a summation over repeated steps, and it is easier to implement. The reward
function also uses ρ = logP (st | st+1), the log of the known transition probability at each time-step.
See Lee, Mengshoel, Saksena, et al. [53].
An example result from Lee, Mengshoel, Saksena, et al. [53] is shown in fig. 2.6. The planes need
to cross paths, and the validation method was able to find a rollout where pilot responses to the
ACAS X system lead to an NMAC. AST was used to find a variety of different failures in ACAS X.
−10,000 −6,000 −2,0000.2
0.4
0.6
0.8
1
·104
1
1
2
2
Position East (ft)
PositionNorth
(ft)
0 10 20 30 40 50
2,000
2,200
2,400
2,600
2,8001
12
2
Time (s)
Altitude(ft)
Figure 2.6: An example result from Lee, Mengshoel, Saksena, et al. [53], showing an NMAC identifiedby AST. Note that the planes must be both vertically and horizontally near to each other to registeras an NMAC.
2.7 Discussion
This chapter presented adaptive stress testing (AST), including the problem formulation, the reward
function format, and the methodology for both direct disturbance control and seed-disturbance
control. We were able to prove that our RL problem is equivalent to the optimization problem. We
CHAPTER 2. ADAPTIVE STRESS TESTING 24
were also able to provide some useful guidelines on setting hyperparameters in the reward function, as
well as on how to design a heuristic reward that will not change the optimal trajectory. We grounded
this theory by presenting three examples of how to formulate the validation of an autonomous system
according to AST. The theory presented in this chapter underlies all of the work throughout the
rest of this thesis.
Chapter 3
Scalable Validation
Adaptive Stress Testing (AST) allows us to search the space of possible simulation rollouts for a
specific scenario to find the most likely failure. Consequently, we can avoid significant simplifications
or constraints that could undermine results when validating autonomous systems. However, many
autonomous systems act in state and action spaces that are both continuous and high-dimensional.
Therefore, we want an AST reinforcement learner that can also scale well to complex problems.
Prior work on AST [40] used Monte Carlo tree search, a reinforcement learning method that builds
a tree of state and action nodes to find the trajectory that yields the highest reward. During rollouts,
MCTS takes actions that balance exploration and exploration, and then uses the rollout rewards
to track the expected value of different state-action pairs. Unfortunately, tree size can quickly
explode when dealing with continuous or high-dimensional state or action spaces. While there are
variations of MCTS that allow it to perform better with continuous or high-dimensional spaces (see
section 2.1.2), they often involve pruning branches that do not seem promising. Aggressive pruning
can cause MCTS to miss branches that would have led to better solutions, harming performance.
In addition, MCTS commits to the early part of a trajectory before it moves on to explore later
steps. Consequently, MCTS can struggle with exploitation, as it may converge to a trajectory that
is similar to an optimal solution but is slightly worse, especially in continuous spaces. In the case of
AST, suboptimal exploitation means that MCTS could yield an underestimate of the probability of
the most likely failure.
To address the limitations of the MCTS reinforcement learner, this chapter presents an AST
reinforcement learner that uses deep reinforcement learning (DRL). Instead of a tree, DRL represents
a policy with a neural network and then uses batches of rollouts to perform optimization. DRL has
already been shown to perform well on tasks with continuous or high-dimensional state or action
spaces such as Atari environments. In addition, numerical optimization allows DRL to have strong
exploitation. In fact, contrary to MCTS, DRL can struggle instead with exploration. To mitigate
the exploration issues, instead of having the DRL reinforcement learner output actions directly,
25
CHAPTER 3. SCALABLE VALIDATION 26
xt=1 = [µt+1,Σt+1]
hidden layer (32)
hidden layer (64)
hidden layer (128)
hidden layer (256)
hidden layer (512)
st
Figure 3.1: The network architecture of the fully-connected DRL reinforcement learner. A numberof hidden layers learn to map the simulation state st to x, the mean and diagonal covariance of amultivariate normal distribution. The disturbance x is then sampled from the distribution.
the DRL reinforcement learner will instead output the mean and standard deviation of a Gaussian
distribution. Actions are then sampled from the resulting distribution. Early in training, the
use of distributions can enhance exploration by adding randomness. However, through the course
of training, the reinforcement learner learns to reduce the standard deviations, so the policy still
converges to a distribution with little stochasticity.
Background on Monte Carlo tree search can be found in section 2.1.2. Background on deep
reinforcement learning can be found in section 2.1.3.
3.1 Fully Connected Approach
The most straightforward way to apply DRL to AST is to represent the policy as a fully-connected
neural network, as shown in fig. 3.1. The fully connected neural network maps the simulation state
to the action distribution parameters through a series of fully connected hidden layers. Architecture
design choices like the number of layers, the size of the layers, and types of non-linearities can be
adjusted based on the complexity of the AST problem.
The fully connected architecture has two key advantages. The first is simplicity. A fully connected
neural network is the easiest to implement and the most straightforward to optimize. The second
advantage is performance. By mapping directly from the simulation state, the network has all
CHAPTER 3. SCALABLE VALIDATION 27
LSTM
xt
xt+1 = [µt+1,Σt+1]
ht+1ht
Figure 3.2: The network architecture of the recurrent DRL reinforcement learner. An LSTM learnsto map the previous disturbance xt and the previous hidden state ht to the next hidden state ht+1
and to x, the mean and diagonal covariance of a multivariate normal distribution. The disturbancext+1 is then sampled from the distribution.
the information it needs to force failures. Unfortunately, there is a major limitation as well. By
mapping directly from the simulation state, the fully connected architecture violates the black-box
assumption. In some cases, this may be acceptable, but complex autonomous systems often are
tested in complex simulators, where it may be difficult and time consuming to get access to the full
simulation state. A different architecture is needed for such cases.
3.2 Preserving the Black-box Simulator Assumption
For cases where the simulator must be treated as a black box, we can instead use a recurrent neural
network architecture, as shown in fig. 3.2. A recurrent neural network maps two inputs—the actual
input and the previous hidden state—to two outputs—the actual output and the next hidden state.
The hidden state is a learned latent approximation of the state space, which allows the architecture
to force failures without access to the simulator state. Instead, the reinforcement learner uses the
previous action as input and effectively learns to approximate a state space over the course of
training. While there are many flavors of recurrent networks, in this case we use an LSTM because
it has been shown to have strong performance on many types of sequential tasks.
The main advantage of the recurrent architecture is that it can treat the simulator as a black
box. By learning to map a sequence of actions to a hidden state, the reinforcement learner can find
failures without having access to the actual simulator state. Learning the hidden state is a harder
problem, so the recurrent architecture will usually be slower than the fully connected architecture
in cases where both can work. Implementing a recurrent neural network can be more complicated
than implementing a fully connected neural network as well. However, the recurrent architecture
still retains the main advantages of DRL over MCTS – namely, scalability and strong exploitation.
CHAPTER 3. SCALABLE VALIDATION 28
3.3 Experiments
This section defines the set of scenarios and metrics that we will use to evaluate the performance
of the methods as well as the reward function used by the reinforcement learners. The simulator
design outlined here is the basis of the example AV simulator in the AST Toolbox (see chapter 8)
and will be used throughout this thesis.
3.3.1 Simulator Design
Simulator S
Environment E System Under Test M
ActorDynamics
SensorModule
TrackerModule
SUT Path PlannerSUT
Dynamics
noisymeasurements o
filteredobservations z
environment state z
SUT actions a
disturbance x
transition likelihood ρevent e
Figure 3.3: The modular simulator implementation. The modules of the simulator can be easilyswapped to test different scenarios, SUTs, or sensor configurations.
The system under test (SUT), the sensors, the tracker, the reinforcement learner, and the scenario
definition are separate components in the simulation framework. The simulator architecture is shown
in fig. 3.3. The reinforcement learner outputs the disturbances (see section 3.3.2) to the simulator;
they are used to update non-SUT actors controlled by AST. In our experiments, the only actors
are pedestrians. In order to have smooth trajectories, the disturbances control the pedestrians by
setting their acceleration at each timestep. The sensors receive the new actor states and output
measurements augmented with the noise from the disturbances. The measurements are filtered by
the tracker, which is an alpha-beta filter, and passed to the SUT. See section 3.3.2 for more details
on the implementation of the sensors and tracker. The SUT, which is the driving model, decides
how to maneuver the vehicle based on its observations. We use a modified version of the Intelligent
Driver Model (IDM) as the SUT [68] (see section 3.3.2). The SUT actions are used to update the
state of the vehicle. The simulator outputs the transition probability and event indicator to the
reward function. The current state of the simulator can be represented in different ways. If the state
of the simulator is fully observable, the simulator can provide its state or an autoencoder processed
version of the state [72]. Otherwise, the history of previous actions can be used to represent the
current state. The state representation, along with the reward of the previous step, are then input
CHAPTER 3. SCALABLE VALIDATION 29
to the reinforcement learner.
An advantage of this modular approach is that, if multiple IDM implementations were to be
compared, it would be easy to swap them out and compare the results. Entirely different SUTs can be
compared, or individual modules can be changed, for example by comparing how the SUT performs
with two different tracker modules. The modularity allows AST to be a useful benchmarking tool,
or a batch testing method for autonomous systems. An especially useful version of AST for system
or component comparisons is differential AST [73], which searches two simulators simultaneously to
maximize the difference in performance between the two SUTs.
The reinforcement learner can use two different procedures to generate the disturbances, as shown
in fig. 3.4. The direct disturbance control approach, used by DRL in this chapter, is for the agent
to output the parameters of a distribution at each timestep. disturbances are then sampled from
the distribution. The seed-action control approach, used by MCTS in this chapter, is to output
pseudorandom seeds, which are used to seed random number generators. The disturbances are then
sampled using these random number generators. The seed-action control approach is useful for large,
obfuscated simulators that would be difficult to control with direct actions but already use many
random number generators.
The inputs to the reinforcement learners vary slightly. MCTS does not make use of the simulator’s
internal state, treating it entirely as a black box. Instead of making use of the simulator’s internal
state, the AST implementation of MCTS uses a history of the previous pseudorandom seeds as
the nodes in the tree [40]. In contrast, the fully-connected DRL reinforcement learner takes the
simulation state as input. The simulation state is
ssim = [s(1)sim, s
(2)sim, . . . , s
(n)sim].
For the ith pedestrian,
s(i)sim = [v(i)x , v(i)y , x(i), y(i)],
where
• v(i)x , v
(i)y are the x and y components of the relative velocity between the SUT and the ith
pedestrian.
• x(i), y(i) are the x and y components of the relative position between the SUT and the ith
pedestrian.
3.3.2 Problem Formulation
To evaluate the effectiveness of AST as applied to autonomous vehicles, we stress test a vehicle
in a set of scenarios at a pedestrian crosswalk, shown in fig. 3.5. The scenario is defined by a
CHAPTER 3. SCALABLE VALIDATION 30
Reinforcement Learner A Simulator S
MCTS
DRL Sampler
RandomGenerator
Environment E
Environment E
seed x
disturbance x
disturbance xreward r
reward r
simulation information s
previous disturbance x′
distribution
Figure 3.4: A comparison of the reinforcement learner methods. MCTS uses a seed to control arandom number generator. DRL outputs a distribution, which is then sampled. Both of thesemethods produce disturbances.
single autonomous vehicle approaching a crosswalk. The road has two lanes to model a regular
neighborhood road, although there is no traffic in either direction for this specific example.
The lanes are each 3.7 m wide and the crosswalk is 3.1 m wide, per California state regulations [74].
The Cartesian origin is set at the intersection of the central vertical axis of the crosswalk and the
central horizontal axis of the bottom lane, with the positive x direction following the direction of
the arrow in fig. 3.5, and positive y motion being towards the top lane of the street. We test
with different numbers of pedestrians as well as with different starting states. The state of the ith
pedestrian is
s(i)ped = [v(i)x , v(i)y , x(i), y(i)],
where
• v(i)x , v
(i)y are the x and y components of the velocity of the ith pedestrian.
• x(i), y(i) are the x and y components of the position of the ith pedestrian.
We present data from each pedestrian from the three different variations of the scenario:
• 1 pedestrian, with initial state
s(1)ped = [0.0 m/s, 1.4 m/s, 0.0 m,−2.0 m]
• 1 pedestrian, with initial state
s(1)ped = [0.0 m/s, 1.4 m/s, 0.0 m,−4.0 m]
• 2 pedestrians, with initial state
s(1)ped = [0.0 m/s, 1.4 m/s, 0.0 m,−2.0 m]
s(2)ped = [0.0 m/s,−1.4 m/s, 0.0 m, 5.0 m]
CHAPTER 3. SCALABLE VALIDATION 31
y
x
(xcar, ycar)
(x(i)ped, y
(i)ped
)
vcar,x
v(i)ped,y
ego vehicle pedestrian
(a) Overview of the general crosswalk scenario.
(0.0,−2.0) m
1.4 m/s
(b) Scenario 1
(0.0,−4.0) m
1.4 m/s
(c) Scenario 2
(0.0,−2.0) m
1.4 m/s
(0.0, 5.0) m
−1.4 m/s
(d) Scenario 3
(e) The initial pedestrian configurations for the three different scenarios.
Figure 3.5: The setup of the three crosswalk scenarios.
CHAPTER 3. SCALABLE VALIDATION 32
The scenario variations are shown in fig. 3.5. The first scenario (fig. 3.5b) was chosen as a basic
example to demonstrate AST. The second scenario (fig. 3.5c) was chosen to show that a different
initial condition leads to different collision trajectories. The third scenario (fig. 3.5d) shows the
scalability of AST by including more actors in the scenario.
Actor Dynamics
Both reinforcement learners use the same representation for disturbances. The disturbance vector
at each time step is
aenv = [a(1),a(2), . . . ,a(n)],
where n is the number of pedestrians. For the ith pedestrian,
a(i) = [a(i)x , a(i)y , ε(i)vx , ε(i)vy , ε
(i)x , ε(i)y ],
where
• a(i)x , a
(i)y represent the x and y components of the ith pedestrian’s acceleration, respectively.
• ε(i)vx , ε
(i)vy represent the noise injected into the SUT measurement of the components of the ith
pedestrian’s velocity v(i)x and v
(i)y , respectively.
• ε(i)x , ε
(i)y represent the noise injected into the SUT measurement of x and y components of the
ith pedestrian’s position, respectively.
AST controls both the pedestrian motion and the sensor noise, which allows it to search over both
pedestrian actions and perception failures to find the most likely collision.
At each time step, the pedestrian samples a(i) (as mentioned above, the procedure of this sampling
differs slightly between reinforcement learners, but the representation of the action vector a(i) is the
same). To find the likelihood of a(i), a model of the expected pedestrian action vector is needed.
This model is a multivariate Gaussian distribution N (µa,Σ) where µa is a zero-vector, and Σ is
diagonal. Our pedestrian model is parameterized by σaLat, σaLon, and σnoise, which are the diagonal
elements of the covariance matrix and correspond to lateral acceleration, longitudinal acceleration,
and sensor noise, respectively. The values we use are: σaLat = 0.01, σaLon = 0.1, and σnoise = 0.1.
The acceleration parameters are designed to encourage the pedestrians to move across the street
with some lateral movement. The assumption of the mean action being the zero-vector implies
that, on average, pedestrians maintain their current speed and heading. In reality, this distribution
could depend on the location of the pedestrian, where the vehicle is, the attitude or attention of
the pedestrian, or other factors. Applying a more realistic pedestrian model is an avenue for future
work. The initial speed of the pedestrian is set to 1.5 m/s.
CHAPTER 3. SCALABLE VALIDATION 33
Sensor and Tracker Models
The sensors of the SUT receive a vector of the actor state and output a vector of noisy measurements
m = [m(1),m(2), . . . ,m(n)].
For the ith pedestrian, m(i) = s(i)ped + ε(i) where
ε(i) = [ε(i)vx , ε(i)vy , ε
(i)x , ε(i)y ].
The measurements are passed to an alpha-beta tracker [75], which is parameterized by αtracker and
βtracker. The tracker returns filtered versions of the measurements as the SUT’s observations. We
use the values αtracker = 0.85 and βtracker = 0.005.
System Under Test Model
The SUT is based on the Intelligent Driver Model [68]. The IDM is designed to stay in one lane and
safely follow traffic. To follow the rules around crosswalks, we set the desired velocity at 25 miles
per hour (11.2 m/s). If there is no vehicle in front of the IDM for it to follow, the model maintains
a desired velocity. We adapted the IDM for interacting with pedestrians by modifying it to treat
the closest pedestrian in the road as the target vehicle. The IDM then tries to follow a safe distance
behind the pedestrian based on the difference between their velocities, which results in the vehicle
stopping at the crosswalk since the pedestrian’s vx is negligible. Our modified IDM is obviously not a
safe model; as we will show, ignoring any pedestrian outside of the road makes the vehicle vulnerable
to being blindsided by people moving quickly from the curb. The goal of this chapter, however, is
to show that AST can effectively induce poor behavior in an autonomous driving algorithm, not to
present a safe algorithm. The SUT model receives a series of filtered observations
o = [o(1), o(2), . . . , o(n)].
If there are pedestrians in the road, the SUT model uses the closest pedestrian to find
sSUT = [voth, sheadway],
where
• voth is the relative x velocity between the SUT and the closest pedestrian.
• sheadway is the relative x distance between the SUT and the closest pedestrian.
These factors determine the acceleration of the SUT in the next time step.
CHAPTER 3. SCALABLE VALIDATION 34
Modified Reward Function
Our modified version of the AST reward (see section 2.4) is shown in eq. (3.1). As a proxy for the
probability of an action, we use the Mahalanobis distance [69], which is a measure of distance from
the mean generalized for multivariate continuous distributions. We use a large negative number as a
penalty for rollouts that do not end in a failure. In addition, the penalty at the end of a no-collision
case includes a heursitic reward that is scaled by the distance (dist) between the pedestrian and
the vehicle. The penalty encourages the pedestrian to end early trials closer to the vehicle, which
allows the reinforcement learner to find failures more quickly and leads to faster convergence. The
reward function is modified from the previous version of AST [40] as follows:
R (s) =
0 s ∈ E
−10000− 1000× dist(pv,pp
)s /∈ E, t ≥ T
− log (1 +M (a, µa | s)) s /∈ E, t < T
(3.1)
where M(a, µa | s) is the Mahalanobis distance between the action a and the expected action µa
given the current state s. The distance between the vehicle position pv and the closest pedestrian
position pp is given by the function dist(pv,pp).
Metrics
We use two metrics to evaluate the AST algorithms. The first is the likelihood of the final collision
trajectory output by the system. The second metric is the number of calls to the step function. The
goal of the second metric is to compare the efficiency of the two AST reinforcement learners. The
separate implementations render both wall clock time and iterations inappropriate. The simulator
update function (Step), which was the computational bottleneck, is used instead. This metric is
agnostic to the implementation hardware, the algorithm used, and to the run-time of updating the
simulation.
Reinforcement Learners
For MCTS, the parameters that control how much of the state space is explored are the depth, the
horizon T , and the number of iterations. The depth and horizon are chosen to be equal so that the
search and rollout stages explore the same scenario. We experimented with different values for the
horizon (50, 75, 100) and found that 100 was the minimum horizon that is sufficiently long to cover
the scenario of interest. We used 2000 iterations. For additional detail on MCTS and DPW, see the
paper by Lee et al. [40].
For DRL, the results shown are obtained using a batch size of 4000. Experimentation showed
that reducing the batch size any further resulted in too much variance during the trials. We use a
CHAPTER 3. SCALABLE VALIDATION 35
step size of 0.1, and a discount factor of 1.0. The DRL approach is implemented using RLLAB [76].
3.4 Results
Table 3.1: Numerical results from both reinforcement learners. Reward without noise shows thereward of the MCTS path if sensor noise was set to zero, to illustrate the difficulty that MCTS haswith eliminating noise. DRL is able to find a more probable path than MCTS with a large reductionin calls to the Step function.
MCTS DRL
Scenario Calls to Step Reward Reward w/onoise
Calls to Step Reward
1 4.9× 108 −131 −71 8× 105 −62
2 1.9× 106 −38 −15 8× 105 −1.7
3 1.6× 109 −161 −104 1× 106 −52
−12−10−8 −6 −4 −2 0−6−4−20
2
4
Start position
Collision
x (m)
y(m
)
MCTS
−2 −1 0 1 2
−4
−3
−2
−1
Start position
Collision
x (m)
y(m
)
MCTS
−10−8−6−4−2 0 2 4 6
0
5
10
Start position 1
Start position 2
Collision
x (m)
y(m
)
MCTS
Pedestrian 1Pedestrian 2
−12−10−8 −6 −4 −2 0−6−4−20
2
4
Start position
Collision
x (m)
y(m
)
DRL
(a) Scenario 1
−2 −1 0 1 2
−4
−3
−2
−1
Start position
Collision
x (m)
y(m
)
DRL
(b) Scenario 2
−10 −5 0 5−5
0
5
Start position 1
Start position 2
Collision
x (m)
y(m
)
DRL
Pedestrian 1Pedestrian 2
(c) Scenario 3
Figure 3.6: Pedestrian motion trajectories for each scenario and algorithm. The collision point isthe point of contact between the vehicle and the pedestrian. In scenario 3, pedestrian 1 does notcollide with the vehicle.
The results show that both reinforcement learners are able to identify failure trajectories in an
CHAPTER 3. SCALABLE VALIDATION 36
autonomous vehicle scenario. MCTS and DRL identify several simulation rollouts where the vehicle
collides with the pedestrian. Table 3.1 shows the results for the three different scenarios. Both
methods successfully converge to a solution in a tractable number of simulator steps. AST is able
to find collisions by taking advantage of the modified IDM’s decision to ignore any pedestrian who
is not in the road. Although the likelihoods seem to vary greatly, much of this difference is due to
MCTS having negligible but non-zero noise that adds up over the long horizon. The likelihood of
pedestrian motion dominates that of sensor noise. Consequently, the noise should be very sparse,
and the DRL solution reflects this. MCTS, however, has difficulty driving the noise to true zero.
This difficulty results in very small numbers in the noise vector throughout the trajectory. While
this near-zero noise is not enough to affect the result of the trajectory, it does accumulate over the
long trajectory, resulting in a significant difference in the trajectory’s likelihood. We recomputed
the reward as if the noise were 0 as a reference, which is also shown in Table 3.1.
3.4.1 Performance
The number of calls to Step for MCTS is the number of calls required to find a collision in the
rollouts. Training could continue and possibly find better failures, at the cost of extra computation.
The MCTS reinforcement learner had high variance in its results and therefore had to be run multiple
times to achieve consistent results. The rewards shown are the average failure found over 100 trials,
and the steps are the total steps taken across all trials.
Across all scenarios, DRL consistently converges to solutions with less than 1% of the number of
calls to Step required by MCTS despite the state and action spaces being very small. Theoretically,
the scalability advantages of DRL should be even more apparent in a higher-dimensional problem.
This advantage is supported by the fact that MCTS performs worse on the dual pedestrian scenario
than it does on both single pedestrian scenarios.
3.4.2 Trajectories
Figure 3.6 shows the pedestrian paths until the collision from each reinforcement learner for scenarios
1, 2, and 3, respectively. The start positions and collision positions, if the trajectory ends in collision,
are marked. In scenario 1, both reinforcement learners send a single pedestrian into the road, and
have the pedestrian move towards the vehicle to create a collision. However, the turn towards the
vehicle is much more pronounced in MCTS, where the pedestrian comes to a near stop, before
angling hard left and into the vehicle. DRL instead settles on a smoother path. The DRL path is
slightly more likely because less acceleration is needed. In scenario 2, both reinforcement learners
find failures with similar trajectories. The pedestrians start at a point from which their mean action
should create a collision. Both reinforcement learners identify this path quickly and the pedestrians
take very little action.
CHAPTER 3. SCALABLE VALIDATION 37
Both scenarios are relevant to scenario 3, which presents the largest difference. Because pedes-
trian 2 starts farther away from the vehicle, pedestrian 2 has the more likely path to being hit by the
vehicle, as in scenario 2. In both reinforcement learners, the second pedestrian takes actions similar
to each reinforcement learner’s respective scenario 1. However, there is a large difference between
the trajectories of pedestrian 1 from each reinforcement learner. In MCTS, pedestrian 1 maintains
their initial velocity and heading shortly before aggressively accelerating towards the other side of
the road. In DRL, pedestrian 1 takes a slight turn to the right, and then maintains their velocity
and heading from there. MCTS has less ability to minimize the effect of pedestrian 1 on the total
reward since using a single seed results in coupling the actions of the pedestrians. Hence, pedestrian
1 has a different and less optimal trajectory than their counterpart in DRL. In DRL, pedestrian 1
had a change of direction at first, causing pedestrian 2 to be closer to the vehicle. Then pedestrian 1
maintained a course with very little acceleration, minimizing that pedestrian’s effect on the reward.
In scenarios 1 and 3, the blame of the collision could be on the pedestrian, which would not
inform any modifications to the SUT. However, in scenario 2, the blame is on the vehicle, since it
does not check for pedestrians approaching the crosswalk until the pedestrians are in the crosswalk,
which gives very short response time for the vehicle. The suggestion for avoiding a collision like
scenario 2 is to expand the sensing range of the IDM to go beyond the curb of the road. The reason
AST returns situations where the blame is not on the vehicle is that we define the subset of state
space that we are interested in E to be any collision. The kind of collisions reported by the examples
shown in scenario 1 and 3 do not give the designer of the SUT any insight on how the SUT should be
improved. The solution is to redefine the space of events of interest E to be the subset of collisions
in which the responsibility of the collision was on the SUT [47]. The definition of E requires formal
models of responsibility and blame in various road situations [77].
3.5 Discussion
This chapter demonstrated that we can use deep reinforcement learning to improve the efficiency
of adaptive stress testing. Deep reinforcement learning can find more likely failure scenarios than
Monte Carlo tree search, and it finds them more efficiently. Despite the improved performance,
the trajectories found by both algorithms are similar, which demonstrates consistency of results
across different reinforcement learners. The improved scalability can allow AST to be applied to
autonomous systems acting in continuous and high-dimensional spaces, but AST can still take a
lot of compute to run. This compute can be prohibitive if we want to evaluate a scenario across
multiple similar initial conditions. In the next chapter, we present a slight modification to the
DRL architecture and training process that allows a single run of AST to generalize across initial
conditions.
Chapter 4
Generalizing across Initial
Conditions
AST can identify the most likely failure for a given scenario, but what if we are interested in a
range of similar scenarios? For validation purposes, scenarios are often defined not by concrete
instantiations but as a class of scenarios where each initial condition parameter is within a certain
range. Validation would then involve testing across numerous instantiations from the scenario class,
and a system might have to be safe on every instantiation to pass the overall scenario class.
As an example, consider the running crosswalk example. We might be interested in the case
where the vehicle starts with a velocity anywhere between 25 and 35 mph, or where the pedestrian
starts anywhere from 1 to 6 meters from the crosswalk entry, as shown in fig. 4.1. Currently, we
would have to run a new AST instance for each concrete instantiation we are interested in, as shown
in fig. 4.2a. While AST can make finding the most likely failure tractable, it is not necessarily fast,
and we certainly do not want to be required to run hundreds of AST instances for similar scenario
instantiations. Having to run a new AST instance for each instantiation is especially problematic
considering that each instantiation shares a lot of underlying similarities—conceptually, it is easy
to imagine that moving the pedestrian’s starting position a small distance back from the crosswalk
should not greatly change the resulting failure trajectory. Therefore, we should be able to create
an AST reinforcement learner that can generalize during training across the entire initial condition
space, as shown in fig. 4.2b, and it should be far more efficient than running a new instance for
each instantiation since the reinforcement learner will no longer be wasting information from similar
rollouts.
This chapter presents a slight modification to the DRL architecture that allows a single AST
reinforcement learner to generalize across a scenario class defined by an initial condition space. The
new architecture requires only a single instance of AST to be run, and initial conditions are sampled
38
CHAPTER 4. GENERALIZING ACROSS INITIAL CONDITIONS 39
ego vehicle pedestrian
Figure 4.1: An example of a scenario class for the crosswalk example. The ego vehicle and thepedestrian both have a range of initial conditions for their position and velocity. A concrete scenarioinstantiation could be created by sampling specific initial condition values from the ranges.
I
Simulator SReinforcement Learner A(1)
Simulator SReinforcement Learner A(2)
Simulator SReinforcement Learner A(3)
s(1)0
s(2)0
s(3)0
xt
xt
xt
st
st
st
(a) The previous version of AST running over a space of initial conditions I. The space must be discretized,and then each initial condition s0 requires a separate instance of the DRL reinforcement learner and thesimulator. In addition, the reinforcement learner requires the next simulator state st at each time-step.
ISimulator SReinforcement Learner A
s(1)0
s(2)0
s(3)0
at
ht
(b) The new version of AST running over a space of initial conditions I. The continuous space is sampledat random at the start of each trajectory; therefore only one instance of the reinforcement learner is needed.The reinforcement learner does not need access to the simulation state because it is maintaining a hiddenstate ht at each time-step. The reinforcement learner instead uses the previous action at.
Figure 4.2: Contrasting the new and old AST architectures. The new reinforcement learner uses arecurrent architecture and is able to generalize across a continuous space of initial conditions witha single trained instance. These improvements allow AST to be used on problems that would havepreviously been intractable.
CHAPTER 4. GENERALIZING ACROSS INITIAL CONDITIONS 40
LSTM
[xt, s0]
xt+1 = [µt+1,Σt+1]
ht+1ht
Figure 4.3: The network architecture of the generalized recurrent DRL reinforcement learner. AnLSTM learns to map the input—a concatenation of the previous disturbance xt and the simulator’sinitial conditions s0—and the previous hidden state ht to the next hidden state ht+1 and to x, themean and diagonal covariance of a multivariate normal distribution. The disturbance xt+1 is thensampled from the distribution.
at the start of each rollout during training, as shown in fig. 4.2b. The output of AST is no longer the
most likely failure trajectory but is instead a policy that maps initial conditions to the corresponding
most likely failure. Therefore, if a designer were interested in a specific set of initial conditions, they
would first run AST across the whole space, and then they would feed initial conditions into the
trained policy to find the corresponding failures.
4.1 Modified Recurrent Architecture
We presented architectures in chapter 3 for both a fully-connected DRL reinforcement learner,
which we will now refer to as the FCDRL reinforcement learner, and for a discrete (non-generalized)
recurrent DRL reinforcement learner, which we will now refer to as the DRDRL reinforcement
learner. We now present an architecture for a generalized recurrent DRL reinforcement learner,
which we will refer to as the GRDRL reinforcement learner. The GRDRL architecture is shown in
fig. 4.3. The architecture is the same as the DRDRL architecture shown in fig. 3.2, except instead
of the input at each timestep being the previous action, the input is now a concatenation of the
previous action and the initial condition. Because the reinforcement learner sees only the sequence of
actions taken, it would previously have had no way to differentiate between two trajectories that take
identical actions but from different initial conditions, which could obviously lead to very different
results. The slight modification gives the reinforcement learner enough information to differentiate
between trajectories from different initial conditions, so when the initial conditions of each rollout
are sampled at train time, the resulting policy learns to find the most likely failure across the entire
space of initial conditions. This problem is larger, and therefore harder, than finding a failure from
a single initial condition, so the generalized policy may take longer to converge than running AST
from a single scenario instance. However, the policy will learn during training certain behaviors
CHAPTER 4. GENERALIZING ACROSS INITIAL CONDITIONS 41
that apply across the entire space of initial conditions; for example, it will learn how to use noise
to fool the system-under-test’s perception system, so it should be far more efficient than running a
new AST instance from each initial condition.
4.2 Experiments
This section outlines the problem used in simulation to test AST, the hyper-parameters of the
DRL reinforcement learner, and the reward structure. For bench-marking purposes, we follow the
experiment setup—simulation, pedestrian models, and SUT model—proposed in section 3.3. The
problem has a 5-dimensional state-space and a 6-dimensional action space, and is run for up to 50
timesteps of 0.1 s per timestep.
4.2.1 Problem Formulation
y
x
(s0,car, 0) m
(s0,ped,x, s0,ped,y) m
v0,car
v0,ped
ego vehicle pedestrian
Figure 4.4: The crosswalk scenario class. To instantiate a concrete scenario, the initial conditionss0,ped, s0,car, v0,ped, and v0,car are drawn from their respective ranges, defined in table 4.1.
Our experiment simulates a common neighborhood road driving scenario, shown in fig. 4.4. The
road has one lane in each direction. A pedestrian crosses at a marked crosswalk, from south to
north. The y origin is at the center of the crosswalk, and the x origin is where the crosswalk meets
the side of the road. The speed limit is 25 mph, which is 11.2 m/s.
The inputs to the GRDRL reinforcement learner include the initial state s0 = [s0,ped, s0,car, v0,ped, v0,car]
where
• s0,ped is the initial x, y position of the pedestrian,
• s0,car is the initial x position of the car,
• v0,ped is the initial y velocity of the pedestrian, and
• v0,car is the initial x velocity of the car.
CHAPTER 4. GENERALIZING ACROSS INITIAL CONDITIONS 42
Initial conditions are drawn from a continuous uniform distribution, with the supports shown in
table 4.1. Trajectory rollouts are instantiated by randomly sampling an initial condition from the
parameter ranges.
Table 4.1: The initial condition space. Initial conditions are drawn from a continuous uniformdistribution defined by the supports below.
Variable Min Max
s0,ped,x −1 m 1 m
s0,ped,y −6 m −2 m
s0,car −43.75 m −26.25 m
v0,ped 0 m/s 2 m/s
v0,car 8.34 m/s 13.96 m/s
4.2.2 Modified Reward Function
AST penalizes each step by the likelihood of the disturbances, as shown in Equation (4.1). Unlikely
actions have a higher cost, so the reinforcement learner is incentivized to take likelier actions, and
thereby find likelier failures. The Mahalanobis distance [69] is used as a proxy for the likelihood
of an action. The Mahalanobis distance is a measure of distance from the mean generalized for
multivariate continuous distributions. The penalty for failing to find a collision is controlled by α
and β. The penalty at the end of a no-collision case is scaled by the distance (dist) between the
pedestrian and the vehicle. The penalty encourages the pedestrian to end early trials closer to the
vehicle and leads to faster convergence. We use α = −1× 105 and β = −1× 104. The reward
function is modified from the previous version of AST [40] as follows:
R (s) =
0, s ∈ E
−α− β × dist(pv,pp
), s /∈ E, t ≥ T
−M (a, µa,Σa | s) , s /∈ E, t < T
(4.1)
where M(a, µa,Σa | s) is the Mahalanobis distance between the action a and the expected action
µa given the covariance matrix Σa in the current state s. The distance between the vehicle position
pv and the closest pedestrian position pp is given by the function dist(pv,pp).
4.2.3 Reinforcement Learners
The GRDRL reinforcement learner is compared against the DRDRL, FCDRL, and MCTS reinforce-
ment learners. For the DRL reinforcement learners, the hidden layer size is 64. Training was done
CHAPTER 4. GENERALIZING ACROSS INITIAL CONDITIONS 43
with a batch size of 5× 105 timesteps. The maximum trajectory length is 50, hence each batch has
1000 trajectories. The optimizer used a step size of 0.1 s, and a discount factor of 1.0.
4.3 Results
This section shows the performance of the new reinforcement learners on our running example. First,
the generalized reinforcement learner’s ability to train on the problem is compared to that of the
discrete recurrent reinforcement learner. Both reinforcement learners are then compared to baselines
to show their improvement.
4.3.1 Overall Performance
0 100 200 300 400 500−200
−180
−160
−140
−120
−100
Iterations
Rew
ard Generalized
Discrete (Conservative)
Discrete (Optimistic)
Figure 4.5: The Mahalanobis distance of the most likely failure found at each iteration for botharchitectures. The conservative discrete architecture runs each of the discrete reinforcement learnersin sequential order. The optimistic discrete architecture runs each of the discrete reinforcementlearners in a single batch.
The goal of AST is to understand failure modes by returning the most likely failure. An advantage
of the new architecture is its ability to search for the most likely failure from a space of initial
conditions while training a single network. Figure 4.5 demonstrates these benefits by showing the
cumulative maximum reward found by the DRDRL and GRDRL reinforcement learners at each
iteration. There are two estimates shown for the DRDRL architecture:
• Sequential: Each discrete AST reinforcement learner is run sequentially. The naive approach
serves as a lower bound on the performance of the discrete architecture.
• Batch: The AST reinforcement learners are updated as a batch. Each batch is assumed to
still take 32 iterations, but the best reward of the best reinforcement learner is known after
each update.
CHAPTER 4. GENERALIZING ACROSS INITIAL CONDITIONS 44
The generalized architecture outperforms the discrete architecture at every iteration. The gener-
alized version finds a collision sooner and converges to a solution after about 100 iterations, whereas
the discrete architecture is still training after 500 iterations. Furthermore, the generalized version
is able to find a trajectory that has a net Mahalanobis distance of −101.0. In contrast, the discrete
version’s most likely solution was −114.2. To put this in perspective, these results mean that the
average timestep of the generalized version was 2.0 standard deviations from the mean disturbance,
while the average timestep of the discrete version as 2.3 standard deviations from the mean dis-
turbance. Over the entire space of initial conditions, running the generalized architecture is more
accurate in far fewer iterations than running the discrete architecture at discrete points.
4.3.2 Comparison to Baselines
Table 4.2: The aggregate results of the DRDRL and GRDRL reinforcement learners, as well asthe MCTS and FCDRL reinforcement learners as baselines, on an autonomous driving scenariowith a 5-dimensional initial condition space. Despite not having access to the simulator’s internalstate, the DRDRL reinforcement learner achieves results that are competitive with both baselines.However, the GRDRL reinforcement learner demonstrates a significant improvement over the otherthree reinforcement learners.
MCTS FCDRL DRDRL GRDRL Point GRDRL Bin
Average Collision Reward −192.92 −229.80 −236.25 −148.48 −133.86
Max Collision Reward −145.80 −139.38 −125.51 −98.85 −91.67
Collisions Found 21 29 30 25 32
Collision Percentage 65.63 90.63 93.75 78.13 100
Table 4.2 shows the aggregate results of the new architectures as well as two baselines: the
fully-connected DRL (FCDRL) and a Monte Carlo tree search (MCTS) reinforcement learner. The
data was generated by dividing the 5-dimensional initial condition space into 2 bins per dimension,
which resulted in 32 bins. Such a rough discretization provides little confidence in our validation
results, but the number of bins is equal to b5, where b is the number of bins per dimension. Using
3 bins per dimension, which hardly provides more confidence, would result in training 243 instances
of AST. Even on a toy problem, running AST for a safe number of discrete points is intractable.
However, to demonstrate the performance benefits of the GRDRL architecture, we ran the MCTS,
FCDRL, and DRDRL reinforcement learners at the center-point of each of the 32 bins. The GRDRL
reinforcement learner was trained on the entire space of initial conditions and evaluated in two ways:
1) by executing the GRDRL reinforcement learner’s policy from the same 32 center-points the other
reinforcement learners were tested at, referred to as point evaluation, and 2) by sampling from each
bin in the initial condition space and keeping the best GRDRL solution, referred to as bin evaluation.
CHAPTER 4. GENERALIZING ACROSS INITIAL CONDITIONS 45
The GRDRL reinforcement learner far outperforms all baselines. When evaluating over the
entirety of each bin, the GRDRL reinforcement learner found collisions in every single bin, and
had by far the best average and maximum collision rewards. In particular, the maximum reward
demonstrates both the strength and the necessity of the new reinforcement learner architecture. The
most likely collision was not at one of the 32 points tested; hence a discretization approach does
not find the most likely trajectory. Surprisingly though, the GRDRL reinforcement learner also
outperforms the other reinforcement learners at the 32 center-points. Despite not being trained from
the center-points specifically, the GRDRL reinforcement learner has a better average and maximum
collision reward. The only degradation in performance was in collision percentage, although even
there the GRDRL reinforcement learner outperforms the MCTS reinforcement learner.
4.4 Discussion
This chapter presented a new architecture for AST to improve the validation of autonomous vehicles.
The new reinforcement learner treats the simulator as a black box and generalizes across a space
of initial conditions. The new architecture is able to converge to a more likely failure scenario in
fewer iterations than the discrete architecture. This architecture is essential for designers who are
interested not just in concrete scenario instantiations but in scenario classes defined by parameter
ranges. Running AST for each scenario instantiation would have been prohibitively expensive and
time-consuming. The new architecture can search the entire scenario class in a single run while still
finding more likely failures. However, this architecture is dependent on having a heuristic reward
signal to guide the reinforcement learner to failures. In the next chapter, we will introduce a new
reinforcement learner that can find failures without the use of heuristic rewards, even when horizons
are long.
Chapter 5
Heuristic Rewards
In section 2.4, we showed that the reward function was derived in such a way that the trajectory that
maximizes reward would be the most likely failure, if a failure exists. While the resulting reward
structure has useful properties—such as being proportional to the likelihood of failure—it is not
without drawbacks. The primary limitation is the problem of sparse rewards. The agent receives
a penalty only on whether a failure is found at the end of what could be a long trajectory, which
makes it difficult to assign credit to upstream actions and learn how to force failures. While the
agent does receive likelihood rewards at each step, these can actually be counterproductive—if a
failure is unlikely, learning to take likelier actions could actually lead the agent away from finding
failures. The penalty for not finding a failure is relatively large in order to make the agent prioritize
finding failures over taking likely actions, but that only occurs after the agent actually finds a failure
(and the resulting lack of a massive penalty) in the first place, and the reward structure makes
finding failures hard.
A possible solution is to change the reward structure to make finding failures easier by using
heuristic rewards, which are domain-specific components of the reward function designed, by using
expert-knowledge, to provide the agent with a more useful reward signal. The reward function allows
for both per-step and per-trajectory heuristic rewards. For example, when validating an autonomous
vehicle on the crosswalk example we may use a heuristic that gives a penalty proportional to the
distance between the pedestrian and the vehicle. This gives the agent a reward signal to follow before
it finds failures, and it will learn to move the pedestrian closer and closer to the vehicle until it finds a
collision. If heuristic rewards are scaled properly, then once the agent finds failures the agent should
be focused on finding likelier failures, so that the end result is not changed. Unfortunately, creating
heuristic rewards may not always be feasible. Designers may not have access to domain experts, or
the domain might not have a clear way to guide the reinforcement learner towards failures.
AST must still be able to find failures in cases in which heuristic rewards are not available, but
current reinforcement learners are not appropriate for such cases. The DRL reinforcement learner is
46
CHAPTER 5. HEURISTIC REWARDS 47
Start
12
3
4567
Key
Figure 5.1: The path to the first reward in the Atari 2600 version of Montezuma’s Revenge [78].The player must take the numbered steps in order, without dying, before getting the first key.
heavily dependent on having a useful reward signal and will struggle on even easy problems for which
there is no heuristic reward. The tree structure inherent in MCTS makes it somewhat more robust
to poor reward signals, but empirically it quickly loses the ability to perform well on validation tasks
with no heuristic rewards as the trajectory length increases. A new approach is needed for AST to
be able to validate systems acting in domains in which no heuristic rewards are readily available.
When an agent has to act in an environment without a clear or useful reward signal, that problem
is known as a hard-exploration problem. Within the domain of hard-exploration problems, the most
notorious benchmark problem is the Atari game Montezuma’s Revenge. The starting position of
Montezuma’s Revenge is shown in fig. 5.1. In order to receive the first reward in Montezuma’s
revenge, the player must
1. Go down the ladder.
2. Jump to the rope.
3. Jump to the next ladder.
4. Go down the ladder.
5. Move across the ground and wait for the skull to be in the correct position (the skull moves
back and forth along part of the floor).
6. Jump over the skull when it is in the correct position.
7. Move to the next ladder.
CHAPTER 5. HEURISTIC REWARDS 48
8. Climb the ladder.
9. Jump and get the first key.
When executing these moves, the player must avoid falling off a ledge, touching the skull, or touching
the bottoms of the non-solid portions of the center platform, or else they will die. If the player
successfully gets the key, they receive a small reward. The player must then retrace their steps and
then jump to the platform on the right, where they can touch the door to unlock it. Touching the
door without a key will kill the player. Finally, the player can leave the first room. This is a long,
complex sequence of steps the player must take in the proper order, without dying, before receiving
any reward signal (and that is just the first level). Validation using AST shares many difficulties
with Montezuma’s revenge, such as the need to take a complex sequence of steps in the proper order
before receiving a reward. Consequently, an algorithm that performs well on Montezuma’s Revenge
could be of interest for AST.
In 2019, a new algorithm—go-explore—was released that set new records on Montezuma’s re-
venge. Go-explore was designed to address two major issues in using RL for hard-exploration
problems: detachment and derailment.
• Detachment arises due to the fact that intrinsic rewards are often consumable resources. The
first time an agent reaches a state from which there are multiple paths with high intrinsic
rewards, they can only search one. Due to maximum trajectory lengths, the agent may not be
able to explore the entire path on the first rollout. However, upon returning to the state at the
start of the promising paths, it may explore one of other high-reward paths, collecting a similar
or greater reward. Consequently, it will have no memory of the partly finished exploration
from its first rollout, nor any remaining intrinsic reward to guide it back to that point. The
agent has become detached from the reward frontier.
• Derailment arises due to stochasticity being added during training rollouts to enhance explo-
ration. During an earlier rollout, the agent may have discovered a promising state that would
be beneficial to return to. However, stochasticity may be added to the actions of the agent,
which may prevent the agent from successfully returning to the promising state. If the path
back to the promising state is longer or more complex, then it is more likely that the stochastic
perturbations will derail the agent from its desired path. This derailment could prevent the
agent from ever returning to the promising state to explore.
Go-explore mitigates these issues through a two-phase approach. Phase 1 is a heuristic tree search
algorithm that uses deterministic restarts from randomly selected nodes of the tree to return to
promising exploration frontiers without suffering from detachment or derailment. Phase 2 trains a
robust neural-network policy by using the best trajectory found from phase 1 as an expert demon-
stration for the Backward Algorithm, a learning-from-demonstration (LfD) algorithm. This chapter
CHAPTER 5. HEURISTIC REWARDS 49
presents a go-explore AST reinforcement learner for cases where heuristic rewards are not available
and search horizons are too long for MCTS to perform well. Since we are only interested in finding
failures, we only use phase 1 on go-explore in this chapter – Phase 2 has its own useful properties
though, as will be covered in section 6.1.
5.1 Go-Explore
Before explaining how go-explore was applied to validation tasks, this section presents background
material explaining how go-explore works in general, with special attention paid to Phase 1. For a
deeper dive into phase 2, which uses the backward algorithm, see section 6.1.
5.1.1 Phase 1
Phase 1 is an explore-until-solved phase that uses a heuristic tree search and takes advantage of
determinism for exploration. During phase 1, a pool of ”cells” is maintained, which acts as the tree.
A cell is a data-structure containing information like the reward and the trajectory taken to get to
the cell. Cells are indexed by possibly compressed mapping of the agent’s state. During rollouts,
every step yields a new cell. If the cell is ”unseen,” meaning its index is not in the pool, the cell
is added to the pool. If the cell has already been seen, the new and old versions are compared; if
the new cell has a higher reward or a shorter trajectory, it replaces the old cell. In some ways, the
algorithm is similar to MCTS, but the algorithms differ greatly in how rollouts are started.
In go-explore, rollouts are started from a cell randomly selected from the pool. Cells are randomly
selected according to some heuristic rules meant to bias the search towards promising exploration
frontiers, but any cell can be selected at any time. In contrast, MCTS uses backpropagation from
a series of rollouts to select the best option from a series of nodes at a timestep. After selection,
MCTS expands the tree from the selected node at the next timestep and repeats the process. MCTS
therefore tends to select a trajectory step by step, and generally does not go back to revisit nodes
that have been pruned or where a different node was already selected. Go-explore will start a rollout
from any cell at any time, potentially giving the algorithm more capability for exploration. In order
to balance exploration and exploitation, hyperparameters must be carefully chosen to ensure that
unseen areas are explored, but also in such a way that more promising cells are given more rollouts.
As an example, consider using go-explore on the first room of Montezuma’s Revenge from fig. 5.1.
Perhaps on the first rollout the agent successfully makes it to the rope, but then makes an incorrect
move and falls to the floor below, resulting in the agent’s death and the end of the trajectory.
However, due to collected intrinsic rewards, the agent knows it wants to return to this point and
explore. Instead of the agent having to attempt to navigate back to that state while suffering
stochastic perturbations to its actions, go-explore will eventually sample the cell and start a rollout
by deterministically setting the game back to the exact game-state. Eventually, the agent will
CHAPTER 5. HEURISTIC REWARDS 50
get lucky and jump to the next platform, collecting more intrinsic reward, and proceeding further
along the correct trajectory to the key. The further along this trajectory the agent gets, the more
important and useful that deterministic resetting becomes, since the difficulty of returning to an
earlier point along the trajectory drastically increases.
5.1.2 Phase 2
Whereas phase 1 of go-explore returns a single trajectory, phase 2 returns a trained policy that is
robust to stochastic perturbations. In phase 2, a deep neural network agent is trained using the
backward algorithm [79]. The backward algorithm is an algorithm for hard exploration problems
that trains a deep neural network policy based on a single expert demonstration. The trajectory
returned from phase 1 is used as the expert demonstration. The properties of the backward algorithm
lead to a policy that does at least as well as the expert trajectory and may in fact improve upon the
expert, while learning how to overcome deviations from the expert trajectory.
5.2 Go-Explore for Black-Box Validation
Go-explore has shown promising results on hard-exploration benchmarks, but the algorithm does
not meet all of our design desiderata. In particular, the cells are indexed based on a down-sampled
version of the environment state, but we would not have access to the state if the simulator were
treated as a black box. Furthermore, cell scores for Montezuma’s Revenge were based on a notion
of ”levels”—a measure of progression—which does not have a direct analogue in our validation task.
Therefore, we must make changes to adapt both the cell structure and the cell selection algorithm
for our validation task.
5.2.1 Cell Structure
In go-explore, cells are indexed by a compressed representation of the environment state, which in
our case would be the simulator state. However, according to our black-box assumption, we may
not have access to the simulator state. To preserve the black-box assumption, we instead index cells
by hashing a concatenation of the current step number t and a discretized version of the previous
action, a ∼ t, so the index is idx = hash (t, a ∼ t). Therefore, similar actions taken at the same
step of a rollout trajectory are treated as the same cell, which preserves the black-box assumption
by eliminating the dependence on simulation state while still providing the algorithm with a useful
way to group similar observations into cells.
CHAPTER 5. HEURISTIC REWARDS 51
5.2.2 Cell Selection
Two key changes need to be made for cell selection to work for validation tasks: 1) deterministic
resets and 2) cell scores.
Deterministic Resets
A key component of the go-explore algorithm is that the simulator is deterministically reset to the
exact simulation state of a cell when that cell is sampled to start a rollout. If the simulation state
is unavailable, though, this becomes trickier. Instead, the simulator must support deterministic
simulation. Each cell stores the trajectory of actions that were taken to reach that cell’s state.
When a cell is sampled to start a rollout, these actions are taken exactly to deterministically return
the simulator to the desired state. Simulation can then proceed as normal.
Cell Scores
When sampling a cell to start a rollout, a cell is sampled with probability proportional to its fitness
score. The fitness score is partially made up of “count subscores” for three attributes that represent
how often a cell has been interacted with: 1) the number of times a cell has been chosen to start a
rollout, 2) the number of times a cell has been visited, and 3) the number of times a cell has been
chosen since a rollout from that cell has resulted in the discovery of a new or improved cell. For
each of these three attributes, a count subscore for cell c and attribute a can be calculated as
CntScore(c, a) = wa
(1
v(c, a) + ε1
)pa+ ε2 (5.1)
where v(c, a) is the value of attribute a for cell c, and wa, pa, ε1, and ε2 are hyperparameters. The
total unnormalized fitness score is then
CellScore(c) = ScoreWeight(c)
(1 +
∑a
CntScore(c, a)
)(5.2)
When applied to Montezuma’s Revenge, the authors obtained better results by using a ScoreWeight
in eq. (5.2) that was based on what level of the game a cell was in. The ScoreWeight heuristic favored
sampling cells where the agent had progressed further within the game. Unfortunately, progression
does not have a direct analogue for validation tasks.
For validation tasks, we use the estimated value of a cell as the ScoreWeight. Similar to MCTS,
cells track an estimate of the value function. Anytime a new cell is added to the pool, or a cell is
updated, the value estimate for a particular cell vc is updated as
vc ← vc +(r + γv∗child)− vc
N(5.3)
CHAPTER 5. HEURISTIC REWARDS 52
When a cell is updated, the cell’s parent also updates its value estimate. Value updates are there-
fore propagated all the way up the tree. The total unnormalized fitness score is calculated with
ScoreWeight(c) = vc. Cells with high value estimates are selected more often.
5.3 Experiments
y
x
(s0,car,x, 0)
(s0,ped,x, s0,ped,y)
vcar,x
vped,x
ego vehicle pedestrian
Figure 5.2: The layout of the crosswalk example scenario. The car approaches the road where apedestrian is trying to cross. Initial conditions are shown, and values for s0,ped,y can be found intable 5.1.
5.3.1 Problem Description
The validation scenario consists of a vehicle approaching a crosswalk on a neighborhood road as
a pedestrian is trying to cross, as shown in fig. 5.2. The car is approaching at the speed limit of
25 mph(11.2 m/s). The vehicle, a modified version (see section 3.3) of the intelligent driver model
(IDM) [68], has noisy observations of the pedestrian’s position and velocity. The AST reinforcement
learner controls the simulation through a six-dimensional action vector, consisting of the x and
y components for three parameters: 1) the pedestrian acceleration, 2) the sensor noise on the
pedestrian position, and 3) the sensor noise on the pedestrian velocity. We treat the simulation as
a black box, so the AST agent has access to only the initial conditions and the history of previous
actions. From this general setup we generate three specific scenarios (easy, medium, and hard),
which are differentiated by the difficulty of finding a failure. The differences between the scenarios
include whether a reward heuristic was used (see section 5.3.2), the initial location of the pedestrian,
as well as the rollout horizon and timestep size. Pedestrian and vehicle location are measured from
the origin, which is located at the intersection of the center of the crosswalk and the center of the
vehicle’s lane. The scenario parameters are shown in table 5.1. The easy scenario is designed such
that the average action leads to a collision, so the maximum possible reward is known to be 0. The
medium and hard scenarios require unlikely actions to be taken to force a collision. They have
CHAPTER 5. HEURISTIC REWARDS 53
the same initial conditions, except the hard scenario has a timestep with half the duration of the
medium scenario’s timestep, and accordingly the hard scenario has double the maximum path length
of the medium scenario. The hard scenario demonstrates the effect of horizon length on exploration
difficulty.
Table 5.1: Parameters that define the easy, medium, and hard scenarios. Changing the pedestrianlocation results in failures being further from the average action, making exploration more difficult,whereas changing the horizon and timestep lengths makes exploration more complex.
.
Variable Easy Medium Hard
β 1000 0 0
s0,ped,y −4 m −6 m −6 m
T 50 steps 50 steps 100 steps
dt 0.10 s 0.10 s 0.05 s
5.3.2 Modified Reward Function
We make some modifications to the theoretical reward function shown in eq. (2.9) to allow practical
implementation:
R (s) =
0, s ∈ E
−α− β × dist(pv,pp
), s /∈ E, t ≥ T
−M (a, µa,Σa | s) , s /∈ E, t < T
(5.4)
where M(a, µa,Σa | s) is the Mahalanobis distance [69] between the action a and the expected action
µa given the covariance matrix Σa in the current state s and dist(pv,pp
)is the distance between
the pedestrian and the vehicle at the end of the rollout. The latter reward is the domain-specific
heuristic reward that guides AST reinforcement learners by giving a lower penalty when the scenario
ends with a pedestrian closer to the car. While the heuristic reward in eq. (5.4) is not theoretically
guaranteed to preserve the optimal policy [64] we find that it works well in practice and is easy
to implement, while also requiring less access to the simulation state. We use α = −1× 105 and
β = −1× 104 for the easy scenario and, to disable the heuristic, β = 0 for the medium and hard
scenarios.
5.3.3 Reinforcement Learners
For each experiment, the DRL, MCTS, and GE reinforcement learners were run for 100 iterations
each with a batch size of 500. Algorithm-specific hyperparameter settings are listed below.
CHAPTER 5. HEURISTIC REWARDS 54
Go-explore Phase 1
GE was run with hyperparameters similar to those used for Montezuma’s Revenge [54]. For the
count subscore attributes (times chosen, times chosen since improvement, and times seen), we set
wa equal to 0.1, 0, and 0.3, respectively. All attributes share ε1 = 0.001, ε2 = 0.00001, and pa = 0.5.
We always use a discount factor of 1.0. During rollouts, actions are sampled uniformly.
Deep Reinforcement Learning
The DRL reinforcement learner uses a Gaussian-LSTM trained with PPO and GAE. The LSTM has
a hidden layer size of 64 units, and uses peephole connections [80]. For PPO, we used a KL penalty
with factor 1.0 as well as a clipping range of 1.0. GAE uses a discount of 1.0 and λ = 1.0. There is
no entropy coefficient.
Monte Carlo Tree Search
We use MCTS with DPW (see section 2.1.2), where rollout actions are sampled uniformly from the
action space. The exploration constant was 100. The DPW parameters are set to k = 0.5 and
α = 0.5.
5.4 Results
The results of all three reinforcement learners in the easy scenario are shown in fig. 5.3. The easy
scenario is designed so that the likeliest actions lead to collision, and yet even the DRL reinforcement
learner was unable to achieve the optimal reward of 0. However, all three algorithms were able to
find failures quickly when given access to a heursitic reward. GE performed the worst, and both
GE and MCTS showed little improvement after finding their first failure. In contrast, the DRL
reinforcement learner continued to improve over the 100 iterations, ending with the best reward and
therefore the likeliest failure.
Figure 5.4 shows the results of the MCTS and GE reinforcement learners in the medium scenario.
Without a heuristic reward, the DRL reinforcement learner was unable to find a failure within 100
iterations. MCTS and GE, however, were still able to find failures. While there was no heuristic,
the horizon of the problem was still short, and we see that MCTS was still able to outperform GE.
While GE was able to improve over the first collision it found more than MCTS was able to, the
final collision found by GE was still less likely than even the first collision found by MCTS, and
MCTS was able to find a failure more quickly as well.
Figure 5.5 shows the results of GE in the hard scenario. The hard scenario had a longer horizon,
which prevented both DRL and MCTS from being able to find failures within 100 iterations. GE
was still able to find failures, however, and to improve the likelihood of the failure found over the
CHAPTER 5. HEURISTIC REWARDS 55
0 20 40 60 80 100
−600
−400
−200
0
Iterations
Rew
ard GE
DRLMCTS
Figure 5.3: The reward of the most likely failure found at each iteration of the GE, DRL, and MCTSreinforcement learners in the easy scenario. Results are cropped to only show results when a failurewas found.
course of training. In fact, when adjusting for the increased number of steps, the GE reinforcement
learner’s results in the hard scenario were very similar to its results in the medium scenario, showing
that GE is robust to longer-horizon problems.
5.5 Discussion
The results across the three scenarios illuminate the strengths and weaknesses of the three algorithms.
When a useful reward signal is present, the DRL reinforcement learner shows the highest ability
to find the most likely failure. However, without a heuristic, it quickly loses its ability to find
failures. In a no-heuristic setting, MCTS is able to find more likely failures than GE as long as
the problem’s horizon is short. However, in longer-horizon problems, GE is able to find failures
that MCTS cannot. The underlying principle of these differences lies in how the algorithms balance
exploration and exploitation. The tree search algorithms show better ability to explore, which is
why they can find failures without the use of heuristics. In contrast, DRL shows better ability to
exploit, which is why, if it finds failures, it finds likelier failures. Understanding the effects of the
exploration/exploitation tradeoff results in an interesting question: could we achieve better results
with a two-phase algorithm that could combine the exploration ability of the tree search algorithms
with the exploitation ability of DRL? It turns out that we can, and we demonstrate the abilities of
robustification in the next chapter.
CHAPTER 5. HEURISTIC REWARDS 56
0 20 40 60 80 100
−600
−400
−200
0
Iterations
Rew
ard
GEMCTS
Figure 5.4: The reward of the most likely failure found at each iteration of the GE and MCTSreinforcement learners in the medium scenario. The DRL reinforcement learner was unable to finda failure. Results are cropped to show results only when a failure was found.
0 20 40 60 80 100
−1,500
−1,000
−500
Iterations
Rew
ard
GE
Figure 5.5: The reward of the most likely failure found at each iteration of the GE reinforcementlearner in the hard scenario. The DRL and MCTS reinforcement learners were unable to find afailure.
Chapter 6
Robustification
From a theoretical standpoint, proposition 2.4.1 shows that the trajectory that maximizes the AST
reward function will be the most likely failure of a system for a given scenario. Unfortunately, since
AST uses RL, in practice the solution is an approximate one. AST may converge to a local optimum,
and there is little in the way of guarantees we can make on how close to the global optimum that
solution will be. In practice, it is not uncommon to see separate AST runs produce high variance
results. The inconsistency over identical runs raises concerns on the reliability of our validation
results after a single run of AST.
In chapter 5 we applied go-explore’s phase 1 to find failures on long-horizon validation problems
without the use of domain heuristics. Go-explore’s phase 2 has applications for AST as well, though.
Phase 2 takes the best trajectory from phase 1 and uses it as an expert demonstration for the
backward algorithm (BA), a learning-from-demonstration (LfD) algorithm. The BA allows us to
turn a single trajectory into a neural network policy that is robust to stochasticity. However,
importantly, the BA also allows the robust policy to improve upon the expert demonstration’s
results.
This chapter introduces a way to use the BA to produce improved and more consistent results
from AST. The BA requires only a trajectory as an expert demonstration; it is agnostic to how
that trajectory was produced. Consequently, we can use the BA to improve the results of any AST
reinforcement learner. Furthermore, the way that the BA’s training process introduces stochasticity
(see section 6.1) essentially makes the algorithm an efficient local search around similar trajectories.
The robustification phase can therefore be seen as a sort of hill-climbing phase to force better
convergence to consistent results. Using the BA in this way allows AST to provide results that have
less variance between identical runs but with a minimal amount of added compute.
57
CHAPTER 6. ROBUSTIFICATION 58
6.1 The Backward Algorithm
The backward algorithm is an algorithm for hard exploration problems that trains a deep neural
network policy based on a single expert demonstration [79]. Given a trajectory (st, at, rt, st+1)Tt=0
as the expert demonstration, training of the policy begins with episodes starting from sτ1 , where τ1
is near the end of the trajectory. Training proceeds until the agent receives as much or more reward
than the expert from sτ1 , after which the starting point moves back along the expert demonstration
to sτ2 , where 0 ≤ τ2 ≤ τ1 ≤ T . Training continues in this way until sτN = s0. Diversity can be
introduced by starting episodes from a small set of timesteps around sτ , or by adding small random
perturbations to sτ when instantiating an episode. Training based on the episode can be done
with any relevant deep reinforcement learning algorithm that allows optimization from batches of
trajectories, such as PPO with GAE.
6.2 Robustification
The BA was designed to create a policy robust to stochastic noise while potentially improving upon
the expert demonstration. Within the context of AST, however, the BA acts as a hill-climbing step,
enabling AST to more consistently converge to a better local optimum. AST is designed to find the
most likely failure of a system in a given scenario even when the environment has high-dimensional
state and action spaces. While existing AST reinforcement learners can provide a good guess for
the most likely failure trajectory, the BA allows us to approximately search the space of similar
trajectories to find the best one. Conceptually, instead of searching the entire space of possible
rollouts, we are instead constraining the search to a significantly smaller space where we already
know a failure exists. Because it is a smaller space, we can search the space that is local to a known
failure more robustly with a reasonable amount of added computational cost, yielding better and
more consistent results. This idea is applicable to all of the reinforcement learners we have used on
AST so far, and we demonstrate this by using it to improve failures found on the scenarios from
chapter 5.
The BA has two key features that make it applicable to improving results given by other AST
reinforcement learners. The first, as covered in section 6.1, is that it can perform efficiently on hard
exploration problems. If the BA required orders of magnitude more compute than the AST rein-
forcement learners, the benefits would not be worth the cost. However, the BA is efficient enough to
justify its use in yielding more consistent AST results. The second key feature is that the BA can im-
prove upon the expert trajectory. Training starts from steps along the expert demonstration, which
limits the deviation between the agent’s actions and the expert trajectory. However, stochasticity
during training still allows a significant amount of deviation from the expert demonstration, which
allows the BA to discover trajectories that yield higher rewards than those of the expert trajectory.
While designed for robustification, these features allow us to instead use the BA as a hill-climbing
CHAPTER 6. ROBUSTIFICATION 59
method.
A slight change was made to the BA in order to make it a better fit for AST robustification. In
the original BA paper, a policy was trained from a specific step of the expert demonstration until
it learned to do as well, or better, than the expert. In the validation tasks we are interested in,
compute may be too limited to be able to train for indefinite amounts of time. Instead, we modify
the BA to train for a small number of epochs at each step of the expert trajectory, which allows
the total number of iterations to be known and specified ahead of time. This modification did not
prevent the BA from improving upon the expert demonstration in any of our experiments.
6.3 Experiments
The problem description, reward function, and reinforcement learners are identical to those in sec-
tion 5.3. We take the best results of the DRL, MCTS, and GE reinforcement learners from section 5.4
and perform robustification using the BA. The BA was run for 100 iterations with a batch size of
5000, with the results reported as DRL+BA, MCTS+BA, and GE+BA, respectively. The BA rep-
resents the policy with a Gaussian-LSTM, and optimizes the policy with PPO and GAE. The LSTM
has a hidden layer size of 64 units, and uses peephole connections [80]. For PPO, we used a KL
penalty with factor 1.0 and a clipping range of 1.0. GAE uses a discount of 1.0 and λ = 1.0. There
is no entropy coefficient.
6.4 Results
The results of all three reinforcement learners in the easy scenario are shown in fig. 6.1. This scenario
was designed so that the likeliest actions lead to collision, and yet even the DRL reinforcement learner
was still was not near the optimal reward of 0. In contrast, adding robustification through the BA
resulted in finding failures that were significantly closer to optimal behavior. While GE significantly
improved with robustification, GE+BA was still far from the optimal solution. In contrast, both
MCTS+BA and DRL+BA were able to converge to results very near to 0.
Figure 6.2 shows the results of the non-DRL reinforcement learners in the medium scenario, as the
DRL reinforcement learner was unable to find a failure in that scenario. Adding a robustification
phase again improved both algorithms, and again the MCTS+BA reinforcement learner outper-
formed the GE+BA reinforcement learner. Note that taking the average action was not sufficient
to cause a crash in this scenario.
Figure 6.3 shows the results of GE and GE+BA in the hard scenario, as GE was the only
algorithm to find a failure. Despite the difficulty of the scenario, GE+BA was still able to improve
the results. In section 5.4 we saw that the performance of the GE reinforcement learner is robust
to changes in the horizon length. Similarly, when adjusting for the increased number of steps, the
CHAPTER 6. ROBUSTIFICATION 60
0 20 40 60 80 100
−600
−400
−200
0
Iterations
Rew
ard GE
DRLMCTS+BA
Figure 6.1: The reward of the most likely failure found at each iteration of the GE, DRL, andMCTS reinforcement learners in the easy scenario, as well as GE+BA, DRL+BA, and MCTS+BA.The dashed lines indicate the respective scores after robustification of each reinforcement learner.Results are cropped to show results only when a failure was found.
GE+BA reinforcement learner’s results in the hard scenario were very similar to its results in the
medium scenario, showing that GE+BA is also robust to longer-horizon problems.
6.5 Discussion
The results across the three scenarios show that robustification consistently yields failures that are
more likely than those yielded by running a standard AST reinforcement learner alone. Even on
problems where a normal DRL reinforcement learner was unable to find any failures, the BA was
able to find likely failures. The power of using the BA in conjunction with another algorithm lies
in its ability to balance exploration and exploitation. The robustification phase applies the strength
of DRL—exploitation—to a domain that requires significantly less exploration since the first failure
was already found by a different algorithm. Consequently, these two-phase methods could also be
seen as an exploration phase plus an exploitation phase, a decomposition that results in a problem
that is easier to solve.
An open question is where validation tasks would fall within the difficulty spectrum presented in
this chapter. All three algorithms will certainly have cases where they are the best choice. However,
it remains to be seen whether one of the three algorithms ends up as the dominant strategy for
validation in the real world. In part, this is due to the added complexities of validating systems in
high-fidelity simulators—a challenge we address in the next chapter.
CHAPTER 6. ROBUSTIFICATION 61
0 20 40 60 80 100
−600
−400
−200
0
Iterations
Rew
ard GE
MCTS+BA
Figure 6.2: The reward of the most likely failure found at each iteration of the GE and MCTSreinforcement learners in the medium scenario, as well as GE+BA and MCTS+BA. The dashedlines indicate the respective scores after robustification of each reinforcement learner. The DRLreinforcement learner was unable to find a failure. Results are cropped to show results only when afailure was found.
0 20 40 60 80 100
−1,500
−1,000
−500
Iterations
Rew
ard
GE+BA
Figure 6.3: The reward of the most likely failure found at each iteration of the GE reinforcementlearner in the hard scenario, as well as GE+BA. The dashed line indicates the score after robusti-fication of the GE reinforcement learner. The DRL and MCTS reinforcement learners were unableto find a failure.
Chapter 7
Validation in High-Fidelity
While performing validation in simulation may be necessary due to time and safety constraints,
simulation is clearly not a perfect recreation of reality. Some errors found when testing in simulation
may be real errors that can be replicated in real-world testing, or at least close to real-world errors,
but other errors may be spurious. A spurious failure is one that comes from a fault in the simulator’s
representation of reality, not from an actual fault in the system under test. Furthermore, some actual
faults may not be realizable at all if the simulator is not a good representation of the real world.
For that reason, industry has recently put significant amounts of effort and money towards the
development of high-fidelity simulators (hifi).
While there is no strict definition of what constitutes a hifi simulator, they are generally charac-
terized by features such as advanced dynamics models, perception from graphics, and autonomous
system software-in-the-loop simulation. These features allow far more accurate testing, but at a cost:
hifi simulators are also far slower and far more computationally expensive to run than low-fidelity
simulators (lofi). While we need methods like AST to help us capture the variability of real-world
scenarios during testing, the methods may require too many simulation rollouts to be able to run in
high-fidelity.
This chapter presents a way to make AST tractable in high fidelity by learning from simulation
rollouts run in less costly low-fidelity. The idea is to first run AST in low fidelity to find candidate
failures. These candidate failures might be failures that exist in hifi, or are close to failure in hifi,
but they also might be spurious errors. To address this potential problem, we use the backward
algorithm (BA) (see section 6.1), an algorithm for hard-exploration problems that learns a deep
neural network policy using a single expert demonstration. We use the candidate failure as the
expert demonstration to the BA. Doing so has two key advantages:
1. The BA can train a policy that outperforms the original expert, which means that we can
learn to take a low-fidelity failure and transform it into a similar high-fidelity failure.
62
CHAPTER 7. VALIDATION IN HIGH-FIDELITY 63
2. The BA uses very short horizons early in training, which allows us to identify and reject
spurious errors in a computationally efficient way.
7.1 Validation in High-Fidelity
In order to find failures with fewer hifi steps, we will first learn lessons from running AST in lofi, where
simulation rollouts are much cheaper. By using the candidate failures as the expert demonstration
when learning with the BA, we can overcome two key problems with using failures from lofi.
The first problem is that a failure found in lofi may not correspond exactly to a failure in hifi.
For example, a trajectory might have to be slightly different if the dynamics change, or the noise
injection might have to change to adapt to a higher-quality perception model. The BA allows us to
efficiently adapt a lofi failure to a corresponding hifi failure. By using the lofi failure as an expert
demonstration, the BA biases its policy search towards similar policies, reducing the search space
and requiring fewer iterations. However, the BA still allows the learned policy to improve upon the
expert demonstration, which means that the policy can learn to force a failure in hifi even when the
candidate failure does not exist exactly in hifi.
The second problem is that a failure found in lofi may be a spurious failure and may therefore
not correspond to any failure at all in hifi. In such a case, it is important that we minimize the
computational cost of identifying and rejecting the spurious failure. The BA starts training with
truncated rollouts from the end of the expert demonstration. As a consequence of the shorter
trajectories, early epochs of the BA require far fewer simulation steps than running standard DRL.
We can reject a failure as spurious if the BA fails to find a failure from multiple consecutive steps
of an expert demonstration. While this may not always happen in early epochs, we are still able to
reject spurious failures with fewer simulation steps when using the BA.
A remaining problem is cases where a failure in high-fidelity is unable to be represented in low-
fidelity. In many cases, this may simply be a more extreme version of the first problem. While
this will always remain a risk, in section 7.2 we will show strong empirical results across a range of
different test scenarios that should alleviate some concerns.
We made some changes to the BA in order to achieve the best performance on our validation
task. The original algorithm calls for training at each timestep of the expert demonstration until
the policy is as good or better than the original. First, we relax this rule and, instead, move sτ
back along the expert demonstration any time a failure is found within the current epoch. Second,
we add an additional constraint of a maximum number of epochs at each timestep. If the policy
reaches the maximum number of epochs without finding failures, training continues from the next
step of the expert demonstration. However, if sτ is moved back without the policy finding a failure
five consecutive times, the expert demonstration is rejected as a spurious error. We found that there
are times when the BA is unable to find failures in hifi in the early stages of training due to the lofi
CHAPTER 7. VALIDATION IN HIGH-FIDELITY 64
y
x
(−55, 0) m
(0.0,−1.9) m
11.2 m/s
1.0 m/s
ego vehicle pedestrian
Figure 7.1: Layout of the crosswalk example. A car approaches a crosswalk on a neighborhood roadwith one lane in each direction. A pedestrian is attempting to cross the street at the crosswalk.Initial conditions are shown.
trajectory being too dissimilar to any failure in hifi, so this change allows the BA more epochs to
train and find failures. In a similar vein, we also start training with τ > 0, and when moving back
along the expert demonstration after a failure, we increment τ more than one step back.
7.2 Case Studies
To demonstrate the BA’s ability to adapt lofi failures to hifi failures, we constructed a series of case
studies that represent a variety of differences one might see between lofi and hifi simulators. These
case studies measure which types of fidelity differences the BA can handle well and which types it
will struggle with. Because the BA starts many epochs from points along the expert demonstration,
many rollouts will have shorter trajectory lengths than rollouts generated by the DRL reinforcement
learner. Therefore, a direct comparison to DRL of iterations would not be fair. Instead, we measure
performance in terms of the number of simulation steps, assuming this would be the bottleneck in
hifi simulators. Unless otherwise noted, all case studies share the following setup:
Simulation
Unless otherwise noted, the test scenario is simulated using the Python simulator presented in
section 3.3. In the test scenario, the system under test (SUT) is approaching a crosswalk on a
neighborhood road where a pedestrian is trying to cross, as shown in fig. 7.1. The pedestrian starts
1.9 m back from the center of the SUT’s lane, exactly at the edge of the street, and is moving
across the crosswalk with an initial velocity of 1.0 m/s. The SUT starts at 55 m away from the
crosswalk with an initial velocity of 11.2 m/s (25 mph), which is also the desired velocity. The SUT
is a modified version of the intelligent driver model (IDM) [68]. IDM is a lane following model that
calculates acceleration based on factors including the desired velocity, the headway to the vehicle in
CHAPTER 7. VALIDATION IN HIGH-FIDELITY 65
front, and the IDM’s velocity relative to the vehicle in front. Our modified IDM ignores pedestrians
that are not in the street, but treats the pedestrian as a vehicle when it is in the street, which—due
to large differences in relative velocity—will cause the IDM to brake aggressively to avoid collision.
Simulation was performed with the AST Toolbox1 (see chapter 8).
Algorithms
To find collisions, AST was first run with a DRL reinforcement learner in each case study’s low
fidelity version of the simulator. Once a collision was found, the backward algorithm was run using
the lofi failure as the expert demonstration. Results are shown both for instantiating the backward
algorithm’s policy from scratch and for loading the policy trained in lofi. Results are compared
against running AST with the DRL reinforcement learner from scratch in hifi. Optimization for all
methods is done with PPO and GAE, using a batch size of 5000, a learning rate of 1.0, a maximum
KL divergence of 1.0, and a discount factor of 1.0. The BA starts training 10 steps back from the
last step, and moves back 4 steps every time a failure is found during a batch of rollouts.
7.2.1 Case Study: Time Discretization
In this case study, the fidelity difference is time discretization and trajectory length. The lofi
simulator runs with a timestep of 0.5 seconds for 10 steps, while the hifi simulator runs with a
timestep of 0.1 seconds for 50 steps. This fidelity difference approximates skipping frames or steps
to reduce runtime. In order to get an expert demonstration of the correct length and discretization
in hifi, the lofi actions were repeated 5 times for each lofi step. The results are shown in table 7.1.
The hifi DRL baseline took 44 800 simulation steps to find a failure. The lofi reinforcement
learner was run for 5 epochs, finding a failure after 25 600 simulation steps. When instantiating a
policy from scratch, the BA was able to find a failure in hifi after 19 760 steps, 44.1 % of the DRL
baseline. The BA was able to find a failure even faster when the policy trained in lofi was loaded,
needing 15 230 steps to find a failure in hifi, 34.0 % of the DRL baseline and 77.1 % of the BA trained
from scratch.
7.2.2 Case Study: Dynamics
In this case study, the fidelity difference is in the precision of the simulator state. The lofi simulator
runs with every simulation state variable rounded to 1 decimal point, while the hifi simulator runs
with 32-bit variables. This fidelity difference approximates situations when simulators may have
differences in vehicle or environment dynamics. In order to get an expert demonstration with the
correct state variables, the lofi actions were run in hifi. The results are shown in table 7.2.
1github.com/sisl/AdaptiveStressTestingToolbox
CHAPTER 7. VALIDATION IN HIGH-FIDELITY 66
Table 7.1: The results of the time discretization case study. Load Lofi Policy indicates whether thepolicy of the BA was initialized from scratch or the weights were loaded from the policy trained inlofi. The BA was able to significantly reduce the number of hifi steps needed, and loading the lofipolicy produced a further reduction.
AlgorithmSteps toFailure
FinalReward
Load LofiPolicy?
LofiSteps
Percent ofHifi Steps
BA 19 760 −794.6 No 25 600 44.1 %
BA 15 230 −745.6 Yes 25 600 34.0 %
Hifi 44 800 −819.9 – – –
The hifi DRL baseline took 46 800 simulation steps to find a failure. The lofi reinforcement
learner was run for 10 epochs, finding a failure after 57 200 simulation steps. When instantiating a
policy from scratch, the BA was able to find a failure in hifi after 13 320 steps, 28.5 % of the DRL
baseline. The BA was able to find a failure even faster when the policy trained in lofi was loaded,
needing just 2840 steps to find a failure in hifi, 6.1 % of the DRL baseline and 21.3 % of the BA
trained from scratch.
Table 7.2: The results of the dynamics case study. Load Lofi Policy indicates whether the policyof the BA was initialized from scratch or the weights were loaded from the policy trained in lofi.The BA was able to significantly reduce the number of hifi steps needed, and loading the lofi policyproduced a further reduction.
AlgorithmSteps toFailure
FinalReward
Load LofiPolicy?
LofiSteps
Percent ofHifi Steps
BA 13 320 −729.7 No 57 200 28.5 %
BA 2840 −815.8 Yes 57 200 6.1 %
Hifi 46 800 −819.3 – – –
7.2.3 Case Study: Tracker
In this case study, the fidelity difference is that the tracker module of the SUT perception system is
turned off. Without the alpha-beta filter, the SUT calculates its acceleration at each timestep based
directly on the noisy measurement of pedestrian location and velocity at that timestep. This fidelity
difference approximates cases when hifi perception modules are turned off. Perception modules
might be turned of in order to achieve faster runtimes. In order to get an expert demonstration with
the correct state variables, the lofi actions were run in the hifi simulator. The results are shown in
CHAPTER 7. VALIDATION IN HIGH-FIDELITY 67
table 7.3.
The hifi DRL baseline took 44 800 simulation steps to find a failure. The lofi reinforcement
learner was run for 20 epochs, finding a failure after 112 000 simulation steps. When instantiating a
policy from scratch, the BA was able to find a failure in hifi after 18 600 steps, 41.5 % of the DRL
baseline. The BA was able to find a failure even faster when the policy trained in lofi was loaded,
needing just 2750 steps to find a failure in hifi, 6.1 % of the DRL baseline and 14.8 % of the BA
trained from scratch.
Table 7.3: The results of the tracker case study. Load Lofi Policy indicates whether the policy of theBA was initialized from scratch or the weights were loaded from the policy trained in lofi. The BAwas able to significantly reduce the number of hifi steps needed, and loading the lofi policy produceda further reduction.
AlgorithmSteps toFailure
FinalReward
Load LofiPolicy?
LofiSteps
Percent ofHifi Steps
BA 18 600 −777.3 No 112 000 41.5 %
BA 2750 −785.7 Yes 112 000 6.1 %
Hifi 44 800 −800.1 – – –
7.2.4 Case Study: Perception
This case study is similar to the tracker case study in that it models a difference between the
perception systems of lofi and hifi simulators; however, in this case study the difference is far greater.
Here, the hifi simulator of the previous case studies is now the lofi simulator. The new hifi simulator
has a perception system2 that uses LIDAR measurements to create a dynamic occupancy grid map
(DOGMa) [81]–[83]. At each timestep, AST outputs the pedestrian acceleration and a single noise
parameter, which is added to the distance reading of each beam that detects an object. The SUT
has 30 beams with 180 degree coverage and a max detection distance of 100 m. The DOGMa particle
filter uses 10 000 consistent particles, 1000 newborn particles, a birth probability of 0.0, a particle
persistence probability of 1.0, and a discount factor of 1.0. Velocity and acceleration variance were
initialized to 12.0 and 2.0, respectively, and the process noise for position, velocity, and acceleration
was 0.1, 2.4, and 0.2, respectively.
This case study also starts with slightly different initial conditions. The pedestrian starting
location is now 2.0 m back from the edge of the road, while the vehicle starting location is only 45 m
from the crosswalk. The initial velocities are the same.
The difference in noise modeling means that the action vectors lengths of the lofi and hifi sim-
ulators now differ. In order to get an expert demonstration, the lofi actions were run in hifi, with
2Our implementation was based on that of github.com/mitkina/EnvironmentPrediction.
CHAPTER 7. VALIDATION IN HIGH-FIDELITY 68
the noise portion of the action vectors set to 0. Because the action vectors are of different lengths,
the reinforcement learner networks have different sizes as well, so it was not possible to load the lofi
policy for the BA in this case study.
The hifi DRL baseline took 135 000 simulation steps to find a failure. The lofi reinforcement
learner took 100 000 simulation steps to find a failure. The BA was able to find a failure in hifi after
only 6330 steps, a mere 4.7 % of the DRL baseline.
Table 7.4: The results of the perception case study. Due to differences in network sizes, resultingfrom different disturbance vector sizes, it was not possible to load a lofi policy in this case study.The BA was still able to significantly reduce the number of hifi steps needed.
AlgorithmSteps toFailure
FinalReward
Load LofiPolicy?
LofiSteps
Percent ofHifi Steps
BA 6330 −385.8 No 100 000 4.7 %
Hifi 135 000 −511.1 – – –
7.2.5 Case Study: NVIDIA DriveSim
As a proof-of-concept, for the final case study we implemented the new AST algorithm on a hifi
simulator from industry. Nvidia’s Drivesim is a hifi simulator that combines high-accuracy dynamics
with features such as perception from graphics and software-in-the-loop simulation. An example
rendering of an intersection in Drivesim is shown in fig. 7.2. After the AST Toolbox was connected
with Drivesim, we simulated the standard crossing-pedestrian scenario with the modified IDM as
the SUT. Here, the lofi simulator was the AST Toolbox simulator used for all previous case studies,
which was trained for 265 450 steps. Using the BA, AST was able to find a failure in 4060 hifi steps,
which took only 10 hours to run. While the SUT was still just the modified IDM, these exciting
results show that the new approach makes it possible to find failures with AST on state-of-the-art
industry hifi simulators.
7.3 Discussion
Across every case study, a combination of running DRL in lofi and the BA in hifi was able to find
failures with significantly fewer hifi steps than just running DRL in hifi directly. Some of the fidelity
differences were quite extreme, but the BA was still able to find failures and to do so in fewer steps
than were needed by just running DRL in hifi directly. In fact, the most extreme example, the
perception case study, also had the most dramatic improvement in hifi steps needed to find failure.
These results show the power of the BA in adapting to fidelity differences and make the approach of
CHAPTER 7. VALIDATION IN HIGH-FIDELITY 69
Figure 7.2: Example rendering of an intersection from NVIDIA’s Drivesim simulator, an industryexample of a high-fidelity simulator.
running AST in hifi significantly more computationally feasible. Further work could explore using
more traditional transfer learning and meta-learning approaches to save hifi simulation steps using
lofi or previous hifi simulation training results.
The approach of loading the lofi DRL policy had interesting results as well. In all the case
studies presented here, the policy loading approach was even faster than running the BA from
scratch, sometimes drastically so. However, throughout our work on this chapter we also observed
multiple cases where running the BA with a loaded policy did not result in finding failures at all,
whereas running the BA from scratch was still able to find failures in those cases. Furthermore,
there are cases, for instance the perception case study in section 7.2.4, where loading the lofi policy
is not even possible. Based on our experiences, loading the lofi policy is a good first step: it often
works, and when it works, works very well. However, if the BA fails to find a failure with a loaded
policy, then the BA should be run again from scratch, as running from scratch is a more robust
failure-finding method than loading the policy from lofi. Future work could focus on making the
BA using a loaded lofi policy more robust. The policy has a learned standard deviation network,
and one reason the BA using a loaded lofi policy may fail sometimes is that during training in lofi
the policy has already converged to small standard deviation outputs, leading to poor exploration.
Results might be improved by reinitializing the standard deviation network weights or by finding
other ways to boost exploration after a certain number of failed BA training epochs.
One final point is important to note—the goal of AST is to tractably find likely, and therefore
useful, failures in a system in simulation without constraints on actor behavior that can compromise
safety. AST is not a method whose goal is to estimate the total probability of failure. The hope of
this approach is that the majority of failures in hifi are also present in lofi, with additional spurious
errors, but it is certainly possible that there are some errors in hifi that have no close analog in lofi.
CHAPTER 7. VALIDATION IN HIGH-FIDELITY 70
By biasing our search towards a likely failure found in lofi, we could actually hurt our ability to find
certain hifi failures. If our goal were to compute the total failure probability, such a bias could be
a critical flaw that might lead to significantly underestimating the likelihood of failure in certain
situations. However, such a bias is far less of a concern when we are instead merely looking to find
a likely and instructive failure. Indeed, the results bear this out, as the likelihoods of the failures
found by the BA were not just on par with the likelihoods of the failures found by running DRL in
hifi but were actually greater across all case studies.
This chapter, in conjunction with preceding chapters, represents a substantial step forward in
the theory of AST. However, this theoretical step is not worth much if it is not easily accessible for
use by system designers. In the next chapter we present the AST Toolbox, an open-source software
package that allows anyone to apply AST to validating their own autonomous systems.
Chapter 8
The AST Toolbox
Two common themes throughout this thesis are 1) the importance of validating that autonomous
systems will behave as expected prior to their deployment to ensure their safety and 2) the difficulty
of performing said validating. Poor system validation will lead to failures, and failures of safety
critical systems can lead to human injury or fatality. Such costly failures do not merely reflect on the
particular creators of the system at fault but can also lead to a lack of trust in autonomous systems
as a whole. Considering what is at stake, both human lives and societal trust in autonomous systems,
it is clear that safety must be a collaborative property, not a competitive one. Since validation is a
key step of creating safe systems, it is essential for validation methods to be available as open-source
and collaborative resources.
Fortunately, the research community has already created many good open-source options for
validating autonomous systems, as covered in section 8.1. However, there is a niche within the
ecosystem of open-source validation methods that remains unfilled. As of yet, there are no open-
source solutions for finding the most likely failure of a system. Additionally, while there are many
open-source algorithms for validation that treat the system under test as a black box, there are very
few options that can treat the entire simulator as a black box.
The algorithms presented throughout this thesis can fill the aforementioned niche, as AST both
finds the most likely failure and can treat the simulator as a black box. However, merely publishing
algorithm outlines or uploading project code is of limited use. Instead, validation methods should
be made available in a way that is both easily accessible and extensible. For that reason, we have
created the AST Toolbox, an open-source toolbox that allows system designers to apply AST to their
own validation problems. This chapter presents the AST Toolbox, an open-source package designed
to make AST easy to apply to any autonomous system. The Toolbox allows users to wrap their
simulator in a python class that turns the validation problem into an OpenAI gym environment [84].
The Toolbox bundles the wrappers with garage [85], an open-source RL library that handles policy
creation and optimization. Examples and documentation enable users to validate any system in any
71
CHAPTER 8. THE AST TOOLBOX 72
Table 8.1: A feature comparison of the AST Toolbox with three existing software solutions forsystem validation and verification. The AST Toolbox is unique in two features: 1) being able totreat the entire simulator as a black box, and 2) returning the most likely failure.
Black-box Support Returns
ToolSystem
Under Test SimulatorFalsifying
TraceMost Likely
FailureFormal
Guarantees
S-TaLiRo [86] X · X · ·Breach [5] X · · · X1
FalStar [41] X · X · ·AST Toolbox X X X X ·1 For linear systems
simulator, regardless of programming language, in a standardized and straight-forward process.
8.1 Comparison to Existing Software
Table 8.1 compares the features of the AST Toolbox with those of existing software. Unlike other
approaches, AST generates falsifying traces for black-box simulators, returning the most likely fail-
ure. Formal methods often provide guarantees at the expense of computation, whereas AST is
designed with tractability in mind. The goal is to enable approximate verification of complex, high-
dimensional autonomous systems. AST does not replace traditional simulation testing approaches.
Instead, it is an additional validation process for developers to both discover and understand likely
failures relatively quickly, with reduced simulation time.
8.2 AST Toolbox Design
The AST Toolbox is a software package that provides a framework for using AST with any simulator,
facilitating the validation of autonomous agents. The toolbox has three major components: the
reinforcement learners, the simulator interface, and the reward function. The reinforcement learners
are algorithms for finding the most likely failure of the system under test. The AST simulator
interface provides a systematic way of wrapping a simulator to be used with the AST environment.
The reward function uses the standard AST reward structure together with heuristics to guide the
search process and incorporate domain expertise.
CHAPTER 8. THE AST TOOLBOX 73
AST
Simulator
ReinforcementLearner
RewardFunction
ASTEnv
ASTSpaces
RLAlgorithm
Policy
ASTRewardASTSimulator
step()
, reset
(), etc.
getReward()
step(), reset(), etc.
getRewardInfo()
Figure 8.1: The AST Toolbox framework architecture. The core concepts of the method are shown,as well as their associated abstract classes. ASTEnv combines the simulator and reward function ina gym environment. The reinforcement learner is implemented using the garage package.
8.2.1 Architecture Overview
The architecture is shown in fig. 8.1. The three core concepts of the AST method (simulator, rein-
forcement learner, and reward function) have abstract base classes associated with them. These base
classes provide interfaces that allow interaction with the AST module, represented by the ASTEnv
class. ASTEnv is a gym environment that interacts with a wrapped simulator ASTSimulator and
a reward function ASTReward. In conjunction with ASTSpaces, which are gym spaces, the AST
problem is encoded as a standard gym reinforcement learning problem. Many available open-source
reinforcement learning algorithms work with gym environments, but our reinforcement learners are
implemented using the garage framework. The reinforcement learner derives from the garage class
RLAlgorithm, and it uses both a Policy, such as a Guassian LSTM, and an optimization method,
such as TRPO.
8.2.2 Reinforcement Learners
The AST Toolbox is integrated with garage to provide easy access to efficient implementations of
reinforcement learners. The Toolbox comes with four algorithms already implemented:
• Deep reinforcement learning: The garage package has a range of deep reinforcement
learning algorithms and policies that can be used off the shelf, including different variations
of multi-layer perceptrons (MLP), gated recurrent units (GRU), and long-short term memory
(LSTM) networks (see section 2.1.3 for background). We have found the Gaussian LSTM
network with peepholes to be the most successful. Recurrent networks have the advantage
of not needing access to the simulation state, instead maintaining a hidden-state based on
the sequence of previous actions taken. A policy can, however, be defined to depend on the
simulation state, which can increase performance. For direct disturbance control, using a
Gaussian policy with a learned standard deviation introduces noise early on in training, which
improves exploration, but allows the network to reduce the noise over time, which results in
CHAPTER 8. THE AST TOOLBOX 74
better exploitation. Note that for seed-disturbance control, the policy’s output (the seed) does
not have a smooth mapping to disturbances, so using a Gaussian policy would not make sense.
A Gaussian LSTM is used for both experiments in this chapter, and a Gaussian MLP is also
used on the cartpole task. Both tasks use garage’s implementation of PPO to optimize the
DRL reinforcement learner.
• Monte Carlo tree search: The toolbox offers two variants of MCTS (see section 2.1.2 for
background). The vanilla version is MCTS with UCB and DPW. The toolbox also offers a
variant called MCTS with blind value (MCTS-BV). MCTS-BV encourages exploring distinct
actions and therefore is more effective when the disturbance space is large and continuous. Note
that a separate MCTS random seeds (MCTSRS) algorithm is provided for seed-disturbance
control
• Go-explore (Phase 1): The Toolbox provides an implementation of the tree-search phase
of go-explore (see section 5.1 for background). Using go-explore has added dependencies and
also must use a different environment class (GoExploreASTEnv) to account for the deterministic
resets. Different cell selection methods can be implemented by changing the Cell and CellPool
classes or by changing the optimize policy function of the GoExplore class. The downsample
function should also be overloaded or modified based on the application domain.
• The backward algorithm: The Toolbox also provides an implementation of the backward
algorithm (see section 6.1 for background), which can be used as phase 2 of go-explore, as
a general robustification phase (see chapter 6) or for transferring failures from low-fidelity
simulators to high-fidelity simulators (see chapter 7). The backward algorithm also requires
deterministic resets of the simulator, so GoExploreASTEnv must be used as the environment
class.
The Toolbox provides extensions of garage’s samplers, so parallel batch sampling and vectorized
sampling are both usually supported for not just the algorithms listed above but also for most
custom policies or algorithms created by users.
8.2.3 Simulation Interface
The AST Toolbox includes a class template for an interface between the package and a general
simulator. The interface is a wrapper that implements the three necessary control functions for
AST, while still treating the simulator as a black-box. Four specific simulator options are available:
• Open-loop vs. closed-loop: An open-loop simulator is one in which all of the disturbances
must be specified ahead of time, whereas a closed-loop simulator accepts online control at
each time-step. TRPO uses batch optimization, so the update steps for both simulators are
equivalent.
CHAPTER 8. THE AST TOOLBOX 75
• Fixed vs. sampled initial state: The reinforcement learners can be run from a fixed initial state.
Alternatively, AST reinforcement learners can generalize over a space of initial conditions by
sampling them during training.
• Black-box vs. white-box: A black-box simulator provides no access to the internal simulation
state, whereas a white-box simulator does. If the internal simulation state is accessible, using
it may boost performance.
• Exposed actions vs. random seed only: A simulator with exposed actions allows full program-
matic specification of simulation rollouts, allowing AST to control the environment directly
by outputting disturbance vectors. The alternative is to indirectly control disturbances by
allowing AST to control the seed of all the random number generators in a simulator, the
assumption being that the disturbances are subsequently sampled from said generators.
8.2.4 Reward Structure
The reward function module follows the reward function presented in eq. (2.9). Instead of using
the log-probability of disturbances, the function uses the negative Mahalanobis distance [69]. The
reward still ends up proportional to the log-likelihood of the failure found, but the Mahalanobis
distance works better in practice as it does not explode to near-infinite numbers for disturbances
that have very low probability. If no heuristics are used, then the reward function does not require
any information from or access to the simulator beyond what is defined in section 2.5. However, the
reward function is able to accept additional information from the simulator to calculate a heuristic
reward bonus when such a reward bonus is applicable.
8.3 Case Studies
8.3.1 Cartpole
Cartpole is a classic test environment for continuous control algorithms [66]. The state s = [x, x, θ, θ]
represents the cart’s horizontal position and speed as well as the bar’s angle and angular velocity.
The system under test (SUT) is a neural network control policy trained by PPO. The control policy
controls the horizontal force ~F applied to the cart, and the goal is to prevent the bar on top of the
cart from falling over. The failure of the system is defined as |x| > xmax or |θ| > θmax. The initial
state is at s0 = [0, 0, 0, 0]. Figure 8.2 shows the cartpole environment.
The reinforcement learner interacts with the simulator by applying disturbance on the SUT’s
control force ~F . At each time-step, the disturbance force, δ ~F , given by the reinforcement learner
output, and the control force ~F , given by the control policy, are applied simultaneously on the cart.
CHAPTER 8. THE AST TOOLBOX 76
Figure 8.2: Layout of the cartpole environment. A control policy tries to keep the bar from fallingover, or the cart from moving too far horizontally, by applying a control force to the cart [65].
The reward function uses RE = 1× 104 and RE = 0, and as a heuristic reward uses the normal-
ized distance of the final state to failure states, which is given by
f(s) = min
(|x− xmax|xmax
,|θ − θmax|θmax
)(8.1)
The heuristic reward encourages the reinforcement learner to push the SUT closer to failure. The
disturbance likelihood reward ρ is set to the log of the probability density function of the natural
disturbance force distribution, which is a Gaussian with 0 mean and a standard deviation σ = 0.8.
No per-step heuristic is used.
Two reinforcement learners were compared in this experiment: MCTS and DRL. MCTS was
trained with k = 0.5 and α = 0.5, and c = 10. Since the MCTS reinforcement learner does not
need the true simulation state, its performance is the same for both the black-box and white-box
simulator settings. The DRL reinforcement learner was tested under both white-box and black-box
settings. For the white-box setting, we used a multilayer perceptron network with hidden layer sizes
of 128, 64, and 32. The step size was set to 5.0. For the black-box setting, we used a Gaussian LSTM
network with 64 hidden units. The step size was set to 1.0. Both neural networks were trained using
the PPO algorithm with a batch size of 2000 and a discount factor γ = 0.99. We additionally added
a random search reinforcement learner as the baseline. All reinforcement learners were trained using
2× 106 simulation steps. Each reinforcement learner was run for 10 trials using different random
seeds and the results were averaged. The hyperparameters used in this experiment were found
CHAPTER 8. THE AST TOOLBOX 77
empirically. All reinforcement learners used the closed-loop setting.
The best trajectory return found at each trial was recorded every 1× 103 simulation steps. The
average of the best trajectory return over 10 trials is shown in fig. 8.3. Both MCTS and DRL rein-
forcement learners (with MLP and LSTM architectures) are able to find failure trajectories, whereas
the random-search baseline fails to do so in all trials. In this experiment, the DRL reinforcement
learners use significantly fewer simulation steps than the MCTS reinforcement learner to find failure
trajectories in all trials. The MLP reinforcement learner is also slightly more sample-efficient than
the LSTM since it has access to the true simulator state. The best reward found by the MLP rein-
forcement learner, the LSTM reinforcement learner, and the MCTS reinforcement learner are −59.4,
−50.6, and −100.1, respectively. Surprisingly, the best reward found by the LSTM reinforcement
learner is slightly higher than that for the MLP reinforcement learner, but this is likely due to the
stochastic nature of the reinforcement learners.
0 0.5 1 1.5 2
·106
−1
−0.8
−0.6
−0.4
−0.2
0
·104
Step Number
Rew
ard
Random SearchTRPO, White-BoxTRPO, Black-BoxMCTS
Figure 8.3: Best return found up to each iteration. The value is averaged over 10 different trials.Both the MCTS and DRL reinforcement learners are able to find failures, but the DRL reinforcementlearner is more computationally efficient.
8.3.2 Autonomous Vehicle
The autonomous vehicle task is a recreation of the autonomous driving experiment from section 3.3.
A pedestrian crosses a neighborhood road at a crosswalk as an autonomous vehicle approaches, as
shown in fig. 8.4. The x-axis is aligned with the edge of the road, with East being the positive
x-direction. The y-axis is aligned with the center of the cross-walk, with North being the positive
y-direction. The system under test is the autonomous driving policy which is a modified version of
the Intelligent Driver Model [68].
The reward function for this scenario uses RE = −1× 105, RE = 0, and a heuristic reward
with Φ = 10000dist(pv,pp
), where dist
(pv,pp
)is the distance between the pedestrian and the
SUT. This heuristic encourages the reinforcement learner to move the pedestrian closer to the car
CHAPTER 8. THE AST TOOLBOX 78
y
x
(−35, 0) m
(0.0,−4.0) m
11.2 m/s
1.0 m/s
ego vehicle pedestrian
Figure 8.4: Layout of the autonomous vehicle scenario. A vehicle approaches a cross-walk on aneighborhood road as a single pedestrian attempts to walk across. Initial conditions are shown.
in early iterations, which can significantly increase training speeds. The reward function also uses
ρ = M (x, µx | s), which is the Mahalanobis distance function [69]. The Mahalanobis distance is
a generalization of distance to the mean for multivariate distributions. The pedestrian and noise
models used in this experiment are Gaussian, making the Mahalanobis distance proportional to
log-likelihood.
The reinforcement learner interacts with the simulator by controlling the pedestrian’s acceleration
and the noise on the sensors. At each time-step an acceleration (ax, ay) vector is used to move the
pedestrian. Gaussian noise is also added to the sensors at each time-step. The reinforcement learner
outputs the mean and diagonal covariance of a multivariate Gaussian distribution at each time-step,
from which the acceleration and noise vectors are sampled. Using a distribution adds controlled
stochasticity to the actions, which enhances exploration.
The SUT has an initial and target velocity of 11.2 m/s, with a target follow distance of 5 m. The
SUT starts −35.0 m from the cross-walk. The pedestrian starts −4.0 m behind the crosswalk with
an initial velocity of 1.0 m/s. The standard deviation for the pedestrian accelerations are 0.1 m in
the x-direction and 0.0 m in the y-direction, and the standard deviation of all noise vectors is 0.1 m.
The autonomous vehicle experiment was run with the DRL reinforcement learner. Because
autonomous vehicles generally act in high-dimensional spaces, we do not use the MCTS reinforcement
learner on this problem. The DRL reinforcement learner used a batch size of 50 000, γ = 0.999, and
was optimized with a PPO step size of 1.0. The reinforcement learner was run for 5× 106 simulation
steps.
The reward of the most likely trajectory found by the DRL reinforcement learner at each iteration
for the autonomous vehicle experiment is shown in fig. 8.5. The reinforcement learner was run for
4× 106 steps. The maximum reward found at each timestep is shown as the Batch Max, while the
most likely collision found so far is shown as the Cumulative Max. The reinforcement learner quickly
CHAPTER 8. THE AST TOOLBOX 79
0 1 2 3 4
·106
−2
−1.5
−1
−0.5
0
·105
Step Number
Rew
ard
(a) Reward of the most likely failure found ateach iteration.
0 1 2 3 4
·106
−320
−300
−280
−260
−240
Step Number
Rew
ard
(b) Reward of the most likely failure found ateach iteration zoomed to better show the vari-ance in the DRL reinforcement learner results.
DRL Batch Max DRL Cumulative Max Random Batch Max Random Cumulative Max
Figure 8.5: Reward of the most likely failure found at each iteration. The Batch Max is the maximumper-iteration summed Mahalanobis distance. The Cumulative Max is the best Batch Max up to thatiteration. The reinforcement learner finds the best solution by iteration 6 out of 80.
converges to finding solutions in the range of −300 to −250, and the best solution found was −236,
found after 2× 105 steps. Running 4× 106 steps required 7.3 minutes. A random-search baseline
was unable to find a single crash. The toolbox found a likely failure quickly and efficiently, despite
the space of possible actions being 6-dimensional and continuous.
The average return of each iteration is shown in fig. 8.6. Again, both the per-iteration and
cumulative maximum averages are shown. The reinforcement learner seems to have converged by
4× 105 steps to a range of −5.7× 104 to −4.4× 104, although there are slight improvements later.
These slight improvements did not correspond with improvements in the most likely failure found.
At 2× 105 step, the average reward is −2.2× 105. The average reward being so large indicates that
some of the trajectories in each batch were not leading to collisions at all. The policy sometimes
fails to find a collision because the DRL reinforcement learner samples actions stochastically from a
Gaussian distribution. Because the random-search baseline never found a collision, the reward was
consistently around −2.1× 105.
8.3.3 Automatic Transmission
A common benchmark for falsification tools is a four-speed automatic transmission controller mod-
eled in Simulink [87]. The system under test (SUT) is the automatic transmission model also used
in the ARCH-COMP falsification competition [88]. The model takes three real-valued inputs: time
0 ≤ t ≤ 30, throttle percent 0 ≤ τ ≤ 100, and brake torque 0 ≤ β ≤ 325. The model outputs
CHAPTER 8. THE AST TOOLBOX 80
0 1 2 3 4
·106
−2
−1.5
−1
−0.5
·105
Step Number
Rew
ard DRL Batch Average
DRL Cumulative Max AverageRandom Batch AverageRandom Cumulative Max Average
Figure 8.6: The average return at each iteration. The Batch Average is the average return fromeach trajectory in an iteration, while the Cumulative Max Average is the maximum Batch Averageso far. The reinforcement learner is mostly converged by iteration 10, although there are slightimprovements later. The large returns indicate that not every trajectory is ending in a collision.
two continuous states, speed v and RPM ω, and one discrete state g, the current gear. Failures of
the SUT are defined by violations of signal temporal logic (STL) specifications that encode system
requirements. The STL requirements commonly used were originally proposed by Hoxha, Abbas,
and Fainekos [87], and the parameters were selected for their difficulty by Ernst, Arcaini, Donze,
et al. [88]. We chose the following benchmark STL formulas to highlight behaviors of the different
reinforcement learners:
AT1: �[0,20]v < 120 (Speed is always below 120 between 0 and 20 seconds)
AT2: �[0,10]ω < 4750 (RPM is always below 4750 between 0 and 10 seconds)
The reinforcement learners interact with the simulator by selecting up to four input actions, each
a vector [t, τ, β]. The output vector [v, ω, g] for each simulator time step and the STL specifications
are used to calculate a robustness metric [4]. Robustness is a measure of the degree to which the
specification was satisfied. We use the negative robustness in our reward function to guide the search
towards failures (i.e., specification violations), computed using the stlcg Python package [89].
Two reinforcement learners were compared in this experiment: MCTS and DRL. Both reinforce-
ment learners are run in the black-box simulator mode. MCTS was trained with k = 1, α = 0.7,
and c = 10 [58]. For the DRL reinforcement learner, we used an LSTM network [60] with 64 hidden
units and a step size of 1.0. The neural network was trained using the PPO algorithm with a batch
size of 40 and a discount factor γ = 1.0. A random search reinforcement learner was used as the
baseline. All reinforcement learners were trained using 1000 simulation steps. Each reinforcement
CHAPTER 8. THE AST TOOLBOX 81
0 200 400 600 800 1,000
−8
−6
−4
−2
0
·105
Step Number
Rew
ard
AT1 benchmark�[0,20]v < 120
(a) Only DRL finds failures.
0 200 400 600 800 1,000
−2
−1
0
·107
Step Number
Rew
ard
AT2 benchmark�[0,10]ω < 4750
(b) All find failures; DRL is most efficient.
PPO MCTS Random Search
Figure 8.7: The results of the automatic transmission case study, averaged over 10 trials. The DRLreinforcement learner is able to outperform both MCTS and a random search baseline.
learner was run for 10 trials using different random seeds, and the results were averaged. The hy-
perparameters used in this experiment were found empirically and all reinforcement learners used a
close-loop setting.
The best trajectory reward found at each trial was recorded every simulation step. The average of
the best trajectory reward over 10 trials is shown in fig. 8.7. For the more difficult AT1 benchmark,
only the DRL reinforcement learner was able to find failure trajectories within the allotted number
of steps, shown in fig. 8.7a. For the less difficult AT2 benchmark, all three reinforcement learners
were able to find failure trajectories, shown in fig. 8.7b, but MCTS found failures in only 9 out of
10 trials. The results highlight that the choice of reinforcement learner should be dependent on the
underlying problem: MCTS tends to work well in long-horizon problems with large state and action
spaces. In this experiment, the DRL reinforcement learner used fewer simulation steps than the
other reinforcement learners to find failure trajectories in all trials. The DRL reinforcement learner
found the first failure at iteration 226.0±157.7 and 271.9±269.8 for AT1 and AT2, respectively. The
MCTS reinforcement learner found the first failure at iteration 338.3± 314.9 for AT2, and random
search found the first failure at iteration 673.8± 547.3.
8.4 Discussion
This chapter introduced the AST Toolbox for validating autonomous systems. This open-source
package simplifies applying AST to autonomous system safety validation. The toolbox provides
CHAPTER 8. THE AST TOOLBOX 82
wrappers to interface between simulators and the provided reinforcement learners, while also sup-
porting the implementation of new reinforcement learners. The AST approach uses reinforcement
learning to find the most likely failure of a system by treating the simulator as an agent acting in an
MDP. While the solutions are not guaranteed to be optimal, the method is tractable for an emerg-
ing class of safety-critical autonomous systems that act in high-dimensional spaces and experience
extremely low-probability failures. While other open-source software packages support falsification,
the AST Toolbox is distinct in its ability to find the most likely failure while treating simulators as a
black box. The aim of the AST Toolbox is to help make autonomous systems more reliable through
robust testing.
Chapter 9
Summary and Future Work
Autonomous vehicles are rapidly becoming ubiquitous in society, even in safety-critical applica-
tions. Going forward, validation through real-world testing alone will be infeasible for numerous
autonomous systems that must safely interact with people. It is critical that system designers have
access to a simulation-based validation method that produces robust results without requiring an
infeasible amount of compute to run. This thesis is a step towards developing approximate methods
of validating autonomous systems in simulation. This chapter summarizes the approach taken in
this thesis, reviews the contributions made, and proposes several areas of future research.
9.1 Summary
As autonomous systems continue to spread to safety-critical applications, it is essential that designers
have access to simulation-based validation methods. Validation is a key step towards system safety,
but, unfortunately, validation through real-world testing will often be infeasible. For example,
it would take 5 billion miles driven to ensure that a fleet of autonomous vehicles was as safe as
commercial airplanes, and that is assuming those 5 billion miles are perfectly representative of their
eventual application areas. Validation must, at least in part, be done through simulation.
However, validation through simulation has its own challenges. In order for the validation re-
sults to be useful, the simulation must be a high-fidelity representation of the real-world, and the
simulation rollouts must be able to capture the full variance of real-world scenarios. High-fidelity
simulators are slow and expensive to run, so we are constrained in the number of simulations we can
run. However, allowing the simulation scenarios to capture the variance of the real-world results in
a massive space of possible simulation rollouts that must be searched for failures. An approach is
needed that can approximately search the simulation space for failures in an efficient enough manner
to be tractable but without sacrificing the ability to find failures.
Adaptive stress testing (AST) is one such possible method. AST formulates the problem of
83
CHAPTER 9. SUMMARY AND FUTURE WORK 84
finding the most likely failure as a Markov decision process, which can then be solved with standard
reinforcement learning approaches (see chapter 2). AST puts no constraints on actor behavior, which
allows it to capture the full variance of the real-world. The resulting space of possible simulation
rollouts is massive, and, therefore, the reward function is designed to prioritize likely failures, which
allows AST to find likely failures through optimization. AST is no silver bullet, though, and has its
own issues that must be addressed.
This thesis contributes solutions to a number of AST’s limitations. Many autonomous systems
act in high-dimensional spaces, which leads to an exponential explosion in possible simulation roll-
outs. The scalability issue is compounded when initial conditions must be searched over as well, so
AST must be able to generalize across initial conditions after a single training run. The massive
space of possible simulation rollouts leads to high-variance results, so AST needs a way to produce
more robust results. When validating a system in simulation, the results are dependent on the simu-
lator being a highly accurate representation of the real world, otherwise failures found may not exist
in the real system, and failures in the real system may not exist in simulation. This dependency on
simulator accuracy necessitates the use of high-fidelity simulators, and thus AST must be computa-
tionally affordable enough to be run in high-fidelity. Finally, in order for designers to actually make
use of AST, the method must be made open-source and easily applicable.
9.2 Contributions
The first part of this thesis provides a general introduction to the challenges of validation in simu-
lation (chapter 1), as well as an overview of AST (chapter 2). The remainder of this thesis makes
the following contributions across three different categories:
• Scalable Reinforcement Learners
– A deep reinforcement learner for scalable AST: Many autonomous systems act
in continuous, high-dimensional state and action spaces, which leads to an exponential
explosion in the size of the simulation space that must be searched for failures. AST
must be able to scale to these massive spaces in a way that allows validation to still be
tractable. AST previously used a reinforcement learner that used Monte Carlo tree search
with double progressive widening. While the use of upper confidence bounds and double
progressive widening improves the results of MCTS on continuous and high-dimensional
spaces, the tree size can still quickly explode. In contrast, deep reinforcement learning
has already been shown to perform well on tasks with continuous or high-dimensional
state or action spaces. Chapter 3 presents a new reinforcement learner that uses deep
reinforcement learning to improve scalability to large simulation spaces.
– A go-explore reinforcement learner for AST without reward heuristics: Pre-
vious work has relied on the presence of heuristic rewards to speed up validation results.
CHAPTER 9. SUMMARY AND FUTURE WORK 85
Heuristic rewards are domain-specific rewards that are crafted through expert knowledge
or real world data to help guide a reinforcement learning agent to goals that may be hard
to find. However, it may not always be desirable or even feasible for AST to have access
to heuristic rewards, and without them, the problem becomes a hard-exploration domain.
Go-explore recently set new records on common hard-exploration benchmarks by taking
a two-phase approach: 1) A tree search that uses heuristic biases and deterministic resets
to efficiently find goal states and 2) A robustification phase that uses the backward algo-
rithm to produce a deep neural network policy that is robust to stochastic perturbations.
Chapter 5 presents a new reinforcement learner based on phase 1 of go-explore. This
tree-search reinforcement learner is shown to find failures without the use of heuristic
rewards on long-horizon problems where DRL and MCTS approaches both fail to find
failures.
• General AST Utility
– The ability to generalize across initial conditions in a single run of AST: In
real-world applications, system designers are often interested in scenario classes, where a
scenario class is a space of similar scenarios defined by parameter ranges. Specific scenar-
ios can then be instantiated by selecting concrete parameter values from the parameter
ranges. To perform validation over a scenario class with AST, a system designer would
have to generate a number of concrete scenario instantiations and run AST for each one.
For even a simple scenario class, a basic grid search could result in hundreds or even
thousands of different concrete scenario instantiations, and it would be infeasible to run
AST thousands of times for each scenario class when performing validation over a suite
of tests. However, concrete scenarios within a scenario class often have significant simi-
larities to each other. It should therefore be possible to run AST a single time and learn
to generalize across the entire space of initial conditions of the scenario class. Chapter 4
presents an AST architecture and training approach that allows AST to perform valida-
tion of a scenario class in a single run. We show that this approach is able to produce
better results with far less computational expense when compared to running AST from
a series of concrete scenario instantiations.
– Robust AST results through the use of the backward algorithm: Chapter 2
shows that the trajectory that maximizes the reward function in AST is the most likely
failure. However, in practice AST uses approximate reinforcement learning methods, so
the reinforcement learner may converge to a local optimum that is not the global opti-
mum. Because autonomous systems act in continuous and high-dimensional spaces, there
is a massive spread of local optima. Unfortunately, this can lead to large amounts of
CHAPTER 9. SUMMARY AND FUTURE WORK 86
variance in AST’s results, an inconsistency that can lead to unsafe test results. Chap-
ter 6 presents a robustification phase of AST based on go-explore’s use of the backward
algorithm in its phase 2. The robustification phase is able to significantly improve the
likelihood of failures found by the reinforcement learner, regardless of which algorithm
the reinforcement learner is using.
• Real-world Applicability
– High-fidelity tractability through learning in and transferring from low-fidelity:
This thesis presents a number of advancements in AST that improve the approach’s scala-
bility. However, AST still can require many thousands of iterations to converge to a likely
failure. Running this many simulation rollouts may be infeasible in high-fidelity simula-
tors, which are simulators with features such as advanced dynamics models, perception
from graphics, and autonomous system software-in-the-loop simulation. These features
are essential for simulation accuracy, but they can make high-fidelity simulators slow and
expensive to run. In order to make AST tractable in high-fidelity, chapter 7 presents a
method of first finding failures in low-fidelity simulation and then transferring the fail-
ures to high-fidelity. We show that this approach reduces the number of high-fidelity
simulation steps needed to find a failure, sometimes dramatically so.
– An open-source and easily applicable software toolbox: While this thesis rep-
resents a significant advancement in the ability to validate autonomous systems using
AST, such advancements are only useful if system designers can actually apply AST to
validating their autonomous systems. For the good of society, it is essential that safety
become a collaborative effort among autonomous system designers, not a competitive one.
Chapter 8 presented the AST Toolbox, an open-source software toolbox that enables sys-
tem designers to easily apply AST to their own system. Through a series of wrappers,
designers are able to connect the toolbox to their own simulator and system, regardless
of the implementation language. In addition, the Toolbox is connected with garage to
provide easy access to reinforcement learning methods. The Toolbox is a step towards
safety as a collaborative feature of autonomous systems.
9.3 Further Work
This thesis contributes towards the validation of autonomous systems in simulation by making
significant improvements to Adaptive Stress Testing. However, the work presented here still has
limitations. Much of the work was tested on a toy simulator from the AST Toolbox. While tests on
high-fidelity simulators yielded promising results, more work must be done to prove the applicability
of AST to high-fidelity validation. The models used for the actors within scenarios were often very
CHAPTER 9. SUMMARY AND FUTURE WORK 87
basic, resulting in some unrealistic behaviors. Learning models of actors from data could yield failures
that are more realistic, and therefore usually more useful for system designers trying to make their
system more safe. Similarly, the most common system tested here was a modified version of a lane-
following model. While AST has been applied to complex real-world systems in the past [90]–[92], it
would be instructive to apply AST to validating a real autonomous vehicle. Addressing these issues
in future work would require significant time, money, and access, but could also yield significant
benefits. In addition to these practical considerations, there are several directions of theoretical
work that seem promising.
9.3.1 Full Generalization
Chapter 4 shows that AST can generalize across a scenario class. The theory underlying this
improvement is that similar scenarios share many commonalities that only need to be learned once
when generalizing, instead of having to be relearned for every concrete scenario. For example,
perhaps an object in a scenario causes an occlusion that can cause a collision. If AST learns to
use the failure from one set of initial conditions, it should not have to relearn from scratch to use
the occlusion from a different set of initial conditions. However, this argument can be taken a step
further.
In truth, there are many commonalities across all scenarios. Lessons on how to manipulate
sensor noise, occlusions, actor actions, and other disturbances are instructive across all validation
scenarios. AST should be able to use data from past validation tests to make running the current
test much faster. This sort of generalization is in the realm of meta-learning, where an agent learns
how to learn. By incorporating meta-learning into AST, it is possible that we could develop an
intelligent testing agent that can use life-long learning to continue to quickly find failures across
different systems and scenarios.
9.3.2 Fault Injection in Vision Systems
Autonomous vehicles are equipped with a range of sensors, with LIDAR, RADAR, and cameras
being some of the most popular. These sensors provide the vehicle with computer vision, but that
vision is not perfect. Different sensors have different failure modes, such as a camera suffering from
lens flare when the sun is in a bad position, or a lidar beam not bouncing back from an object
because it was absorbed by something black. These failures can create uncertainty in the vehicle’s
state estimation, which can result in collisions. AST could find failure by injecting these faults into
visions systems.
One advantage of this approach is that it would be straightforward to learn from data the
probability distribution over different types of failures. Some systems have failures that would be
easy to inject, such as a lidar beam not reflecting off an object. On the other hand, camera systems
could be much trickier to validate, though there has already been some promising work in this
CHAPTER 9. SUMMARY AND FUTURE WORK 88
area [44]. One possible approach is to use generative adversarial networks (GANs), which have
shown promising performance across a range of image tasks. For example, one common use of
GANs is style transfer tasks, where certain features are extracted from a base image and applied in
a realistic way to a target image. AST could use style transfer GANs to take camera images from
the real-world and inject faults, like adding lens flare, or increase scenario difficulty, like changing a
clear day to a snowy one.
9.3.3 Interpretability
Finding failures is not useful if system designers cannot understand those failures well enough to
address them. Thus, a key component of any validation method is the interpretability of the failures,
or how well a human can understand them. There has already been some exciting work done on
methods of automatically classifying or categorizing failures, work that is compatible with AST [93],
[94]. Especially promising is the potential for the classification system to feed back into the AST
method, which is to say that the interpretability work would not just classify failures found by
AST but would actually allow designers to specify in which types of failures they are interested, in
human readable formats, and AST would restrict its search to such failures, perhaps using reward
augmentation [47]. Further work could build on the existing frameworks, but the most immediate
step should be to build the interpretability features into the AST Toolbox.
Bibliography
[1] N. Kalra and S. M. Paddock, “Driving to safety: How many miles of driving would it take
to demonstrate autonomous vehicle reliability?” Transportation Research Part A: Policy and
Practice, vol. 94, no. Supplement C, pp. 182–193, 2016.
[2] P. Koopman, “The heavy tail safety ceiling,” in Automated and Connected Vehicle Systems
Testing Symposium, 2018.
[3] A. Corso, R. J. Moss, M. Koren, R. Lee, and M. J. Kochenderfer, “A survey of algorithms for
black-box safety validation,” arXiv preprint arXiv:2005.02979, 2020.
[4] G. E. Fainekos and G. J. Pappas, “Robustness of temporal logic specifications for continuous-
time signals,” Theoretical Computer Science, vol. 410, no. 42, pp. 4262–4291, 2009.
[5] A. Donze, “Breach, a toolbox for verification and parameter synthesis of hybrid systems,” in
Computer Aided Verification, Springer, 2010, pp. 167–170.
[6] H. Yang, “Dynamic programming algorithm for computing temporal logic robustness,” M.S.
thesis, Arizona State University, 2013.
[7] T. Dreossi, T. Dang, A. Donze, J. Kapinski, X. Jin, and J. V. Deshmukh, “Efficient guiding
strategies for testing of temporal properties of hybrid systems,” Springer, 2015, pp. 127–142.
[8] G. Ernst, S. Sedwards, Z. Zhang, and I. Hasuo, “Fast falsification of hybrid systems using
probabilistically adaptive input,” in International Conference on Quantitative Evaluation of
Systems (QEST), Springer, 2019, pp. 165–181.
[9] Y. V. Pant, H. Abbas, and R. Mangharam, “Smooth operator: Control using the smooth
robustness of temporal logic,” in IEEE Conference on Control Technology and Applications
(CCTA), IEEE, 2017, pp. 1235–1240.
[10] T. Akazaki, Y. Kumazawa, and I. Hasuo, “Causality-aided falsification,” Electronic Proceedings
in Theoretical Computer Science, vol. 257, 2017.
[11] H. Abbas, M. O’Kelly, and R. Mangharam, “Relaxed decidability and the robust semantics of
metric temporal logic,” 2017, pp. 217–225.
89
BIBLIOGRAPHY 90
[12] H. Abbas, G. Fainekos, S. Sankaranarayanan, F. Ivancic, and A. Gupta, “Probabilistic tempo-
ral logic falsification of cyber-physical systems,” ACM Transactions on Embedded Computing
Systems (TECS), vol. 12, no. 2s, pp. 1–30, 2013.
[13] A. Aerts, B. T. Minh, M. R. Mousavi, and M. A. Reniers, “Temporal logic falsification of cyber-
physical systems: An input-signal-space optimization approach,” in IEEE International Con-
ference on Software Testing, Verification and Validation Workshops (ICSTW), IEEE, 2018,
pp. 214–223.
[14] Q. Zhao, B. H. Krogh, and P. Hubbard, “Generating test inputs for embedded control systems,”
IEEE Control Systems Magazine, vol. 23, no. 4, pp. 49–57, 2003.
[15] X. Zou, R. Alexander, and J. McDermid, “Safety validation of sense and avoid algorithms
using simulation and evolutionary search,” in International Conference on Computer Safety,
Reliability, and Security (SafeComp), Springer, 2014, pp. 33–48.
[16] S. Silvetti, A. Policriti, and L. Bortolussi, “An active learning approach to the falsification of
black box cyber-physical systems,” in International Conference on Integrated Formal Methods
(iFM), Springer, 2017, pp. 3–17.
[17] J. Deshmukh, M. Horvat, X. Jin, R. Majumdar, and V. S. Prabhu, “Testing cyber-physical
systems through bayesian optimization,” ACM Trans. Embed. Comput. Syst., vol. 16, no. 5s,
Sep. 2017, issn: 1539-9087. doi: 10.1145/3126521.
[18] G. E. Mullins, P. G. Stankiewicz, R. C. Hawthorne, and S. K. Gupta, “Adaptive generation of
challenging scenarios for testing and evaluation of autonomous vehicles,” Journal of Systems
and Software, vol. 137, pp. 197–215, 2018.
[19] Y. Abeysirigoonawardena, F. Shkurti, and G. Dudek, “Generating adversarial driving scenarios
in high-fidelity simulators,” in IEEE International Conference on Robotics and Automation
(ICRA), IEEE, 2019, pp. 8271–8277.
[20] X. Yang, M. Egorov, A. Evans, S. Munn, and P. Wei, “Stress testing of uas traffic management
decision making systems,” in AIAA AVIATION Forum, 2020, p. 2868.
[21] G. E. Fainekos and K. C. Giannakoglou, “Inverse design of airfoils based on a novel formulation
of the ant colony optimization method,” Inverse Problems in Engineering, vol. 11, no. 1, pp. 21–
38, 2003.
[22] J. M. Esposito, J. Kim, and V. Kumar, “Adaptive RRTs for validating hybrid robotic control
systems,” in Algorithmic Foundations of Robotics VI, Springer, 2004, pp. 107–121.
[23] J. Kim, J. M. Esposito, and V. Kumar, “An RRT-based algorithm for testing and validating
multi-robot controllers,” Moore School of Electrical Engineering GRASP Lab, Tech. Rep.,
2005.
BIBLIOGRAPHY 91
[24] M. S. Branicky, M. M. Curtiss, J. Levine, and S. Morgan, “Sampling-based planning, control
and verification of hybrid systems,” IEEE Proceedings - Control Theory and Applications,
vol. 153, no. 5, pp. 575–590, 2006.
[25] T. Dang, A. Donze, O. Maler, and N. Shalev, “Sensitive state-space exploration,” in IEEE
Conference on Decision and Control (CDC), IEEE, 2008, pp. 4049–4054.
[26] T. Nahhal and T. Dang, “Test coverage for continuous and hybrid systems,” in Computer
Aided Verification, W. Damm and H. Hermanns, Eds., Berlin, Heidelberg: Springer Berlin
Heidelberg, 2007, pp. 449–462, isbn: 978-3-540-73368-3.
[27] E. Plaku, L. E. Kavraki, and M. Y. Vardi, “Hybrid systems: From verification to falsification
by combining motion planning and discrete search,” Formal Methods in System Design, vol. 34,
no. 2, pp. 157–182, 2009.
[28] C. E. Tuncali and G. Fainekos, “Rapidly-exploring random trees for testing automated vehi-
cles,” in IEEE Intelligent Transportation Systems Conference (ITSC), 2019, pp. 661–666.
[29] M. Koschi, C. Pek, S. Maierhofer, and M. Althoff, “Computationally efficient safety falsifi-
cation of adaptive cruise control systems,” in IEEE International Conference on Intelligent
Transportation Systems (ITSC), IEEE, 2019, pp. 2879–2886.
[30] A. Zutshi, S. Sankaranarayanan, J. V. Deshmukh, and J. Kapinski, “A trajectory splicing ap-
proach to concretizing counterexamples for hybrid systems,” in IEEE Conference on Decision
and Control (CDC), IEEE, 2013, pp. 3918–3925.
[31] A. Zutshi, J. V. Deshmukh, S. Sankaranarayanan, and J. Kapinski, “Multiple shooting, CEGAR-
based falsification for hybrid systems,” in International Conference on Embedded Software
(ICESS), 2014, pp. 1–10.
[32] Y. Kim and M. J. Kochenderfer, “Improving aircraft collision risk estimation using the cross-
entropy method,” Journal of Air Transportation, vol. 24, no. 2, pp. 55–62, 2016.
[33] M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi, “Scalable end-to-end au-
tonomous vehicle testing via rare-event simulation,” in Advances in Neural Information Pro-
cessing Systems (NeurIPS), 2018, pp. 9827–9838.
[34] D. Zhao, H. Lam, H. Peng, S. Bao, D. J. LeBlanc, K. Nobukawa, and C. S. Pan, “Accelerated
evaluation of automated vehicles safety in lane-change scenarios based on importance sampling
techniques,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 3, pp. 595–
607, 2016.
[35] Z. Huang, H. Lam, D. J. LeBlanc, and D. Zhao, “Accelerated evaluation of automated vehicles
using piecewise mixture models,” IEEE Transactions on Intelligent Transportation Systems,
vol. 19, no. 9, pp. 2845–2855, 2017.
BIBLIOGRAPHY 92
[36] S. Sankaranarayanan and G. Fainekos, “Falsification of temporal properties of hybrid systems
using the cross-entropy method,” 2012, pp. 125–134.
[37] J. Norden, M. O’Kelly, and A. Sinha, “Efficient Black-box Assessment of Autonomous Vehicle
Safety,” arXiv e-prints, arXiv:1912.03618, arXiv:1912.03618, Dec. 2019. arXiv: 1912.03618
[cs.LG].
[38] A. Corso, R. Lee, and M. J. Kochenderfer, “Scalable autonomous vehicle safety validation
through dynamic programming and scene decomposition,” in IEEE International Conference
on Intelligent Transportation Systems (ITSC), IEEE, 2020.
[39] J. P. Chryssanthacopoulos, M. J. Kochenderfer, and R. E. Williams, “Improved Monte Carlo
sampling for conflict probability estimation,” in AIAA Non-Deterministic Approaches Con-
ference, Orlando, Florida, 2010. doi: 10.2514/6.2010-3012.
[40] R. Lee, M. J. Kochenderfer, O. J. Mengshoel, G. P. Brat, and M. P. Owen, “Adaptive
stress testing of airborne collision avoidance systems,” in Digital Avionics Systems Confer-
ence (DASC), 2015.
[41] Z. Zhang, G. Ernst, S. Sedwards, P. Arcaini, and I. Hasuo, “Two-layered falsification of hybrid
systems guided by monte carlo tree search,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2894–2905, 2018.
[42] M. Wicker, X. Huang, and M. Kwiatkowska, “Feature-guided black-box safety testing of deep
neural networks,” in International Conference on Tools and Algorithms for the Construction
and Analysis of Systems (TACAS), Springer, 2018, pp. 408–426.
[43] R. Delmas, T. Loquen, J. Boada-Bauxell, and M. Carton, “An evaluation of Monte-Carlo tree
search for property falsification on hybrid flight control laws,” in International Workshop on
Numerical Software Verification, Springer, 2019, pp. 45–59.
[44] K. D. Julian, R. Lee, and M. J. Kochenderfer, “Validation of image-based neural network
controllers through adaptive stress testing,” in IEEE International Conference on Intelligent
Transportation Systems (ITSC), 2020, pp. 1–7.
[45] T. Akazaki, S. Liu, Y. Yamagata, Y. Duan, and J. Hao, “Falsification of cyber-physical systems
using deep reinforcement learning,” in International Symposium on Formal Methods, Springer,
2018, pp. 456–465.
[46] M. Koren, S. Alsaif, R. Lee, and M. J. Kochenderfer, “Adaptive stress testing for autonomous
vehicles,” in IEEE Intelligent Vehicles Symposium, 2018.
[47] A. Corso, P. Du, K. Driggs-Campbell, and M. J. Kochenderfer, “Adaptive stress testing with
reward augmentation for autonomous vehicle validation,” in IEEE International Conference
on Intelligent Transportation Systems (ITSC), 2019, pp. 163–168.
BIBLIOGRAPHY 93
[48] M. Koren and M. J. Kochenderfer, “Efficient autonomy validation in simulation with adap-
tive stress testing,” in IEEE International Conference on Intelligent Transportation Systems
(ITSC), 2019, pp. 4178–4183.
[49] V. Behzadan and A. Munir, “Adversarial reinforcement learning framework for benchmark-
ing collision avoidance mechanisms in autonomous vehicles,” IEEE Intelligent Transportation
Systems Magazine, 2019.
[50] S. Kuutti, S. Fallah, and R. Bowden, “Training adversarial agents to exploit weaknesses in
deep control policies,” in IEEE International Conference on Robotics and Automation (ICRA),
IEEE, 2020, pp. 108–114.
[51] M. Koren and M. J. Kochenderfer, “Adaptive stress testing without domain heuristics using
go-explore,” in IEEE International Conference on Intelligent Transportation Systems (ITSC),
IEEE, 2020.
[52] X. Qin, N. Arechiga, A. Best, and J. Deshmukh, “Automatic Testing and Falsification with Dy-
namically Constrained Reinforcement Learning,” arXiv e-prints, arXiv:1910.13645, arXiv:1910.13645,
Oct. 2019. arXiv: 1910.13645 [cs.LG].
[53] R. Lee, O. J. Mengshoel, A. Saksena, R. W. Gardner, D. Genin, J. Silbermann, M. Owen, and
M. J. Kochenderfer, “Adaptive stress testing: Finding likely failure events with reinforcement
learning,” Journal of Artificial Intelligence Research, vol. 69, pp. 1165–1201, 2020.
[54] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “Go-explore: A new approach
for hard-exploration problems,” arXiv preprint arXiv:1901.10995, 2019.
[55] M. J. Kochenderfer, Decision Making Under Uncertainty. MIT Press, 2015, ch. Model Uncer-
tainty, pp. 113–132.
[56] L. Kocsis and C. Szepesvari, “Bandit based Monte Carlo planning,” in European Conference
on Machine Learning (ECML), 2006.
[57] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S.
Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of Monte Carlo tree search
methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1,
pp. 1–43, 2012.
[58] A. Couetoux, J.-B. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard, “Continuous upper
confidence trees,” in Learning and Intelligent Optimization (LION), Springer, 2011, pp. 433–
445.
[59] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
[60] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9,
no. 8, pp. 1735–1780, 1997.
BIBLIOGRAPHY 94
[61] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align
and translate,” in International Conference on Learning Representations (ICLR), 2015.
[62] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimiza-
tion algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[63] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continu-
ous control using generalized advantage estimation,” in International Conference on Learning
Representations (ICLR), 2016.
[64] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory
and application to reward shaping,” in International Conference on Machine Learning (ICML),
vol. 99, 1999, pp. 278–287.
[65] W. Commons, Schematic drawing of an inverted pendulum on a cart, https://commons.
wikimedia.org/wiki/File:Cart-pendulum.svg, 2012.
[66] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve
difficult learning control problems,” IEEE Transactions on Systems, Man, and Cybernetics,
no. 5, pp. 834–846, 1983.
[67] M. Koren, X. Ma, A. Corso, R. J. Moss, P. Du, K. Driggs Campbell, and M. J. Kochenderfer,
“AST Toolbox: an adaptive stress testing framework for validation of autonomous systems,”
Journal of Open Source Software, 2021, In Review.
[68] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in empirical observations
and microscopic simulations,” Physics Review E, vol. 62, pp. 1805–1824, 2 Aug. 2000.
[69] P. C. Mahalanobis, “On the generalised distance in statistics,” Proceedings of the National
Institute of Sciences of India, vol. 2, no. 1, pp. 49–55, 1936.
[70] M. J. Kochenderfer, J. E. Holland, and J. P. Chryssanthacopoulos, “Next-generation airborne
collision avoidance system,” Massachusetts Institute of Technology-Lincoln Laboratory Lex-
ington United States, Tech. Rep., 2012.
[71] J. Kuchar and A. C. Drumm, “The traffic alert and collision avoidance system,” Lincoln
Laboratory Journal, vol. 16, no. 2, p. 277, 2007.
[72] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference
on Learning Representations (ICLR), 2013.
[73] R. Lee, O. Mengshoel, A. Saksena, R. Gardner, D. Genin, J. Brush, and M. J. Kochenderfer,
“Differential adaptive stress testing of airborne collision avoidance systems,” in 2018 AIAA
Modeling and Simulation Technologies Conference, 2018, p. 1923.
[74] California Department of Transportation, California manual on uniform traffic control devices,
2014, Revision 2.
BIBLIOGRAPHY 95
[75] J. Sklansky, “Optimizing the dynamic parameters of a track-while-scan system,” RCA Review,
vol. 18, no. 2, pp. 163–185, Jun. 1957.
[76] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep rein-
forcement learning for continuous control,” in International Conference on Machine Learning
(ICML), 2016.
[77] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “On a formal model of safe and scalable
self-driving cars,” arXiv preprint arXiv:1708.06374, 2017.
[78] T. Peng, Uber ai beats montezuma’s revenge (video game), https://medium.com/syncedreview/
uber-ai-beats-montezumas-revenge-video-game-dee33417a56e, 2018.
[79] T. Salimans and R. Chen, “Learning Montezuma’s Revenge from a single demonstration,”
arXiv preprint arXiv:1812.03381, 2018.
[80] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in IEEE International
Joint Conference on Neural Networks, 2000, pp. 189–194.
[81] M. Itkina, K. Driggs-Campbell, and M. J. Kochenderfer, “Dynamic environment prediction in
urban scenes using recurrent representation learning,” in IEEE International Conference on
Intelligent Transportation Systems (ITSC), 2019.
[82] D. Nuss, S. Reuter, M. Thom, T. Yuan, G. Krehl, M. Maile, A. Gern, and K. Dietmayer, “A
random finite set approach for dynamic occupancy grid maps with real-time application,” The
International Journal of Robotics Research, vol. 37, no. 8, pp. 841–866, 2018.
[83] S. Hoermann, M. Bach, and K. Dietmayer, “Dynamic occupancy grid prediction for urban
autonomous driving: A deep learning approach with fully automatic labeling,” in IEEE Inter-
national Conference on Robotics and Automation (ICRA), 2018.
[84] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba,
Openai gym, 2016. eprint: arXiv:1606.01540.
[85] The garage contributors, Garage: A toolkit for reproducible reinforcement learning research,
https://github.com/rlworkgroup/garage, 2019.
[86] Y. Annapureddy, C. Liu, G. Fainekos, and S. Sankaranarayanan, “S-TaLiRo: A Tool for Tem-
poral Logic Falsification for Hybrid Systems,” in International Conference on Tools and Al-
gorithms for the Construction and Analysis of Systems (TACAS), Springer, 2011, pp. 254–
257.
[87] B. Hoxha, H. Abbas, and G. Fainekos, “Benchmarks for temporal logic requirements for auto-
motive systems,” in ARCH14-15. 1st and 2nd International Workshop on Applied veRification
for Continuous and Hybrid Systems, ser. EPiC Series in Computing, vol. 34, 2015, pp. 25–30.
BIBLIOGRAPHY 96
[88] G. Ernst, P. Arcaini, A. Donze, G. Fainekos, L. Mathesen, G. Pedrielli, S. Yaghoubi, Y. Ya-
magata, and Z. Zhang, “ARCH-COMP 2019 category report: Falsification,” in International
Workshop on Applied Verification of Continuous and Hybrid Systems, ser. EPiC Series in
Computing, vol. 61, 2019, pp. 129–140.
[89] K. Leung, N. Arechiga, and M. Pavone, “Back-propagation through signal temporal logic spec-
ifications: Infusing logical structure into gradient-based methods,” in Workshop on Algorithmic
Foundations of Robotics, 2020.
[90] R. J. Moss, R. Lee, N. Visser, J. Hochwarth, J. G. Lopez, and M. J. Kochenderfer, “Adaptive
stress testing of trajectory predictions in flight management systems,” in Digital Avionics
Systems Conference (DASC), 2020, pp. 1–10.
[91] R. Lee, J. Puig-Navarro, A. K. Agogino, D. Giannakoupoulou, O. J. Mengshoel, M. J. Kochen-
derfer, and B. D. Allen, “Adaptive stress testing of trajectory planning systems,” in AIAA
Scitech, 2019, p. 1454.
[92] R. Lee, O. J. Mengshoel, and M. J. Kochenderfer, “Adaptive stress testing of safety-critical
systems,” in Safe, Autonomous and Intelligent Vehicles, Springer, 2019, pp. 77–95.
[93] A. Corso and M. J. Kochenderfer, “Interpretable safety validation for autonomous vehicles,” in
IEEE International Conference on Intelligent Transportation Systems (ITSC), 2020, pp. 1–6.
[94] R. Lee, M. J. Kochenderfer, O. J. Mengshoel, and J. Silbermann, “Interpretable categorization
of heterogeneous time series data,” in International Conference on Data Mining, SIAM, 2018,
pp. 216–224.