APPROXIMATE METHODS FOR VALIDATING AUTONOMOUS A …

APPROXIMATE METHODS FOR VALIDATING AUTONOMOUS

SYSTEMS IN SIMULATION

A DISSERTATION

SUBMITTED TO THE DEPARTMENT OF AERONAUTICS AND

ASTRONAUTICS

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

Mark Koren

May 2021

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/pv383pd8838

© 2021 by Mark Koren. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/pv383pd8838

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

Mykel Kochenderfer, Primary Adviser


J Gerdes


Dorsa Sadigh


Ritchie Lee

Approved for the Stanford University Committee on Graduate Studies.

Stacey F. Bent, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

Abstract

Because of the safety-critical nature of many autonomous systems, validation is essential before

deployment. However, validation is difficult—most of these systems act in high-dimensional spaces

that make formal methods intractable, and failures are too rare to rely on physical testing. Instead,

systems must be validated approximately in simulation. How to perform this validation tractably

while still ensuring safety is an open problem.

One approach to validation is adaptive stress testing (AST), where finding the most-likely failure

in simulation is formulated as a Markov decision process (MDP). Reinforcement learning techniques

can then be used to validate a system through falsification. We are interested in validating agents

that act in large, continuous, and complex spaces. Consequently, it is almost always the case that

forcing a failure is possible. Optimizing to find the most likely failure improves the relevance of

the failures uncovered, and provides valuable information to designers. This thesis presents two

new techniques for solving the MDP to find failures: 1) a deep reinforcement learning (DRL) based

approach and 2) a go-explore (GE) based approach.

Scalability is key to efficiently validating an autonomous agent, for which large, continuous

state and action spaces lead to a dimensional explosion in possible scenario rollouts. This problem is

exacerbated by the fact that designers are often interested in a space of similar test scenarios starting

from slightly different initial conditions. Running a validation method many times from different

initial conditions could quickly become intractable. DRL has been shown to perform better than

traditional reinforcement learning techniques, such as Monte Carlo tree search (MCTS) on problems

with continuous state spaces. In addition to scalability advantages, DRL can use recurrent networks

to explicitly capture the sequential structure of a policy. This thesis presents a DRL reinforcement

learner for AST based on recurrent neural networks (RNNs). By using an RNN, the reinforcement

learner learns a policy that generalizes across initial conditions, while also providing the scalability

advantages of deep learning.

While DRL techniques scale well, they also rely on the existence of a constant reward signal

to guide the agent towards better solutions during training. For validation, domain experts can

sometimes provide a heuristic that will guide the reinforcement learner towards failures. However,

iv

without such a heuristic, the problem becomes a hard-exploration problem. GE has shown state-

of-the-art results on traditional hard-exploration benchmarks such as Montezuma’s Revenge. This

thesis uses the tree search phase of go-explore to find failures without heuristics in domains where

DRL and MCTS do not find failures. In addition, this thesis shows that phase 2 of go-explore,

the backwards algorithm, can often be used to improve the likelihood of failures found by any

reinforcement learning method, with or without heuristics.

Autonomous vehicles are an example of an autonomous system that acts in a large, continuous

state space. In addition, failures are rare events for autonomous vehicles, with some experts propos-

ing that they will not be safe enough until they crash only once every 1.0× 109 miles. Consequently,

validating the safety of autonomous systems generally requires the use of high-fidelity simulators

that adequately capture the variability of real-world scenarios. However, it is generally not feasible

to exhaustively search the space of simulation scenarios for failures. Adaptive stress testing (AST)

is a method that uses reinforcement learning to find the most likely failure of a system. This thesis

presents a way of using low-fidelity simulation rollouts—generally much cheaper and faster to gen-

erate—to reduce the number of high-fidelity simulation rollouts needed to find failures, which allows

us to validate autonomous vehicles at scale in high-fidelty simulators.

As autonomous systems become more prevalent, and their development more wide-spread and

distributed, validation techniques likewise must become widely available. Towards that end, the

final contribution of this thesis is the AST Toolbox, an open-source Python package for applying

AST to any autonomous system. The Toolbox contains pre-implemented MCTS, DRL, and GE

reinforcement learners to allow designers to apply the work of this thesis to validating their own

systems. In addition, the Toolbox provides templates to simplify the process of wrapping the system

and simulator in a format that is conducive to reinforcement learning.

v

Acknowledgments

Finishing a PhD is never easy (and even if it was, I would never underrate my own achievement by

admitting so). However, I owe sincere thanks to many wonderful people for making my PhD process

as smooth and enjoyable as it could possibly be.

First and foremost, I have to thank my advisor, Mykel Kochenderfer. As one would expect, I owe

so much of my educational and professional development to his tutelage. However, I certainly didn’t

anticipate how influential he would be regarding my personal development. Mykel has an iron-clad

commitment to empathy and kindness, and he shows by his actions that being a good person and

being a successful person are not at all contradictory goals. His job was to help me become the

latter, and he did, but in truth, I value his example on being the former even more. Thank you,

Mykel.

I also owe gratitude to all the brilliant people who agreed to be on my committee. Dorsa

Sadigh’s faculty talk was my first exposure at Stanford to safety validation of autonomous vehicles

as a research area that promised to be fascinating and rewarding. Chris Gerdes and I became

acquainted when he was running a research group on the ethics of autonomous vehicles, and our

meetings were hugely influential in the direction in which I ended up taking my thesis. Ritchie

Lee wrote the first adaptive stress testing paper, and in many ways that paper became the basis

of my entire thesis. And while I didn’t personally work on formal methods, Clark Barrett’s work

in the area has always been something I followed closely, as the features of his work and my thesis

complement each other in wonderful ways. Thank you all for your incredibly useful feedback and

advice throughout this process.

I would also like to thank the people who helped put me on the path that led me to Stanford. At

the University of Alabama (roll tide!), John Baker, Jane Batson, Gary Cheng, Darren Evans-Young,

Paul Hubner, Amy Lang, and Shane Sharpe were all instrumental in preparing me to be a successful

researcher and engineer. Preceding them, I might not even have made it to Alabama if not for the

guidance I received at Fremd High School from Karen Clindaniel, Paul Hardy, and Michael Karasch,

nor would I be the person I am today without the influence of Steven Buenning and LoriAnne Frieri.

Finally, a quick shout-out to a person I can only describe as my “engineering uncle,” John Koepke.

Thank you all so much for making me who I am today.

vi

I would be remiss if I didn’t thank the people and organizations who financially made my time

at Stanford a reality. Thank you to Stanford and the Moler family for funding me as the Dr. Cleve

B. Moler Fellow. Thank you to Uber ATG and NVIDIA, as well as all the amazing folks I worked

with there, for both the funding and the internship opportunities. Additionally, thank you to the

Toyota Research Institute for funding my final quarter.

Finally, thank you to all of my friends and family, especially my mom, my dad, and my brother.

It seems unfair, considering how much you all have done for me, to only devote a few sentences to

thanking you. Alas, I don’t have room to thank all of you individually as you deserve, so I will leave

it at this: whether in small or large ways, every single one of you has influenced me for the better,

and I thank you for that.

vii

Contents

Abstract iv

Acknowledgments vi

1 Introduction 1

1.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 The Crosswalk—A Running Example . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Black-Box Safety Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 Path Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.3 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Adaptive Stress Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Adaptive Stress Testing 10

2.1 Sequential Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.3 Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1 Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.2 Proofs of Desirable Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 Direct Disturbance Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5.2 Seed-Disturbance Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

viii

2.6 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.1 Cartpole with Disturbances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.6.2 Autonomous Vehicle at a Crosswalk . . . . . . . . . . . . . . . . . . . . . . . 22

2.6.3 Aircraft Collision Avoidance Software . . . . . . . . . . . . . . . . . . . . . . 22

2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3 Scalable Validation 25

3.1 Fully Connected Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Preserving the Black-box Simulator Assumption . . . . . . . . . . . . . . . . . . . . 27

3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Simulator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.4.2 Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Generalizing across Initial Conditions 38

4.1 Modified Recurrent Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 Modified Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.3 Reinforcement Learners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3.1 Overall Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3.2 Comparison to Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Heuristic Rewards 46

5.1 Go-Explore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.1 Phase 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.1.2 Phase 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Go-Explore for Black-Box Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2.1 Cell Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2.2 Cell Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.2 Modified Reward Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

ix


5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6 Robustification 57

6.1 The Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.2 Robustification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

7 Validation in High-Fidelity 62

7.1 Validation in High-Fidelity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7.2.1 Case Study: Time Discretization . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2.2 Case Study: Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2.3 Case Study: Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.2.4 Case Study: Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.2.5 Case Study: NVIDIA DriveSim . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8 The AST Toolbox 71

8.1 Comparison to Existing Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8.2 AST Toolbox Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8.2.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73


8.2.3 Simulation Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

8.2.4 Reward Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.3.1 Cartpole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8.3.2 Autonomous Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

8.3.3 Automatic Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

9 Summary and Future Work 83

9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

9.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9.3 Further Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

9.3.1 Full Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

x

9.3.2 Fault Injection in Vision Systems . . . . . . . . . . . . . . . . . . . . . . . . . 87

9.3.3 Interpretability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

xi

List of Tables

3.1 Numerical results from both reinforcement learners. Reward without noise shows the

reward of the MCTS path if sensor noise was set to zero, to illustrate the difficulty

that MCTS has with eliminating noise. DRL is able to find a more probable path

than MCTS with a large reduction in calls to the Step function. . . . . . . . . . . . 35

4.1 The initial condition space. Initial conditions are drawn from a continuous uniform

distribution defined by the supports below. . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 The aggregate results of the DRDRL and GRDRL reinforcement learners, as well as

the MCTS and FCDRL reinforcement learners as baselines, on an autonomous driving

scenario with a 5-dimensional initial condition space. Despite not having access to

the simulator’s internal state, the DRDRL reinforcement learner achieves results that

are competitive with both baselines. However, the GRDRL reinforcement learner

demonstrates a significant improvement over the other three reinforcement learners. 44

5.1 Parameters that define the easy, medium, and hard scenarios. Changing the pedes-

trian location results in failures being further from the average action, making ex-

ploration more difficult, whereas changing the horizon and timestep lengths makes

exploration more complex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7.1 The results of the time discretization case study. Load Lofi Policy indicates whether

the policy of the BA was initialized from scratch or the weights were loaded from the

policy trained in lofi. The BA was able to significantly reduce the number of hifi steps

needed, and loading the lofi policy produced a further reduction. . . . . . . . . . . . 66

7.2 The results of the dynamics case study. Load Lofi Policy indicates whether the policy

of the BA was initialized from scratch or the weights were loaded from the policy

trained in lofi. The BA was able to significantly reduce the number of hifi steps

needed, and loading the lofi policy produced a further reduction. . . . . . . . . . . . 66

xii

7.3 The results of the tracker case study. Load Lofi Policy indicates whether the policy of

the BA was initialized from scratch or the weights were loaded from the policy trained

in lofi. The BA was able to significantly reduce the number of hifi steps needed, and

loading the lofi policy produced a further reduction. . . . . . . . . . . . . . . . . . . 67

7.4 The results of the perception case study. Due to differences in network sizes, resulting

from different disturbance vector sizes, it was not possible to load a lofi policy in this

case study. The BA was still able to significantly reduce the number of hifi steps needed. 68

8.1 A feature comparison of the AST Toolbox with three existing software solutions for

system validation and verification. The AST Toolbox is unique in two features: 1)

being able to treat the entire simulator as a black box, and 2) returning the most

likely failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

xiii

List of Figures

1.1 A general layout of the running crosswalk example. The ego vehicle approaches a

crosswalk where a pedestrian is trying to cross. The ego vehicle estimates the pedes-

trian location based on noisy observations. . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 A naive approach to validation. The pedestrian is constrained to cross the street in a

straight line. Different pedestrian velocities, shown as different sized arrows, may be

simulated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 A validation approach that captures the full variance of real-world possibilities. The

pedestrian is unconstrained and can move in any direction, as shown by the blue circle. 3

1.4 The general validation problem. A model of the system under test interacts with an

environment, both of which are contained in a simulator. An adversary perturbs the

simulation with disturbances in an effort to force failures. . . . . . . . . . . . . . . . 4

1.5 The AST methodology. The simulator is treated as a black box. The reinforcement

learner interacts with the simulator through disturbances and receives a reward. Max-

imizing reward results in the most likely failure path. . . . . . . . . . . . . . . . . . . 6

1.6 A graphical overview of the chapters in this thesis, organized by the category of the

contribution in each chapter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 The AST methodology. The simulator is treated as a black box. The reinforcement

learner interacts with the simulator through disturbances and receives a reward. Max-

imizing reward results in the most likely failure path. . . . . . . . . . . . . . . . . . . 13

2.2 The AST process using direct disturbance control. The reinforcement learner controls

the simulator directly with disturbances, which are used the by the reward function

to calculate reward. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 The AST process using seed-disturbance control. The reinforcement learner controls

the simulator by outputting a seed for the random number generators to use. The

reward function uses the transition likelihoods from the simulator to calculate reward. 20

xiv

2.4 Layout of the cartpole environment. A control policy tries to keep the bar from falling

over or the cart from moving too far horizontally by applying a control force to the

cart [65]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Layout of the autonomous vehicle scenario. A vehicle approaches a crosswalk on a

neighborhood road as a single pedestrian attempts to walk across. Initial conditions

are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 An example result from Lee, Mengshoel, Saksena, et al. [53], showing an NMAC

identified by AST. Note that the planes must be both vertically and horizontally near

to each other to register as an NMAC. . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 The network architecture of the fully-connected DRL reinforcement learner. A num-

ber of hidden layers learn to map the simulation state st to x, the mean and diagonal

covariance of a multivariate normal distribution. The disturbance x is then sampled

from the distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 The network architecture of the recurrent DRL reinforcement learner. An LSTM

learns to map the previous disturbance xt and the previous hidden state ht to the

next hidden state ht+1 and to x, the mean and diagonal covariance of a multivariate

normal distribution. The disturbance xt+1 is then sampled from the distribution. . . 27

3.3 The modular simulator implementation. The modules of the simulator can be easily

swapped to test different scenarios, SUTs, or sensor configurations. . . . . . . . . . . 28

3.4 A comparison of the reinforcement learner methods. MCTS uses a seed to control a

random number generator. DRL outputs a distribution, which is then sampled. Both

of these methods produce disturbances. . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5 The setup of the three crosswalk scenarios. . . . . . . . . . . . . . . . . . . . . . . . 31

3.6 Pedestrian motion trajectories for each scenario and algorithm. The collision point is

the point of contact between the vehicle and the pedestrian. In scenario 3, pedestrian

1 does not collide with the vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 An example of a scenario class for the crosswalk example. The ego vehicle and the

pedestrian both have a range of initial conditions for their position and velocity. A

concrete scenario instantiation could be created by sampling specific initial condition

values from the ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Contrasting the new and old AST architectures. The new reinforcement learner uses

a recurrent architecture and is able to generalize across a continuous space of initial

conditions with a single trained instance. These improvements allow AST to be used

on problems that would have previously been intractable. . . . . . . . . . . . . . . . 39

xv

4.3 The network architecture of the generalized recurrent DRL reinforcement learner.

An LSTM learns to map the input—a concatenation of the previous disturbance xt

and the simulator’s initial conditions s0—and the previous hidden state ht to the

next hidden state ht+1 and to x, the mean and diagonal covariance of a multivariate

normal distribution. The disturbance xt+1 is then sampled from the distribution. . . 40

4.4 The crosswalk scenario class. To instantiate a concrete scenario, the initial conditions

s0,ped, s0,car, v0,ped, and v0,car are drawn from their respective ranges, defined in

table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 The Mahalanobis distance of the most likely failure found at each iteration for both

architectures. The conservative discrete architecture runs each of the discrete rein-

forcement learners in sequential order. The optimistic discrete architecture runs each

of the discrete reinforcement learners in a single batch. . . . . . . . . . . . . . . . . . 43

5.1 The path to the first reward in the Atari 2600 version of Montezuma’s Revenge [78].

The player must take the numbered steps in order, without dying, before getting the

first key. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 The layout of the crosswalk example scenario. The car approaches the road where a

pedestrian is trying to cross. Initial conditions are shown, and values for s0,ped,y can

be found in table 5.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 The reward of the most likely failure found at each iteration of the GE, DRL, and

MCTS reinforcement learners in the easy scenario. Results are cropped to only show

results when a failure was found. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.4 The reward of the most likely failure found at each iteration of the GE and MCTS

reinforcement learners in the medium scenario. The DRL reinforcement learner was

unable to find a failure. Results are cropped to show results only when a failure was

found. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.5 The reward of the most likely failure found at each iteration of the GE reinforcement

learner in the hard scenario. The DRL and MCTS reinforcement learners were unable

to find a failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

6.1 The reward of the most likely failure found at each iteration of the GE, DRL, and

MCTS reinforcement learners in the easy scenario, as well as GE+BA, DRL+BA,

and MCTS+BA. The dashed lines indicate the respective scores after robustification

of each reinforcement learner. Results are cropped to show results only when a failure

was found. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

xvi

6.2 The reward of the most likely failure found at each iteration of the GE and MCTS

reinforcement learners in the medium scenario, as well as GE+BA and MCTS+BA.

The dashed lines indicate the respective scores after robustification of each reinforce-

ment learner. The DRL reinforcement learner was unable to find a failure. Results

are cropped to show results only when a failure was found. . . . . . . . . . . . . . . 61

6.3 The reward of the most likely failure found at each iteration of the GE reinforce-

ment learner in the hard scenario, as well as GE+BA. The dashed line indicates the

score after robustification of the GE reinforcement learner. The DRL and MCTS

reinforcement learners were unable to find a failure. . . . . . . . . . . . . . . . . . . . 61

7.1 Layout of the crosswalk example. A car approaches a crosswalk on a neighborhood

road with one lane in each direction. A pedestrian is attempting to cross the street

at the crosswalk. Initial conditions are shown. . . . . . . . . . . . . . . . . . . . . . . 64

7.2 Example rendering of an intersection from NVIDIA’s Drivesim simulator, an industry

example of a high-fidelity simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

8.1 The AST Toolbox framework architecture. The core concepts of the method are

shown, as well as their associated abstract classes. ASTEnv combines the simulator

and reward function in a gym environment. The reinforcement learner is implemented

using the garage package. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8.2 Layout of the cartpole environment. A control policy tries to keep the bar from falling

over, or the cart from moving too far horizontally, by applying a control force to the

cart [65]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

8.3 Best return found up to each iteration. The value is averaged over 10 different trials.

Both the MCTS and DRL reinforcement learners are able to find failures, but the

DRL reinforcement learner is more computationally efficient. . . . . . . . . . . . . . 77

8.4 Layout of the autonomous vehicle scenario. A vehicle approaches a cross-walk on a

neighborhood road as a single pedestrian attempts to walk across. Initial conditions

are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

8.5 Reward of the most likely failure found at each iteration. The Batch Max is the

maximum per-iteration summed Mahalanobis distance. The Cumulative Max is the

best Batch Max up to that iteration. The reinforcement learner finds the best solution

by iteration 6 out of 80. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

8.6 The average return at each iteration. The Batch Average is the average return from

each trajectory in an iteration, while the Cumulative Max Average is the maximum

Batch Average so far. The reinforcement learner is mostly converged by iteration

10, although there are slight improvements later. The large returns indicate that not

every trajectory is ending in a collision. . . . . . . . . . . . . . . . . . . . . . . . . . 80

xvii

8.7 The results of the automatic transmission case study, averaged over 10 trials. The

DRL reinforcement learner is able to outperform both MCTS and a random search

baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

xviii

Chapter 1

Introduction

Recent years have seen an explosion of work on solving real world problems with autonomous systems.

Engineers have proposed using autonomous systems for applications such as delivering packages with

quadcopters, delivering food orders with robots, providing security, and growing food. Autonomous

systems are even being proposed in safety-critical domains, for example autonomous vehicles. In

order for autonomous systems to deliver on their promising potential across these domains, they

will need to be safe. Designing the systems to be safe is not enough; safety validation—the process

of proving that a system is as safe as it is purported or required to be—is essential. Unfortunately,

validating the safety of autonomous systems is also a serious challenge.

1.1 Challenges

For many autonomous systems, validation through real-world testing is infeasible. For example, it

has been estimated that a fleet of autonomous vehicles would have to drive over 5 billion miles to

demonstrate safety equivalent to that of commercial airplanes [1]. That estimate only gets worse

when we consider that the 5 billion miles must be representative of the types of driving the vehicles

will be expected to do in operation—testing only on sunny highways, for example, does not provide

any validation of how the car performs in a snowy mountain neighborhood. Worse, this testing might

have to be redone for every update in the self-driving vehicle’s software. Due to the infeasibility

of validation through real-world testing, many designers of autonomous systems are turning to

simulation.

Validation in simulation presents its own challenges, however. First, autonomous systems are

complex, and may include components like large deep-learning networks. As a consequence, it may

be impossible in many cases to use formal methods to prove safety guarantees. Second, many

autonomous systems act in continuous, high-dimensional state and action spaces. As a result, there

are far too many possible rollout trajectories to exhaustively simulate all possible outcomes. Third,

1

CHAPTER 1. INTRODUCTION 2

ego vehicle pedestrian perceived pedestrian

Figure 1.1: A general layout of the running crosswalk example. The ego vehicle approaches acrosswalk where a pedestrian is trying to cross. The ego vehicle estimates the pedestrian locationbased on noisy observations.

some safety-critical autonomous systems require high levels of safety. Due to this requirement, we

may be interested in finding rare or low-likelihood failures. The rarity of failures complicates the

use of approximate methods for validating autonomous systems. Fourth, in order to have confidence

in the results of validation, simulators must be highly accurate representations of the real world,

which may require features like simulated perception or software-in-the-loop (SiL) simulation. The

downside to these desirable features is computational cost—high-fidelity simulators are often slow

and expensive to run. Consequently, designers are often stuck with a tradeoff between the robustness

of validation and computational cost. Finally, autonomous systems are often large and complex,

and therefore access to internal state variables and processes may be limited and challenging to

implement. If testing is being performed by a third-party or government institution, internal state

access may also be impossible for legal reasons Therefore, we want to be able to validate autonomous

systems in a way that requires limited or no access to the internal state of the system-under-test,

an approach known as black-box validation. To contextualize these challenges, we now present a

real-world example.

1.2 The Crosswalk—A Running Example

Consider the case of an autonomous vehicle approaching a crosswalk on a neighborhood road—a

scenario that will serve as a running example through this thesis. The basic crosswalk layout is

shown in fig. 1.1. The autonomous vehicle is approaching at 25 mph to 34 mph, and the pedestrian

is on the side of the road, near the crosswalk entrance. In order to make validation tractable, the

current approach might be to constrain the pedestrian to move in a straight line across the crosswalk,

as shown in fig. 1.2. Designers could then do a grid search over different pedestrian velocities and

establish the range of pedestrian velocities that their vehicle can safely handle. However, this

simplification is far too restrictive to guarantee safety.

In the real world, pedestrians can and will take a massively diverse range of trajectories while


ego vehicle pedestrian

Figure 1.2: A naive approach to validation. The pedestrian is constrained to cross the street in astraight line. Different pedestrian velocities, shown as different sized arrows, may be simulated.


Figure 1.3: A validation approach that captures the full variance of real-world possibilities. Thepedestrian is unconstrained and can move in any direction, as shown by the blue circle.

crossing the street, as shown in fig. 1.3. While many of these trajectories are unlikely, they are not

so unlikely that we can neglect them completely. However, there are so many different instantiations

of unlikely pedestrian trajectories that we would never be able to simulate them all through brute

force methods. Furthermore, many failures are unlikely enough to make them hard to find, but

likely enough that—in aggregate—they can still result in a high likelihood of system failure, the so

called heavy-tail problem [2]. We need a way to approximately search the space of possible rollouts

to make validation tractable, while still preserving the black-box assumption introduced earlier. We

are therefore interested in the field of black-box safety validation.

1.3 Black-Box Safety Validation

The general field of safety validation contains numerous sub-fields. For example, the field of formal

verification focuses on extracting provable guarantees about a system. We are interested specifically

in the sub-field of black-box safety validation, which is the process of ensuring the safety of a system


SystemM

EnvironmentE

AdversaryA

action disturbance x

observation state s

Figure 1.4: The general validation problem. A model of the system under test interacts with anenvironment, both of which are contained in a simulator. An adversary perturbs the simulation withdisturbances in an effort to force failures.

with limited or no access to the internal state of the system. In contrast, white-box safety validation

would be the process of ensuring the safety of a system with full access to the internal state of the

system.

The general problem set up for black-box validation is shown in fig. 1.4. A model M of the

system-under-test (SUT) takes actions a in and receives observations o from the environment E .

Both the system and the environment are contained in a simulator. Simultaneously, an adversary Athat might have access to the environment state s outputs disturbances x. The disturbances control

the environment, and they are chosen by the adversary with the goal of forcing a failure in the SUT.

We briefly cover some of the existing methods of black-box validation (see Corso, Moss, Koren, et

al. [3] for an in-depth survey of black-box validation).

1.3.1 Optimization

One approach to efficiently finding failures is to define a cost function that measures the level of

safety of the system over the duration of an environment rollout (s0, . . . , sT ). Once a cost function

is defined, the black-box validation problem can be solved as an optimization problem. While a

cost function is often application specific, much work has been done on creating cost functions using

temporal logic expressions for a range of domains [4]–[11], and a range of algorithms have been

applied to black-box validation optimization problems:

• Simulated annealing is a stochastic global optimization method that can perform well on

problems with many local minima, and therefore has been shown to be effective for black-box

validation [12], [13].

• Genetic algorithms mimic a basic model of genetic evolution, and they have been used to solve

black-box validation problems [14], [15].

• Bayesian optimization methods select disturbances that are likely to lower the cost function

by building a probabilistic surrogate model of the cost function over the space of disturbance

trajectories. Bayesian optimization is designed to handle stochastic objective functions and

uncertainty and has been shown to perform well on validation problems [10], [16]–[20].


• Extended ant-colony optimization, a probabilistic technique for solving continuous optimiza-

tion problems inspired by the way certain ant species leave pheromone traces while searching

for food, has also been successfully used to perform black-box validation [21].

1.3.2 Path Planning

Another approach to finding failures is to solve the problem as a path planning problem of finding

the best trajectory through the state space, using disturbances as control inputs. The disturbance

trajectory is sequentially built to reach E, the subset of the state space containing failures, from

the initial state s0. Rapidly-exploring random tree (RRT) is one of the most popular path planning

algorithms, and has been frequently applied to black-box safety validation [7], [22]–[29]. However,

other approaches like multiple shooting methods [30], [31] and Las Vegas tree search (LVTS) [8]

have also been explored.

1.3.3 Importance Sampling

Many of the above methods are focused on finding failures in a system when it is hard to do so. A

different subset of black-box validation is interested in estimating the overall likelihood of system

failure. Importance sampling (IS) techniques are the most common approach to estimating the

failure probability of a system, and there is a broad library of existing work on how to best use IS

for black-box safety validation. In some cases, the extreme rarity of failures can prevent IS from

converging. In such cases, IS with adaptive sampling has been shown to successfully find failure

probabilities [32]–[36]. Validation problems can be high-dimensional, therefore a non-parametric

version of IS that uses Markov chain Monte Carlo (MCMC) estimation to achieve better scalability

has been used to find failures in systems [37]. Another approach to finding failures is to combine

IS with sequential decision-making techniques to efficiently find the optimal importance sampling

policy for a specific state [38], [39].

1.3.4 Reinforcement Learning

Scalability is an issue with many of the above approaches to black-box safety validation, as we are

searching for failures in the space of disturbance trajectories, which scales exponentially with the

number of simulation timesteps. One approach, the one taken throughout this thesis, is to formulate

the problem of black-box safety validation as a Markov decision process (MDP) (see section 2.1.1).

Reinforcement learning (RL) techniques can then be used to find failures. The two most common

reinforcement learning approaches for finding failures are Monte Carlo tree search (MCTS) [40]–[44]

and deep reinforcement learning (DRL) [45]–[52]. In this thesis we will focus on an approach that

is capable of using both MCTS and DRL, or other RL algorithms, to find the most likely failure in

a system: adaptive stress testing (AST).


Simulator S

Environment E SystemUnder Test M

ReinforcementLearner A

disturbance xsim state s,reward r

Most LikelyFailure Path

Figure 1.5: The AST methodology. The simulator is treated as a black box. The reinforcementlearner interacts with the simulator through disturbances and receives a reward. Maximizing rewardresults in the most likely failure path.

1.4 Adaptive Stress Testing

In adaptive stress testing (AST), we formulate the problem of finding the most likely failure as a

Markov decision process [53]. As such a process, the problem can be approached with standard

reinforcement learning (RL) techniques for validation. The process is shown in fig. 1.5. The re-

inforcement learner, which contains the RL agent, controls the simulation through disturbances.

The simulation should be deterministic with respect to the disturbances. The simulation updates

according to the disturbances and then outputs the likelihood of the timestep, and whether an event

of interest—in our application, a collision—occurred. The output from the simulator is then used

to calculate a reward, which is used by the reinforcement learner to improve the RL agent through

optimization. See chapter 2 for a detailed coverage of AST.

While vanilla AST provides a useful starting point, there are still unsolved issues that limit the

cases in which we can uses AST for validation. AST needs to be able to scale to high-dimensional

problems, which may be an issue when using Monte Carlo tree search (MCTS). AST needs to be

able to find failures on system where finding failures may be difficult or require long search horizons.

AST needs to be efficient enough to work with high-fidelity simulations. Finally, AST needs to be

readily available for researchers and engineers to apply to their systems. The work presented in this

thesis focuses on addressing these unresolved issues.

1.5 Contributions

This thesis contributes to the field of approximate validation for autonomous systems by addressing

several limitations in AST. Contributions are highlighted in the chapters in which they are presented,

and chapter 9 summarizes the specific contributions made throughout this thesis. This section

provides a brief, high-level overview of the primary contributions. Figure 1.6 shows a graphical

representation of these contributions.


AST Utility

Chapter 4: Generalizing across Initial Conditions

Chapter 6: Robustification

Scalability

Chapter 3: Scalable Validation

Chapter 5: Heuristic Reward

Applicability

Chapter 7: Validation in High-Fidelity

Chapter 8: The AST Toolbox

Figure 1.6: A graphical overview of the chapters in this thesis, organized by the category of thecontribution in each chapter.

In order to be useful for validating autonomous systems, which often act in large, continuous

state and action spaces, AST reinforcement learners must be scalable. Monte Carlo tree search

may not provide the scalability needed for validating systems that act based on perception, like

autonomous vehicles. This thesis introduces a deep reinforcement learning (DRL) reinforcement

learner that is shown to have better performance and scalability. Furthermore, we show that with

slight modifications the recurrent neural-network architecture allows the DRL reinforcement learner

to generalize across initial conditions, adding an avenue to significant computational savings.

An issue for DRL approaches is their reliance on a consistent reward signal to guide them to their

goals during training. For validation tasks, heuristic rewards can be used to provide a useful signal

for the reinforcement learner by giving reward for intermediate steps that lead towards failures.

However, it may not always be possible or desirable to craft heuristic reward functions, and, without

the reward signal, DRL could struggle to find failures. MCTS may perform better, but can also

struggle on these types of hard-exploration domains if the trajectories get too long. This thesis

introduces a reinforcement learner based on go-explore [54], a state-of-the-art algorithm for hard-

exploration problems that can find failures when no heuristics are available and search horizons are

long.

The accuracy required by validation tasks necessitates the use of high-fidelity (hifi) simulators

when moving away from real-world testing. However, hifi can be slow and expensive. While AST

allows us to search the space of possible rollouts, its use of RL means that finding a failure could still

take hundreds or thousands of iterations, which may be intractable for a hifi simulator. This thesis

presents a way of running AST in low-fidelity simulation, and then using the information acquired


to make running in hifi tractable.

As autonomous vehicles find widespread adoption in safety-critical applications, it is essential

for safety to become collaborate, not competitive. To facilitate this collaboration, safety validation

methods should be as open-source and transparent as possible. Towards that end, this thesis presents

the AST Toolbox, an open-source Python package that allows designers to easily apply AST to their

own system.

1.6 Overview

This thesis presents advancements in approximate methods for validating an autonomous system

in simulation. This chapter introduced the overall challenges of validation, explained how adap-

tive stress testing might alleviate those challenges, presented outstanding limitations in AST, and

outlined the contributions of this thesis. The remainder of this thesis proceeds as follows:

Chapter 2 provides a background on Adaptive Stress Testing. We show that the formulation of

the reward function provides the same optimal trajectory as our motivational optimization problem,

and we provide guidance on how to set certain reward parameters to achieve desirable behavior.

Chapter 2 also provides background on Markov decision processes and deep reinforcement learning.

Chapter 3 presents a new reinforcement learner for AST that uses deep reinforcement learning

(DRL). It is essential for validation methods to be scalable in order to be applicable to autonomous

systems that act in high-dimensional or continuous state and action spaces. The use of DRL improves

the scalability of AST, allowing it to be applied to such systems.

Chapter 4 provides a way to use a single run of AST to validate a system over a set of initial

conditions. Designers are often interested in a scenario class, which is defined by a parameter range.

Instead of having to run AST an infeasible number of times from different instantiations of the

scenario class, we show that AST can learn to generalize across the entire scenario class.

Chapter 5 presents a new reinforcement learner for AST that uses go-explore to handle long-

horizon validation problems with no heuristic rewards. Heuristic rewards are domain-specific terms

in the reward function meant to guide RL agents to goal states. We show that the new reinforcement

learner can find failures even in cases where it is infeasible or undesirable to craft heuristic rewards.

Chapter 5 also presents background on go-explore.

Chapter 6 shows that the backward algorithm can be used to robustify the results of all existing

reinforcement learners. AST must search a massive space of possible simulation rollouts to find

failures, which can lead to significant variance in the final validation results. To avoid such variance,

which is unsafe, we show that the backward algorithm can be applied after training to get improved

results that are also more consistent. Chapter 6 also provides background on the backward algorithm.

Chapter 7 presents a way of using the backward algorithm to transfer failures found in low-fidelity

simulation to high-fidelity simulation. High-fidelity simulation is needed for validation because of its


accurate representation of the real world, but it is slow and expensive to run. We show that we can

first find failures in low-fidelity simulation, which is much cheaper to run, and then transfer them

to high-fidelity, significantly reducing the number of high-fidelity simulation steps needed.

Chapter 8 introduces the AST Toolbox, an open-source software toolbox that allows designers to

easily apply AST to their own systems. The Toolbox provides an environment that turns AST into

a standard OpenAI gym reinforcement learning environment. Off-the-shelf reinforcement learners,

like those provided by garage, can then be used to find failures. The designer needs only to write a

wrapper to interface with their simulator. Chapter 8 also covers existing software packages.

Chapter 9 concludes the thesis with a summary of the contributions and results as well as a brief

discussion of ideas for further research.

Chapter 2

Adaptive Stress Testing

When performing approximate validation of autonomous systems, we must find the right balance

between computational cost and thoroughness. This tradeoff is a focus within the field of black-box

safety validation. Black-box safety validation is a rich field with a variety of different approaches

(see section 1.3 for brief overview of the existing approaches to black-box safety validation). This

thesis focuses in on a single approach, adaptive stress testing (AST), that has the unique feature

of returning the most likely failure of a system for a given scenario. As the rest of this thesis

contributes several advancements to validating autonomous systems based on AST, this chapter will

provide background on the latest formulation of AST.

2.1 Sequential Decision Making

2.1.1 Markov Decision Process

Adaptive stress testing (AST) frames the problem of finding the most likely failure as a Markov

decision process (MDP) [55]. In an MDP, an agent takes action a while in state s at each timestep.

The agent may receive a reward from the environment according to the reward function R(s, a). The

agent then transitions to the next state s′ according to the transition probability P (s′ | a, s). Both

the reward and transition functions may be deterministic or stochastic. The Markov assumption

requires that the next state and reward be independent of the past history conditioned on the current

state-action pair (s, a). An agent’s behavior is specified by a policy π(s) that maps states to actions,

either stochastically or deterministically. An optimal policy is one that maximizes expected reward.

Reinforcement learning is one way to approximately optimize policies in large MDPs.

10

CHAPTER 2. ADAPTIVE STRESS TESTING 11

2.1.2 Monte Carlo Tree Search

Monte Carlo tree search (MCTS) [56] has been successfully applied to a variety of problems, including

the game of Go, and has been demonstrated to perform well on large scale MDPs [57]. MCTS

incrementally builds a search tree where each node represents a state or action in the MDP. It

uses forward simulation to evaluate the return of state-action pairs. To balance exploration and

exploitation, each action in the tree is chosen according to its upper confidence bound (UCB)

evaluation:

a← arg maxa

Q(s, a) + c

√log(N(s))

N(s, a)(2.1)

where Q(s, a) is the average return of choosing action a at state s, N(s) is the number of times that

s has been visited, N(s, a) is the number of times that a has been chosen as the next action at state

s, and c is a parameter that controls the exploration. UCB helps bias the search to focus on the

most promising areas of the action space.

For problems with a large or continuous action space (as is common in the AST context), a

technique called double progressive widening (DPW) is used to control the branching factor of the

tree [58]. In DPW, the number of different actions tried at each state node, |N(s, a)|, is constrained

by |N(s, a)| < kN(s)α. The parameters k and α control the widening speed of the tree. Since the

transition is deterministic in AST, no constraint on the number of different next states, |N(s, a, s′)|,is needed [40].

2.1.3 Deep Reinforcement Learning

In deep reinforcement learning (DRL), a policy is represented by a neural network [59]. Whereas

a feed-forward neural network maps an input to an output, we use a recurrent neural network

(RNN), which maps an input and a hidden state from the previous timestep to an output and an

updated hidden state. An RNN is naturally suited to sequential data due to the hidden state, which

is a learned latent representation of the current state. RNNs suffer from exploding or vanishing

gradients, a problem addressed by variations such as long-short term memory (LSTM) [60] or gated

recurrent unit (GRU) [61] networks.

There are many different algorithms for optimizing a neural network, proximal policy optimiza-

tion (PPO) [62] being one of the most popular. PPO is a policy-gradient method that updates

the network parameters to minimize the cost function. Improvement in a policy is measured by an

advantage function, which can be estimated for a batch of rollout trajectories using methods such as

generalized advantage estimation (GAE) [63]. However, variance in the advantage estimate can lead

to poor performance if the policy changes too much in a single step. To prevent such a step leading

to a collapse in training, PPO can limit the step size in two ways: 1) by incorporating a penalty

proportional to the KL-divergence between the new and old policy or 2) by clipping the estimated


advantage function when the new and old policies are too different.

2.2 Preliminaries

A safety validation problem consists of a system under test (SUT), represented by a SUT modelMwith state µ ∈ M , that is acting in an environment E , as shown in fig. 1.4. The safety validation

problem evolves over a discrete time range t ∈ [0, . . . , tend], where tend ≤ tmax for some horizon

tmax. We use a subscript to denote a variable at time t (e.g., the SUT state at time t is µt) and

a subscript colon range to denote a sequence of a variable over a range of timesteps (e.g., the SUT

state path from time 0 to time t is µ0:t = [µ0, . . . , µt]). The SUT receives an observation o ∈ O from

the environment—which depends on the environment state z ∈ Z—and then takes action a ∈ A.

The SUT state and action depend on the environment observations:

µt+1, at =M (o0:t) (2.2)

The SUT and environment are both contained by a simulator S with state s ∈ S, where the

simulator state is the stacked SUT and environment states [µ, z]. An adversary A also interacts

with the simulator by producing disturbances x ∈ X. A disturbance can take many forms (see

section 2.6), but the disturbance vector must control all stochasticity within the environment. The

disturbance can affect both the environment state and the observation:

zt+1, ot+1 = E (a0:t, x0:t) (2.3)

Since we assume that the disturbance controls all stochasticity in the environment, eq. (2.2) and

eq. (2.3) together mean that the simulator state is determined by the disturbance:

st+1 = S (x0:t) (2.4)

We assume disturbances are independent across time and distributed with known probability density

p (x | s). The disturbance model could be learned from data or generated from expert knowledge.

Finally, we define an event space E ⊂ S where an event of interest occurs. While this event space

can be arbitrarily defined, for validation tasks we focus on failure events. A trajectory is said to be

a failure when stend∈ E.

2.3 Problem Formulation

Finding the most likely failure of a system is a sequential decision-making problem. Given an event

space E, we want to find the most likely simulator path (x0:tend−1, s0:tend) that ends in the event


Simulator S



disturbance xsim state s,reward r

Most LikelyFailure Path

Figure 2.1: The AST methodology. The simulator is treated as a black box. The reinforcementlearner interacts with the simulator through disturbances and receives a reward. Maximizing rewardresults in the most likely failure path.

space by controlling the adversary’s disturbances:

maximizex0,...,xtend−1

P (s0, x0, . . . , xtend−1, stend)

subject to stend∈ E

(2.5)

where P (s0, x0, . . . , xtend−1, stend) is the probability of a path in simulator S.

Because we assume the simulator is Markovian,

P (s0, x0, . . . , xtend−1, stend) = P (s0)

tend−1∏t=0

P (st+1 | xt, st)P (xt | st) (2.6)

By definition, the simulator is deterministic with respect to xt, so P (st+1 | xt, st) = 1, and we

assume P (s0) = 1, and therefore eq. (2.5) becomes

maximizex0,...,xtend−1

tend−1∏t=0

P (xt | st)

subject to stend∈ E

(2.7)

2.4 Reinforcement Learning

AST solves the optimization problem in eq. (2.7) through reinforcement learning (RL) by letting

an RL agent A act as the adversary. The general process is shown in fig. 2.1. The reinforcement

learner passes the disturbance to the simulator, which may be treated by the reinforcement learner

as a black box. The simulator uses the disturbance to update the environment and the SUT. The

simulator returns a reward and some simulation information (see section 2.5). If the simulator is

treated as black box, the simulation information may merely be an indicator of whether a trajectory


has ended in failure. If the simulator is not fully treated as a black box, the simulation information

may include a part or all of the simulation state—or heuristics that depend on a part or all of the

simulation state—depending on how much of the simulation state is exposed. Through repeated

interactions with the simulator, the reinforcement learner is optimized to choose disturbances that

maximize reward. The process will therefore return the most likely failure path when the reward

function returns higher rewards for failure events and higher likelihood transitions.

2.4.1 Reward Function

In order to find the most likely failure, the agent tries to maximize the expected sum of rewards,

E

[tmax∑t=0

R(st, at)

], (2.8)

where the reward function must be structured as follows:

R (st, xt−1, st−1) = h (st, xt−1, st−1) +

RE st ∈ E

−RE st /∈ E, t = tmax

ρt st /∈ E, t < tmax

(2.9)

where the parameters are

• h (st, xt−1, st−1): An optional training heuristic given at each timestep of the form h (st, xt−1, st−1) =

Φ(st)−Φ(st−1). When Φ is a potential function that smoothly measures the closeness to fail-

ure, h (st, xt−1, st−1) is a difference of potential functions and will not change the optimal

trajectory [64].

• RE : A reward for trajectories that end in the event space E.

• RE : A penalty for trajectories that do not end in the event space E.

• ρt: The action likelihood reward. For direct action control, ρ = logP (xt | st). For seed-

action control, ρt = logP (xt | st; x), where x is the pseudorandom seed that is output by

the reinforcement learner. (see section 2.5 for the differences between direct disturbance and

seed-action control, and for when each version is appropriate). In practice, ρt may be replaced

by a reward proportional to the log-probabilities.

Within eq. (2.9), there are three cases:

• s ∈ E: The trajectory has terminated because an event has been found. This is the goal, so

the AST agent receives a reward.


• s /∈ E, t = tmax: The trajectory has terminated by reaching the horizon tmax without reaching

an event. This is the least-useful outcome, so the AST agent receives a penalty.

• s /∈ E, t < tmax: The trajectory has not terminated, which is most timesteps. The reward is

generally the negative log-likelihood of the disturbance, which promotes likely actions.

2.4.2 Proofs of Desirable Properties

Under certain conditions, we can guarantee some desirable properties of the RL approach. For the

purposes of proving the propositions below, we will briefly introduce some common notation. Con-

sider the set of all possible failure trajectories TE and the set of all possible non-failure trajectories

TE , where TE ∩ TE = ∅ and TE ∪ TE = T, with T being the set of all possible trajectories. Let

τx = (s0:tend, x0:tend−1) be a trajectory in some set of trajectories, τx ∈ Tx, where τE is a trajectory

that ends in failure (stend∈ E), so τE ∈ TE , and τE is a trajectory that does not end in a failure

(stend/∈ E), so τE ∈ TE We have already defined ρt = logP (xt | st), and we will further define the

likelihood of a trajectory as

ρτ =∑

(st,xt)∈τ

logP (xt | st) (2.10)

We will denote the most likely trajectory in a set with a star, such that ∀τx ∈ Tx, ρτ∗x ≥ ρτx . We

will denote the least likely trajectory in a set with a prime, such that ∀τx ∈ Tx, ρτx ≥ ρτ ′x . We will

denote the trajectory that is the most likely of the trajectories that end closest to a failure with a

dagger, such that ∀τx ∈ Tx,Φ(s†tend

)≥ Φ (stend

), and if Φ(s†tend

)= Φ (stend

), then ρ†E> ρE , where

Φ(s) is a potential function that is a smooth measure of closeness to failure. We will also consider

the total sum of rewards that a trajectory receives from eq. (2.9):

Gτ =∑

(st,xt)∈τ

R (st, xt−1, st−1) (2.11)

= RE1{st ∈ E} −RE1{st /∈ E, t = tmax}+ ρτ (2.12)

The most important property of the RL approach is that the optimal trajectory is the same as

that for eq. (2.7), which is to say that by maximizing eq. (2.8) we actually find the most likely failure.

Equation (2.9) shows that, of the two failure paths, the likelier path will receive a higher reward.

Similarly, of the two non-failure paths, the likelier path will receive a higher reward. Therefore, to

show that our RL approach yields the same trajectory as eq. (2.7), we only need to show that the

most likely failure path yields a higher reward than the most likely non-failure path.

Proposition 2.4.1. Consider the most likely trajectory that ends in failure, τ∗E. Let ρ∗min > −ρτ∗E .

If (RE +RE) ≥ ρ∗min, then τ∗E is also the trajectory that maximizes eq. (2.9).

Proof. Because τ∗E ∈ TE and τ∗E∈ TE , Gτ∗E = ρτ∗E + RE and Gτ∗

E= ρτ∗

E− RE . We want to show


that the most likely failure receives a higher reward than the most likely non-failure, so

Gτ∗E > Gτ∗E

(2.13)

ρτ∗E +RE > ρτ∗E−RE (2.14)

RE +RE > ρτ∗E− ρτ∗E (2.15)

ρτ∗E

is a sum of log-probabilities, and therefore we know ρτ∗E≤ 0, so −ρτ∗E ≥ ρτ∗E − ρτ∗E . In addition,

by definition ρ∗min > −ρτ∗E and, consequently, if (RE +RE) ≥ ρ∗min, then we can easily show that

RE +RE ≥ ρmin (2.16)

> −ρτ∗E (2.17)

≥ ρτ∗E− ρτ∗E (2.18)

Therefore, if (RE +RE) ≥ ρ∗min, then Gτ∗E > Gτ∗E

.

There are two difficulties with applying the above proof in practice. The first difficulty is that

ρτ∗E is not known ahead of time. However, ρ∗min can be set to a large enough number to ensure

that ρ∗min > −ρτ∗E . The second, and more problematic, difficulty is that AST provides only an

approximate solution to the RL problem. The approximate solution might have converged to a local

optimum that is not the global optimum, and therefore the most likely failure found would have had

lower probability than the most likely failure ρτ∗E . In such cases, we would prefer the local optimum

to still be a failure, even if that failure is less likely than the most likely failure. With a slight

variation of proposition 2.4.1, we can show that, under certain conditions, all trajectories that end

in failure will result in a higher reward than any trajectory that does not end in failure.

Proposition 2.4.2. Consider the least likely trajectory that ends in failure τ ′E and the most likely

trajectory that does not end in failure τ∗E

. Let ρ′min > −ρτ ′E . If (RE +RE) ≥ ρ′min, then eq. (2.9)

yields a higher reward for τ ′E than for τ∗E

.

Proof. Note that we can ignore the heuristic reward term without loss of generality (see corol-

lary 2.4.1). We want to show that the least likely failure will receive a higher reward than the most

likely non-failure, so

Gτ ′E > Gτ∗E

(2.19)

ρτ ′E +RE > ρτ∗E−RE (2.20)

RE +RE > ρτ∗E− ρτ ′E (2.21)

Because ρτ∗E

is a sum of log-probabilities, we know ρτ∗E≤ 0, so −ρτ ′E ≥ ρτ∗

E− ρτ ′E . In addition, by


definition ρ′min > −ρτ ′E and, consequently, if (RE +RE) ≥ ρ′min, then we can easily show that

RE +RE ≥ ρ′min (2.22)

> −ρτ ′E (2.23)

≥ ρτ∗E− ρτ ′E (2.24)

Therefore, if (RE +RE) ≥ ρ′min, then Gτ ′E > Gτ∗E

.

We have shown that the approximate solution to the RL problem will be the most likely failure

found. However, in practice we still have the difficulty of not knowing ρτ ′E ahead of time. We can

overcome this difficulty by setting a minimum threshold for the likelihood of failures that we are

interested in and considering any trajectory that is less likely than the threshold to be a non-failure.

This threshold would then be a lower bound on ρτ ′E , and ρ′min can be set accordingly.

Both of the above proofs ignored the heuristic terms h (st, xt−1, st−1), but it is easy to show that

the results still hold when a heuristic reward is used.

Corollary 2.4.1. If h (st, xt−1, st−1) = Φ(st)−Φ(st−1) where Φ is a smooth measure of closeness to

failure, then proposition 2.4.1 and proposition 2.4.2 are true when using a non-zero heuristic reward.

Proof. In eq. (2.16) and eq. (2.22), both proofs arrive at an inequality of the form

RE +RE ≥ ρ (2.25)

where ρ = ρ∗min and ρ = ρ′min, respectively. For the purpose of this proof we can ignore the specific ρ

terms and note that including a heuristic reward for either proof would have resulted in an inequality

of the form

RE +RE ≥ ρ+ Φ (sE)− Φ (sE) (2.26)

Because Φ is a measure of closeness to failure, Φ (sE) > Φ (sE), and, therefore, Φ (sE)−Φ (sE) < 0.

Consequently, if RE and RE are set such that eq. (2.25) is true, then eq. (2.26) is also true.

There are also situations in which, if we do not find a failure, we would like to return the

trajectory that was closest to being a failure. We can show that, under certain conditions, when we

do not find a failure, the trajectory that is closest to being a failure will receive the highest reward

(and if there are multiple trajectories that are equally close to failure, the most likely one will receive

the highest reward).

Proposition 2.4.3. Assume that the heuristic reward h (st, xt−1, st−1) = Φ(st) − Φ(st−1), where

Φ(s) is a potential function that is a smooth measure to closeness to failure. Further assume that

there are no trajectories that end in failure, so TE = ∅ and TE = T. Consider the most likely of the


trajectories that end closest to failure τ †E

. If Φ(s†tend

)− Φ (stend

) > −ρ†E

, then τ †E

is the trajectory

in T that maximizes eq. (2.9).

Proof. We want to show that τ †E

receives a greater reward than any other trajectory, so

Gτ†E

> GτE (2.27)

ρ†E

+RE + Φ(s†tend

)− Φ (s0) > ρE +RE + Φ (stend

)− Φ (s0) (2.28)

ρ†E

+ Φ(s†tend

)> ρE + Φ (stend

) (2.29)

Φ(s†tend

)− Φ (stend

) > ρE − ρ†E

(2.30)

Note that in eq. (2.28), all the middle Φ (s1:tend−1) terms cancel out due to it being a telescoping

sum. Because ρE is a sum of log-probabilities, we know ρE ≤ 0, and therefore

− ρ†E≥ ρE − ρ

†E

(2.31)

Consequently, Gτ†E

> GτE when

Φ(s†tend

)− Φ (stend

) > −ρ†E

(2.32)

In the case in which Φ(s†tend

)= Φ (stend

), we must have ρE − ρ†E< 0, which is the case if, and only

if, τ †E

is the more likely of the two failures.

As with previous propositions, a difficulty arises in practice in that we do not know −ρ†E

ahead

of time. However, we can still use proposition 2.4.3 as a guideline for tuning the heuristic reward

in eq. (2.9). Our result shows that the change in Φ as a trajectory gets closer to a failure must be

greater than the sum of its log rewards. As a consequence, we can see that the more we care about

returning the trajectory that is closest to failure, the more we should scale Φ.

2.5 Methodology

The AST approach treats both the system under test and the simulator itself as a black box.

However, the reinforcement learner does need the following three access functions in order to interact

with the simulator:

• Initialize(S, s0): Resets S to a given initial state s0.

• Step(S, E, x): Steps the simulation in time by drawing the next state s′ based on disturbance

x. The function returns ρ, the log-probability of the transition and an indicator showing

whether or not s′ is in E. The function may return additional simulation information for the


Simulator S


action a

observation o


RewardFunction R

disturbance x

rewardr

event e, simulation information s

Figure 2.2: The AST process using direct disturbance control. The reinforcement learner controlsthe simulator directly with disturbances, which are used the by the reward function to calculatereward.

reward function to use if the simulation state is partially or fully exposed. The simulation

information may include the exposed portions of the simulation state itself, metrics based on

the exposed portions of the simulation state, or a combination of both.

• IsTerminal(S, E): Returns true if the current state of the simulation is in E or if the horizon

of the simulation tmax has been reached.

Depending on the simulator design, AST has two different ways to control the simulation rollouts:

direct disturbance control, and seed-disturbance control.

2.5.1 Direct Disturbance Control

Under direct disturbance control, the reinforcement learner directly outputs a disturbance vector that

is used by the simulator when updating to the next timestep, as shown in fig. 2.2. The disturbance is

used by the reward function, with any additional simulation information, to determine the reward.

At each timestep, the disturbance output may depend only on the previous disturbance, or it may

depend on simulation information if some of the simulator state is exposed.

2.5.2 Seed-Disturbance Control

In some simulators, it may not be feasible to allow an outside adversary to directly control the update

of a simulator. Under seed-disturbance control, AST instead controls stochasticity at each timestep

by setting the global random seed x. When stochastic elements of the simulator are generated from

pseudorandom number generators, the outcome is determined by the global random seed. Therefore,

the disturbance x is fully determined by x, and the simulator is still deterministic with respect to

the reinforcement learner’s output. The seed-disturbance control process is shown in fig. 2.3. The


Simulator S


action a

observation o

MCTS-SA

RewardFunction

seed x

reward r

transition likelihood ρ,event e, simulation information s

Figure 2.3: The AST process using seed-disturbance control. The reinforcement learner controls thesimulator by outputting a seed for the random number generators to use. The reward function usesthe transition likelihoods from the simulator to calculate reward.

reward function cannot determine a reward directly from the seed x and must rely instead on the

simulator to provide it with the likelihood of the transition to the current simulator state.

2.6 Case Studies

We present three case studies in which an autonomous system needs to be validated. For each

scenario, we provide an example of how it could be formulated as an AST problem.

2.6.1 Cartpole with Disturbances

Problem

Cartpole is a classic test environment for continuous control algorithms [66]. The system under

test (SUT) is a neural network control policy trained by TRPO. The control policy controls the

horizontal force ~F applied to the cart, and the goal is to prevent the bar on top of the cart from

falling over.

The cartpole scenario from [67] is shown in fig. 2.4. The state s = [x, x, θ, θ] represents the cart’s

horizontal position and speed as well as the bar’s angle and angular velocity. The control policy, a

neural network trained by TRPO, controls the horizontal force ~F applied to the cart. The failure of

the system is defined as |x| > xmax or |θ| > θmax. The initial state is at s0 = [0, 0, 0, 0].

Formulation

We define an event as the pole reaching some maximum rotation or the cart reaching some maximum

horizontal distance from the start position. The disturbance is δ ~F , the disturbance force applied to

the cart at each timestep. The reward function uses RE = 1× 104 and RE = 0. There is also a


Figure 2.4: Layout of the cartpole environment. A control policy tries to keep the bar from fallingover or the cart from moving too far horizontally by applying a control force to the cart [65].

heuristic reward where Φ is 1000 times the normalized distance of the final state to the nearest failure

state. The choice of Φ encourages the reinforcement learner to push the SUT closer to failure. The

disturbance likelihood reward ρ is set to the log of the probability density function of the natural

disturbance force distribution. See Koren, Ma, Corso, et al. [67].

y

x

(−35, 0) m

(0.0,−2) m

11.2 m/s

1.0 m/s


Figure 2.5: Layout of the autonomous vehicle scenario. A vehicle approaches a crosswalk on aneighborhood road as a single pedestrian attempts to walk across. Initial conditions are shown.


2.6.2 Autonomous Vehicle at a Crosswalk

Problem

Autonomous vehicles must be able to safely interact with pedestrians. Consider an autonomous

vehicle approaching a crosswalk on a neighborhood road. There is a single pedestrian who is free to

move in any direction. The autonomous vehicle has imperfect sensors.

The autonomous vehicle scenario from Koren, Alsaif, Lee, et al. [46] is shown in fig. 2.5. The x-

axis is aligned with the center of the SUT’s lane, with East being the positive x-direction. The y-axis

is aligned with the center of the cross-walk, with North being the positive y-direction. The pedestrian

is crossing from South to North. The simulator state s = [scar,x, scar,y, sped,x, sped,y, vcar,x, vcar,y, vped,x, vped,y]

represents the x and y position and velocity of both the vehicle and the pedestrian. The vehicle also

observes a noisy vector o = [srel,x, srel,y, vrel,x, vrel,y], which represents the position and velocity

of the pedestrian relative to the vehicle. The vehicle starts 35 m back from the crosswalk, with an

initial velocity of 11.2 m/s East. The pedestrian starts 2 m back from the edge of the road, with an

initial velocity of 1 m/s North. The autonomous vehicle policy is a modified version of the intelligent

driver model [68].

Formulation

We define an event as an overlap between the car and the pedestrian, which occurs when

|scar,x − sped,x| ≤ 2.5 and |scar,y − sped,y| ≤ 1.4. The disturbance vector controls both the motion of

the pedestrian and the scale and direction of the sensor noise. The reward function for this scenario

uses RE = −1× 105 and RE = 0. There is also a heuristic reward with Φ = 10000 · dist(pv,pp

),

where dist(pv,pp

)is the distance between the pedestrian and the SUT. This heuristic encourages

the reinforcement learner to move the pedestrian closer to the car in early iterations, which can

significantly increase training speeds. The reward function also uses ρ = M (x, µx | s), which is

the Mahalanobis distance function [69]. Mahalanobis distance is a generalization of distance to the

mean for multivariate distributions. Using the Mahalanobis distance results in a reward that is still

proportional to the likelihood of a trajectory but handles very small probabilities without exploding

towards negative infinity. See Koren, Alsaif, Lee, et al. [46].

2.6.3 Aircraft Collision Avoidance Software

Problem

The next-generation Airborne Collision Avoidance System (ACAS X) [70] gives instructions to pilots

to avoid collisions. We want to identify system failures in simulation to ensure the system is robust

enough to replace the Traffic Alert and Collision Avoidance System (TCAS) [71]. We are interested

in a number of different scenarios in which two or three aircraft are in the same airspace.


Formulation

The event will be a near mid-air collision (NMAC), which is when two planes pass within 100 vertical

feet and 500 horizontal feet of each other. The simulator is quite complicated, involving sensor,

aircraft, and pilot models. Consequently, it is too difficult to define or access the full simulator state.

Therefore, instead of trying to control the simulation state explicitly, we will use seed-disturbance

control, so the reinforcement learner will output seeds to the random number generators in the

simulator. The reward function for this scenario uses RE = −1× 105, RE = 0, and a heuristic

reward equivalent to the negative miss distance at the end of the trajectory. While this heuristic is

not technically of the form h (st, xt−1, st−1) = Φ(st)−Φ(st−1), it still works due to how the middle

terms cancel out in a summation over repeated steps, and it is easier to implement. The reward

function also uses ρ = logP (st | st+1), the log of the known transition probability at each time-step.

See Lee, Mengshoel, Saksena, et al. [53].

An example result from Lee, Mengshoel, Saksena, et al. [53] is shown in fig. 2.6. The planes need

to cross paths, and the validation method was able to find a rollout where pilot responses to the

ACAS X system lead to an NMAC. AST was used to find a variety of different failures in ACAS X.

−10,000 −6,000 −2,0000.2

0.4

0.6

0.8

1

·104

1

1

2

2

Position East (ft)

PositionNorth

(ft)

0 10 20 30 40 50

2,000

2,200

2,400

2,600

2,8001

12

2

Time (s)

Altitude(ft)

Figure 2.6: An example result from Lee, Mengshoel, Saksena, et al. [53], showing an NMAC identifiedby AST. Note that the planes must be both vertically and horizontally near to each other to registeras an NMAC.

2.7 Discussion

This chapter presented adaptive stress testing (AST), including the problem formulation, the reward

function format, and the methodology for both direct disturbance control and seed-disturbance

control. We were able to prove that our RL problem is equivalent to the optimization problem. We


were also able to provide some useful guidelines on setting hyperparameters in the reward function, as

well as on how to design a heuristic reward that will not change the optimal trajectory. We grounded

this theory by presenting three examples of how to formulate the validation of an autonomous system

according to AST. The theory presented in this chapter underlies all of the work throughout the

rest of this thesis.

Chapter 3

Scalable Validation

Adaptive Stress Testing (AST) allows us to search the space of possible simulation rollouts for a

specific scenario to find the most likely failure. Consequently, we can avoid significant simplifications

or constraints that could undermine results when validating autonomous systems. However, many

autonomous systems act in state and action spaces that are both continuous and high-dimensional.

Therefore, we want an AST reinforcement learner that can also scale well to complex problems.

Prior work on AST [40] used Monte Carlo tree search, a reinforcement learning method that builds

a tree of state and action nodes to find the trajectory that yields the highest reward. During rollouts,

MCTS takes actions that balance exploration and exploration, and then uses the rollout rewards

to track the expected value of different state-action pairs. Unfortunately, tree size can quickly

explode when dealing with continuous or high-dimensional state or action spaces. While there are

variations of MCTS that allow it to perform better with continuous or high-dimensional spaces (see

section 2.1.2), they often involve pruning branches that do not seem promising. Aggressive pruning

can cause MCTS to miss branches that would have led to better solutions, harming performance.

In addition, MCTS commits to the early part of a trajectory before it moves on to explore later

steps. Consequently, MCTS can struggle with exploitation, as it may converge to a trajectory that

is similar to an optimal solution but is slightly worse, especially in continuous spaces. In the case of

AST, suboptimal exploitation means that MCTS could yield an underestimate of the probability of

the most likely failure.

To address the limitations of the MCTS reinforcement learner, this chapter presents an AST

reinforcement learner that uses deep reinforcement learning (DRL). Instead of a tree, DRL represents

a policy with a neural network and then uses batches of rollouts to perform optimization. DRL has

already been shown to perform well on tasks with continuous or high-dimensional state or action

spaces such as Atari environments. In addition, numerical optimization allows DRL to have strong

exploitation. In fact, contrary to MCTS, DRL can struggle instead with exploration. To mitigate

the exploration issues, instead of having the DRL reinforcement learner output actions directly,

25

CHAPTER 3. SCALABLE VALIDATION 26

xt=1 = [µt+1,Σt+1]

hidden layer (32)

hidden layer (64)

hidden layer (128)

hidden layer (256)

hidden layer (512)

st

Figure 3.1: The network architecture of the fully-connected DRL reinforcement learner. A numberof hidden layers learn to map the simulation state st to x, the mean and diagonal covariance of amultivariate normal distribution. The disturbance x is then sampled from the distribution.

the DRL reinforcement learner will instead output the mean and standard deviation of a Gaussian

distribution. Actions are then sampled from the resulting distribution. Early in training, the

use of distributions can enhance exploration by adding randomness. However, through the course

of training, the reinforcement learner learns to reduce the standard deviations, so the policy still

converges to a distribution with little stochasticity.

Background on Monte Carlo tree search can be found in section 2.1.2. Background on deep

reinforcement learning can be found in section 2.1.3.

3.1 Fully Connected Approach

The most straightforward way to apply DRL to AST is to represent the policy as a fully-connected

neural network, as shown in fig. 3.1. The fully connected neural network maps the simulation state

to the action distribution parameters through a series of fully connected hidden layers. Architecture

design choices like the number of layers, the size of the layers, and types of non-linearities can be

adjusted based on the complexity of the AST problem.

The fully connected architecture has two key advantages. The first is simplicity. A fully connected

neural network is the easiest to implement and the most straightforward to optimize. The second

advantage is performance. By mapping directly from the simulation state, the network has all


LSTM

xt

xt+1 = [µt+1,Σt+1]

ht+1ht

Figure 3.2: The network architecture of the recurrent DRL reinforcement learner. An LSTM learnsto map the previous disturbance xt and the previous hidden state ht to the next hidden state ht+1

and to x, the mean and diagonal covariance of a multivariate normal distribution. The disturbancext+1 is then sampled from the distribution.

the information it needs to force failures. Unfortunately, there is a major limitation as well. By

mapping directly from the simulation state, the fully connected architecture violates the black-box

assumption. In some cases, this may be acceptable, but complex autonomous systems often are

tested in complex simulators, where it may be difficult and time consuming to get access to the full

simulation state. A different architecture is needed for such cases.

3.2 Preserving the Black-box Simulator Assumption

For cases where the simulator must be treated as a black box, we can instead use a recurrent neural

network architecture, as shown in fig. 3.2. A recurrent neural network maps two inputs—the actual

input and the previous hidden state—to two outputs—the actual output and the next hidden state.

The hidden state is a learned latent approximation of the state space, which allows the architecture

to force failures without access to the simulator state. Instead, the reinforcement learner uses the

previous action as input and effectively learns to approximate a state space over the course of

training. While there are many flavors of recurrent networks, in this case we use an LSTM because

it has been shown to have strong performance on many types of sequential tasks.

The main advantage of the recurrent architecture is that it can treat the simulator as a black

box. By learning to map a sequence of actions to a hidden state, the reinforcement learner can find

failures without having access to the actual simulator state. Learning the hidden state is a harder

problem, so the recurrent architecture will usually be slower than the fully connected architecture

in cases where both can work. Implementing a recurrent neural network can be more complicated

than implementing a fully connected neural network as well. However, the recurrent architecture

still retains the main advantages of DRL over MCTS – namely, scalability and strong exploitation.


3.3 Experiments

This section defines the set of scenarios and metrics that we will use to evaluate the performance

of the methods as well as the reward function used by the reinforcement learners. The simulator

design outlined here is the basis of the example AV simulator in the AST Toolbox (see chapter 8)

and will be used throughout this thesis.

3.3.1 Simulator Design

Simulator S

Environment E System Under Test M

ActorDynamics

SensorModule

TrackerModule

SUT Path PlannerSUT

Dynamics

noisymeasurements o

filteredobservations z

environment state z

SUT actions a

disturbance x

transition likelihood ρevent e

Figure 3.3: The modular simulator implementation. The modules of the simulator can be easilyswapped to test different scenarios, SUTs, or sensor configurations.

The system under test (SUT), the sensors, the tracker, the reinforcement learner, and the scenario

definition are separate components in the simulation framework. The simulator architecture is shown

in fig. 3.3. The reinforcement learner outputs the disturbances (see section 3.3.2) to the simulator;

they are used to update non-SUT actors controlled by AST. In our experiments, the only actors

are pedestrians. In order to have smooth trajectories, the disturbances control the pedestrians by

setting their acceleration at each timestep. The sensors receive the new actor states and output

measurements augmented with the noise from the disturbances. The measurements are filtered by

the tracker, which is an alpha-beta filter, and passed to the SUT. See section 3.3.2 for more details

on the implementation of the sensors and tracker. The SUT, which is the driving model, decides

how to maneuver the vehicle based on its observations. We use a modified version of the Intelligent

Driver Model (IDM) as the SUT [68] (see section 3.3.2). The SUT actions are used to update the

state of the vehicle. The simulator outputs the transition probability and event indicator to the

reward function. The current state of the simulator can be represented in different ways. If the state

of the simulator is fully observable, the simulator can provide its state or an autoencoder processed

version of the state [72]. Otherwise, the history of previous actions can be used to represent the

current state. The state representation, along with the reward of the previous step, are then input


to the reinforcement learner.

An advantage of this modular approach is that, if multiple IDM implementations were to be

compared, it would be easy to swap them out and compare the results. Entirely different SUTs can be

compared, or individual modules can be changed, for example by comparing how the SUT performs

with two different tracker modules. The modularity allows AST to be a useful benchmarking tool,

or a batch testing method for autonomous systems. An especially useful version of AST for system

or component comparisons is differential AST [73], which searches two simulators simultaneously to

maximize the difference in performance between the two SUTs.

The reinforcement learner can use two different procedures to generate the disturbances, as shown

in fig. 3.4. The direct disturbance control approach, used by DRL in this chapter, is for the agent

to output the parameters of a distribution at each timestep. disturbances are then sampled from

the distribution. The seed-action control approach, used by MCTS in this chapter, is to output

pseudorandom seeds, which are used to seed random number generators. The disturbances are then

sampled using these random number generators. The seed-action control approach is useful for large,

obfuscated simulators that would be difficult to control with direct actions but already use many

random number generators.

The inputs to the reinforcement learners vary slightly. MCTS does not make use of the simulator’s

internal state, treating it entirely as a black box. Instead of making use of the simulator’s internal

state, the AST implementation of MCTS uses a history of the previous pseudorandom seeds as

the nodes in the tree [40]. In contrast, the fully-connected DRL reinforcement learner takes the

simulation state as input. The simulation state is

ssim = [s(1)sim, s

(2)sim, . . . , s

(n)sim].

For the ith pedestrian,

s(i)sim = [v(i)x , v(i)y , x(i), y(i)],

where

• v(i)x , v

(i)y are the x and y components of the relative velocity between the SUT and the ith

pedestrian.

• x(i), y(i) are the x and y components of the relative position between the SUT and the ith

pedestrian.

3.3.2 Problem Formulation

To evaluate the effectiveness of AST as applied to autonomous vehicles, we stress test a vehicle

in a set of scenarios at a pedestrian crosswalk, shown in fig. 3.5. The scenario is defined by a


Reinforcement Learner A Simulator S

MCTS

DRL Sampler

RandomGenerator

Environment E

Environment E

seed x

disturbance x

disturbance xreward r

reward r

simulation information s

previous disturbance x′

distribution

Figure 3.4: A comparison of the reinforcement learner methods. MCTS uses a seed to control arandom number generator. DRL outputs a distribution, which is then sampled. Both of thesemethods produce disturbances.

single autonomous vehicle approaching a crosswalk. The road has two lanes to model a regular

neighborhood road, although there is no traffic in either direction for this specific example.

The lanes are each 3.7 m wide and the crosswalk is 3.1 m wide, per California state regulations [74].

The Cartesian origin is set at the intersection of the central vertical axis of the crosswalk and the

central horizontal axis of the bottom lane, with the positive x direction following the direction of

the arrow in fig. 3.5, and positive y motion being towards the top lane of the street. We test

with different numbers of pedestrians as well as with different starting states. The state of the ith

pedestrian is

s(i)ped = [v(i)x , v(i)y , x(i), y(i)],

where

• v(i)x , v

(i)y are the x and y components of the velocity of the ith pedestrian.

• x(i), y(i) are the x and y components of the position of the ith pedestrian.

We present data from each pedestrian from the three different variations of the scenario:

• 1 pedestrian, with initial state

s(1)ped = [0.0 m/s, 1.4 m/s, 0.0 m,−2.0 m]

• 1 pedestrian, with initial state

s(1)ped = [0.0 m/s, 1.4 m/s, 0.0 m,−4.0 m]

• 2 pedestrians, with initial state

s(1)ped = [0.0 m/s, 1.4 m/s, 0.0 m,−2.0 m]

s(2)ped = [0.0 m/s,−1.4 m/s, 0.0 m, 5.0 m]


y

x

(xcar, ycar)

(x(i)ped, y

(i)ped

)

vcar,x

v(i)ped,y


(a) Overview of the general crosswalk scenario.

(0.0,−2.0) m

1.4 m/s

(b) Scenario 1

(0.0,−4.0) m

1.4 m/s

(c) Scenario 2

(0.0,−2.0) m

1.4 m/s

(0.0, 5.0) m

−1.4 m/s

(d) Scenario 3

(e) The initial pedestrian configurations for the three different scenarios.

Figure 3.5: The setup of the three crosswalk scenarios.


The scenario variations are shown in fig. 3.5. The first scenario (fig. 3.5b) was chosen as a basic

example to demonstrate AST. The second scenario (fig. 3.5c) was chosen to show that a different

initial condition leads to different collision trajectories. The third scenario (fig. 3.5d) shows the

scalability of AST by including more actors in the scenario.

Actor Dynamics

Both reinforcement learners use the same representation for disturbances. The disturbance vector

at each time step is

aenv = [a(1),a(2), . . . ,a(n)],

where n is the number of pedestrians. For the ith pedestrian,

a(i) = [a(i)x , a(i)y , ε(i)vx , ε(i)vy , ε

(i)x , ε(i)y ],

where

• a(i)x , a

(i)y represent the x and y components of the ith pedestrian’s acceleration, respectively.

• ε(i)vx , ε

(i)vy represent the noise injected into the SUT measurement of the components of the ith

pedestrian’s velocity v(i)x and v

(i)y , respectively.

• ε(i)x , ε

(i)y represent the noise injected into the SUT measurement of x and y components of the

ith pedestrian’s position, respectively.

AST controls both the pedestrian motion and the sensor noise, which allows it to search over both

pedestrian actions and perception failures to find the most likely collision.

At each time step, the pedestrian samples a(i) (as mentioned above, the procedure of this sampling

differs slightly between reinforcement learners, but the representation of the action vector a(i) is the

same). To find the likelihood of a(i), a model of the expected pedestrian action vector is needed.

This model is a multivariate Gaussian distribution N (µa,Σ) where µa is a zero-vector, and Σ is

diagonal. Our pedestrian model is parameterized by σaLat, σaLon, and σnoise, which are the diagonal

elements of the covariance matrix and correspond to lateral acceleration, longitudinal acceleration,

and sensor noise, respectively. The values we use are: σaLat = 0.01, σaLon = 0.1, and σnoise = 0.1.

The acceleration parameters are designed to encourage the pedestrians to move across the street

with some lateral movement. The assumption of the mean action being the zero-vector implies

that, on average, pedestrians maintain their current speed and heading. In reality, this distribution

could depend on the location of the pedestrian, where the vehicle is, the attitude or attention of

the pedestrian, or other factors. Applying a more realistic pedestrian model is an avenue for future

work. The initial speed of the pedestrian is set to 1.5 m/s.


Sensor and Tracker Models

The sensors of the SUT receive a vector of the actor state and output a vector of noisy measurements

m = [m(1),m(2), . . . ,m(n)].

For the ith pedestrian, m(i) = s(i)ped + ε(i) where

ε(i) = [ε(i)vx , ε(i)vy , ε

(i)x , ε(i)y ].

The measurements are passed to an alpha-beta tracker [75], which is parameterized by αtracker and

βtracker. The tracker returns filtered versions of the measurements as the SUT’s observations. We

use the values αtracker = 0.85 and βtracker = 0.005.

System Under Test Model

The SUT is based on the Intelligent Driver Model [68]. The IDM is designed to stay in one lane and

safely follow traffic. To follow the rules around crosswalks, we set the desired velocity at 25 miles

per hour (11.2 m/s). If there is no vehicle in front of the IDM for it to follow, the model maintains

a desired velocity. We adapted the IDM for interacting with pedestrians by modifying it to treat

the closest pedestrian in the road as the target vehicle. The IDM then tries to follow a safe distance

behind the pedestrian based on the difference between their velocities, which results in the vehicle

stopping at the crosswalk since the pedestrian’s vx is negligible. Our modified IDM is obviously not a

safe model; as we will show, ignoring any pedestrian outside of the road makes the vehicle vulnerable

to being blindsided by people moving quickly from the curb. The goal of this chapter, however, is

to show that AST can effectively induce poor behavior in an autonomous driving algorithm, not to

present a safe algorithm. The SUT model receives a series of filtered observations

o = [o(1), o(2), . . . , o(n)].

If there are pedestrians in the road, the SUT model uses the closest pedestrian to find

sSUT = [voth, sheadway],

where

• voth is the relative x velocity between the SUT and the closest pedestrian.

• sheadway is the relative x distance between the SUT and the closest pedestrian.

These factors determine the acceleration of the SUT in the next time step.


Modified Reward Function

Our modified version of the AST reward (see section 2.4) is shown in eq. (3.1). As a proxy for the

probability of an action, we use the Mahalanobis distance [69], which is a measure of distance from

the mean generalized for multivariate continuous distributions. We use a large negative number as a

penalty for rollouts that do not end in a failure. In addition, the penalty at the end of a no-collision

case includes a heursitic reward that is scaled by the distance (dist) between the pedestrian and

the vehicle. The penalty encourages the pedestrian to end early trials closer to the vehicle, which

allows the reinforcement learner to find failures more quickly and leads to faster convergence. The

reward function is modified from the previous version of AST [40] as follows:

R (s) =

0 s ∈ E

−10000− 1000× dist(pv,pp

)s /∈ E, t ≥ T

− log (1 +M (a, µa | s)) s /∈ E, t < T

(3.1)

where M(a, µa | s) is the Mahalanobis distance between the action a and the expected action µa

given the current state s. The distance between the vehicle position pv and the closest pedestrian

position pp is given by the function dist(pv,pp).

Metrics

We use two metrics to evaluate the AST algorithms. The first is the likelihood of the final collision

trajectory output by the system. The second metric is the number of calls to the step function. The

goal of the second metric is to compare the efficiency of the two AST reinforcement learners. The

separate implementations render both wall clock time and iterations inappropriate. The simulator

update function (Step), which was the computational bottleneck, is used instead. This metric is

agnostic to the implementation hardware, the algorithm used, and to the run-time of updating the

simulation.

Reinforcement Learners

For MCTS, the parameters that control how much of the state space is explored are the depth, the

horizon T , and the number of iterations. The depth and horizon are chosen to be equal so that the

search and rollout stages explore the same scenario. We experimented with different values for the

horizon (50, 75, 100) and found that 100 was the minimum horizon that is sufficiently long to cover

the scenario of interest. We used 2000 iterations. For additional detail on MCTS and DPW, see the

paper by Lee et al. [40].

For DRL, the results shown are obtained using a batch size of 4000. Experimentation showed

that reducing the batch size any further resulted in too much variance during the trials. We use a


step size of 0.1, and a discount factor of 1.0. The DRL approach is implemented using RLLAB [76].

3.4 Results

Table 3.1: Numerical results from both reinforcement learners. Reward without noise shows thereward of the MCTS path if sensor noise was set to zero, to illustrate the difficulty that MCTS haswith eliminating noise. DRL is able to find a more probable path than MCTS with a large reductionin calls to the Step function.

MCTS DRL

Scenario Calls to Step Reward Reward w/onoise

Calls to Step Reward

1 4.9× 108 −131 −71 8× 105 −62

2 1.9× 106 −38 −15 8× 105 −1.7

3 1.6× 109 −161 −104 1× 106 −52

−12−10−8 −6 −4 −2 0−6−4−20

2

4

Start position

Collision

x (m)

y(m

)

MCTS

−2 −1 0 1 2

−4

−3

−2

−1

Start position

Collision

x (m)

y(m

)

MCTS

−10−8−6−4−2 0 2 4 6

0

5

10

Start position 1

Start position 2

Collision

x (m)

y(m

)

MCTS

Pedestrian 1Pedestrian 2

−12−10−8 −6 −4 −2 0−6−4−20

2

4

Start position

Collision

x (m)

y(m

)

DRL

(a) Scenario 1

−2 −1 0 1 2

−4

−3

−2

−1

Start position

Collision

x (m)

y(m

)

DRL

(b) Scenario 2

−10 −5 0 5−5

0

5

Start position 1

Start position 2

Collision

x (m)

y(m

)

DRL

Pedestrian 1Pedestrian 2

(c) Scenario 3

Figure 3.6: Pedestrian motion trajectories for each scenario and algorithm. The collision point isthe point of contact between the vehicle and the pedestrian. In scenario 3, pedestrian 1 does notcollide with the vehicle.

The results show that both reinforcement learners are able to identify failure trajectories in an


autonomous vehicle scenario. MCTS and DRL identify several simulation rollouts where the vehicle

collides with the pedestrian. Table 3.1 shows the results for the three different scenarios. Both

methods successfully converge to a solution in a tractable number of simulator steps. AST is able

to find collisions by taking advantage of the modified IDM’s decision to ignore any pedestrian who

is not in the road. Although the likelihoods seem to vary greatly, much of this difference is due to

MCTS having negligible but non-zero noise that adds up over the long horizon. The likelihood of

pedestrian motion dominates that of sensor noise. Consequently, the noise should be very sparse,

and the DRL solution reflects this. MCTS, however, has difficulty driving the noise to true zero.

This difficulty results in very small numbers in the noise vector throughout the trajectory. While

this near-zero noise is not enough to affect the result of the trajectory, it does accumulate over the

long trajectory, resulting in a significant difference in the trajectory’s likelihood. We recomputed

the reward as if the noise were 0 as a reference, which is also shown in Table 3.1.

3.4.1 Performance

The number of calls to Step for MCTS is the number of calls required to find a collision in the

rollouts. Training could continue and possibly find better failures, at the cost of extra computation.

The MCTS reinforcement learner had high variance in its results and therefore had to be run multiple

times to achieve consistent results. The rewards shown are the average failure found over 100 trials,

and the steps are the total steps taken across all trials.

Across all scenarios, DRL consistently converges to solutions with less than 1% of the number of

calls to Step required by MCTS despite the state and action spaces being very small. Theoretically,

the scalability advantages of DRL should be even more apparent in a higher-dimensional problem.

This advantage is supported by the fact that MCTS performs worse on the dual pedestrian scenario

than it does on both single pedestrian scenarios.

3.4.2 Trajectories

Figure 3.6 shows the pedestrian paths until the collision from each reinforcement learner for scenarios

1, 2, and 3, respectively. The start positions and collision positions, if the trajectory ends in collision,

are marked. In scenario 1, both reinforcement learners send a single pedestrian into the road, and

have the pedestrian move towards the vehicle to create a collision. However, the turn towards the

vehicle is much more pronounced in MCTS, where the pedestrian comes to a near stop, before

angling hard left and into the vehicle. DRL instead settles on a smoother path. The DRL path is

slightly more likely because less acceleration is needed. In scenario 2, both reinforcement learners

find failures with similar trajectories. The pedestrians start at a point from which their mean action

should create a collision. Both reinforcement learners identify this path quickly and the pedestrians

take very little action.


Both scenarios are relevant to scenario 3, which presents the largest difference. Because pedes-

trian 2 starts farther away from the vehicle, pedestrian 2 has the more likely path to being hit by the

vehicle, as in scenario 2. In both reinforcement learners, the second pedestrian takes actions similar

to each reinforcement learner’s respective scenario 1. However, there is a large difference between

the trajectories of pedestrian 1 from each reinforcement learner. In MCTS, pedestrian 1 maintains

their initial velocity and heading shortly before aggressively accelerating towards the other side of

the road. In DRL, pedestrian 1 takes a slight turn to the right, and then maintains their velocity

and heading from there. MCTS has less ability to minimize the effect of pedestrian 1 on the total

reward since using a single seed results in coupling the actions of the pedestrians. Hence, pedestrian

1 has a different and less optimal trajectory than their counterpart in DRL. In DRL, pedestrian 1

had a change of direction at first, causing pedestrian 2 to be closer to the vehicle. Then pedestrian 1

maintained a course with very little acceleration, minimizing that pedestrian’s effect on the reward.

In scenarios 1 and 3, the blame of the collision could be on the pedestrian, which would not

inform any modifications to the SUT. However, in scenario 2, the blame is on the vehicle, since it

does not check for pedestrians approaching the crosswalk until the pedestrians are in the crosswalk,

which gives very short response time for the vehicle. The suggestion for avoiding a collision like

scenario 2 is to expand the sensing range of the IDM to go beyond the curb of the road. The reason

AST returns situations where the blame is not on the vehicle is that we define the subset of state

space that we are interested in E to be any collision. The kind of collisions reported by the examples

shown in scenario 1 and 3 do not give the designer of the SUT any insight on how the SUT should be

improved. The solution is to redefine the space of events of interest E to be the subset of collisions

in which the responsibility of the collision was on the SUT [47]. The definition of E requires formal

models of responsibility and blame in various road situations [77].

3.5 Discussion

This chapter demonstrated that we can use deep reinforcement learning to improve the efficiency

of adaptive stress testing. Deep reinforcement learning can find more likely failure scenarios than

Monte Carlo tree search, and it finds them more efficiently. Despite the improved performance,

the trajectories found by both algorithms are similar, which demonstrates consistency of results

across different reinforcement learners. The improved scalability can allow AST to be applied to

autonomous systems acting in continuous and high-dimensional spaces, but AST can still take a

lot of compute to run. This compute can be prohibitive if we want to evaluate a scenario across

multiple similar initial conditions. In the next chapter, we present a slight modification to the

DRL architecture and training process that allows a single run of AST to generalize across initial

conditions.

Chapter 4

Generalizing across Initial

Conditions

AST can identify the most likely failure for a given scenario, but what if we are interested in a

range of similar scenarios? For validation purposes, scenarios are often defined not by concrete

instantiations but as a class of scenarios where each initial condition parameter is within a certain

range. Validation would then involve testing across numerous instantiations from the scenario class,

and a system might have to be safe on every instantiation to pass the overall scenario class.

As an example, consider the running crosswalk example. We might be interested in the case

where the vehicle starts with a velocity anywhere between 25 and 35 mph, or where the pedestrian

starts anywhere from 1 to 6 meters from the crosswalk entry, as shown in fig. 4.1. Currently, we

would have to run a new AST instance for each concrete instantiation we are interested in, as shown

in fig. 4.2a. While AST can make finding the most likely failure tractable, it is not necessarily fast,

and we certainly do not want to be required to run hundreds of AST instances for similar scenario

instantiations. Having to run a new AST instance for each instantiation is especially problematic

considering that each instantiation shares a lot of underlying similarities—conceptually, it is easy

to imagine that moving the pedestrian’s starting position a small distance back from the crosswalk

should not greatly change the resulting failure trajectory. Therefore, we should be able to create

an AST reinforcement learner that can generalize during training across the entire initial condition

space, as shown in fig. 4.2b, and it should be far more efficient than running a new instance for

each instantiation since the reinforcement learner will no longer be wasting information from similar

rollouts.

This chapter presents a slight modification to the DRL architecture that allows a single AST

reinforcement learner to generalize across a scenario class defined by an initial condition space. The

new architecture requires only a single instance of AST to be run, and initial conditions are sampled

38

CHAPTER 4. GENERALIZING ACROSS INITIAL CONDITIONS 39


Figure 4.1: An example of a scenario class for the crosswalk example. The ego vehicle and thepedestrian both have a range of initial conditions for their position and velocity. A concrete scenarioinstantiation could be created by sampling specific initial condition values from the ranges.

I

Simulator SReinforcement Learner A(1)



s(1)0

s(2)0

s(3)0

xt

xt

xt

st

st

st

(a) The previous version of AST running over a space of initial conditions I. The space must be discretized,and then each initial condition s0 requires a separate instance of the DRL reinforcement learner and thesimulator. In addition, the reinforcement learner requires the next simulator state st at each time-step.

ISimulator SReinforcement Learner A

s(1)0

s(2)0

s(3)0

at

ht

(b) The new version of AST running over a space of initial conditions I. The continuous space is sampledat random at the start of each trajectory; therefore only one instance of the reinforcement learner is needed.The reinforcement learner does not need access to the simulation state because it is maintaining a hiddenstate ht at each time-step. The reinforcement learner instead uses the previous action at.

Figure 4.2: Contrasting the new and old AST architectures. The new reinforcement learner uses arecurrent architecture and is able to generalize across a continuous space of initial conditions witha single trained instance. These improvements allow AST to be used on problems that would havepreviously been intractable.


LSTM

[xt, s0]

xt+1 = [µt+1,Σt+1]

ht+1ht

Figure 4.3: The network architecture of the generalized recurrent DRL reinforcement learner. AnLSTM learns to map the input—a concatenation of the previous disturbance xt and the simulator’sinitial conditions s0—and the previous hidden state ht to the next hidden state ht+1 and to x, themean and diagonal covariance of a multivariate normal distribution. The disturbance xt+1 is thensampled from the distribution.

at the start of each rollout during training, as shown in fig. 4.2b. The output of AST is no longer the

most likely failure trajectory but is instead a policy that maps initial conditions to the corresponding

most likely failure. Therefore, if a designer were interested in a specific set of initial conditions, they

would first run AST across the whole space, and then they would feed initial conditions into the

trained policy to find the corresponding failures.

4.1 Modified Recurrent Architecture

We presented architectures in chapter 3 for both a fully-connected DRL reinforcement learner,

which we will now refer to as the FCDRL reinforcement learner, and for a discrete (non-generalized)

recurrent DRL reinforcement learner, which we will now refer to as the DRDRL reinforcement

learner. We now present an architecture for a generalized recurrent DRL reinforcement learner,

which we will refer to as the GRDRL reinforcement learner. The GRDRL architecture is shown in

fig. 4.3. The architecture is the same as the DRDRL architecture shown in fig. 3.2, except instead

of the input at each timestep being the previous action, the input is now a concatenation of the

previous action and the initial condition. Because the reinforcement learner sees only the sequence of

actions taken, it would previously have had no way to differentiate between two trajectories that take

identical actions but from different initial conditions, which could obviously lead to very different

results. The slight modification gives the reinforcement learner enough information to differentiate

between trajectories from different initial conditions, so when the initial conditions of each rollout

are sampled at train time, the resulting policy learns to find the most likely failure across the entire

space of initial conditions. This problem is larger, and therefore harder, than finding a failure from

a single initial condition, so the generalized policy may take longer to converge than running AST

from a single scenario instance. However, the policy will learn during training certain behaviors


that apply across the entire space of initial conditions; for example, it will learn how to use noise

to fool the system-under-test’s perception system, so it should be far more efficient than running a

new AST instance from each initial condition.

4.2 Experiments

This section outlines the problem used in simulation to test AST, the hyper-parameters of the

DRL reinforcement learner, and the reward structure. For bench-marking purposes, we follow the

experiment setup—simulation, pedestrian models, and SUT model—proposed in section 3.3. The

problem has a 5-dimensional state-space and a 6-dimensional action space, and is run for up to 50

timesteps of 0.1 s per timestep.

4.2.1 Problem Formulation

y

x

(s0,car, 0) m

(s0,ped,x, s0,ped,y) m

v0,car

v0,ped


Figure 4.4: The crosswalk scenario class. To instantiate a concrete scenario, the initial conditionss0,ped, s0,car, v0,ped, and v0,car are drawn from their respective ranges, defined in table 4.1.

Our experiment simulates a common neighborhood road driving scenario, shown in fig. 4.4. The

road has one lane in each direction. A pedestrian crosses at a marked crosswalk, from south to

north. The y origin is at the center of the crosswalk, and the x origin is where the crosswalk meets

the side of the road. The speed limit is 25 mph, which is 11.2 m/s.

The inputs to the GRDRL reinforcement learner include the initial state s0 = [s0,ped, s0,car, v0,ped, v0,car]

where

• s0,ped is the initial x, y position of the pedestrian,

• s0,car is the initial x position of the car,

• v0,ped is the initial y velocity of the pedestrian, and

• v0,car is the initial x velocity of the car.


Initial conditions are drawn from a continuous uniform distribution, with the supports shown in

table 4.1. Trajectory rollouts are instantiated by randomly sampling an initial condition from the

parameter ranges.

Table 4.1: The initial condition space. Initial conditions are drawn from a continuous uniformdistribution defined by the supports below.

Variable Min Max

s0,ped,x −1 m 1 m

s0,ped,y −6 m −2 m

s0,car −43.75 m −26.25 m

v0,ped 0 m/s 2 m/s

v0,car 8.34 m/s 13.96 m/s

4.2.2 Modified Reward Function

AST penalizes each step by the likelihood of the disturbances, as shown in Equation (4.1). Unlikely

actions have a higher cost, so the reinforcement learner is incentivized to take likelier actions, and

thereby find likelier failures. The Mahalanobis distance [69] is used as a proxy for the likelihood

of an action. The Mahalanobis distance is a measure of distance from the mean generalized for

multivariate continuous distributions. The penalty for failing to find a collision is controlled by α

and β. The penalty at the end of a no-collision case is scaled by the distance (dist) between the

pedestrian and the vehicle. The penalty encourages the pedestrian to end early trials closer to the

vehicle and leads to faster convergence. We use α = −1× 105 and β = −1× 104. The reward

function is modified from the previous version of AST [40] as follows:

R (s) =

0, s ∈ E

−α− β × dist(pv,pp

), s /∈ E, t ≥ T

−M (a, µa,Σa | s) , s /∈ E, t < T

(4.1)

where M(a, µa,Σa | s) is the Mahalanobis distance between the action a and the expected action

µa given the covariance matrix Σa in the current state s. The distance between the vehicle position

pv and the closest pedestrian position pp is given by the function dist(pv,pp).

4.2.3 Reinforcement Learners

The GRDRL reinforcement learner is compared against the DRDRL, FCDRL, and MCTS reinforce-

ment learners. For the DRL reinforcement learners, the hidden layer size is 64. Training was done


with a batch size of 5× 105 timesteps. The maximum trajectory length is 50, hence each batch has

1000 trajectories. The optimizer used a step size of 0.1 s, and a discount factor of 1.0.

4.3 Results

This section shows the performance of the new reinforcement learners on our running example. First,

the generalized reinforcement learner’s ability to train on the problem is compared to that of the

discrete recurrent reinforcement learner. Both reinforcement learners are then compared to baselines

to show their improvement.

4.3.1 Overall Performance

0 100 200 300 400 500−200

−180

−160

−140

−120

−100

Iterations

Rew

ard Generalized

Discrete (Conservative)

Discrete (Optimistic)

Figure 4.5: The Mahalanobis distance of the most likely failure found at each iteration for botharchitectures. The conservative discrete architecture runs each of the discrete reinforcement learnersin sequential order. The optimistic discrete architecture runs each of the discrete reinforcementlearners in a single batch.

The goal of AST is to understand failure modes by returning the most likely failure. An advantage

of the new architecture is its ability to search for the most likely failure from a space of initial

conditions while training a single network. Figure 4.5 demonstrates these benefits by showing the

cumulative maximum reward found by the DRDRL and GRDRL reinforcement learners at each

iteration. There are two estimates shown for the DRDRL architecture:

• Sequential: Each discrete AST reinforcement learner is run sequentially. The naive approach

serves as a lower bound on the performance of the discrete architecture.

• Batch: The AST reinforcement learners are updated as a batch. Each batch is assumed to

still take 32 iterations, but the best reward of the best reinforcement learner is known after

each update.


The generalized architecture outperforms the discrete architecture at every iteration. The gener-

alized version finds a collision sooner and converges to a solution after about 100 iterations, whereas

the discrete architecture is still training after 500 iterations. Furthermore, the generalized version

is able to find a trajectory that has a net Mahalanobis distance of −101.0. In contrast, the discrete

version’s most likely solution was −114.2. To put this in perspective, these results mean that the

average timestep of the generalized version was 2.0 standard deviations from the mean disturbance,

while the average timestep of the discrete version as 2.3 standard deviations from the mean dis-

turbance. Over the entire space of initial conditions, running the generalized architecture is more

accurate in far fewer iterations than running the discrete architecture at discrete points.

4.3.2 Comparison to Baselines

Table 4.2: The aggregate results of the DRDRL and GRDRL reinforcement learners, as well asthe MCTS and FCDRL reinforcement learners as baselines, on an autonomous driving scenariowith a 5-dimensional initial condition space. Despite not having access to the simulator’s internalstate, the DRDRL reinforcement learner achieves results that are competitive with both baselines.However, the GRDRL reinforcement learner demonstrates a significant improvement over the otherthree reinforcement learners.

MCTS FCDRL DRDRL GRDRL Point GRDRL Bin

Average Collision Reward −192.92 −229.80 −236.25 −148.48 −133.86

Max Collision Reward −145.80 −139.38 −125.51 −98.85 −91.67

Collisions Found 21 29 30 25 32

Collision Percentage 65.63 90.63 93.75 78.13 100

Table 4.2 shows the aggregate results of the new architectures as well as two baselines: the

fully-connected DRL (FCDRL) and a Monte Carlo tree search (MCTS) reinforcement learner. The

data was generated by dividing the 5-dimensional initial condition space into 2 bins per dimension,

which resulted in 32 bins. Such a rough discretization provides little confidence in our validation

results, but the number of bins is equal to b5, where b is the number of bins per dimension. Using

3 bins per dimension, which hardly provides more confidence, would result in training 243 instances

of AST. Even on a toy problem, running AST for a safe number of discrete points is intractable.

However, to demonstrate the performance benefits of the GRDRL architecture, we ran the MCTS,

FCDRL, and DRDRL reinforcement learners at the center-point of each of the 32 bins. The GRDRL

reinforcement learner was trained on the entire space of initial conditions and evaluated in two ways:

1) by executing the GRDRL reinforcement learner’s policy from the same 32 center-points the other

reinforcement learners were tested at, referred to as point evaluation, and 2) by sampling from each

bin in the initial condition space and keeping the best GRDRL solution, referred to as bin evaluation.


The GRDRL reinforcement learner far outperforms all baselines. When evaluating over the

entirety of each bin, the GRDRL reinforcement learner found collisions in every single bin, and

had by far the best average and maximum collision rewards. In particular, the maximum reward

demonstrates both the strength and the necessity of the new reinforcement learner architecture. The

most likely collision was not at one of the 32 points tested; hence a discretization approach does

not find the most likely trajectory. Surprisingly though, the GRDRL reinforcement learner also

outperforms the other reinforcement learners at the 32 center-points. Despite not being trained from

the center-points specifically, the GRDRL reinforcement learner has a better average and maximum

collision reward. The only degradation in performance was in collision percentage, although even

there the GRDRL reinforcement learner outperforms the MCTS reinforcement learner.

4.4 Discussion

This chapter presented a new architecture for AST to improve the validation of autonomous vehicles.

The new reinforcement learner treats the simulator as a black box and generalizes across a space

of initial conditions. The new architecture is able to converge to a more likely failure scenario in

fewer iterations than the discrete architecture. This architecture is essential for designers who are

interested not just in concrete scenario instantiations but in scenario classes defined by parameter

ranges. Running AST for each scenario instantiation would have been prohibitively expensive and

time-consuming. The new architecture can search the entire scenario class in a single run while still

finding more likely failures. However, this architecture is dependent on having a heuristic reward

signal to guide the reinforcement learner to failures. In the next chapter, we will introduce a new

reinforcement learner that can find failures without the use of heuristic rewards, even when horizons

are long.

Chapter 5

Heuristic Rewards

In section 2.4, we showed that the reward function was derived in such a way that the trajectory that

maximizes reward would be the most likely failure, if a failure exists. While the resulting reward

structure has useful properties—such as being proportional to the likelihood of failure—it is not

without drawbacks. The primary limitation is the problem of sparse rewards. The agent receives

a penalty only on whether a failure is found at the end of what could be a long trajectory, which

makes it difficult to assign credit to upstream actions and learn how to force failures. While the

agent does receive likelihood rewards at each step, these can actually be counterproductive—if a

failure is unlikely, learning to take likelier actions could actually lead the agent away from finding

failures. The penalty for not finding a failure is relatively large in order to make the agent prioritize

finding failures over taking likely actions, but that only occurs after the agent actually finds a failure

(and the resulting lack of a massive penalty) in the first place, and the reward structure makes

finding failures hard.

A possible solution is to change the reward structure to make finding failures easier by using

heuristic rewards, which are domain-specific components of the reward function designed, by using

expert-knowledge, to provide the agent with a more useful reward signal. The reward function allows

for both per-step and per-trajectory heuristic rewards. For example, when validating an autonomous

vehicle on the crosswalk example we may use a heuristic that gives a penalty proportional to the

distance between the pedestrian and the vehicle. This gives the agent a reward signal to follow before

it finds failures, and it will learn to move the pedestrian closer and closer to the vehicle until it finds a

collision. If heuristic rewards are scaled properly, then once the agent finds failures the agent should

be focused on finding likelier failures, so that the end result is not changed. Unfortunately, creating

heuristic rewards may not always be feasible. Designers may not have access to domain experts, or

the domain might not have a clear way to guide the reinforcement learner towards failures.

AST must still be able to find failures in cases in which heuristic rewards are not available, but

current reinforcement learners are not appropriate for such cases. The DRL reinforcement learner is

46

CHAPTER 5. HEURISTIC REWARDS 47

Start

12

3

4567

Key

Figure 5.1: The path to the first reward in the Atari 2600 version of Montezuma’s Revenge [78].The player must take the numbered steps in order, without dying, before getting the first key.

heavily dependent on having a useful reward signal and will struggle on even easy problems for which

there is no heuristic reward. The tree structure inherent in MCTS makes it somewhat more robust

to poor reward signals, but empirically it quickly loses the ability to perform well on validation tasks

with no heuristic rewards as the trajectory length increases. A new approach is needed for AST to

be able to validate systems acting in domains in which no heuristic rewards are readily available.

When an agent has to act in an environment without a clear or useful reward signal, that problem

is known as a hard-exploration problem. Within the domain of hard-exploration problems, the most

notorious benchmark problem is the Atari game Montezuma’s Revenge. The starting position of

Montezuma’s Revenge is shown in fig. 5.1. In order to receive the first reward in Montezuma’s

revenge, the player must

1. Go down the ladder.

2. Jump to the rope.

3. Jump to the next ladder.

4. Go down the ladder.

5. Move across the ground and wait for the skull to be in the correct position (the skull moves

back and forth along part of the floor).

6. Jump over the skull when it is in the correct position.

7. Move to the next ladder.


8. Climb the ladder.

9. Jump and get the first key.

When executing these moves, the player must avoid falling off a ledge, touching the skull, or touching

the bottoms of the non-solid portions of the center platform, or else they will die. If the player

successfully gets the key, they receive a small reward. The player must then retrace their steps and

then jump to the platform on the right, where they can touch the door to unlock it. Touching the

door without a key will kill the player. Finally, the player can leave the first room. This is a long,

complex sequence of steps the player must take in the proper order, without dying, before receiving

any reward signal (and that is just the first level). Validation using AST shares many difficulties

with Montezuma’s revenge, such as the need to take a complex sequence of steps in the proper order

before receiving a reward. Consequently, an algorithm that performs well on Montezuma’s Revenge

could be of interest for AST.

In 2019, a new algorithm—go-explore—was released that set new records on Montezuma’s re-

venge. Go-explore was designed to address two major issues in using RL for hard-exploration

problems: detachment and derailment.

• Detachment arises due to the fact that intrinsic rewards are often consumable resources. The

first time an agent reaches a state from which there are multiple paths with high intrinsic

rewards, they can only search one. Due to maximum trajectory lengths, the agent may not be

able to explore the entire path on the first rollout. However, upon returning to the state at the

start of the promising paths, it may explore one of other high-reward paths, collecting a similar

or greater reward. Consequently, it will have no memory of the partly finished exploration

from its first rollout, nor any remaining intrinsic reward to guide it back to that point. The

agent has become detached from the reward frontier.

• Derailment arises due to stochasticity being added during training rollouts to enhance explo-

ration. During an earlier rollout, the agent may have discovered a promising state that would

be beneficial to return to. However, stochasticity may be added to the actions of the agent,

which may prevent the agent from successfully returning to the promising state. If the path

back to the promising state is longer or more complex, then it is more likely that the stochastic

perturbations will derail the agent from its desired path. This derailment could prevent the

agent from ever returning to the promising state to explore.

Go-explore mitigates these issues through a two-phase approach. Phase 1 is a heuristic tree search

algorithm that uses deterministic restarts from randomly selected nodes of the tree to return to

promising exploration frontiers without suffering from detachment or derailment. Phase 2 trains a

robust neural-network policy by using the best trajectory found from phase 1 as an expert demon-

stration for the Backward Algorithm, a learning-from-demonstration (LfD) algorithm. This chapter


presents a go-explore AST reinforcement learner for cases where heuristic rewards are not available

and search horizons are too long for MCTS to perform well. Since we are only interested in finding

failures, we only use phase 1 on go-explore in this chapter – Phase 2 has its own useful properties

though, as will be covered in section 6.1.

5.1 Go-Explore

Before explaining how go-explore was applied to validation tasks, this section presents background

material explaining how go-explore works in general, with special attention paid to Phase 1. For a

deeper dive into phase 2, which uses the backward algorithm, see section 6.1.

5.1.1 Phase 1

Phase 1 is an explore-until-solved phase that uses a heuristic tree search and takes advantage of

determinism for exploration. During phase 1, a pool of ”cells” is maintained, which acts as the tree.

A cell is a data-structure containing information like the reward and the trajectory taken to get to

the cell. Cells are indexed by possibly compressed mapping of the agent’s state. During rollouts,

every step yields a new cell. If the cell is ”unseen,” meaning its index is not in the pool, the cell

is added to the pool. If the cell has already been seen, the new and old versions are compared; if

the new cell has a higher reward or a shorter trajectory, it replaces the old cell. In some ways, the

algorithm is similar to MCTS, but the algorithms differ greatly in how rollouts are started.

In go-explore, rollouts are started from a cell randomly selected from the pool. Cells are randomly

selected according to some heuristic rules meant to bias the search towards promising exploration

frontiers, but any cell can be selected at any time. In contrast, MCTS uses backpropagation from

a series of rollouts to select the best option from a series of nodes at a timestep. After selection,

MCTS expands the tree from the selected node at the next timestep and repeats the process. MCTS

therefore tends to select a trajectory step by step, and generally does not go back to revisit nodes

that have been pruned or where a different node was already selected. Go-explore will start a rollout

from any cell at any time, potentially giving the algorithm more capability for exploration. In order

to balance exploration and exploitation, hyperparameters must be carefully chosen to ensure that

unseen areas are explored, but also in such a way that more promising cells are given more rollouts.

As an example, consider using go-explore on the first room of Montezuma’s Revenge from fig. 5.1.

Perhaps on the first rollout the agent successfully makes it to the rope, but then makes an incorrect

move and falls to the floor below, resulting in the agent’s death and the end of the trajectory.

However, due to collected intrinsic rewards, the agent knows it wants to return to this point and

explore. Instead of the agent having to attempt to navigate back to that state while suffering

stochastic perturbations to its actions, go-explore will eventually sample the cell and start a rollout

by deterministically setting the game back to the exact game-state. Eventually, the agent will


get lucky and jump to the next platform, collecting more intrinsic reward, and proceeding further

along the correct trajectory to the key. The further along this trajectory the agent gets, the more

important and useful that deterministic resetting becomes, since the difficulty of returning to an

earlier point along the trajectory drastically increases.

5.1.2 Phase 2

Whereas phase 1 of go-explore returns a single trajectory, phase 2 returns a trained policy that is

robust to stochastic perturbations. In phase 2, a deep neural network agent is trained using the

backward algorithm [79]. The backward algorithm is an algorithm for hard exploration problems

that trains a deep neural network policy based on a single expert demonstration. The trajectory

returned from phase 1 is used as the expert demonstration. The properties of the backward algorithm

lead to a policy that does at least as well as the expert trajectory and may in fact improve upon the

expert, while learning how to overcome deviations from the expert trajectory.

5.2 Go-Explore for Black-Box Validation

Go-explore has shown promising results on hard-exploration benchmarks, but the algorithm does

not meet all of our design desiderata. In particular, the cells are indexed based on a down-sampled

version of the environment state, but we would not have access to the state if the simulator were

treated as a black box. Furthermore, cell scores for Montezuma’s Revenge were based on a notion

of ”levels”—a measure of progression—which does not have a direct analogue in our validation task.

Therefore, we must make changes to adapt both the cell structure and the cell selection algorithm

for our validation task.

5.2.1 Cell Structure

In go-explore, cells are indexed by a compressed representation of the environment state, which in

our case would be the simulator state. However, according to our black-box assumption, we may

not have access to the simulator state. To preserve the black-box assumption, we instead index cells

by hashing a concatenation of the current step number t and a discretized version of the previous

action, a ∼ t, so the index is idx = hash (t, a ∼ t). Therefore, similar actions taken at the same

step of a rollout trajectory are treated as the same cell, which preserves the black-box assumption

by eliminating the dependence on simulation state while still providing the algorithm with a useful

way to group similar observations into cells.


5.2.2 Cell Selection

Two key changes need to be made for cell selection to work for validation tasks: 1) deterministic

resets and 2) cell scores.

Deterministic Resets

A key component of the go-explore algorithm is that the simulator is deterministically reset to the

exact simulation state of a cell when that cell is sampled to start a rollout. If the simulation state

is unavailable, though, this becomes trickier. Instead, the simulator must support deterministic

simulation. Each cell stores the trajectory of actions that were taken to reach that cell’s state.

When a cell is sampled to start a rollout, these actions are taken exactly to deterministically return

the simulator to the desired state. Simulation can then proceed as normal.

Cell Scores

When sampling a cell to start a rollout, a cell is sampled with probability proportional to its fitness

score. The fitness score is partially made up of “count subscores” for three attributes that represent

how often a cell has been interacted with: 1) the number of times a cell has been chosen to start a

rollout, 2) the number of times a cell has been visited, and 3) the number of times a cell has been

chosen since a rollout from that cell has resulted in the discovery of a new or improved cell. For

each of these three attributes, a count subscore for cell c and attribute a can be calculated as

CntScore(c, a) = wa

(1

v(c, a) + ε1

)pa+ ε2 (5.1)

where v(c, a) is the value of attribute a for cell c, and wa, pa, ε1, and ε2 are hyperparameters. The

total unnormalized fitness score is then

CellScore(c) = ScoreWeight(c)

(1 +

∑a

CntScore(c, a)

)(5.2)

When applied to Montezuma’s Revenge, the authors obtained better results by using a ScoreWeight

in eq. (5.2) that was based on what level of the game a cell was in. The ScoreWeight heuristic favored

sampling cells where the agent had progressed further within the game. Unfortunately, progression

does not have a direct analogue for validation tasks.

For validation tasks, we use the estimated value of a cell as the ScoreWeight. Similar to MCTS,

cells track an estimate of the value function. Anytime a new cell is added to the pool, or a cell is

updated, the value estimate for a particular cell vc is updated as

vc ← vc +(r + γv∗child)− vc

N(5.3)


When a cell is updated, the cell’s parent also updates its value estimate. Value updates are there-

fore propagated all the way up the tree. The total unnormalized fitness score is calculated with

ScoreWeight(c) = vc. Cells with high value estimates are selected more often.

5.3 Experiments

y

x

(s0,car,x, 0)

(s0,ped,x, s0,ped,y)

vcar,x

vped,x


Figure 5.2: The layout of the crosswalk example scenario. The car approaches the road where apedestrian is trying to cross. Initial conditions are shown, and values for s0,ped,y can be found intable 5.1.

5.3.1 Problem Description

The validation scenario consists of a vehicle approaching a crosswalk on a neighborhood road as

a pedestrian is trying to cross, as shown in fig. 5.2. The car is approaching at the speed limit of

25 mph(11.2 m/s). The vehicle, a modified version (see section 3.3) of the intelligent driver model

(IDM) [68], has noisy observations of the pedestrian’s position and velocity. The AST reinforcement

learner controls the simulation through a six-dimensional action vector, consisting of the x and

y components for three parameters: 1) the pedestrian acceleration, 2) the sensor noise on the

pedestrian position, and 3) the sensor noise on the pedestrian velocity. We treat the simulation as

a black box, so the AST agent has access to only the initial conditions and the history of previous

actions. From this general setup we generate three specific scenarios (easy, medium, and hard),

which are differentiated by the difficulty of finding a failure. The differences between the scenarios

include whether a reward heuristic was used (see section 5.3.2), the initial location of the pedestrian,

as well as the rollout horizon and timestep size. Pedestrian and vehicle location are measured from

the origin, which is located at the intersection of the center of the crosswalk and the center of the

vehicle’s lane. The scenario parameters are shown in table 5.1. The easy scenario is designed such

that the average action leads to a collision, so the maximum possible reward is known to be 0. The

medium and hard scenarios require unlikely actions to be taken to force a collision. They have


the same initial conditions, except the hard scenario has a timestep with half the duration of the

medium scenario’s timestep, and accordingly the hard scenario has double the maximum path length

of the medium scenario. The hard scenario demonstrates the effect of horizon length on exploration

difficulty.

Table 5.1: Parameters that define the easy, medium, and hard scenarios. Changing the pedestrianlocation results in failures being further from the average action, making exploration more difficult,whereas changing the horizon and timestep lengths makes exploration more complex.

.

Variable Easy Medium Hard

β 1000 0 0

s0,ped,y −4 m −6 m −6 m

T 50 steps 50 steps 100 steps

dt 0.10 s 0.10 s 0.05 s

5.3.2 Modified Reward Function

We make some modifications to the theoretical reward function shown in eq. (2.9) to allow practical

implementation:

R (s) =

0, s ∈ E

−α− β × dist(pv,pp

), s /∈ E, t ≥ T

−M (a, µa,Σa | s) , s /∈ E, t < T

(5.4)

where M(a, µa,Σa | s) is the Mahalanobis distance [69] between the action a and the expected action

µa given the covariance matrix Σa in the current state s and dist(pv,pp

)is the distance between

the pedestrian and the vehicle at the end of the rollout. The latter reward is the domain-specific

heuristic reward that guides AST reinforcement learners by giving a lower penalty when the scenario

ends with a pedestrian closer to the car. While the heuristic reward in eq. (5.4) is not theoretically

guaranteed to preserve the optimal policy [64] we find that it works well in practice and is easy

to implement, while also requiring less access to the simulation state. We use α = −1× 105 and

β = −1× 104 for the easy scenario and, to disable the heuristic, β = 0 for the medium and hard

scenarios.


For each experiment, the DRL, MCTS, and GE reinforcement learners were run for 100 iterations

each with a batch size of 500. Algorithm-specific hyperparameter settings are listed below.


Go-explore Phase 1

GE was run with hyperparameters similar to those used for Montezuma’s Revenge [54]. For the

count subscore attributes (times chosen, times chosen since improvement, and times seen), we set

wa equal to 0.1, 0, and 0.3, respectively. All attributes share ε1 = 0.001, ε2 = 0.00001, and pa = 0.5.

We always use a discount factor of 1.0. During rollouts, actions are sampled uniformly.

Deep Reinforcement Learning

The DRL reinforcement learner uses a Gaussian-LSTM trained with PPO and GAE. The LSTM has

a hidden layer size of 64 units, and uses peephole connections [80]. For PPO, we used a KL penalty

with factor 1.0 as well as a clipping range of 1.0. GAE uses a discount of 1.0 and λ = 1.0. There is

no entropy coefficient.

Monte Carlo Tree Search

We use MCTS with DPW (see section 2.1.2), where rollout actions are sampled uniformly from the

action space. The exploration constant was 100. The DPW parameters are set to k = 0.5 and

α = 0.5.

5.4 Results

The results of all three reinforcement learners in the easy scenario are shown in fig. 5.3. The easy

scenario is designed so that the likeliest actions lead to collision, and yet even the DRL reinforcement

learner was unable to achieve the optimal reward of 0. However, all three algorithms were able to

find failures quickly when given access to a heursitic reward. GE performed the worst, and both

GE and MCTS showed little improvement after finding their first failure. In contrast, the DRL

reinforcement learner continued to improve over the 100 iterations, ending with the best reward and

therefore the likeliest failure.

Figure 5.4 shows the results of the MCTS and GE reinforcement learners in the medium scenario.

Without a heuristic reward, the DRL reinforcement learner was unable to find a failure within 100

iterations. MCTS and GE, however, were still able to find failures. While there was no heuristic,

the horizon of the problem was still short, and we see that MCTS was still able to outperform GE.

While GE was able to improve over the first collision it found more than MCTS was able to, the

final collision found by GE was still less likely than even the first collision found by MCTS, and

MCTS was able to find a failure more quickly as well.

Figure 5.5 shows the results of GE in the hard scenario. The hard scenario had a longer horizon,

which prevented both DRL and MCTS from being able to find failures within 100 iterations. GE

was still able to find failures, however, and to improve the likelihood of the failure found over the


0 20 40 60 80 100

−600

−400

−200

0

Iterations

Rew

ard GE

DRLMCTS

Figure 5.3: The reward of the most likely failure found at each iteration of the GE, DRL, and MCTSreinforcement learners in the easy scenario. Results are cropped to only show results when a failurewas found.

course of training. In fact, when adjusting for the increased number of steps, the GE reinforcement

learner’s results in the hard scenario were very similar to its results in the medium scenario, showing

that GE is robust to longer-horizon problems.

5.5 Discussion

The results across the three scenarios illuminate the strengths and weaknesses of the three algorithms.

When a useful reward signal is present, the DRL reinforcement learner shows the highest ability

to find the most likely failure. However, without a heuristic, it quickly loses its ability to find

failures. In a no-heuristic setting, MCTS is able to find more likely failures than GE as long as

the problem’s horizon is short. However, in longer-horizon problems, GE is able to find failures

that MCTS cannot. The underlying principle of these differences lies in how the algorithms balance

exploration and exploitation. The tree search algorithms show better ability to explore, which is

why they can find failures without the use of heuristics. In contrast, DRL shows better ability to

exploit, which is why, if it finds failures, it finds likelier failures. Understanding the effects of the

exploration/exploitation tradeoff results in an interesting question: could we achieve better results

with a two-phase algorithm that could combine the exploration ability of the tree search algorithms

with the exploitation ability of DRL? It turns out that we can, and we demonstrate the abilities of

robustification in the next chapter.


0 20 40 60 80 100

−600

−400

−200

0

Iterations

Rew

ard

GEMCTS

Figure 5.4: The reward of the most likely failure found at each iteration of the GE and MCTSreinforcement learners in the medium scenario. The DRL reinforcement learner was unable to finda failure. Results are cropped to show results only when a failure was found.

0 20 40 60 80 100

−1,500

−1,000

−500

Iterations

Rew

ard

GE

Figure 5.5: The reward of the most likely failure found at each iteration of the GE reinforcementlearner in the hard scenario. The DRL and MCTS reinforcement learners were unable to find afailure.

Chapter 6

Robustification

From a theoretical standpoint, proposition 2.4.1 shows that the trajectory that maximizes the AST

reward function will be the most likely failure of a system for a given scenario. Unfortunately, since

AST uses RL, in practice the solution is an approximate one. AST may converge to a local optimum,

and there is little in the way of guarantees we can make on how close to the global optimum that

solution will be. In practice, it is not uncommon to see separate AST runs produce high variance

results. The inconsistency over identical runs raises concerns on the reliability of our validation

results after a single run of AST.

In chapter 5 we applied go-explore’s phase 1 to find failures on long-horizon validation problems

without the use of domain heuristics. Go-explore’s phase 2 has applications for AST as well, though.

Phase 2 takes the best trajectory from phase 1 and uses it as an expert demonstration for the

backward algorithm (BA), a learning-from-demonstration (LfD) algorithm. The BA allows us to

turn a single trajectory into a neural network policy that is robust to stochasticity. However,

importantly, the BA also allows the robust policy to improve upon the expert demonstration’s

results.

This chapter introduces a way to use the BA to produce improved and more consistent results

from AST. The BA requires only a trajectory as an expert demonstration; it is agnostic to how

that trajectory was produced. Consequently, we can use the BA to improve the results of any AST

reinforcement learner. Furthermore, the way that the BA’s training process introduces stochasticity

(see section 6.1) essentially makes the algorithm an efficient local search around similar trajectories.

The robustification phase can therefore be seen as a sort of hill-climbing phase to force better

convergence to consistent results. Using the BA in this way allows AST to provide results that have

less variance between identical runs but with a minimal amount of added compute.

57

CHAPTER 6. ROBUSTIFICATION 58

6.1 The Backward Algorithm

The backward algorithm is an algorithm for hard exploration problems that trains a deep neural

network policy based on a single expert demonstration [79]. Given a trajectory (st, at, rt, st+1)Tt=0

as the expert demonstration, training of the policy begins with episodes starting from sτ1 , where τ1

is near the end of the trajectory. Training proceeds until the agent receives as much or more reward

than the expert from sτ1 , after which the starting point moves back along the expert demonstration

to sτ2 , where 0 ≤ τ2 ≤ τ1 ≤ T . Training continues in this way until sτN = s0. Diversity can be

introduced by starting episodes from a small set of timesteps around sτ , or by adding small random

perturbations to sτ when instantiating an episode. Training based on the episode can be done

with any relevant deep reinforcement learning algorithm that allows optimization from batches of

trajectories, such as PPO with GAE.

6.2 Robustification

The BA was designed to create a policy robust to stochastic noise while potentially improving upon

the expert demonstration. Within the context of AST, however, the BA acts as a hill-climbing step,

enabling AST to more consistently converge to a better local optimum. AST is designed to find the

most likely failure of a system in a given scenario even when the environment has high-dimensional

state and action spaces. While existing AST reinforcement learners can provide a good guess for

the most likely failure trajectory, the BA allows us to approximately search the space of similar

trajectories to find the best one. Conceptually, instead of searching the entire space of possible

rollouts, we are instead constraining the search to a significantly smaller space where we already

know a failure exists. Because it is a smaller space, we can search the space that is local to a known

failure more robustly with a reasonable amount of added computational cost, yielding better and

more consistent results. This idea is applicable to all of the reinforcement learners we have used on

AST so far, and we demonstrate this by using it to improve failures found on the scenarios from

chapter 5.

The BA has two key features that make it applicable to improving results given by other AST

reinforcement learners. The first, as covered in section 6.1, is that it can perform efficiently on hard

exploration problems. If the BA required orders of magnitude more compute than the AST rein-

forcement learners, the benefits would not be worth the cost. However, the BA is efficient enough to

justify its use in yielding more consistent AST results. The second key feature is that the BA can im-

prove upon the expert trajectory. Training starts from steps along the expert demonstration, which

limits the deviation between the agent’s actions and the expert trajectory. However, stochasticity

during training still allows a significant amount of deviation from the expert demonstration, which

allows the BA to discover trajectories that yield higher rewards than those of the expert trajectory.

While designed for robustification, these features allow us to instead use the BA as a hill-climbing


method.

A slight change was made to the BA in order to make it a better fit for AST robustification. In

the original BA paper, a policy was trained from a specific step of the expert demonstration until

it learned to do as well, or better, than the expert. In the validation tasks we are interested in,

compute may be too limited to be able to train for indefinite amounts of time. Instead, we modify

the BA to train for a small number of epochs at each step of the expert trajectory, which allows

the total number of iterations to be known and specified ahead of time. This modification did not

prevent the BA from improving upon the expert demonstration in any of our experiments.

6.3 Experiments

The problem description, reward function, and reinforcement learners are identical to those in sec-

tion 5.3. We take the best results of the DRL, MCTS, and GE reinforcement learners from section 5.4

and perform robustification using the BA. The BA was run for 100 iterations with a batch size of

5000, with the results reported as DRL+BA, MCTS+BA, and GE+BA, respectively. The BA rep-

resents the policy with a Gaussian-LSTM, and optimizes the policy with PPO and GAE. The LSTM

has a hidden layer size of 64 units, and uses peephole connections [80]. For PPO, we used a KL

penalty with factor 1.0 and a clipping range of 1.0. GAE uses a discount of 1.0 and λ = 1.0. There

is no entropy coefficient.

6.4 Results

The results of all three reinforcement learners in the easy scenario are shown in fig. 6.1. This scenario

was designed so that the likeliest actions lead to collision, and yet even the DRL reinforcement learner

was still was not near the optimal reward of 0. In contrast, adding robustification through the BA

resulted in finding failures that were significantly closer to optimal behavior. While GE significantly

improved with robustification, GE+BA was still far from the optimal solution. In contrast, both

MCTS+BA and DRL+BA were able to converge to results very near to 0.

Figure 6.2 shows the results of the non-DRL reinforcement learners in the medium scenario, as the

DRL reinforcement learner was unable to find a failure in that scenario. Adding a robustification

phase again improved both algorithms, and again the MCTS+BA reinforcement learner outper-

formed the GE+BA reinforcement learner. Note that taking the average action was not sufficient

to cause a crash in this scenario.

Figure 6.3 shows the results of GE and GE+BA in the hard scenario, as GE was the only

algorithm to find a failure. Despite the difficulty of the scenario, GE+BA was still able to improve

the results. In section 5.4 we saw that the performance of the GE reinforcement learner is robust

to changes in the horizon length. Similarly, when adjusting for the increased number of steps, the


0 20 40 60 80 100

−600

−400

−200

0

Iterations

Rew

ard GE

DRLMCTS+BA

Figure 6.1: The reward of the most likely failure found at each iteration of the GE, DRL, andMCTS reinforcement learners in the easy scenario, as well as GE+BA, DRL+BA, and MCTS+BA.The dashed lines indicate the respective scores after robustification of each reinforcement learner.Results are cropped to show results only when a failure was found.

GE+BA reinforcement learner’s results in the hard scenario were very similar to its results in the

medium scenario, showing that GE+BA is also robust to longer-horizon problems.

6.5 Discussion

The results across the three scenarios show that robustification consistently yields failures that are

more likely than those yielded by running a standard AST reinforcement learner alone. Even on

problems where a normal DRL reinforcement learner was unable to find any failures, the BA was

able to find likely failures. The power of using the BA in conjunction with another algorithm lies

in its ability to balance exploration and exploitation. The robustification phase applies the strength

of DRL—exploitation—to a domain that requires significantly less exploration since the first failure

was already found by a different algorithm. Consequently, these two-phase methods could also be

seen as an exploration phase plus an exploitation phase, a decomposition that results in a problem

that is easier to solve.

An open question is where validation tasks would fall within the difficulty spectrum presented in

this chapter. All three algorithms will certainly have cases where they are the best choice. However,

it remains to be seen whether one of the three algorithms ends up as the dominant strategy for

validation in the real world. In part, this is due to the added complexities of validating systems in

high-fidelity simulators—a challenge we address in the next chapter.


0 20 40 60 80 100

−600

−400

−200

0

Iterations

Rew

ard GE

MCTS+BA

Figure 6.2: The reward of the most likely failure found at each iteration of the GE and MCTSreinforcement learners in the medium scenario, as well as GE+BA and MCTS+BA. The dashedlines indicate the respective scores after robustification of each reinforcement learner. The DRLreinforcement learner was unable to find a failure. Results are cropped to show results only when afailure was found.

0 20 40 60 80 100

−1,500

−1,000

−500

Iterations

Rew

ard

GE+BA

Figure 6.3: The reward of the most likely failure found at each iteration of the GE reinforcementlearner in the hard scenario, as well as GE+BA. The dashed line indicates the score after robusti-fication of the GE reinforcement learner. The DRL and MCTS reinforcement learners were unableto find a failure.

Chapter 7

Validation in High-Fidelity

While performing validation in simulation may be necessary due to time and safety constraints,

simulation is clearly not a perfect recreation of reality. Some errors found when testing in simulation

may be real errors that can be replicated in real-world testing, or at least close to real-world errors,

but other errors may be spurious. A spurious failure is one that comes from a fault in the simulator’s

representation of reality, not from an actual fault in the system under test. Furthermore, some actual

faults may not be realizable at all if the simulator is not a good representation of the real world.

For that reason, industry has recently put significant amounts of effort and money towards the

development of high-fidelity simulators (hifi).

While there is no strict definition of what constitutes a hifi simulator, they are generally charac-

terized by features such as advanced dynamics models, perception from graphics, and autonomous

system software-in-the-loop simulation. These features allow far more accurate testing, but at a cost:

hifi simulators are also far slower and far more computationally expensive to run than low-fidelity

simulators (lofi). While we need methods like AST to help us capture the variability of real-world

scenarios during testing, the methods may require too many simulation rollouts to be able to run in

high-fidelity.

This chapter presents a way to make AST tractable in high fidelity by learning from simulation

rollouts run in less costly low-fidelity. The idea is to first run AST in low fidelity to find candidate

failures. These candidate failures might be failures that exist in hifi, or are close to failure in hifi,

but they also might be spurious errors. To address this potential problem, we use the backward

algorithm (BA) (see section 6.1), an algorithm for hard-exploration problems that learns a deep

neural network policy using a single expert demonstration. We use the candidate failure as the

expert demonstration to the BA. Doing so has two key advantages:

1. The BA can train a policy that outperforms the original expert, which means that we can

learn to take a low-fidelity failure and transform it into a similar high-fidelity failure.

62

CHAPTER 7. VALIDATION IN HIGH-FIDELITY 63

2. The BA uses very short horizons early in training, which allows us to identify and reject

spurious errors in a computationally efficient way.

7.1 Validation in High-Fidelity

In order to find failures with fewer hifi steps, we will first learn lessons from running AST in lofi, where

simulation rollouts are much cheaper. By using the candidate failures as the expert demonstration

when learning with the BA, we can overcome two key problems with using failures from lofi.

The first problem is that a failure found in lofi may not correspond exactly to a failure in hifi.

For example, a trajectory might have to be slightly different if the dynamics change, or the noise

injection might have to change to adapt to a higher-quality perception model. The BA allows us to

efficiently adapt a lofi failure to a corresponding hifi failure. By using the lofi failure as an expert

demonstration, the BA biases its policy search towards similar policies, reducing the search space

and requiring fewer iterations. However, the BA still allows the learned policy to improve upon the

expert demonstration, which means that the policy can learn to force a failure in hifi even when the

candidate failure does not exist exactly in hifi.

The second problem is that a failure found in lofi may be a spurious failure and may therefore

not correspond to any failure at all in hifi. In such a case, it is important that we minimize the

computational cost of identifying and rejecting the spurious failure. The BA starts training with

truncated rollouts from the end of the expert demonstration. As a consequence of the shorter

trajectories, early epochs of the BA require far fewer simulation steps than running standard DRL.

We can reject a failure as spurious if the BA fails to find a failure from multiple consecutive steps

of an expert demonstration. While this may not always happen in early epochs, we are still able to

reject spurious failures with fewer simulation steps when using the BA.

A remaining problem is cases where a failure in high-fidelity is unable to be represented in low-

fidelity. In many cases, this may simply be a more extreme version of the first problem. While

this will always remain a risk, in section 7.2 we will show strong empirical results across a range of

different test scenarios that should alleviate some concerns.

We made some changes to the BA in order to achieve the best performance on our validation

task. The original algorithm calls for training at each timestep of the expert demonstration until

the policy is as good or better than the original. First, we relax this rule and, instead, move sτ

back along the expert demonstration any time a failure is found within the current epoch. Second,

we add an additional constraint of a maximum number of epochs at each timestep. If the policy

reaches the maximum number of epochs without finding failures, training continues from the next

step of the expert demonstration. However, if sτ is moved back without the policy finding a failure

five consecutive times, the expert demonstration is rejected as a spurious error. We found that there

are times when the BA is unable to find failures in hifi in the early stages of training due to the lofi


y

x

(−55, 0) m

(0.0,−1.9) m

11.2 m/s

1.0 m/s


Figure 7.1: Layout of the crosswalk example. A car approaches a crosswalk on a neighborhood roadwith one lane in each direction. A pedestrian is attempting to cross the street at the crosswalk.Initial conditions are shown.

trajectory being too dissimilar to any failure in hifi, so this change allows the BA more epochs to

train and find failures. In a similar vein, we also start training with τ > 0, and when moving back

along the expert demonstration after a failure, we increment τ more than one step back.

7.2 Case Studies

To demonstrate the BA’s ability to adapt lofi failures to hifi failures, we constructed a series of case

studies that represent a variety of differences one might see between lofi and hifi simulators. These

case studies measure which types of fidelity differences the BA can handle well and which types it

will struggle with. Because the BA starts many epochs from points along the expert demonstration,

many rollouts will have shorter trajectory lengths than rollouts generated by the DRL reinforcement

learner. Therefore, a direct comparison to DRL of iterations would not be fair. Instead, we measure

performance in terms of the number of simulation steps, assuming this would be the bottleneck in

hifi simulators. Unless otherwise noted, all case studies share the following setup:

Simulation

Unless otherwise noted, the test scenario is simulated using the Python simulator presented in

section 3.3. In the test scenario, the system under test (SUT) is approaching a crosswalk on a

neighborhood road where a pedestrian is trying to cross, as shown in fig. 7.1. The pedestrian starts

1.9 m back from the center of the SUT’s lane, exactly at the edge of the street, and is moving

across the crosswalk with an initial velocity of 1.0 m/s. The SUT starts at 55 m away from the

crosswalk with an initial velocity of 11.2 m/s (25 mph), which is also the desired velocity. The SUT

is a modified version of the intelligent driver model (IDM) [68]. IDM is a lane following model that

calculates acceleration based on factors including the desired velocity, the headway to the vehicle in


front, and the IDM’s velocity relative to the vehicle in front. Our modified IDM ignores pedestrians

that are not in the street, but treats the pedestrian as a vehicle when it is in the street, which—due

to large differences in relative velocity—will cause the IDM to brake aggressively to avoid collision.

Simulation was performed with the AST Toolbox1 (see chapter 8).

Algorithms

To find collisions, AST was first run with a DRL reinforcement learner in each case study’s low

fidelity version of the simulator. Once a collision was found, the backward algorithm was run using

the lofi failure as the expert demonstration. Results are shown both for instantiating the backward

algorithm’s policy from scratch and for loading the policy trained in lofi. Results are compared

against running AST with the DRL reinforcement learner from scratch in hifi. Optimization for all

methods is done with PPO and GAE, using a batch size of 5000, a learning rate of 1.0, a maximum

KL divergence of 1.0, and a discount factor of 1.0. The BA starts training 10 steps back from the

last step, and moves back 4 steps every time a failure is found during a batch of rollouts.

7.2.1 Case Study: Time Discretization

In this case study, the fidelity difference is time discretization and trajectory length. The lofi

simulator runs with a timestep of 0.5 seconds for 10 steps, while the hifi simulator runs with a

timestep of 0.1 seconds for 50 steps. This fidelity difference approximates skipping frames or steps

to reduce runtime. In order to get an expert demonstration of the correct length and discretization

in hifi, the lofi actions were repeated 5 times for each lofi step. The results are shown in table 7.1.

The hifi DRL baseline took 44 800 simulation steps to find a failure. The lofi reinforcement

learner was run for 5 epochs, finding a failure after 25 600 simulation steps. When instantiating a

policy from scratch, the BA was able to find a failure in hifi after 19 760 steps, 44.1 % of the DRL

baseline. The BA was able to find a failure even faster when the policy trained in lofi was loaded,

needing 15 230 steps to find a failure in hifi, 34.0 % of the DRL baseline and 77.1 % of the BA trained

from scratch.

7.2.2 Case Study: Dynamics

In this case study, the fidelity difference is in the precision of the simulator state. The lofi simulator

runs with every simulation state variable rounded to 1 decimal point, while the hifi simulator runs

with 32-bit variables. This fidelity difference approximates situations when simulators may have

differences in vehicle or environment dynamics. In order to get an expert demonstration with the

correct state variables, the lofi actions were run in hifi. The results are shown in table 7.2.

1github.com/sisl/AdaptiveStressTestingToolbox

github.com/sisl/AdaptiveStressTestingToolbox


Table 7.1: The results of the time discretization case study. Load Lofi Policy indicates whether thepolicy of the BA was initialized from scratch or the weights were loaded from the policy trained inlofi. The BA was able to significantly reduce the number of hifi steps needed, and loading the lofipolicy produced a further reduction.

AlgorithmSteps toFailure

FinalReward

Load LofiPolicy?

LofiSteps

Percent ofHifi Steps

BA 19 760 −794.6 No 25 600 44.1 %

BA 15 230 −745.6 Yes 25 600 34.0 %

Hifi 44 800 −819.9 – – –





needing just 2840 steps to find a failure in hifi, 6.1 % of the DRL baseline and 21.3 % of the BA

trained from scratch.

Table 7.2: The results of the dynamics case study. Load Lofi Policy indicates whether the policyof the BA was initialized from scratch or the weights were loaded from the policy trained in lofi.The BA was able to significantly reduce the number of hifi steps needed, and loading the lofi policyproduced a further reduction.


FinalReward

Load LofiPolicy?

LofiSteps


BA 13 320 −729.7 No 57 200 28.5 %

BA 2840 −815.8 Yes 57 200 6.1 %

Hifi 46 800 −819.3 – – –

7.2.3 Case Study: Tracker

In this case study, the fidelity difference is that the tracker module of the SUT perception system is

turned off. Without the alpha-beta filter, the SUT calculates its acceleration at each timestep based

directly on the noisy measurement of pedestrian location and velocity at that timestep. This fidelity

difference approximates cases when hifi perception modules are turned off. Perception modules

might be turned of in order to achieve faster runtimes. In order to get an expert demonstration with

the correct state variables, the lofi actions were run in the hifi simulator. The results are shown in


table 7.3.





needing just 2750 steps to find a failure in hifi, 6.1 % of the DRL baseline and 14.8 % of the BA

trained from scratch.

Table 7.3: The results of the tracker case study. Load Lofi Policy indicates whether the policy of theBA was initialized from scratch or the weights were loaded from the policy trained in lofi. The BAwas able to significantly reduce the number of hifi steps needed, and loading the lofi policy produceda further reduction.


FinalReward

Load LofiPolicy?

LofiSteps


BA 18 600 −777.3 No 112 000 41.5 %

BA 2750 −785.7 Yes 112 000 6.1 %

Hifi 44 800 −800.1 – – –

7.2.4 Case Study: Perception

This case study is similar to the tracker case study in that it models a difference between the

perception systems of lofi and hifi simulators; however, in this case study the difference is far greater.

Here, the hifi simulator of the previous case studies is now the lofi simulator. The new hifi simulator

has a perception system2 that uses LIDAR measurements to create a dynamic occupancy grid map

(DOGMa) [81]–[83]. At each timestep, AST outputs the pedestrian acceleration and a single noise

parameter, which is added to the distance reading of each beam that detects an object. The SUT

has 30 beams with 180 degree coverage and a max detection distance of 100 m. The DOGMa particle

filter uses 10 000 consistent particles, 1000 newborn particles, a birth probability of 0.0, a particle

persistence probability of 1.0, and a discount factor of 1.0. Velocity and acceleration variance were

initialized to 12.0 and 2.0, respectively, and the process noise for position, velocity, and acceleration

was 0.1, 2.4, and 0.2, respectively.

This case study also starts with slightly different initial conditions. The pedestrian starting

location is now 2.0 m back from the edge of the road, while the vehicle starting location is only 45 m

from the crosswalk. The initial velocities are the same.

The difference in noise modeling means that the action vectors lengths of the lofi and hifi sim-

ulators now differ. In order to get an expert demonstration, the lofi actions were run in hifi, with

2Our implementation was based on that of github.com/mitkina/EnvironmentPrediction.

github.com/mitkina/EnvironmentPrediction


the noise portion of the action vectors set to 0. Because the action vectors are of different lengths,

the reinforcement learner networks have different sizes as well, so it was not possible to load the lofi

policy for the BA in this case study.


learner took 100 000 simulation steps to find a failure. The BA was able to find a failure in hifi after

only 6330 steps, a mere 4.7 % of the DRL baseline.

Table 7.4: The results of the perception case study. Due to differences in network sizes, resultingfrom different disturbance vector sizes, it was not possible to load a lofi policy in this case study.The BA was still able to significantly reduce the number of hifi steps needed.


FinalReward

Load LofiPolicy?

LofiSteps


BA 6330 −385.8 No 100 000 4.7 %

Hifi 135 000 −511.1 – – –

7.2.5 Case Study: NVIDIA DriveSim

As a proof-of-concept, for the final case study we implemented the new AST algorithm on a hifi

simulator from industry. Nvidia’s Drivesim is a hifi simulator that combines high-accuracy dynamics

with features such as perception from graphics and software-in-the-loop simulation. An example

rendering of an intersection in Drivesim is shown in fig. 7.2. After the AST Toolbox was connected

with Drivesim, we simulated the standard crossing-pedestrian scenario with the modified IDM as

the SUT. Here, the lofi simulator was the AST Toolbox simulator used for all previous case studies,

which was trained for 265 450 steps. Using the BA, AST was able to find a failure in 4060 hifi steps,

which took only 10 hours to run. While the SUT was still just the modified IDM, these exciting

results show that the new approach makes it possible to find failures with AST on state-of-the-art

industry hifi simulators.

7.3 Discussion

Across every case study, a combination of running DRL in lofi and the BA in hifi was able to find

failures with significantly fewer hifi steps than just running DRL in hifi directly. Some of the fidelity

differences were quite extreme, but the BA was still able to find failures and to do so in fewer steps

than were needed by just running DRL in hifi directly. In fact, the most extreme example, the

perception case study, also had the most dramatic improvement in hifi steps needed to find failure.

These results show the power of the BA in adapting to fidelity differences and make the approach of


Figure 7.2: Example rendering of an intersection from NVIDIA’s Drivesim simulator, an industryexample of a high-fidelity simulator.

running AST in hifi significantly more computationally feasible. Further work could explore using

more traditional transfer learning and meta-learning approaches to save hifi simulation steps using

lofi or previous hifi simulation training results.

The approach of loading the lofi DRL policy had interesting results as well. In all the case

studies presented here, the policy loading approach was even faster than running the BA from

scratch, sometimes drastically so. However, throughout our work on this chapter we also observed

multiple cases where running the BA with a loaded policy did not result in finding failures at all,

whereas running the BA from scratch was still able to find failures in those cases. Furthermore,

there are cases, for instance the perception case study in section 7.2.4, where loading the lofi policy

is not even possible. Based on our experiences, loading the lofi policy is a good first step: it often

works, and when it works, works very well. However, if the BA fails to find a failure with a loaded

policy, then the BA should be run again from scratch, as running from scratch is a more robust

failure-finding method than loading the policy from lofi. Future work could focus on making the

BA using a loaded lofi policy more robust. The policy has a learned standard deviation network,

and one reason the BA using a loaded lofi policy may fail sometimes is that during training in lofi

the policy has already converged to small standard deviation outputs, leading to poor exploration.

Results might be improved by reinitializing the standard deviation network weights or by finding

other ways to boost exploration after a certain number of failed BA training epochs.

One final point is important to note—the goal of AST is to tractably find likely, and therefore

useful, failures in a system in simulation without constraints on actor behavior that can compromise

safety. AST is not a method whose goal is to estimate the total probability of failure. The hope of

this approach is that the majority of failures in hifi are also present in lofi, with additional spurious

errors, but it is certainly possible that there are some errors in hifi that have no close analog in lofi.


By biasing our search towards a likely failure found in lofi, we could actually hurt our ability to find

certain hifi failures. If our goal were to compute the total failure probability, such a bias could be

a critical flaw that might lead to significantly underestimating the likelihood of failure in certain

situations. However, such a bias is far less of a concern when we are instead merely looking to find

a likely and instructive failure. Indeed, the results bear this out, as the likelihoods of the failures

found by the BA were not just on par with the likelihoods of the failures found by running DRL in

hifi but were actually greater across all case studies.

This chapter, in conjunction with preceding chapters, represents a substantial step forward in

the theory of AST. However, this theoretical step is not worth much if it is not easily accessible for

use by system designers. In the next chapter we present the AST Toolbox, an open-source software

package that allows anyone to apply AST to validating their own autonomous systems.

Chapter 8

The AST Toolbox

Two common themes throughout this thesis are 1) the importance of validating that autonomous

systems will behave as expected prior to their deployment to ensure their safety and 2) the difficulty

of performing said validating. Poor system validation will lead to failures, and failures of safety

critical systems can lead to human injury or fatality. Such costly failures do not merely reflect on the

particular creators of the system at fault but can also lead to a lack of trust in autonomous systems

as a whole. Considering what is at stake, both human lives and societal trust in autonomous systems,

it is clear that safety must be a collaborative property, not a competitive one. Since validation is a

key step of creating safe systems, it is essential for validation methods to be available as open-source

and collaborative resources.

Fortunately, the research community has already created many good open-source options for

validating autonomous systems, as covered in section 8.1. However, there is a niche within the

ecosystem of open-source validation methods that remains unfilled. As of yet, there are no open-

source solutions for finding the most likely failure of a system. Additionally, while there are many

open-source algorithms for validation that treat the system under test as a black box, there are very

few options that can treat the entire simulator as a black box.

The algorithms presented throughout this thesis can fill the aforementioned niche, as AST both

finds the most likely failure and can treat the simulator as a black box. However, merely publishing

algorithm outlines or uploading project code is of limited use. Instead, validation methods should

be made available in a way that is both easily accessible and extensible. For that reason, we have

created the AST Toolbox, an open-source toolbox that allows system designers to apply AST to their

own validation problems. This chapter presents the AST Toolbox, an open-source package designed

to make AST easy to apply to any autonomous system. The Toolbox allows users to wrap their

simulator in a python class that turns the validation problem into an OpenAI gym environment [84].

The Toolbox bundles the wrappers with garage [85], an open-source RL library that handles policy

creation and optimization. Examples and documentation enable users to validate any system in any

71

CHAPTER 8. THE AST TOOLBOX 72

Table 8.1: A feature comparison of the AST Toolbox with three existing software solutions forsystem validation and verification. The AST Toolbox is unique in two features: 1) being able totreat the entire simulator as a black box, and 2) returning the most likely failure.

Black-box Support Returns

ToolSystem

Under Test SimulatorFalsifying

TraceMost Likely

FailureFormal

Guarantees

S-TaLiRo [86] X · X · ·Breach [5] X · · · X1

FalStar [41] X · X · ·AST Toolbox X X X X ·1 For linear systems

simulator, regardless of programming language, in a standardized and straight-forward process.

8.1 Comparison to Existing Software

Table 8.1 compares the features of the AST Toolbox with those of existing software. Unlike other

approaches, AST generates falsifying traces for black-box simulators, returning the most likely fail-

ure. Formal methods often provide guarantees at the expense of computation, whereas AST is

designed with tractability in mind. The goal is to enable approximate verification of complex, high-

dimensional autonomous systems. AST does not replace traditional simulation testing approaches.

Instead, it is an additional validation process for developers to both discover and understand likely

failures relatively quickly, with reduced simulation time.

8.2 AST Toolbox Design

The AST Toolbox is a software package that provides a framework for using AST with any simulator,

facilitating the validation of autonomous agents. The toolbox has three major components: the

reinforcement learners, the simulator interface, and the reward function. The reinforcement learners

are algorithms for finding the most likely failure of the system under test. The AST simulator

interface provides a systematic way of wrapping a simulator to be used with the AST environment.

The reward function uses the standard AST reward structure together with heuristics to guide the

search process and incorporate domain expertise.


AST

Simulator

ReinforcementLearner

RewardFunction

ASTEnv

ASTSpaces

RLAlgorithm

Policy

ASTRewardASTSimulator

step()

, reset

(), etc.

getReward()

step(), reset(), etc.

getRewardInfo()

Figure 8.1: The AST Toolbox framework architecture. The core concepts of the method are shown,as well as their associated abstract classes. ASTEnv combines the simulator and reward function ina gym environment. The reinforcement learner is implemented using the garage package.

8.2.1 Architecture Overview

The architecture is shown in fig. 8.1. The three core concepts of the AST method (simulator, rein-

forcement learner, and reward function) have abstract base classes associated with them. These base

classes provide interfaces that allow interaction with the AST module, represented by the ASTEnv

class. ASTEnv is a gym environment that interacts with a wrapped simulator ASTSimulator and

a reward function ASTReward. In conjunction with ASTSpaces, which are gym spaces, the AST

problem is encoded as a standard gym reinforcement learning problem. Many available open-source

reinforcement learning algorithms work with gym environments, but our reinforcement learners are

implemented using the garage framework. The reinforcement learner derives from the garage class

RLAlgorithm, and it uses both a Policy, such as a Guassian LSTM, and an optimization method,

such as TRPO.


The AST Toolbox is integrated with garage to provide easy access to efficient implementations of

reinforcement learners. The Toolbox comes with four algorithms already implemented:

• Deep reinforcement learning: The garage package has a range of deep reinforcement

learning algorithms and policies that can be used off the shelf, including different variations

of multi-layer perceptrons (MLP), gated recurrent units (GRU), and long-short term memory

(LSTM) networks (see section 2.1.3 for background). We have found the Gaussian LSTM

network with peepholes to be the most successful. Recurrent networks have the advantage

of not needing access to the simulation state, instead maintaining a hidden-state based on

the sequence of previous actions taken. A policy can, however, be defined to depend on the

simulation state, which can increase performance. For direct disturbance control, using a

Gaussian policy with a learned standard deviation introduces noise early on in training, which

improves exploration, but allows the network to reduce the noise over time, which results in


better exploitation. Note that for seed-disturbance control, the policy’s output (the seed) does

not have a smooth mapping to disturbances, so using a Gaussian policy would not make sense.

A Gaussian LSTM is used for both experiments in this chapter, and a Gaussian MLP is also

used on the cartpole task. Both tasks use garage’s implementation of PPO to optimize the

DRL reinforcement learner.

• Monte Carlo tree search: The toolbox offers two variants of MCTS (see section 2.1.2 for

background). The vanilla version is MCTS with UCB and DPW. The toolbox also offers a

variant called MCTS with blind value (MCTS-BV). MCTS-BV encourages exploring distinct

actions and therefore is more effective when the disturbance space is large and continuous. Note

that a separate MCTS random seeds (MCTSRS) algorithm is provided for seed-disturbance

control

• Go-explore (Phase 1): The Toolbox provides an implementation of the tree-search phase

of go-explore (see section 5.1 for background). Using go-explore has added dependencies and

also must use a different environment class (GoExploreASTEnv) to account for the deterministic

resets. Different cell selection methods can be implemented by changing the Cell and CellPool

classes or by changing the optimize policy function of the GoExplore class. The downsample

function should also be overloaded or modified based on the application domain.

• The backward algorithm: The Toolbox also provides an implementation of the backward

algorithm (see section 6.1 for background), which can be used as phase 2 of go-explore, as

a general robustification phase (see chapter 6) or for transferring failures from low-fidelity

simulators to high-fidelity simulators (see chapter 7). The backward algorithm also requires

deterministic resets of the simulator, so GoExploreASTEnv must be used as the environment

class.

The Toolbox provides extensions of garage’s samplers, so parallel batch sampling and vectorized

sampling are both usually supported for not just the algorithms listed above but also for most

custom policies or algorithms created by users.

8.2.3 Simulation Interface

The AST Toolbox includes a class template for an interface between the package and a general

simulator. The interface is a wrapper that implements the three necessary control functions for

AST, while still treating the simulator as a black-box. Four specific simulator options are available:

• Open-loop vs. closed-loop: An open-loop simulator is one in which all of the disturbances

must be specified ahead of time, whereas a closed-loop simulator accepts online control at

each time-step. TRPO uses batch optimization, so the update steps for both simulators are

equivalent.


• Fixed vs. sampled initial state: The reinforcement learners can be run from a fixed initial state.

Alternatively, AST reinforcement learners can generalize over a space of initial conditions by

sampling them during training.

• Black-box vs. white-box: A black-box simulator provides no access to the internal simulation

state, whereas a white-box simulator does. If the internal simulation state is accessible, using

it may boost performance.

• Exposed actions vs. random seed only: A simulator with exposed actions allows full program-

matic specification of simulation rollouts, allowing AST to control the environment directly

by outputting disturbance vectors. The alternative is to indirectly control disturbances by

allowing AST to control the seed of all the random number generators in a simulator, the

assumption being that the disturbances are subsequently sampled from said generators.

8.2.4 Reward Structure

The reward function module follows the reward function presented in eq. (2.9). Instead of using

the log-probability of disturbances, the function uses the negative Mahalanobis distance [69]. The

reward still ends up proportional to the log-likelihood of the failure found, but the Mahalanobis

distance works better in practice as it does not explode to near-infinite numbers for disturbances

that have very low probability. If no heuristics are used, then the reward function does not require

any information from or access to the simulator beyond what is defined in section 2.5. However, the

reward function is able to accept additional information from the simulator to calculate a heuristic

reward bonus when such a reward bonus is applicable.

8.3 Case Studies

8.3.1 Cartpole

Cartpole is a classic test environment for continuous control algorithms [66]. The state s = [x, x, θ, θ]

represents the cart’s horizontal position and speed as well as the bar’s angle and angular velocity.

The system under test (SUT) is a neural network control policy trained by PPO. The control policy

controls the horizontal force ~F applied to the cart, and the goal is to prevent the bar on top of the

cart from falling over. The failure of the system is defined as |x| > xmax or |θ| > θmax. The initial

state is at s0 = [0, 0, 0, 0]. Figure 8.2 shows the cartpole environment.

The reinforcement learner interacts with the simulator by applying disturbance on the SUT’s

control force ~F . At each time-step, the disturbance force, δ ~F , given by the reinforcement learner

output, and the control force ~F , given by the control policy, are applied simultaneously on the cart.


Figure 8.2: Layout of the cartpole environment. A control policy tries to keep the bar from fallingover, or the cart from moving too far horizontally, by applying a control force to the cart [65].

The reward function uses RE = 1× 104 and RE = 0, and as a heuristic reward uses the normal-

ized distance of the final state to failure states, which is given by

f(s) = min

(|x− xmax|xmax

,|θ − θmax|θmax

)(8.1)

The heuristic reward encourages the reinforcement learner to push the SUT closer to failure. The

disturbance likelihood reward ρ is set to the log of the probability density function of the natural

disturbance force distribution, which is a Gaussian with 0 mean and a standard deviation σ = 0.8.

No per-step heuristic is used.

Two reinforcement learners were compared in this experiment: MCTS and DRL. MCTS was

trained with k = 0.5 and α = 0.5, and c = 10. Since the MCTS reinforcement learner does not

need the true simulation state, its performance is the same for both the black-box and white-box

simulator settings. The DRL reinforcement learner was tested under both white-box and black-box

settings. For the white-box setting, we used a multilayer perceptron network with hidden layer sizes

of 128, 64, and 32. The step size was set to 5.0. For the black-box setting, we used a Gaussian LSTM

network with 64 hidden units. The step size was set to 1.0. Both neural networks were trained using

the PPO algorithm with a batch size of 2000 and a discount factor γ = 0.99. We additionally added

a random search reinforcement learner as the baseline. All reinforcement learners were trained using

2× 106 simulation steps. Each reinforcement learner was run for 10 trials using different random

seeds and the results were averaged. The hyperparameters used in this experiment were found


empirically. All reinforcement learners used the closed-loop setting.

The best trajectory return found at each trial was recorded every 1× 103 simulation steps. The

average of the best trajectory return over 10 trials is shown in fig. 8.3. Both MCTS and DRL rein-

forcement learners (with MLP and LSTM architectures) are able to find failure trajectories, whereas

the random-search baseline fails to do so in all trials. In this experiment, the DRL reinforcement

learners use significantly fewer simulation steps than the MCTS reinforcement learner to find failure

trajectories in all trials. The MLP reinforcement learner is also slightly more sample-efficient than

the LSTM since it has access to the true simulator state. The best reward found by the MLP rein-

forcement learner, the LSTM reinforcement learner, and the MCTS reinforcement learner are −59.4,

−50.6, and −100.1, respectively. Surprisingly, the best reward found by the LSTM reinforcement

learner is slightly higher than that for the MLP reinforcement learner, but this is likely due to the

stochastic nature of the reinforcement learners.

0 0.5 1 1.5 2

·106

−1

−0.8

−0.6

−0.4

−0.2

0

·104

Step Number

Rew

ard

Random SearchTRPO, White-BoxTRPO, Black-BoxMCTS

Figure 8.3: Best return found up to each iteration. The value is averaged over 10 different trials.Both the MCTS and DRL reinforcement learners are able to find failures, but the DRL reinforcementlearner is more computationally efficient.

8.3.2 Autonomous Vehicle

The autonomous vehicle task is a recreation of the autonomous driving experiment from section 3.3.

A pedestrian crosses a neighborhood road at a crosswalk as an autonomous vehicle approaches, as

shown in fig. 8.4. The x-axis is aligned with the edge of the road, with East being the positive

x-direction. The y-axis is aligned with the center of the cross-walk, with North being the positive

y-direction. The system under test is the autonomous driving policy which is a modified version of

the Intelligent Driver Model [68].

The reward function for this scenario uses RE = −1× 105, RE = 0, and a heuristic reward

with Φ = 10000dist(pv,pp

), where dist

(pv,pp

)is the distance between the pedestrian and the

SUT. This heuristic encourages the reinforcement learner to move the pedestrian closer to the car


y

x

(−35, 0) m

(0.0,−4.0) m

11.2 m/s

1.0 m/s


Figure 8.4: Layout of the autonomous vehicle scenario. A vehicle approaches a cross-walk on aneighborhood road as a single pedestrian attempts to walk across. Initial conditions are shown.

in early iterations, which can significantly increase training speeds. The reward function also uses

ρ = M (x, µx | s), which is the Mahalanobis distance function [69]. The Mahalanobis distance is

a generalization of distance to the mean for multivariate distributions. The pedestrian and noise

models used in this experiment are Gaussian, making the Mahalanobis distance proportional to

log-likelihood.

The reinforcement learner interacts with the simulator by controlling the pedestrian’s acceleration

and the noise on the sensors. At each time-step an acceleration (ax, ay) vector is used to move the

pedestrian. Gaussian noise is also added to the sensors at each time-step. The reinforcement learner

outputs the mean and diagonal covariance of a multivariate Gaussian distribution at each time-step,

from which the acceleration and noise vectors are sampled. Using a distribution adds controlled

stochasticity to the actions, which enhances exploration.

The SUT has an initial and target velocity of 11.2 m/s, with a target follow distance of 5 m. The

SUT starts −35.0 m from the cross-walk. The pedestrian starts −4.0 m behind the crosswalk with

an initial velocity of 1.0 m/s. The standard deviation for the pedestrian accelerations are 0.1 m in

the x-direction and 0.0 m in the y-direction, and the standard deviation of all noise vectors is 0.1 m.

The autonomous vehicle experiment was run with the DRL reinforcement learner. Because

autonomous vehicles generally act in high-dimensional spaces, we do not use the MCTS reinforcement

learner on this problem. The DRL reinforcement learner used a batch size of 50 000, γ = 0.999, and

was optimized with a PPO step size of 1.0. The reinforcement learner was run for 5× 106 simulation

steps.

The reward of the most likely trajectory found by the DRL reinforcement learner at each iteration

for the autonomous vehicle experiment is shown in fig. 8.5. The reinforcement learner was run for

4× 106 steps. The maximum reward found at each timestep is shown as the Batch Max, while the

most likely collision found so far is shown as the Cumulative Max. The reinforcement learner quickly


0 1 2 3 4

·106

−2

−1.5

−1

−0.5

0

·105

Step Number

Rew

ard

(a) Reward of the most likely failure found ateach iteration.

0 1 2 3 4

·106

−320

−300

−280

−260

−240

Step Number

Rew

ard

(b) Reward of the most likely failure found ateach iteration zoomed to better show the vari-ance in the DRL reinforcement learner results.

DRL Batch Max DRL Cumulative Max Random Batch Max Random Cumulative Max

Figure 8.5: Reward of the most likely failure found at each iteration. The Batch Max is the maximumper-iteration summed Mahalanobis distance. The Cumulative Max is the best Batch Max up to thatiteration. The reinforcement learner finds the best solution by iteration 6 out of 80.

converges to finding solutions in the range of −300 to −250, and the best solution found was −236,

found after 2× 105 steps. Running 4× 106 steps required 7.3 minutes. A random-search baseline

was unable to find a single crash. The toolbox found a likely failure quickly and efficiently, despite

the space of possible actions being 6-dimensional and continuous.

The average return of each iteration is shown in fig. 8.6. Again, both the per-iteration and

cumulative maximum averages are shown. The reinforcement learner seems to have converged by

4× 105 steps to a range of −5.7× 104 to −4.4× 104, although there are slight improvements later.

These slight improvements did not correspond with improvements in the most likely failure found.

At 2× 105 step, the average reward is −2.2× 105. The average reward being so large indicates that

some of the trajectories in each batch were not leading to collisions at all. The policy sometimes

fails to find a collision because the DRL reinforcement learner samples actions stochastically from a

Gaussian distribution. Because the random-search baseline never found a collision, the reward was

consistently around −2.1× 105.

8.3.3 Automatic Transmission

A common benchmark for falsification tools is a four-speed automatic transmission controller mod-

eled in Simulink [87]. The system under test (SUT) is the automatic transmission model also used

in the ARCH-COMP falsification competition [88]. The model takes three real-valued inputs: time

0 ≤ t ≤ 30, throttle percent 0 ≤ τ ≤ 100, and brake torque 0 ≤ β ≤ 325. The model outputs


0 1 2 3 4

·106

−2

−1.5

−1

−0.5

·105

Step Number

Rew

ard DRL Batch Average

DRL Cumulative Max AverageRandom Batch AverageRandom Cumulative Max Average

Figure 8.6: The average return at each iteration. The Batch Average is the average return fromeach trajectory in an iteration, while the Cumulative Max Average is the maximum Batch Averageso far. The reinforcement learner is mostly converged by iteration 10, although there are slightimprovements later. The large returns indicate that not every trajectory is ending in a collision.

two continuous states, speed v and RPM ω, and one discrete state g, the current gear. Failures of

the SUT are defined by violations of signal temporal logic (STL) specifications that encode system

requirements. The STL requirements commonly used were originally proposed by Hoxha, Abbas,

and Fainekos [87], and the parameters were selected for their difficulty by Ernst, Arcaini, Donze,

et al. [88]. We chose the following benchmark STL formulas to highlight behaviors of the different

reinforcement learners:

AT1: �[0,20]v < 120 (Speed is always below 120 between 0 and 20 seconds)

AT2: �[0,10]ω < 4750 (RPM is always below 4750 between 0 and 10 seconds)

The reinforcement learners interact with the simulator by selecting up to four input actions, each

a vector [t, τ, β]. The output vector [v, ω, g] for each simulator time step and the STL specifications

are used to calculate a robustness metric [4]. Robustness is a measure of the degree to which the

specification was satisfied. We use the negative robustness in our reward function to guide the search

towards failures (i.e., specification violations), computed using the stlcg Python package [89].

Two reinforcement learners were compared in this experiment: MCTS and DRL. Both reinforce-

ment learners are run in the black-box simulator mode. MCTS was trained with k = 1, α = 0.7,

and c = 10 [58]. For the DRL reinforcement learner, we used an LSTM network [60] with 64 hidden

units and a step size of 1.0. The neural network was trained using the PPO algorithm with a batch

size of 40 and a discount factor γ = 1.0. A random search reinforcement learner was used as the

baseline. All reinforcement learners were trained using 1000 simulation steps. Each reinforcement


0 200 400 600 800 1,000

−8

−6

−4

−2

0

·105

Step Number

Rew

ard

AT1 benchmark�[0,20]v < 120

(a) Only DRL finds failures.

0 200 400 600 800 1,000

−2

−1

0

·107

Step Number

Rew

ard

AT2 benchmark�[0,10]ω < 4750

(b) All find failures; DRL is most efficient.

PPO MCTS Random Search

Figure 8.7: The results of the automatic transmission case study, averaged over 10 trials. The DRLreinforcement learner is able to outperform both MCTS and a random search baseline.

learner was run for 10 trials using different random seeds, and the results were averaged. The hy-

perparameters used in this experiment were found empirically and all reinforcement learners used a

close-loop setting.

The best trajectory reward found at each trial was recorded every simulation step. The average of

the best trajectory reward over 10 trials is shown in fig. 8.7. For the more difficult AT1 benchmark,

only the DRL reinforcement learner was able to find failure trajectories within the allotted number

of steps, shown in fig. 8.7a. For the less difficult AT2 benchmark, all three reinforcement learners

were able to find failure trajectories, shown in fig. 8.7b, but MCTS found failures in only 9 out of

10 trials. The results highlight that the choice of reinforcement learner should be dependent on the

underlying problem: MCTS tends to work well in long-horizon problems with large state and action

spaces. In this experiment, the DRL reinforcement learner used fewer simulation steps than the

other reinforcement learners to find failure trajectories in all trials. The DRL reinforcement learner

found the first failure at iteration 226.0±157.7 and 271.9±269.8 for AT1 and AT2, respectively. The

MCTS reinforcement learner found the first failure at iteration 338.3± 314.9 for AT2, and random

search found the first failure at iteration 673.8± 547.3.

8.4 Discussion

This chapter introduced the AST Toolbox for validating autonomous systems. This open-source

package simplifies applying AST to autonomous system safety validation. The toolbox provides


wrappers to interface between simulators and the provided reinforcement learners, while also sup-

porting the implementation of new reinforcement learners. The AST approach uses reinforcement

learning to find the most likely failure of a system by treating the simulator as an agent acting in an

MDP. While the solutions are not guaranteed to be optimal, the method is tractable for an emerg-

ing class of safety-critical autonomous systems that act in high-dimensional spaces and experience

extremely low-probability failures. While other open-source software packages support falsification,

the AST Toolbox is distinct in its ability to find the most likely failure while treating simulators as a

black box. The aim of the AST Toolbox is to help make autonomous systems more reliable through

robust testing.

Chapter 9

Summary and Future Work

Autonomous vehicles are rapidly becoming ubiquitous in society, even in safety-critical applica-

tions. Going forward, validation through real-world testing alone will be infeasible for numerous

autonomous systems that must safely interact with people. It is critical that system designers have

access to a simulation-based validation method that produces robust results without requiring an

infeasible amount of compute to run. This thesis is a step towards developing approximate methods

of validating autonomous systems in simulation. This chapter summarizes the approach taken in

this thesis, reviews the contributions made, and proposes several areas of future research.

9.1 Summary

As autonomous systems continue to spread to safety-critical applications, it is essential that designers

have access to simulation-based validation methods. Validation is a key step towards system safety,

but, unfortunately, validation through real-world testing will often be infeasible. For example,

it would take 5 billion miles driven to ensure that a fleet of autonomous vehicles was as safe as

commercial airplanes, and that is assuming those 5 billion miles are perfectly representative of their

eventual application areas. Validation must, at least in part, be done through simulation.

However, validation through simulation has its own challenges. In order for the validation re-

sults to be useful, the simulation must be a high-fidelity representation of the real-world, and the

simulation rollouts must be able to capture the full variance of real-world scenarios. High-fidelity

simulators are slow and expensive to run, so we are constrained in the number of simulations we can

run. However, allowing the simulation scenarios to capture the variance of the real-world results in

a massive space of possible simulation rollouts that must be searched for failures. An approach is

needed that can approximately search the simulation space for failures in an efficient enough manner

to be tractable but without sacrificing the ability to find failures.

Adaptive stress testing (AST) is one such possible method. AST formulates the problem of

83

CHAPTER 9. SUMMARY AND FUTURE WORK 84

finding the most likely failure as a Markov decision process, which can then be solved with standard

reinforcement learning approaches (see chapter 2). AST puts no constraints on actor behavior, which

allows it to capture the full variance of the real-world. The resulting space of possible simulation

rollouts is massive, and, therefore, the reward function is designed to prioritize likely failures, which

allows AST to find likely failures through optimization. AST is no silver bullet, though, and has its

own issues that must be addressed.

This thesis contributes solutions to a number of AST’s limitations. Many autonomous systems

act in high-dimensional spaces, which leads to an exponential explosion in possible simulation roll-

outs. The scalability issue is compounded when initial conditions must be searched over as well, so

AST must be able to generalize across initial conditions after a single training run. The massive

space of possible simulation rollouts leads to high-variance results, so AST needs a way to produce

more robust results. When validating a system in simulation, the results are dependent on the simu-

lator being a highly accurate representation of the real world, otherwise failures found may not exist

in the real system, and failures in the real system may not exist in simulation. This dependency on

simulator accuracy necessitates the use of high-fidelity simulators, and thus AST must be computa-

tionally affordable enough to be run in high-fidelity. Finally, in order for designers to actually make

use of AST, the method must be made open-source and easily applicable.

9.2 Contributions

The first part of this thesis provides a general introduction to the challenges of validation in simu-

lation (chapter 1), as well as an overview of AST (chapter 2). The remainder of this thesis makes

the following contributions across three different categories:

• Scalable Reinforcement Learners

– A deep reinforcement learner for scalable AST: Many autonomous systems act

in continuous, high-dimensional state and action spaces, which leads to an exponential

explosion in the size of the simulation space that must be searched for failures. AST

must be able to scale to these massive spaces in a way that allows validation to still be

tractable. AST previously used a reinforcement learner that used Monte Carlo tree search

with double progressive widening. While the use of upper confidence bounds and double

progressive widening improves the results of MCTS on continuous and high-dimensional

spaces, the tree size can still quickly explode. In contrast, deep reinforcement learning

has already been shown to perform well on tasks with continuous or high-dimensional

state or action spaces. Chapter 3 presents a new reinforcement learner that uses deep

reinforcement learning to improve scalability to large simulation spaces.

– A go-explore reinforcement learner for AST without reward heuristics: Pre-

vious work has relied on the presence of heuristic rewards to speed up validation results.


Heuristic rewards are domain-specific rewards that are crafted through expert knowledge

or real world data to help guide a reinforcement learning agent to goals that may be hard

to find. However, it may not always be desirable or even feasible for AST to have access

to heuristic rewards, and without them, the problem becomes a hard-exploration domain.

Go-explore recently set new records on common hard-exploration benchmarks by taking

a two-phase approach: 1) A tree search that uses heuristic biases and deterministic resets

to efficiently find goal states and 2) A robustification phase that uses the backward algo-

rithm to produce a deep neural network policy that is robust to stochastic perturbations.

Chapter 5 presents a new reinforcement learner based on phase 1 of go-explore. This

tree-search reinforcement learner is shown to find failures without the use of heuristic

rewards on long-horizon problems where DRL and MCTS approaches both fail to find

failures.

• General AST Utility

– The ability to generalize across initial conditions in a single run of AST: In

real-world applications, system designers are often interested in scenario classes, where a

scenario class is a space of similar scenarios defined by parameter ranges. Specific scenar-

ios can then be instantiated by selecting concrete parameter values from the parameter

ranges. To perform validation over a scenario class with AST, a system designer would

have to generate a number of concrete scenario instantiations and run AST for each one.

For even a simple scenario class, a basic grid search could result in hundreds or even

thousands of different concrete scenario instantiations, and it would be infeasible to run

AST thousands of times for each scenario class when performing validation over a suite

of tests. However, concrete scenarios within a scenario class often have significant simi-

larities to each other. It should therefore be possible to run AST a single time and learn

to generalize across the entire space of initial conditions of the scenario class. Chapter 4

presents an AST architecture and training approach that allows AST to perform valida-

tion of a scenario class in a single run. We show that this approach is able to produce

better results with far less computational expense when compared to running AST from

a series of concrete scenario instantiations.

– Robust AST results through the use of the backward algorithm: Chapter 2

shows that the trajectory that maximizes the reward function in AST is the most likely

failure. However, in practice AST uses approximate reinforcement learning methods, so

the reinforcement learner may converge to a local optimum that is not the global opti-

mum. Because autonomous systems act in continuous and high-dimensional spaces, there

is a massive spread of local optima. Unfortunately, this can lead to large amounts of


variance in AST’s results, an inconsistency that can lead to unsafe test results. Chap-

ter 6 presents a robustification phase of AST based on go-explore’s use of the backward

algorithm in its phase 2. The robustification phase is able to significantly improve the

likelihood of failures found by the reinforcement learner, regardless of which algorithm

the reinforcement learner is using.

• Real-world Applicability

– High-fidelity tractability through learning in and transferring from low-fidelity:

This thesis presents a number of advancements in AST that improve the approach’s scala-

bility. However, AST still can require many thousands of iterations to converge to a likely

failure. Running this many simulation rollouts may be infeasible in high-fidelity simula-

tors, which are simulators with features such as advanced dynamics models, perception

from graphics, and autonomous system software-in-the-loop simulation. These features

are essential for simulation accuracy, but they can make high-fidelity simulators slow and

expensive to run. In order to make AST tractable in high-fidelity, chapter 7 presents a

method of first finding failures in low-fidelity simulation and then transferring the fail-

ures to high-fidelity. We show that this approach reduces the number of high-fidelity

simulation steps needed to find a failure, sometimes dramatically so.

– An open-source and easily applicable software toolbox: While this thesis rep-

resents a significant advancement in the ability to validate autonomous systems using

AST, such advancements are only useful if system designers can actually apply AST to

validating their autonomous systems. For the good of society, it is essential that safety

become a collaborative effort among autonomous system designers, not a competitive one.

Chapter 8 presented the AST Toolbox, an open-source software toolbox that enables sys-

tem designers to easily apply AST to their own system. Through a series of wrappers,

designers are able to connect the toolbox to their own simulator and system, regardless

of the implementation language. In addition, the Toolbox is connected with garage to

provide easy access to reinforcement learning methods. The Toolbox is a step towards

safety as a collaborative feature of autonomous systems.

9.3 Further Work

This thesis contributes towards the validation of autonomous systems in simulation by making

significant improvements to Adaptive Stress Testing. However, the work presented here still has

limitations. Much of the work was tested on a toy simulator from the AST Toolbox. While tests on

high-fidelity simulators yielded promising results, more work must be done to prove the applicability

of AST to high-fidelity validation. The models used for the actors within scenarios were often very


basic, resulting in some unrealistic behaviors. Learning models of actors from data could yield failures

that are more realistic, and therefore usually more useful for system designers trying to make their

system more safe. Similarly, the most common system tested here was a modified version of a lane-

following model. While AST has been applied to complex real-world systems in the past [90]–[92], it

would be instructive to apply AST to validating a real autonomous vehicle. Addressing these issues

in future work would require significant time, money, and access, but could also yield significant

benefits. In addition to these practical considerations, there are several directions of theoretical

work that seem promising.

9.3.1 Full Generalization

Chapter 4 shows that AST can generalize across a scenario class. The theory underlying this

improvement is that similar scenarios share many commonalities that only need to be learned once

when generalizing, instead of having to be relearned for every concrete scenario. For example,

perhaps an object in a scenario causes an occlusion that can cause a collision. If AST learns to

use the failure from one set of initial conditions, it should not have to relearn from scratch to use

the occlusion from a different set of initial conditions. However, this argument can be taken a step

further.

In truth, there are many commonalities across all scenarios. Lessons on how to manipulate

sensor noise, occlusions, actor actions, and other disturbances are instructive across all validation

scenarios. AST should be able to use data from past validation tests to make running the current

test much faster. This sort of generalization is in the realm of meta-learning, where an agent learns

how to learn. By incorporating meta-learning into AST, it is possible that we could develop an

intelligent testing agent that can use life-long learning to continue to quickly find failures across

different systems and scenarios.

9.3.2 Fault Injection in Vision Systems

Autonomous vehicles are equipped with a range of sensors, with LIDAR, RADAR, and cameras

being some of the most popular. These sensors provide the vehicle with computer vision, but that

vision is not perfect. Different sensors have different failure modes, such as a camera suffering from

lens flare when the sun is in a bad position, or a lidar beam not bouncing back from an object

because it was absorbed by something black. These failures can create uncertainty in the vehicle’s

state estimation, which can result in collisions. AST could find failure by injecting these faults into

visions systems.

One advantage of this approach is that it would be straightforward to learn from data the

probability distribution over different types of failures. Some systems have failures that would be

easy to inject, such as a lidar beam not reflecting off an object. On the other hand, camera systems

could be much trickier to validate, though there has already been some promising work in this


area [44]. One possible approach is to use generative adversarial networks (GANs), which have

shown promising performance across a range of image tasks. For example, one common use of

GANs is style transfer tasks, where certain features are extracted from a base image and applied in

a realistic way to a target image. AST could use style transfer GANs to take camera images from

the real-world and inject faults, like adding lens flare, or increase scenario difficulty, like changing a

clear day to a snowy one.

9.3.3 Interpretability

Finding failures is not useful if system designers cannot understand those failures well enough to

address them. Thus, a key component of any validation method is the interpretability of the failures,

or how well a human can understand them. There has already been some exciting work done on

methods of automatically classifying or categorizing failures, work that is compatible with AST [93],

[94]. Especially promising is the potential for the classification system to feed back into the AST

method, which is to say that the interpretability work would not just classify failures found by

AST but would actually allow designers to specify in which types of failures they are interested, in

human readable formats, and AST would restrict its search to such failures, perhaps using reward

augmentation [47]. Further work could build on the existing frameworks, but the most immediate

step should be to build the interpretability features into the AST Toolbox.

Bibliography

[1] N. Kalra and S. M. Paddock, “Driving to safety: How many miles of driving would it take

to demonstrate autonomous vehicle reliability?” Transportation Research Part A: Policy and

Practice, vol. 94, no. Supplement C, pp. 182–193, 2016.

[2] P. Koopman, “The heavy tail safety ceiling,” in Automated and Connected Vehicle Systems

Testing Symposium, 2018.

[3] A. Corso, R. J. Moss, M. Koren, R. Lee, and M. J. Kochenderfer, “A survey of algorithms for

black-box safety validation,” arXiv preprint arXiv:2005.02979, 2020.

[4] G. E. Fainekos and G. J. Pappas, “Robustness of temporal logic specifications for continuous-

time signals,” Theoretical Computer Science, vol. 410, no. 42, pp. 4262–4291, 2009.

[5] A. Donze, “Breach, a toolbox for verification and parameter synthesis of hybrid systems,” in

Computer Aided Verification, Springer, 2010, pp. 167–170.

[6] H. Yang, “Dynamic programming algorithm for computing temporal logic robustness,” M.S.

thesis, Arizona State University, 2013.

[7] T. Dreossi, T. Dang, A. Donze, J. Kapinski, X. Jin, and J. V. Deshmukh, “Efficient guiding

strategies for testing of temporal properties of hybrid systems,” Springer, 2015, pp. 127–142.

[8] G. Ernst, S. Sedwards, Z. Zhang, and I. Hasuo, “Fast falsification of hybrid systems using

probabilistically adaptive input,” in International Conference on Quantitative Evaluation of

Systems (QEST), Springer, 2019, pp. 165–181.

[9] Y. V. Pant, H. Abbas, and R. Mangharam, “Smooth operator: Control using the smooth

robustness of temporal logic,” in IEEE Conference on Control Technology and Applications

(CCTA), IEEE, 2017, pp. 1235–1240.

[10] T. Akazaki, Y. Kumazawa, and I. Hasuo, “Causality-aided falsification,” Electronic Proceedings

in Theoretical Computer Science, vol. 257, 2017.

[11] H. Abbas, M. O’Kelly, and R. Mangharam, “Relaxed decidability and the robust semantics of

metric temporal logic,” 2017, pp. 217–225.

89

BIBLIOGRAPHY 90

[12] H. Abbas, G. Fainekos, S. Sankaranarayanan, F. Ivancic, and A. Gupta, “Probabilistic tempo-

ral logic falsification of cyber-physical systems,” ACM Transactions on Embedded Computing

Systems (TECS), vol. 12, no. 2s, pp. 1–30, 2013.

[13] A. Aerts, B. T. Minh, M. R. Mousavi, and M. A. Reniers, “Temporal logic falsification of cyber-

physical systems: An input-signal-space optimization approach,” in IEEE International Con-

ference on Software Testing, Verification and Validation Workshops (ICSTW), IEEE, 2018,

pp. 214–223.

[14] Q. Zhao, B. H. Krogh, and P. Hubbard, “Generating test inputs for embedded control systems,”

IEEE Control Systems Magazine, vol. 23, no. 4, pp. 49–57, 2003.

[15] X. Zou, R. Alexander, and J. McDermid, “Safety validation of sense and avoid algorithms

using simulation and evolutionary search,” in International Conference on Computer Safety,

Reliability, and Security (SafeComp), Springer, 2014, pp. 33–48.

[16] S. Silvetti, A. Policriti, and L. Bortolussi, “An active learning approach to the falsification of

black box cyber-physical systems,” in International Conference on Integrated Formal Methods

(iFM), Springer, 2017, pp. 3–17.

[17] J. Deshmukh, M. Horvat, X. Jin, R. Majumdar, and V. S. Prabhu, “Testing cyber-physical

systems through bayesian optimization,” ACM Trans. Embed. Comput. Syst., vol. 16, no. 5s,

Sep. 2017, issn: 1539-9087. doi: 10.1145/3126521.

[18] G. E. Mullins, P. G. Stankiewicz, R. C. Hawthorne, and S. K. Gupta, “Adaptive generation of

challenging scenarios for testing and evaluation of autonomous vehicles,” Journal of Systems

and Software, vol. 137, pp. 197–215, 2018.

[19] Y. Abeysirigoonawardena, F. Shkurti, and G. Dudek, “Generating adversarial driving scenarios

in high-fidelity simulators,” in IEEE International Conference on Robotics and Automation

(ICRA), IEEE, 2019, pp. 8271–8277.

[20] X. Yang, M. Egorov, A. Evans, S. Munn, and P. Wei, “Stress testing of uas traffic management

decision making systems,” in AIAA AVIATION Forum, 2020, p. 2868.

[21] G. E. Fainekos and K. C. Giannakoglou, “Inverse design of airfoils based on a novel formulation

of the ant colony optimization method,” Inverse Problems in Engineering, vol. 11, no. 1, pp. 21–

38, 2003.

[22] J. M. Esposito, J. Kim, and V. Kumar, “Adaptive RRTs for validating hybrid robotic control

systems,” in Algorithmic Foundations of Robotics VI, Springer, 2004, pp. 107–121.

[23] J. Kim, J. M. Esposito, and V. Kumar, “An RRT-based algorithm for testing and validating

multi-robot controllers,” Moore School of Electrical Engineering GRASP Lab, Tech. Rep.,

2005.

https://doi.org/10.1145/3126521

BIBLIOGRAPHY 91

[24] M. S. Branicky, M. M. Curtiss, J. Levine, and S. Morgan, “Sampling-based planning, control

and verification of hybrid systems,” IEEE Proceedings - Control Theory and Applications,

vol. 153, no. 5, pp. 575–590, 2006.

[25] T. Dang, A. Donze, O. Maler, and N. Shalev, “Sensitive state-space exploration,” in IEEE

Conference on Decision and Control (CDC), IEEE, 2008, pp. 4049–4054.

[26] T. Nahhal and T. Dang, “Test coverage for continuous and hybrid systems,” in Computer

Aided Verification, W. Damm and H. Hermanns, Eds., Berlin, Heidelberg: Springer Berlin

Heidelberg, 2007, pp. 449–462, isbn: 978-3-540-73368-3.

[27] E. Plaku, L. E. Kavraki, and M. Y. Vardi, “Hybrid systems: From verification to falsification

by combining motion planning and discrete search,” Formal Methods in System Design, vol. 34,

no. 2, pp. 157–182, 2009.

[28] C. E. Tuncali and G. Fainekos, “Rapidly-exploring random trees for testing automated vehi-

cles,” in IEEE Intelligent Transportation Systems Conference (ITSC), 2019, pp. 661–666.

[29] M. Koschi, C. Pek, S. Maierhofer, and M. Althoff, “Computationally efficient safety falsifi-

cation of adaptive cruise control systems,” in IEEE International Conference on Intelligent

Transportation Systems (ITSC), IEEE, 2019, pp. 2879–2886.

[30] A. Zutshi, S. Sankaranarayanan, J. V. Deshmukh, and J. Kapinski, “A trajectory splicing ap-

proach to concretizing counterexamples for hybrid systems,” in IEEE Conference on Decision

and Control (CDC), IEEE, 2013, pp. 3918–3925.

[31] A. Zutshi, J. V. Deshmukh, S. Sankaranarayanan, and J. Kapinski, “Multiple shooting, CEGAR-

based falsification for hybrid systems,” in International Conference on Embedded Software

(ICESS), 2014, pp. 1–10.

[32] Y. Kim and M. J. Kochenderfer, “Improving aircraft collision risk estimation using the cross-

entropy method,” Journal of Air Transportation, vol. 24, no. 2, pp. 55–62, 2016.

[33] M. O’Kelly, A. Sinha, H. Namkoong, R. Tedrake, and J. C. Duchi, “Scalable end-to-end au-

tonomous vehicle testing via rare-event simulation,” in Advances in Neural Information Pro-

cessing Systems (NeurIPS), 2018, pp. 9827–9838.

[34] D. Zhao, H. Lam, H. Peng, S. Bao, D. J. LeBlanc, K. Nobukawa, and C. S. Pan, “Accelerated

evaluation of automated vehicles safety in lane-change scenarios based on importance sampling

techniques,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 3, pp. 595–

607, 2016.

[35] Z. Huang, H. Lam, D. J. LeBlanc, and D. Zhao, “Accelerated evaluation of automated vehicles

using piecewise mixture models,” IEEE Transactions on Intelligent Transportation Systems,

vol. 19, no. 9, pp. 2845–2855, 2017.

BIBLIOGRAPHY 92

[36] S. Sankaranarayanan and G. Fainekos, “Falsification of temporal properties of hybrid systems

using the cross-entropy method,” 2012, pp. 125–134.

[37] J. Norden, M. O’Kelly, and A. Sinha, “Efficient Black-box Assessment of Autonomous Vehicle

Safety,” arXiv e-prints, arXiv:1912.03618, arXiv:1912.03618, Dec. 2019. arXiv: 1912.03618

[cs.LG].

[38] A. Corso, R. Lee, and M. J. Kochenderfer, “Scalable autonomous vehicle safety validation

through dynamic programming and scene decomposition,” in IEEE International Conference

on Intelligent Transportation Systems (ITSC), IEEE, 2020.

[39] J. P. Chryssanthacopoulos, M. J. Kochenderfer, and R. E. Williams, “Improved Monte Carlo

sampling for conflict probability estimation,” in AIAA Non-Deterministic Approaches Con-

ference, Orlando, Florida, 2010. doi: 10.2514/6.2010-3012.

[40] R. Lee, M. J. Kochenderfer, O. J. Mengshoel, G. P. Brat, and M. P. Owen, “Adaptive

stress testing of airborne collision avoidance systems,” in Digital Avionics Systems Confer-

ence (DASC), 2015.

[41] Z. Zhang, G. Ernst, S. Sedwards, P. Arcaini, and I. Hasuo, “Two-layered falsification of hybrid

systems guided by monte carlo tree search,” IEEE Transactions on Computer-Aided Design

of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2894–2905, 2018.

[42] M. Wicker, X. Huang, and M. Kwiatkowska, “Feature-guided black-box safety testing of deep

neural networks,” in International Conference on Tools and Algorithms for the Construction

and Analysis of Systems (TACAS), Springer, 2018, pp. 408–426.

[43] R. Delmas, T. Loquen, J. Boada-Bauxell, and M. Carton, “An evaluation of Monte-Carlo tree

search for property falsification on hybrid flight control laws,” in International Workshop on

Numerical Software Verification, Springer, 2019, pp. 45–59.

[44] K. D. Julian, R. Lee, and M. J. Kochenderfer, “Validation of image-based neural network

controllers through adaptive stress testing,” in IEEE International Conference on Intelligent

Transportation Systems (ITSC), 2020, pp. 1–7.

[45] T. Akazaki, S. Liu, Y. Yamagata, Y. Duan, and J. Hao, “Falsification of cyber-physical systems

using deep reinforcement learning,” in International Symposium on Formal Methods, Springer,

2018, pp. 456–465.

[46] M. Koren, S. Alsaif, R. Lee, and M. J. Kochenderfer, “Adaptive stress testing for autonomous

vehicles,” in IEEE Intelligent Vehicles Symposium, 2018.

[47] A. Corso, P. Du, K. Driggs-Campbell, and M. J. Kochenderfer, “Adaptive stress testing with

reward augmentation for autonomous vehicle validation,” in IEEE International Conference

on Intelligent Transportation Systems (ITSC), 2019, pp. 163–168.

https://arxiv.org/abs/1912.03618


https://doi.org/10.2514/6.2010-3012

BIBLIOGRAPHY 93

[48] M. Koren and M. J. Kochenderfer, “Efficient autonomy validation in simulation with adap-

tive stress testing,” in IEEE International Conference on Intelligent Transportation Systems

(ITSC), 2019, pp. 4178–4183.

[49] V. Behzadan and A. Munir, “Adversarial reinforcement learning framework for benchmark-

ing collision avoidance mechanisms in autonomous vehicles,” IEEE Intelligent Transportation

Systems Magazine, 2019.

[50] S. Kuutti, S. Fallah, and R. Bowden, “Training adversarial agents to exploit weaknesses in

deep control policies,” in IEEE International Conference on Robotics and Automation (ICRA),

IEEE, 2020, pp. 108–114.

[51] M. Koren and M. J. Kochenderfer, “Adaptive stress testing without domain heuristics using

go-explore,” in IEEE International Conference on Intelligent Transportation Systems (ITSC),

IEEE, 2020.

[52] X. Qin, N. Arechiga, A. Best, and J. Deshmukh, “Automatic Testing and Falsification with Dy-

namically Constrained Reinforcement Learning,” arXiv e-prints, arXiv:1910.13645, arXiv:1910.13645,

Oct. 2019. arXiv: 1910.13645 [cs.LG].

[53] R. Lee, O. J. Mengshoel, A. Saksena, R. W. Gardner, D. Genin, J. Silbermann, M. Owen, and

M. J. Kochenderfer, “Adaptive stress testing: Finding likely failure events with reinforcement

learning,” Journal of Artificial Intelligence Research, vol. 69, pp. 1165–1201, 2020.

[54] A. Ecoffet, J. Huizinga, J. Lehman, K. O. Stanley, and J. Clune, “Go-explore: A new approach

for hard-exploration problems,” arXiv preprint arXiv:1901.10995, 2019.

[55] M. J. Kochenderfer, Decision Making Under Uncertainty. MIT Press, 2015, ch. Model Uncer-

tainty, pp. 113–132.

[56] L. Kocsis and C. Szepesvari, “Bandit based Monte Carlo planning,” in European Conference

on Machine Learning (ECML), 2006.

[57] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S.

Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of Monte Carlo tree search

methods,” IEEE Transactions on Computational Intelligence and AI in Games, vol. 4, no. 1,

pp. 1–43, 2012.

[58] A. Couetoux, J.-B. Hoock, N. Sokolovska, O. Teytaud, and N. Bonnard, “Continuous upper

confidence trees,” in Learning and Intelligent Optimization (LION), Springer, 2011, pp. 433–

445.

[59] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.

[60] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9,

no. 8, pp. 1735–1780, 1997.


BIBLIOGRAPHY 94

[61] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align

and translate,” in International Conference on Learning Representations (ICLR), 2015.

[62] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimiza-

tion algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[63] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continu-

ous control using generalized advantage estimation,” in International Conference on Learning

Representations (ICLR), 2016.

[64] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory

and application to reward shaping,” in International Conference on Machine Learning (ICML),

vol. 99, 1999, pp. 278–287.

[65] W. Commons, Schematic drawing of an inverted pendulum on a cart, https://commons.

wikimedia.org/wiki/File:Cart-pendulum.svg, 2012.

[66] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve

difficult learning control problems,” IEEE Transactions on Systems, Man, and Cybernetics,

no. 5, pp. 834–846, 1983.

[67] M. Koren, X. Ma, A. Corso, R. J. Moss, P. Du, K. Driggs Campbell, and M. J. Kochenderfer,

“AST Toolbox: an adaptive stress testing framework for validation of autonomous systems,”

Journal of Open Source Software, 2021, In Review.

[68] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in empirical observations

and microscopic simulations,” Physics Review E, vol. 62, pp. 1805–1824, 2 Aug. 2000.

[69] P. C. Mahalanobis, “On the generalised distance in statistics,” Proceedings of the National

Institute of Sciences of India, vol. 2, no. 1, pp. 49–55, 1936.

[70] M. J. Kochenderfer, J. E. Holland, and J. P. Chryssanthacopoulos, “Next-generation airborne

collision avoidance system,” Massachusetts Institute of Technology-Lincoln Laboratory Lex-

ington United States, Tech. Rep., 2012.

[71] J. Kuchar and A. C. Drumm, “The traffic alert and collision avoidance system,” Lincoln

Laboratory Journal, vol. 16, no. 2, p. 277, 2007.

[72] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference

on Learning Representations (ICLR), 2013.

[73] R. Lee, O. Mengshoel, A. Saksena, R. Gardner, D. Genin, J. Brush, and M. J. Kochenderfer,

“Differential adaptive stress testing of airborne collision avoidance systems,” in 2018 AIAA

Modeling and Simulation Technologies Conference, 2018, p. 1923.

[74] California Department of Transportation, California manual on uniform traffic control devices,

2014, Revision 2.

https://commons.wikimedia.org/wiki/File:Cart-pendulum.svg

https://commons.wikimedia.org/wiki/File:Cart-pendulum.svg

BIBLIOGRAPHY 95

[75] J. Sklansky, “Optimizing the dynamic parameters of a track-while-scan system,” RCA Review,

vol. 18, no. 2, pp. 163–185, Jun. 1957.

[76] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep rein-

forcement learning for continuous control,” in International Conference on Machine Learning

(ICML), 2016.

[77] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “On a formal model of safe and scalable

self-driving cars,” arXiv preprint arXiv:1708.06374, 2017.

[78] T. Peng, Uber ai beats montezuma’s revenge (video game), https://medium.com/syncedreview/

uber-ai-beats-montezumas-revenge-video-game-dee33417a56e, 2018.

[79] T. Salimans and R. Chen, “Learning Montezuma’s Revenge from a single demonstration,”

arXiv preprint arXiv:1812.03381, 2018.

[80] F. A. Gers and J. Schmidhuber, “Recurrent nets that time and count,” in IEEE International

Joint Conference on Neural Networks, 2000, pp. 189–194.

[81] M. Itkina, K. Driggs-Campbell, and M. J. Kochenderfer, “Dynamic environment prediction in

urban scenes using recurrent representation learning,” in IEEE International Conference on

Intelligent Transportation Systems (ITSC), 2019.

[82] D. Nuss, S. Reuter, M. Thom, T. Yuan, G. Krehl, M. Maile, A. Gern, and K. Dietmayer, “A

random finite set approach for dynamic occupancy grid maps with real-time application,” The

International Journal of Robotics Research, vol. 37, no. 8, pp. 841–866, 2018.

[83] S. Hoermann, M. Bach, and K. Dietmayer, “Dynamic occupancy grid prediction for urban

autonomous driving: A deep learning approach with fully automatic labeling,” in IEEE Inter-

national Conference on Robotics and Automation (ICRA), 2018.

[84] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba,

Openai gym, 2016. eprint: arXiv:1606.01540.

[85] The garage contributors, Garage: A toolkit for reproducible reinforcement learning research,

https://github.com/rlworkgroup/garage, 2019.

[86] Y. Annapureddy, C. Liu, G. Fainekos, and S. Sankaranarayanan, “S-TaLiRo: A Tool for Tem-

poral Logic Falsification for Hybrid Systems,” in International Conference on Tools and Al-

gorithms for the Construction and Analysis of Systems (TACAS), Springer, 2011, pp. 254–

257.

[87] B. Hoxha, H. Abbas, and G. Fainekos, “Benchmarks for temporal logic requirements for auto-

motive systems,” in ARCH14-15. 1st and 2nd International Workshop on Applied veRification

for Continuous and Hybrid Systems, ser. EPiC Series in Computing, vol. 34, 2015, pp. 25–30.

https://medium.com/syncedreview/uber-ai-beats-montezumas-revenge-video-game-dee33417a56e

https://medium.com/syncedreview/uber-ai-beats-montezumas-revenge-video-game-dee33417a56e

arXiv:1606.01540

https://github.com/rlworkgroup/garage

BIBLIOGRAPHY 96

[88] G. Ernst, P. Arcaini, A. Donze, G. Fainekos, L. Mathesen, G. Pedrielli, S. Yaghoubi, Y. Ya-

magata, and Z. Zhang, “ARCH-COMP 2019 category report: Falsification,” in International

Workshop on Applied Verification of Continuous and Hybrid Systems, ser. EPiC Series in

Computing, vol. 61, 2019, pp. 129–140.

[89] K. Leung, N. Arechiga, and M. Pavone, “Back-propagation through signal temporal logic spec-

ifications: Infusing logical structure into gradient-based methods,” in Workshop on Algorithmic

Foundations of Robotics, 2020.

[90] R. J. Moss, R. Lee, N. Visser, J. Hochwarth, J. G. Lopez, and M. J. Kochenderfer, “Adaptive

stress testing of trajectory predictions in flight management systems,” in Digital Avionics

Systems Conference (DASC), 2020, pp. 1–10.

[91] R. Lee, J. Puig-Navarro, A. K. Agogino, D. Giannakoupoulou, O. J. Mengshoel, M. J. Kochen-

derfer, and B. D. Allen, “Adaptive stress testing of trajectory planning systems,” in AIAA

Scitech, 2019, p. 1454.

[92] R. Lee, O. J. Mengshoel, and M. J. Kochenderfer, “Adaptive stress testing of safety-critical

systems,” in Safe, Autonomous and Intelligent Vehicles, Springer, 2019, pp. 77–95.

[93] A. Corso and M. J. Kochenderfer, “Interpretable safety validation for autonomous vehicles,” in

IEEE International Conference on Intelligent Transportation Systems (ITSC), 2020, pp. 1–6.

[94] R. Lee, M. J. Kochenderfer, O. J. Mengshoel, and J. Silbermann, “Interpretable categorization

of heterogeneous time series data,” in International Conference on Data Mining, SIAM, 2018,

pp. 216–224.

Documents

APPROXIMATE METHODS FOR VALIDATING AUTONOMOUS A …