Risk Averse Robust Adversarial Reinforcement Learning · Risk Averse Robust Adversarial Reinforcement Learning Extended Abstract Xinlei Pan UC Berkeley Berkeley, CA [email protected]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

Risk Averse Robust Adversarial ReinforcementLearning

Extended Abstract

Xinlei PanUC BerkeleyBerkeley, CA

[email protected]

John CannyUC BerkeleyBerkeley, CA

[email protected]

ABSTRACTRobot controllers need to deal with a variety of uncertaintyand still perform well. Reinforcement learning and deep net-works often su�er from brittleness and e.g. di�culty on trans-fer from simulation to real environments. Recently, robustadversarial reinforcement learning (RARL) was developedwhich allows application of random and systematic perturba-tions by an adversary. A limitation of previous work is thatonly the expected control objective is optimized. i.e. thereis no explicit modeling or optimization of risk. Thus it isnot possible to control the probability of catastrophic events.In this paper we introduce risk-averse robust adversarialRL, using a risk-averse main agent and a risk-seeking adver-sary. We show through experiments in vehicle control that arisk-averse agent is better equipped to handle a risk-seekingadversary, and allows fewer catastrophic outcomes.

KEYWORDSRisk averse reinforcement learning, Adversarial learning,Robust reinforcement learningACM Reference Format:Xinlei Pan and John Canny. 2018. Risk Averse Robust AdversarialReinforcement Learning: Extended Abstract. In Proceedings of ACMComputer Science in Cars Symposium (CSCS’18). ACM, New York,NY, USA, 4 pages. https://doi.org/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTIONReinforcement learning has demonstrated remarkable per-formance on a variety of sequential decision making taskssuch as Go [6] and Atari games [2]. In order to use rein-forcement learning trained agents in real world applicationssuch as house-hold robots or autonomous driving vehicle,the agents need a very high level of robustness and safety.Previous work on robust RL has used random perturbationof a limited number of parameters to change the dynamics[5]. In [1], gradient-based adversarial perturbations wereintroduced during training. In [4] an adversary agent was

CSCS’18, September 2018, Munich, Germany2018. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00https://doi.org/10.1145/nnnnnnn.nnnnnnn

itself trained using reinforcement learning. This approach ismore general than [1], and includes both random and sys-tematic adversaries that directly provide adversarial attackon the input states and dynamics. However, these methodsonly achieve limited diversity of dynamics, which may notbe diverse enough to resemble the real world variety. Thesemethods also mainly consider the robustness of models in-stead of considering the robustness and risk averse abilityof the trained model. In this paper, we consider the taskof training risk averse robust reinforcement policies. Wemodel the risk as variance of value functions, inspired by[7]. In order to emphasize that the agent should be averseto extreme catastrophe, we design the reward function tobe asymmetric. A robust policy should not only maximizelong term expected cumulative reward, but should also selectactions with small variance of that expectation. Here, weuse emsembles of Q value network to estimate variance ofQ values. The work in [3] proposes a similar technique tohelp exploration. We consider a scenario where there are twoagents working together and combating each other. We thenpropose to train risk averse robust adversarial reinforcementlearning policies (RARARL).

2 RISK AVERSE ROBUST ADVERSARIALREINFORCEMENT LEARNING

Two Player Reinforcement Learning. We consider theenvironment as a Markov Decision Process (MDP) M ={S,A,R,P,� }, where S is the state space, A is the actionspace, R (s,a) is the reward function, P (s 0 |s,a) is the statetransition model, and � is the reward discount rate. Thereare two agents in this game, one agent is called the protago-nist agent (denoted by P ) which strives to learn a policy �Pto maximize cumulative expected reward E�P [

P� trt ]. The

other agent is called the adversarial agent (denoted by A),which strives to learn a policy �A to pit the protagonist agentand minimize the cumulative expected reward. In our im-plementation, the adversarial agent gets the negative of theenvironmental reward �r . The protagonist agent should berisk averse, where we model the risk as variance of the valuefunction ( introduced later on), so the value of action a at

1

https://doi.org/10.1145/nnnnnnn.nnnnnnn

https://doi.org/10.1145/nnnnnnn.nnnnnnn

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

CSCS’18, September 2018, Munich, Germany CSCS 2018 Submission

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

state s is modi�ed as, Q̂P (s,a) = QP (s,a)��Varm[QmP (s,a)],

where Q̂P (s,a) is the modi�ed Q function, and QP (s,a) isthe original Q function, and Varm[Qm

P (s,a)] is the varianceof Q function acrossm di�erent models, and � is a multipli-cation constant, and we choose its value to be 0.1. The term��Varm[Qm

P (s,a)] is called the risk-averse term thereafter.In order for the adversarial agent to be risk seeking so asthe provide more challenges for the protagonist agent, themodi�ed value function for action selection is Q̂A (s,a) =QA (s,a) + �Varm[Qm

A (s,a)] where � is a multiplication con-stant, and the value is 1. The term �Varm[Qm

P (s,a)] is calledthe risk-seeking term thereafter. The action space of thisagent is the same as protagonist agent. The two agents takeactions in turn during training. Asymmetric Reward De-sign. In order to train risk averse agents, we propose todesign asymmetric reward function. Namely, good behaviorwill receive only small positive reward, while risk behaviorwill receive very negative reward, so that the trained agentcan be risk averse and robust. Risk Modeling using ValueVariance. The risk of a certain action can be modeled byestimating the variance of value function across di�erentmodels trained on di�erent sets of data. Inspired by [3], weestimate variance of Q value functions by training multipleQ value network together, and take their Q value varianceas risk measurement. The training process for a single agentis the same as in [3].

3 EXPERIMENTSSimulation Environments. For discrete control, we usedthe car racing TORCS environment [8]. We use the MichiganSpeedway environment in TORCS. The vehicle takes actionsin a discrete space consisting of 9 actions, which are combi-nations of move ahead, accelerate, and turn left/right. Thereward function is de�ned to proportional to the speed alongthe road direction and also penalize collision as catastrophe.The catastrophe reward is much more negative than normalreward.

Baselines and Our Method. Vanilla DQN. The pur-pose of comparing with vanilla DQN is to show that modelstrained in one environment may over�t to that speci�c dy-namics and fail to transfer to other dynamics. We denotethis experiment as dqn. Ensemble DQN. Ensemble DQNtends to be more robust than vanilla DQN since it gets thevote from all ensemble policies. However, without beingtrained on a di�erent dynamics, even Ensemble DQN maynot work well when there is adversarial attacks or simple ran-dom dynamics change. We denote this experiment as bsdqn.Ensemble DQN with Random Perturbations WithoutRisk Averse Term. We train the protagonist agent and pro-vide random perturbation. However, for comparison reasons,we didn’t include here variance guided exploration term.

Namely, only the Q value function will be used for choosingactions. We denote this experiment as bsdqnrand. Ensem-ble DQN with Random Perturbations With the RiskAverse Term. We only train the protagonist agent and pro-vide random perturbations. The Protagonist agent will selectaction based on its Q value function and the risk averseterm. We denote this experiment as bsdqnrandriskaverse.Ensemble DQN with Adversarial Perturbation. This isto compare our model with [4]. To make a fair compari-son, we also use Ensemble DQN to train the policy whilethe variance term is not used as either risk-averse or risk-seeking term in either agent. We denote this experiment asbsdqnadv. Our method. In our method, we train both pro-tagonist agent and adversarial agent with Ensemble DQN.We will include here the variance guided exploration term.Therefore, the Q function and its variance across di�erentmodels will be used for action selection.We denote this exper-iment as bsdqnadvriskaverse. Evaluation. First, testingwithout perturbations. We tested all trained models withoutperturbations. Second, testing with random perturbations.Speci�cally, we tested these models by randomly taking 1action for every 10 actions taken by the main agent. Third,testing with Adversarial Perturbations. Random perturba-tions may not result in catastrophe for the vehicle. Therefore,we also tested all models with our trained adversarial agent.Speci�cally, we tested these models by taking 1 action de-cided by the adversarial agent for every 10 actions taken bythe main agent. The results are shown in �gure 1.

Figure 1: Testing all models without attack, with random at-tack, and with adversarial attack (from top to bottom). Thereward is divided into distance related reward (left), progressrelated reward (middle). The result for catastrophe rewardper episode is present in the last image.

For all results curve, there is a vertical blue line corre-sponding to 0.55 million steps showing the beginning ofthe added perturbations during training. Either the randomperturbation or adversarial perturbation is added.

2

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

Risk Averse Robust Adversarial Reinforcement Learning CSCS’18, September 2018, Munich, Germany

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

4 CONCLUSIONWe show that by introducing a notion of risk averse behavior,training agents with adversarial agent achieves signi�cantlyless catastrophe than agents trained without adversarial at-tack. Trained adversarial agent is also able to provide pertur-bations stronger than random perturbations.

REFERENCES[1] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Li Fei-Fei, and Silvio

Savarese. 2017. Adversarially Robust Policy Learning: Active Construc-tion of Physically-Plausible Perturbations. In IEEE/RSJ InternationalConference on Intelligent Robots and Systems (IROS).

[2] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, JoelVeness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas KFidjeland, Georg Ostrovski, et al. 2015. Human-level control throughdeep reinforcement learning. Nature 518, 7540 (2015), 529–533.

[3] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy.2016. Deep exploration via bootstrapped DQN. In Advances in NeuralInformation Processing Systems. 4026–4034.

[4] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta.2017. Robust Adversarial Reinforcement Learning. International Con-ference on Machine Learning (ICML) (2017).

[5] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and SergeyLevine. 2017. Epopt: Learning robust neural network policies usingmodel ensembles. In International Conference on Learning Representa-tions (ICLR).

[6] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre,George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou,Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game ofGo with deep neural networks and tree search. Nature 529, 7587 (2016),484–489.

[7] Aviv Tamar, Dotan Di Castro, and Shie Mannor. 2016. Learning thevariance of the reward-to-go. Journal of Machine Learning Research 17,13 (2016), 1–36.

[8] Bernhard Wymann, Eric Espié, Christophe Guionneau, Christos Dim-itrakakis, Rémi Coulom, and Andrew Sumner. 2000. Torcs, the openracing car simulator. Software available at http://torcs. sourceforge. net(2000).

3

Documents

Risk Averse Robust Adversarial Reinforcement Learning · Risk Averse Robust Adversarial Reinforcement Learning Extended Abstract Xinlei Pan UC Berkeley Berkeley, CA [email protected]