Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

Off-Policy Imitation Learning from Observations



Objective of GAIL

- Requires expert action => Learning from observations - (BCO, GAILfO)

- On-policy (low sample efficiency) - off-policy algorithms rather than TRPO (SQIL, DAC) - offline reward estimation (PWIL)



Learning from Demonstration Learning from Observation

Connection between LfD and LfO



Connection between LfD and LfO





surrogate expert buffer: 𝐷𝑠 = {(𝑠, 𝑎)𝑖}

perform AIRL on 𝐷𝑆

Similar “Surrogate Distribution”



pseudo-reward



no on-policy expectation left



Implementation Details

optimize the ratio using GAN

additional regularization








End

Documents

Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic