13
http://lamda.nju.edu.cn Off-Policy Imitation Learning from Observations

Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

Off-Policy Imitation Learning from Observations

Page 2: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Objective of GAIL

- Requires expert action => Learning from observations - (BCO, GAILfO)

- On-policy (low sample efficiency) - off-policy algorithms rather than TRPO (SQIL, DAC) - offline reward estimation (PWIL)

Page 3: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Learning from Demonstration Learning from Observation

Connection between LfD and LfO

Page 4: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Connection between LfD and LfO

Page 5: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Page 6: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

surrogate expert buffer: 𝐷𝑠 = {(𝑠, 𝑎)𝑖}

perform AIRL on 𝐷𝑆

Similar “Surrogate Distribution”

Page 7: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

pseudo-reward

Page 8: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

no on-policy expectation left

Page 9: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Implementation Details

optimize the ratio using GAN

additional regularization

Page 10: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Page 11: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Page 12: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

http://lamda.nju.edu.cn

Page 13: Off-Policy Imitation Learning from ObservationsIn a non-injective MDP, the discrepancy of DKL [VT (als,s ) (als, cannot be optimized without knowing expert actions. In a deterministic

http://lamda.nju.edu.cn

End