Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
http://lamda.nju.edu.cn
Off-Policy Imitation Learning from Observations
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
Objective of GAIL
- Requires expert action => Learning from observations - (BCO, GAILfO)
- On-policy (low sample efficiency) - off-policy algorithms rather than TRPO (SQIL, DAC) - offline reward estimation (PWIL)
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
Learning from Demonstration Learning from Observation
Connection between LfD and LfO
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
Connection between LfD and LfO
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
surrogate expert buffer: 𝐷𝑠 = {(𝑠, 𝑎)𝑖}
perform AIRL on 𝐷𝑆
Similar “Surrogate Distribution”
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
pseudo-reward
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
no on-policy expectation left
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
Implementation Details
optimize the ratio using GAN
additional regularization
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
http://lamda.nju.edu.cn
End