Research Question

Standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch update - proximal policy optimization (PPO), which have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement,

Approach

Policy Gradient Methods

The most commonly used gradient estimator has the form

$\hat{g} = \hat{\mathbb{E}}t \left[ \nabla{\theta} \log \pi_{\theta}(a_t \mid s_t) \hat{A}_t \right]$

Trust Region Methods:

$\text{maximize}{\theta} \ \hat{\mathbb{E}}t \left[ \frac{\pi{\theta}(a_t \mid s_t)}{\pi{\theta_{\text{old}}}(a_t \mid s_t)} \hat{A}t \right]$ **$\text{subject to} \quad \hat{\mathbb{E}}t \left[ \text{KL}\left[\pi{\theta{\text{old}}}(\cdot \mid s_t), \pi{\theta}(\cdot \mid s_t)\right] \right] \leq \delta.$

Proximal Policy Optimization

Experiments

IDC