Proximal Policy Optimization Algorithms (Schulman, et al., 2017)

Research Question

Standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch update - proximal policy optimization (PPO), which have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement,

Approach

Policy Gradient Methods

The most commonly used gradient estimator has the form

$\hat{g} = \hat{\mathbb{E}}t \left[ \nabla{\theta} \log \pi_{\theta}(a_t \mid s_t) \hat{A}_t \right]$

Trust Region Methods:

$\text{maximize}{\theta} \ \hat{\mathbb{E}}t \left[ \frac{\pi{\theta}(a_t \mid s_t)}{\pi{\theta_{\text{old}}}(a_t \mid s_t)} \hat{A}t \right]$ **$\text{subject to} \quad \hat{\mathbb{E}}t \left[ \text{KL}\left[\pi{\theta{\text{old}}}(\cdot \mid s_t), \pi{\theta}(\cdot \mid s_t)\right] \right] \leq \delta.$

The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem maximize over θ:

$\text{maximize}{\theta} \; \hat{\mathbb{E}}t \left[ \frac{\pi{\theta}(a_t \mid s_t)}{\pi{\theta_{\text{old}}}(a_t \mid s_t)} \hat{A}t - \beta \, \text{KL}\left[\pi{\theta_{\text{old}}}(\cdot \mid s_t), \pi_{\theta}(\cdot \mid s_t)\right] \right]$
TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of $β$ that performs well across different problems—or even within a single problem,

Proximal Policy Optimization

Conservative Policy Iteration Objective

$L^{CPI}(\theta) = \hat{\mathbb{E}}t \left[ \frac{\pi{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \hat{A}_t \right] = \hat{\mathbb{E}}_t \left[ r_t(\theta) \hat{A}_t \right]$, where $r_t(\theta)$ denote the probability ratio
Clipped Surrogate Objective: penalizes changes to the policy that move $r_t(θ)$ away from 1.

$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]$
- the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.
Final PPO objective:

$L_t^{CLIP+VF+S}(\theta) = \hat{\mathbb{E}}_t \left[ L_t^{CLIP}(\theta) - c_1 L_t^{VF}(\theta) + c_2 S\pi_\theta \right]$
- $L_t^{VF}(\theta)$: squared-error loss $(V_\theta(s_t) - V_t^{targ})^2$
- $S\pi_\theta$: an entropy bonus - higher entropy means higher uncertainty, leading to overall lower objective value. So maximizing the objective increases uncertainty of policy at each state $s_t$ and thereby, encourage exploration.
- Truncated version of generalized advantage estimation:
  
  One style of policy gradient implementation, popularized in A3C paper runs the policy for $T$ timesteps (where $T$ is much less than the episode length). It requires an advantage estimator that does not look beyond timestep $T$:
  
  $\hat{A}t = -V(s_t) + r_t + \gamma r{t+1} + \cdots + \gamma^{T - t + 1} r_{T - 1} + \gamma^{T - t} V(s_T).$
  
  Generalizing this choice, we can use a truncated version of generalized advantage estimation, which reduces to above when $λ = 1$:
  
  $\hat{A}t = \delta_t + (\gamma \lambda) \delta{t+1} + \cdots + (\gamma \lambda)^{T - t + 1} \delta_{T - 1} = \delta_t + \gamma\lambda\hat{A}_{t+1},$ where
  
  $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t).$
A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments

Experiments

IDC