Proximal Policy Optimization (PPO) Algorithm

A Proximal Policy Optimization (PPO) Algorithm is a model-free reinforcement learning algorithm that is a policy-based algorithm (that learns a policy function to determine the optimal action from the available game states).

Context:
- inputs:
  - Initial policy parameters θ
  - Environment Env with states s and actions a
  - Number of iterations N
  - Number of timesteps T per trajectory
  - Number of optimization epochs E
  - Minibatch size M
  - Clipping parameter ε
  - Learning rate α
  - ...
Example(s):
- as expressed in (Schulman et al., 2017).
- …
Counter-Example(s):
- Q-Learning Algorithm.
- Direct Policy Optimization (DPO).
See: Trust Region, Policy Gradient Method, OpenAI Reinforcement Learning Baseline, RLHF, PPO Update.

References

2023

GPTChat
- Proximal Policy Optimization (PPO) Pseudocode

INPUT: - Initial policy parameters θ - Environment Env with states s and actions a - Number of iterations N - Number of timesteps T per trajectory - Number of optimization epochs E - Minibatch size M - Clipping parameter ε - Learning rate α

OUTPUT: - Optimized policy parameters θ

ALGORITHM:

FOR i = 1 to N DO:

   Collect Trajectories:
   FOR t = 1 to T DO:
       a_t ∼ π_θ(·|s_t)  Sample action from current policy
       (s_{t+1}, r_t) = Env.step(a_t)  Execute action in the environment

   Calculate Advantages
   A_t = CALCULATE_ADVANTAGES(trajectory)

   FOR e = 1 to E DO:
       Shuffle and create minibatches from trajectories
       MINIBATCHES = CREATE_MINIBATCHES(trajectory, M)

       FOR each minibatch in MINIBATCHES DO:
           Calculate Policy Ratio
           r(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t)

           Calculate Surrogate Objective
           L(θ) = MIN(r(θ) * A_t, CLIP(r(θ), 1-ε, 1+ε) * A_t)

           Policy Update
           θ = θ + α * GRADIENT_OF(L(θ), θ)

           Value Function Update (Optional)
           UPDATE_VALUE_FUNCTION(minibatch)

   θ_old = θ

END

RETURN θ

2022

(Wikipedia, 2022) ⇒ https://en.wikipedia.org/wiki/Proximal_Policy_Optimization Retrieved:2022-12-18.
- Proximal Policy Optimization (PPO) is a family of model-free reinforcement learning algorithms developed at OpenAI in 2017. PPO algorithms are policy gradient methods, which means that they search the space of policies rather than assigning values to state-action pairs.
  PPO algorithms have some of the benefits of trust region policy optimization (TRPO) algorithms, but they are simpler to implement, more general, and have better sample complexity. It is done by using a different objective function.

2022

https://chat.openai.com/chat
- QUESTION: How does PPO differ from Q-learning?
- ANSWER: Proximal Policy Optimization (PPO) and Q-learning are both reinforcement learning algorithms, but they work in quite different ways.
  Q-learning is a value-based reinforcement learning algorithm. It involves learning a value function that estimates the expected reward for an agent taking a given action in a given state. The goal of Q-learning is to find the optimal action-value function, which is the function that maximizes the expected reward for the agent. To do this, the Q-learning algorithm updates the action-value function based on the difference between the expected reward and the actual reward that the agent receives after taking an action. This process is repeated until the action-value function converges to the optimal solution.
  PPO, on the other hand, is a policy-based reinforcement learning algorithm. It involves learning a policy function that determines the actions that an agent should take in a given state. The goal of PPO is to find the optimal policy, which is the policy that maximizes the expected reward for the agent. To do this, PPO uses a combination of gradient ascent and importance sampling to update the policy function. The policy function is updated based on the difference between the expected reward and the actual reward that the agent receives after taking an action, weighted by the probability of the action under the current policy. This process is repeated until the policy function converges to the optimal solution.
  In summary, Q-learning is a value-based algorithm that learns a value function to determine the optimal action, while PPO is a policy-based algorithm that learns a policy function to determine the optimal action.

2020

(Wikipedia, 2020) ⇒ https://en.wikipedia.org/wiki/Model-free_(reinforcement_learning)#Key_model-free_reinforcement_learning_algorithms Retrieved:2020-12-10.

Algorithm	Description	Model	Policy	Action Space	State Space	Operator
PPO	Proximal Policy Optimization	Model-Free	On-policy	Continuous	Continuous	Advantage

2017

https://openai.com/blog/openai-baselines-ppo/
- QUOTE: We’re releasing a new class of reinforcement learning algorithms, Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become the default reinforcement learning algorithm at OpenAI because of its ease of use and good performance.

2017

(Schulman et al., 2017) ⇒ John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. (2017). “Proximal Policy Optimization Algorithms.” arXiv preprint arXiv:1707.06347
- ABSTRACT: We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas standard policy gradient methods perform one gradient update per data sample, we propose a novel objective function that enables multiple epochs of minibatch updates. The new methods, which we call proximal policy optimization (PPO), have some of the benefits of trust region policy optimization (TRPO), but they are much simpler to implement, more general, and have better sample complexity (empirically). Our experiments test PPO on a collection of benchmark tasks, including simulated robotic locomotion and Atari game playing, and we show that PPO outperforms other online policy gradient methods, and overall strikes a favorable balance between sample complexity, simplicity, and wall-time.