Proximal Policy Optimization Algorithms implementation

This project is a PPO paper implementation using Gymnasium environments, PyTorch, and Wandb for metrics visualization.

Overview

In this project, I implement PPO, a state-of-the-art policy gradient algorithm, from scratch in PyTorch. The goal is to train agents on classic control tasks provided by Gymnasium (e.g., CartPole-v1, LunarLander-v3). Key highlights:

Policy and Value Networks: Separate and unified actor (policy) and critic (value) networks built with PyTorch.
Clipped Surrogate Objective: Implementation of PPO’s clipped loss function for stable updates.
Advantage Estimation: Generalized Advantage Estimation (GAE) for variance reduction.
Logging & Visualization: Track training metrics (reward, loss) via Wandb.

Features

Training Loop: mini-batch updates, learning rate scheduling.
Checkpointing: Save and load model weights for reproducibility.
Wandb Integration: Visualize reward curves, loss components, and learning rate.
Configurable Hyperparameters: Easily adjust learning rate, clip ratio, batch size, and more.

Installation

Clone the repository:

git clone https://github.com/oussamakharouiche/PPO-Implementation.git
cd PPO-Implementation

Create a virtual environment and install dependencies:

python3 -m venv ppo
source ppo/bin/activate
pip install -r requirements.txt

Bibliography

Create config file if not found

Train the ppo agent:

python3 ppo.py --config-path ./configs/cartpole_config.yaml

evaluate the agent:

python3 evaluate.py --config-path ./configs/cartpole_config.yaml

Results

Results averaged over 100 evaluation runs:

Environment	Avg. Reward	Std. Dev
CartPole-v1	499.87	1.29
LunarLander-v3	275.54	36.18
Acrobot-v1	-81.00	20.74

Note: Results may vary based on random seed and hyperparameters.

Bibliography

Proximal Policy Optimization Algorithms.

🔗 Links

GitHub Repository