Proximal Policy Optimization Algorithms
Paper
β’
1707.06347
β’
Published
β’
11
This is a PPO (Proximal Policy Optimization) agent trained from scratch on the LunarLander-v3 environment using PyTorch and Gymnasium.
Input (8) β FC(64) β Tanh β FC(64) β Tanh β Output(4) β Softmax
Total Parameters: ~4,500
Input (8) β FC(64) β Tanh β FC(64) β Tanh β Output(1)
Total Parameters: ~4,700
Total Model Parameters: 9,221
The agent successfully solves LunarLander-v3:
import torch
import gymnasium as gym
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(
repo_id="aryannzzz/ppo-lunarlander-scratch",
filename="ppo_lunarlander.pth"
)
# Define Actor Network
class Actor(torch.nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=64):
super().__init__()
self.network = torch.nn.Sequential(
torch.nn.Linear(state_dim, hidden_dim),
torch.nn.Tanh(),
torch.nn.Linear(hidden_dim, hidden_dim),
torch.nn.Tanh(),
torch.nn.Linear(hidden_dim, action_dim),
torch.nn.Softmax(dim=-1)
)
def forward(self, x):
return self.network(x)
# Load model
checkpoint = torch.load(model_path, map_location='cpu')
actor = Actor(state_dim=8, action_dim=4)
actor.load_state_dict(checkpoint['actor_state_dict'])
actor.eval()
# Use the agent
env = gym.make('LunarLander-v3', render_mode='human')
state, _ = env.reset()
done = False
total_reward = 0
while not done:
state_tensor = torch.FloatTensor(state).unsqueeze(0)
with torch.no_grad():
action_probs = actor(state_tensor)
action = action_probs.argmax().item() # Greedy
state, reward, terminated, truncated, _ = env.step(action)
total_reward += reward
done = terminated or truncated
print(f"Total Reward: {total_reward:.2f}")
env.close()
ppo_lunarlander.pth - Complete model checkpoint (actor + critic + optimizer states)ppo_training.png - Training progress visualizationThis model was trained using custom PPO implementation:
04_ppo_from_scratch.py| Metric | Value |
|---|---|
| Average Reward | 261.80 |
| Solved Threshold | 200+ |
| Training Episodes | 2,000 |
| Training Time | ~12 minutes (GPU) |
| Parameters | 9,221 |
| Algorithm | PPO (Actor-Critic) |
PPO is one of the most popular RL algorithms because:
If you use this model, please reference:
@misc{ppo-lunarlander-scratch,
author = {aryannzzz},
title = {PPO LunarLander Model from Scratch},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/aryannzzz/ppo-lunarlander-scratch}
}
Environment provided by Gymnasium (formerly OpenAI Gym). PPO algorithm originally introduced in:
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017).
Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.