PPO Agent playing LunarLander-v2

This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library.

Model Description

This model is a Proximal Policy Optimization (PPO) agent trained to solve the LunarLander-v2 environment from OpenAI Gymnasium. The agent learns to successfully land a lunar module by controlling its main engine and side thrusters while managing fuel consumption and landing precision.

Model Details

  • Algorithm: Proximal Policy Optimization (PPO)
  • Policy Network: Multi-Layer Perceptron (MlpPolicy)
  • Framework: Stable-Baselines3
  • Environment: LunarLander-v2 (Gymnasium)

Hyperparameters

The model was trained with the following PPO hyperparameters:

Parameter Value
Policy MlpPolicy
n_steps 1024
batch_size 64
n_epochs 4
gamma (discount factor) 0.999
gae_lambda 0.98
ent_coef (entropy coefficient) 0.01

Performance

Evaluation Results:

  • Mean Reward: 262.07 ± 21.35
  • Standard Deviation: 21.35

This performance indicates the agent has successfully learned to land the lunar module, as:

  • Rewards > 200 typically indicate successful landings
  • The positive mean reward shows consistent success across evaluation episodes
  • Low standard deviation suggests stable, reliable performance

Environment Details

LunarLander-v2 Environment:

  • Observation Space: 8-dimensional continuous space (position, velocity, angle, angular velocity, leg contact)
  • Action Space: 4 discrete actions (do nothing, fire left engine, fire main engine, fire right engine)
  • Success Criteria: Land between the flags with minimal fuel consumption and impact velocity
  • Reward Range: Approximately -∞ to +∞ (typically -200 to +300 for meaningful episodes)

Training Information

  • Training Framework: Stable-Baselines3
  • Environment Wrapper: Monitor (for episode statistics tracking)
  • Vectorized Environment: DummyVecEnv
  • Render Mode: rgb_array (for video recording)

Usage

import gymnasium as gym
from stable_baselines3 import PPO
from huggingface_sb3 import load_from_hub

# Load the model from Hugging Face Hub
model = load_from_hub(
    repo_id="Adilbai/ppo-LunarLander-v2",
    filename="ppo-LunarLander-v2.zip"
)

# Create environment
env = gym.make("LunarLander-v2", render_mode="human")

# Run the trained agent
obs, info = env.reset()
for _ in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        obs, info = env.reset()
env.close()

Model Architecture

The PPO agent uses a Multi-Layer Perceptron (MLP) policy network that:

  • Takes 8-dimensional observations as input
  • Outputs action probabilities for 4 discrete actions
  • Includes separate policy and value function heads
  • Uses standard MLP layers with ReLU activations

Limitations and Considerations

  • Environment Specific: This model is specifically trained for LunarLander-v2 and may not generalize to other environments
  • Deterministic Evaluation: Performance metrics are based on deterministic policy evaluation
  • Sample Efficiency: PPO is generally sample-efficient but may require significant training time for optimal performance

Training Reproducibility

To reproduce this model's training:

from stable_baselines3 import PPO
import gymnasium as gym

env = gym.make("LunarLander-v2")

model = PPO(
    policy='MlpPolicy',
    env=env,
    n_steps=1024,
    batch_size=64,
    n_epochs=4,
    gamma=0.999,
    gae_lambda=0.98,
    ent_coef=0.01,
    verbose=1
)

model.learn(total_timesteps=500000)  # Adjust based on your training duration

Citation

If you use this model, please cite:

@misc{ppo_lunarlander_2024,
  title={PPO Agent for LunarLander-v2},
  author={[Your Name]},
  year={2024},
  publisher={Hugging Face Hub},
  url={https://huggingface.co/Adilbai/ppo-LunarLander-v2}
}

References

Downloads last month
4
Video Preview
loading

Evaluation results