# stable-baselines3

> Production-ready reinforcement learning algorithms (PPO, SAC, DQN, TD3, DDPG, A2C) with a scikit-learn-like API — use for standard RL experiments, quick prototyping, and well-documented implementations with single-agent Gymnasium environments.

**Use case**: Train RL agents with production-ready algorithm implementations

**Canonical URL**: https://agentcookbooks.com/skills/stable-baselines3/

**Topics**: claude-code, skills, science, ml-libraries

**Trigger phrases**: "train an RL agent", "reinforcement learning with PPO", "SAC agent", "stable baselines", "Gymnasium environment"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/stable-baselines3)

**License**: MIT

---

## What it does

`stable-baselines3` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into a Stable Baselines3 expert covering the full algorithm suite — PPO, SAC, DQN, TD3, DDPG, and A2C — with the clean scikit-learn-like API (`model.learn()`, `model.predict()`), custom policy networks, environment vectorization, callback setup (EvalCallback, CheckpointCallback), and W&B/TensorBoard logging.

A session produces complete RL training code: environment setup, algorithm instantiation with hyperparameters, training loop, evaluation, and model serialization — ready to run on a standard Gymnasium-compatible environment.

## When to use it

Reach for it when:

- You need a reliable, well-documented implementation of a standard RL algorithm for benchmarking or applied RL research
- You're prototyping a new environment and want a working baseline agent quickly to verify the environment is learnable
- You're teaching RL concepts and want clean, readable algorithm implementations with good diagnostics

When *not* to reach for it:

- High-performance parallel training, multi-agent systems, or custom vectorized environments — the upstream skill documentation points to `pufferlib` for these use cases
- Model-based RL or offline RL — SB3 covers online, model-free algorithms only

## Install

Copy the `SKILL.md` from K-Dense AI's [stable-baselines3 folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/stable-baselines3) into `.claude/skills/stable-baselines3/` in your project.

Trigger phrases: "train an RL agent", "reinforcement learning with PPO", "SAC agent", "stable baselines", "Gymnasium environment".

## What a session looks like

A typical session has three phases:

1. **Environment and algorithm selection.** Describe the environment (Gymnasium ID or custom env description) and the task type (continuous action space → SAC or TD3; discrete → DQN or PPO). Claude selects the algorithm and proposes initial hyperparameters.
2. **Training setup.** Claude writes the training script with vectorized environments (`make_vec_env`), callback configuration (EvalCallback for performance tracking, CheckpointCallback for intermediate saves), and total timestep budget.
3. **Evaluation and logging.** Training runs with TensorBoard logging by default; Claude adds a post-training evaluation loop that runs the policy in the environment and produces episode reward statistics.

## Receipts

**Where it works well:**
- Classic control and Atari environments where SB3's default hyperparameters are well-tuned and produce competitive results out of the box
- Environment debugging — a quick SB3 run confirms whether a custom Gymnasium environment is correctly implemented before investing in more complex training setups

**Where it backfires:**
- Very compute-intensive environments where SB3's Python-based environment stepping creates a CPU bottleneck — vectorized envs help but don't fully close the gap with compiled environments
- Multi-agent environments require wrappers (e.g., PettingZoo → SB3 compatibility shim) that add complexity

**Pattern that works:** start with PPO on any new problem — it's robust across action space types, requires less hyperparameter tuning than SAC or TD3, and is a reliable first baseline before switching to more sample-efficient off-policy algorithms.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`stable-baselines3` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/stable-baselines3) of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.