stable-baselines3

Production-ready reinforcement learning algorithms (PPO, SAC, DQN, TD3, DDPG, A2C) with a scikit-learn-like API — use for standard RL experiments, quick prototyping, and well-documented implementations with single-agent Gymnasium environments.

Train RL agents with production-ready algorithm implementations

Source K-Dense AI

License MIT

First documented 2026-04-28

Science ML Libraries

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

train an RL agent
reinforcement learning with PPO
SAC agent
stable baselines
Gymnasium environment

What it does

stable-baselines3 is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a Stable Baselines3 expert covering the full algorithm suite — PPO, SAC, DQN, TD3, DDPG, and A2C — with the clean scikit-learn-like API (model.learn(), model.predict()), custom policy networks, environment vectorization, callback setup (EvalCallback, CheckpointCallback), and W&B/TensorBoard logging.

A session produces complete RL training code: environment setup, algorithm instantiation with hyperparameters, training loop, evaluation, and model serialization — ready to run on a standard Gymnasium-compatible environment.

When to use it

Reach for it when:

You need a reliable, well-documented implementation of a standard RL algorithm for benchmarking or applied RL research
You’re prototyping a new environment and want a working baseline agent quickly to verify the environment is learnable
You’re teaching RL concepts and want clean, readable algorithm implementations with good diagnostics

When not to reach for it:

High-performance parallel training, multi-agent systems, or custom vectorized environments — the upstream skill documentation points to pufferlib for these use cases
Model-based RL or offline RL — SB3 covers online, model-free algorithms only

Install

Copy the SKILL.md from K-Dense AI’s stable-baselines3 folder into .claude/skills/stable-baselines3/ in your project.

Trigger phrases: “train an RL agent”, “reinforcement learning with PPO”, “SAC agent”, “stable baselines”, “Gymnasium environment”.

What a session looks like

A typical session has three phases:

Environment and algorithm selection. Describe the environment (Gymnasium ID or custom env description) and the task type (continuous action space → SAC or TD3; discrete → DQN or PPO). Claude selects the algorithm and proposes initial hyperparameters.
Training setup. Claude writes the training script with vectorized environments (make_vec_env), callback configuration (EvalCallback for performance tracking, CheckpointCallback for intermediate saves), and total timestep budget.
Evaluation and logging. Training runs with TensorBoard logging by default; Claude adds a post-training evaluation loop that runs the policy in the environment and produces episode reward statistics.

Receipts

Where it works well:

Classic control and Atari environments where SB3’s default hyperparameters are well-tuned and produce competitive results out of the box
Environment debugging — a quick SB3 run confirms whether a custom Gymnasium environment is correctly implemented before investing in more complex training setups

Where it backfires:

Very compute-intensive environments where SB3’s Python-based environment stepping creates a CPU bottleneck — vectorized envs help but don’t fully close the gap with compiled environments
Multi-agent environments require wrappers (e.g., PettingZoo → SB3 compatibility shim) that add complexity

Pattern that works: start with PPO on any new problem — it’s robust across action space types, requires less hyperparameter tuning than SAC or TD3, and is a reliable first baseline before switching to more sample-efficient off-policy algorithms.

Source and attribution

Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the stable-baselines3 folder of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.