stable-baselines3
Production-ready reinforcement learning algorithms (PPO, SAC, DQN, TD3, DDPG, A2C) with a scikit-learn-like API — use for standard RL experiments, quick prototyping, and well-documented implementations with single-agent Gymnasium environments.
Train RL agents with production-ready algorithm implementations
Trigger phrases
Phrases that activate this skill when typed to Claude Code:
train an RL agentreinforcement learning with PPOSAC agentstable baselinesGymnasium environment
What it does
stable-baselines3 is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a Stable Baselines3 expert covering the full algorithm suite — PPO, SAC, DQN, TD3, DDPG, and A2C — with the clean scikit-learn-like API (model.learn(), model.predict()), custom policy networks, environment vectorization, callback setup (EvalCallback, CheckpointCallback), and W&B/TensorBoard logging.
A session produces complete RL training code: environment setup, algorithm instantiation with hyperparameters, training loop, evaluation, and model serialization — ready to run on a standard Gymnasium-compatible environment.
When to use it
Reach for it when:
- You need a reliable, well-documented implementation of a standard RL algorithm for benchmarking or applied RL research
- You’re prototyping a new environment and want a working baseline agent quickly to verify the environment is learnable
- You’re teaching RL concepts and want clean, readable algorithm implementations with good diagnostics
When not to reach for it:
- High-performance parallel training, multi-agent systems, or custom vectorized environments — the upstream skill documentation points to
pufferlibfor these use cases - Model-based RL or offline RL — SB3 covers online, model-free algorithms only
Install
Copy the SKILL.md from K-Dense AI’s stable-baselines3 folder into .claude/skills/stable-baselines3/ in your project.
Trigger phrases: “train an RL agent”, “reinforcement learning with PPO”, “SAC agent”, “stable baselines”, “Gymnasium environment”.
What a session looks like
A typical session has three phases:
- Environment and algorithm selection. Describe the environment (Gymnasium ID or custom env description) and the task type (continuous action space → SAC or TD3; discrete → DQN or PPO). Claude selects the algorithm and proposes initial hyperparameters.
- Training setup. Claude writes the training script with vectorized environments (
make_vec_env), callback configuration (EvalCallback for performance tracking, CheckpointCallback for intermediate saves), and total timestep budget. - Evaluation and logging. Training runs with TensorBoard logging by default; Claude adds a post-training evaluation loop that runs the policy in the environment and produces episode reward statistics.
Receipts
Where it works well:
- Classic control and Atari environments where SB3’s default hyperparameters are well-tuned and produce competitive results out of the box
- Environment debugging — a quick SB3 run confirms whether a custom Gymnasium environment is correctly implemented before investing in more complex training setups
Where it backfires:
- Very compute-intensive environments where SB3’s Python-based environment stepping creates a CPU bottleneck — vectorized envs help but don’t fully close the gap with compiled environments
- Multi-agent environments require wrappers (e.g., PettingZoo → SB3 compatibility shim) that add complexity
Pattern that works: start with PPO on any new problem — it’s robust across action space types, requires less hyperparameter tuning than SAC or TD3, and is a reliable first baseline before switching to more sample-efficient off-policy algorithms.
Source and attribution
Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the stable-baselines3 folder of their public scientific-agent-skills repository.
License: MIT. Install, adapt, and redistribute with attribution preserved.
This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.