# pytorch-lightning

> Deep learning framework that organizes PyTorch code into LightningModules with Trainers for multi-GPU/TPU, data pipelines, callbacks, logging (W&B, TensorBoard), and distributed training (DDP, FSDP, DeepSpeed) for scalable neural network training.

**Use case**: Organize PyTorch training with multi-GPU and experiment logging

**Canonical URL**: https://agentcookbooks.com/skills/pytorch-lightning/

**Topics**: claude-code, skills, science, ml-libraries

**Trigger phrases**: "pytorch lightning training", "LightningModule", "multi-GPU training", "distributed training with PyTorch", "organize my training loop"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/pytorch-lightning)

**License**: Apache-2.0

---

## What it does

`pytorch-lightning` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into a PyTorch Lightning expert that structures training code into `LightningModule` and `LightningDataModule` classes, configures `Trainer` for multi-GPU (DDP), TPU, and distributed strategies (FSDP, DeepSpeed), and sets up callbacks (early stopping, model checkpointing) and loggers (Weights & Biases, TensorBoard).

A session produces production-ready training code organized around Lightning's abstractions — the `forward`, `training_step`, `validation_step`, and `configure_optimizers` methods — with hardware-agnostic code that scales from a single GPU to a cluster without changes to the model code.

## When to use it

Reach for it when:

- You have a PyTorch model and want to add multi-GPU training, checkpointing, and experiment logging without boilerplate
- You're scaling a training pipeline from a single machine to a distributed cluster and want strategy-agnostic model code
- You want reproducible experiments with automatic logging of hyperparameters, metrics, and model artifacts

When *not* to reach for it:

- Simple single-GPU research experiments where the Lightning overhead adds more structure than value — vanilla PyTorch is fine
- Hugging Face model fine-tuning — the Transformers `Trainer` API is better integrated with the Hub ecosystem

## Install

Copy the `SKILL.md` from K-Dense AI's [pytorch-lightning folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/pytorch-lightning) into `.claude/skills/pytorch-lightning/` in your project.

Trigger phrases: "pytorch lightning training", "LightningModule", "multi-GPU training", "distributed training with PyTorch".

## What a session looks like

A typical session has three phases:

1. **Model structure translation.** Claude takes an existing PyTorch `nn.Module` (or description of one) and reorganizes it into a `LightningModule` with the required hooks, plus a `LightningDataModule` for the dataset.
2. **Trainer configuration.** Claude sets up the `Trainer` with the appropriate strategy (`ddp` for multi-GPU, `deepspeed` for large models), precision (`bf16-mixed`), and callbacks (EarlyStopping, ModelCheckpoint with the right metric).
3. **Logger integration.** W&B or TensorBoard logging is added with hyperparameter logging and metric tracking, so every experiment run has a complete record.

## Receipts

**Where it works well:**
- Converting ad-hoc PyTorch training loops to Lightning for multi-GPU scale — the structural refactor is mechanical and Claude handles it cleanly in one pass
- Experiment tracking setup with W&B — the Logger integration covers hyperparameter capture, metric curves, and model artifact versioning

**Where it backfires:**
- Very custom training loops with unusual gradient manipulation (gradient surgery, manual optimizer stepping) can be awkward to fit into Lightning's hooks
- Debugging distributed training issues requires understanding of the underlying DDP/NCCL mechanism that Lightning abstracts but doesn't fully hide

**Pattern that works:** define `training_step` to return a loss dict with all metrics you want logged — Lightning automatically aggregates and logs everything in the returned dict, eliminating manual `self.log()` calls for each metric.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`pytorch-lightning` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/pytorch-lightning) of their public scientific-agent-skills repository.

License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.