pytorch-lightning
Deep learning framework that organizes PyTorch code into LightningModules with Trainers for multi-GPU/TPU, data pipelines, callbacks, logging (W&B, TensorBoard), and distributed training (DDP, FSDP, DeepSpeed) for scalable neural network training.
Organize PyTorch training with multi-GPU and experiment logging
Trigger phrases
Phrases that activate this skill when typed to Claude Code:
pytorch lightning trainingLightningModulemulti-GPU trainingdistributed training with PyTorchorganize my training loop
What it does
pytorch-lightning is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a PyTorch Lightning expert that structures training code into LightningModule and LightningDataModule classes, configures Trainer for multi-GPU (DDP), TPU, and distributed strategies (FSDP, DeepSpeed), and sets up callbacks (early stopping, model checkpointing) and loggers (Weights & Biases, TensorBoard).
A session produces production-ready training code organized around Lightning’s abstractions — the forward, training_step, validation_step, and configure_optimizers methods — with hardware-agnostic code that scales from a single GPU to a cluster without changes to the model code.
When to use it
Reach for it when:
- You have a PyTorch model and want to add multi-GPU training, checkpointing, and experiment logging without boilerplate
- You’re scaling a training pipeline from a single machine to a distributed cluster and want strategy-agnostic model code
- You want reproducible experiments with automatic logging of hyperparameters, metrics, and model artifacts
When not to reach for it:
- Simple single-GPU research experiments where the Lightning overhead adds more structure than value — vanilla PyTorch is fine
- Hugging Face model fine-tuning — the Transformers
TrainerAPI is better integrated with the Hub ecosystem
Install
Copy the SKILL.md from K-Dense AI’s pytorch-lightning folder into .claude/skills/pytorch-lightning/ in your project.
Trigger phrases: “pytorch lightning training”, “LightningModule”, “multi-GPU training”, “distributed training with PyTorch”.
What a session looks like
A typical session has three phases:
- Model structure translation. Claude takes an existing PyTorch
nn.Module(or description of one) and reorganizes it into aLightningModulewith the required hooks, plus aLightningDataModulefor the dataset. - Trainer configuration. Claude sets up the
Trainerwith the appropriate strategy (ddpfor multi-GPU,deepspeedfor large models), precision (bf16-mixed), and callbacks (EarlyStopping, ModelCheckpoint with the right metric). - Logger integration. W&B or TensorBoard logging is added with hyperparameter logging and metric tracking, so every experiment run has a complete record.
Receipts
Where it works well:
- Converting ad-hoc PyTorch training loops to Lightning for multi-GPU scale — the structural refactor is mechanical and Claude handles it cleanly in one pass
- Experiment tracking setup with W&B — the Logger integration covers hyperparameter capture, metric curves, and model artifact versioning
Where it backfires:
- Very custom training loops with unusual gradient manipulation (gradient surgery, manual optimizer stepping) can be awkward to fit into Lightning’s hooks
- Debugging distributed training issues requires understanding of the underlying DDP/NCCL mechanism that Lightning abstracts but doesn’t fully hide
Pattern that works: define training_step to return a loss dict with all metrics you want logged — Lightning automatically aggregates and logs everything in the returned dict, eliminating manual self.log() calls for each metric.
Source and attribution
Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the pytorch-lightning folder of their public scientific-agent-skills repository.
License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.
This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.