pytorch-lightning

Deep learning framework that organizes PyTorch code into LightningModules with Trainers for multi-GPU/TPU, data pipelines, callbacks, logging (W&B, TensorBoard), and distributed training (DDP, FSDP, DeepSpeed) for scalable neural network training.

Organize PyTorch training with multi-GPU and experiment logging

Source K-Dense AI
License Apache-2.0
First documented

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

  • pytorch lightning training
  • LightningModule
  • multi-GPU training
  • distributed training with PyTorch
  • organize my training loop

What it does

pytorch-lightning is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a PyTorch Lightning expert that structures training code into LightningModule and LightningDataModule classes, configures Trainer for multi-GPU (DDP), TPU, and distributed strategies (FSDP, DeepSpeed), and sets up callbacks (early stopping, model checkpointing) and loggers (Weights & Biases, TensorBoard).

A session produces production-ready training code organized around Lightning’s abstractions — the forward, training_step, validation_step, and configure_optimizers methods — with hardware-agnostic code that scales from a single GPU to a cluster without changes to the model code.

When to use it

Reach for it when:

  • You have a PyTorch model and want to add multi-GPU training, checkpointing, and experiment logging without boilerplate
  • You’re scaling a training pipeline from a single machine to a distributed cluster and want strategy-agnostic model code
  • You want reproducible experiments with automatic logging of hyperparameters, metrics, and model artifacts

When not to reach for it:

  • Simple single-GPU research experiments where the Lightning overhead adds more structure than value — vanilla PyTorch is fine
  • Hugging Face model fine-tuning — the Transformers Trainer API is better integrated with the Hub ecosystem

Install

Copy the SKILL.md from K-Dense AI’s pytorch-lightning folder into .claude/skills/pytorch-lightning/ in your project.

Trigger phrases: “pytorch lightning training”, “LightningModule”, “multi-GPU training”, “distributed training with PyTorch”.

What a session looks like

A typical session has three phases:

  1. Model structure translation. Claude takes an existing PyTorch nn.Module (or description of one) and reorganizes it into a LightningModule with the required hooks, plus a LightningDataModule for the dataset.
  2. Trainer configuration. Claude sets up the Trainer with the appropriate strategy (ddp for multi-GPU, deepspeed for large models), precision (bf16-mixed), and callbacks (EarlyStopping, ModelCheckpoint with the right metric).
  3. Logger integration. W&B or TensorBoard logging is added with hyperparameter logging and metric tracking, so every experiment run has a complete record.

Receipts

Where it works well:

  • Converting ad-hoc PyTorch training loops to Lightning for multi-GPU scale — the structural refactor is mechanical and Claude handles it cleanly in one pass
  • Experiment tracking setup with W&B — the Logger integration covers hyperparameter capture, metric curves, and model artifact versioning

Where it backfires:

  • Very custom training loops with unusual gradient manipulation (gradient surgery, manual optimizer stepping) can be awkward to fit into Lightning’s hooks
  • Debugging distributed training issues requires understanding of the underlying DDP/NCCL mechanism that Lightning abstracts but doesn’t fully hide

Pattern that works: define training_step to return a loss dict with all metrics you want logged — Lightning automatically aggregates and logs everything in the returned dict, eliminating manual self.log() calls for each metric.

Source and attribution

Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the pytorch-lightning folder of their public scientific-agent-skills repository.

License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.