pytorch-lightning

Deep learning framework that organizes PyTorch code into LightningModules with Trainers for multi-GPU/TPU, data pipelines, callbacks, logging (W&B, TensorBoard), and distributed training (DDP, FSDP, DeepSpeed) for scalable neural network training.

Organize PyTorch training with multi-GPU and experiment logging

Source K-Dense AI

License Apache-2.0

First documented 2026-04-28

Science ML Libraries

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

pytorch lightning training
LightningModule
multi-GPU training
distributed training with PyTorch
organize my training loop

What it does

pytorch-lightning is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a PyTorch Lightning expert that structures training code into LightningModule and LightningDataModule classes, configures Trainer for multi-GPU (DDP), TPU, and distributed strategies (FSDP, DeepSpeed), and sets up callbacks (early stopping, model checkpointing) and loggers (Weights & Biases, TensorBoard).

A session produces production-ready training code organized around Lightning’s abstractions — the forward, training_step, validation_step, and configure_optimizers methods — with hardware-agnostic code that scales from a single GPU to a cluster without changes to the model code.

When to use it

Reach for it when:

You have a PyTorch model and want to add multi-GPU training, checkpointing, and experiment logging without boilerplate
You’re scaling a training pipeline from a single machine to a distributed cluster and want strategy-agnostic model code
You want reproducible experiments with automatic logging of hyperparameters, metrics, and model artifacts

When not to reach for it:

Simple single-GPU research experiments where the Lightning overhead adds more structure than value — vanilla PyTorch is fine
Hugging Face model fine-tuning — the Transformers Trainer API is better integrated with the Hub ecosystem

Install

Copy the SKILL.md from K-Dense AI’s pytorch-lightning folder into .claude/skills/pytorch-lightning/ in your project.

Trigger phrases: “pytorch lightning training”, “LightningModule”, “multi-GPU training”, “distributed training with PyTorch”.

What a session looks like

A typical session has three phases:

Model structure translation. Claude takes an existing PyTorch nn.Module (or description of one) and reorganizes it into a LightningModule with the required hooks, plus a LightningDataModule for the dataset.
Trainer configuration. Claude sets up the Trainer with the appropriate strategy (ddp for multi-GPU, deepspeed for large models), precision (bf16-mixed), and callbacks (EarlyStopping, ModelCheckpoint with the right metric).
Logger integration. W&B or TensorBoard logging is added with hyperparameter logging and metric tracking, so every experiment run has a complete record.

Receipts

Where it works well:

Converting ad-hoc PyTorch training loops to Lightning for multi-GPU scale — the structural refactor is mechanical and Claude handles it cleanly in one pass
Experiment tracking setup with W&B — the Logger integration covers hyperparameter capture, metric curves, and model artifact versioning

Where it backfires:

Very custom training loops with unusual gradient manipulation (gradient surgery, manual optimizer stepping) can be awkward to fit into Lightning’s hooks
Debugging distributed training issues requires understanding of the underlying DDP/NCCL mechanism that Lightning abstracts but doesn’t fully hide

Pattern that works: define training_step to return a loss dict with all metrics you want logged — Lightning automatically aggregates and logs everything in the returned dict, eliminating manual self.log() calls for each metric.

Source and attribution

Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the pytorch-lightning folder of their public scientific-agent-skills repository.

License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.