# optimize-for-gpu

> GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, and RAFT — use whenever you want to speed up NumPy, pandas, scikit-learn, NetworkX, or image processing workloads with NVIDIA GPU acceleration.

**Use case**: GPU-accelerate existing Python NumPy and pandas code

**Canonical URL**: https://agentcookbooks.com/skills/optimize-for-gpu/

**Topics**: claude-code, skills, science, science

**Trigger phrases**: "GPU accelerate this code", "speed up with CUDA", "CuPy instead of NumPy", "cuDF dataframe", "GPU optimization"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/optimize-for-gpu)

**License**: MIT

---

## What it does

`optimize-for-gpu` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into an NVIDIA RAPIDS and CUDA expert that identifies GPU acceleration opportunities in Python code and rewrites or supplements them with the appropriate GPU library: CuPy for NumPy arrays, Numba CUDA for custom kernels, cuDF for pandas DataFrames, cuML for scikit-learn, cuGraph for NetworkX, KvikIO for file I/O, cuCIM for image processing, and RAFT for vector similarity search.

A session produces a GPU-accelerated version of your Python code — with the same API contract where possible (CuPy mirrors NumPy, cuDF mirrors pandas, cuML mirrors scikit-learn) — and a note on which operations benefit most from GPU acceleration.

## When to use it

Reach for it when:

- You have CPU-bound NumPy, pandas, or scikit-learn code running on large arrays or DataFrames and have an NVIDIA GPU available
- You're building a vector similarity search pipeline and want RAFT-based GPU indexing and search instead of FAISS-CPU
- You have custom numerical algorithms in Python loops that are candidates for Numba CUDA kernel optimization

When *not* to reach for it:

- No NVIDIA GPU available — all RAPIDS libraries require CUDA
- Small datasets where the GPU transfer overhead exceeds the compute savings
- PyTorch or TensorFlow model training — those frameworks already use CUDA natively

## Install

Copy the `SKILL.md` from K-Dense AI's [optimize-for-gpu folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/optimize-for-gpu) into `.claude/skills/optimize-for-gpu/` in your project. RAPIDS installation requires an NVIDIA GPU with CUDA support; install via `conda install -c rapidsai -c conda-forge rapids`.

Trigger phrases: "GPU accelerate this code", "speed up with CUDA", "CuPy instead of NumPy", "cuDF dataframe".

## What a session looks like

A typical session has three phases:

1. **Code profiling and bottleneck identification.** Claude reviews the Python code, identifies the compute-heavy operations (matrix operations, groupby aggregations, pairwise distance computations), and determines which GPU library addresses each bottleneck.
2. **GPU rewrite.** Claude rewrites the bottleneck operations: `numpy` → `cupy` with an `import cupy as cp` swap, `pd.DataFrame` → `cudf.DataFrame`, `sklearn` estimators → `cuml` equivalents, `nx.` graph algorithms → `cugraph`. Operations that can't be GPU-accelerated are flagged.
3. **Integration and fallback.** Claude adds a CPU fallback path (try/except for CuPy import) so the code runs on CPU when no GPU is available, and adds explicit `cp.asnumpy()` calls at the boundary where GPU arrays need to be returned to CPU memory.

## Receipts

**Where it works well:**
- NumPy matrix operations on large arrays (>1M elements) — CuPy is a near drop-in replacement and the speedup on large array operations is substantial for compute-bound operations
- cuDF DataFrame operations on large CSVs — the groupby and aggregation speedup is significant for datasets in the hundreds-of-MB range

**Where it backfires:**
- Operations with large GPU-CPU data transfers between steps negate the compute speedup — the skill doesn't always identify when a mixed CPU/GPU pipeline is slower than pure CPU
- RAPIDS version pinning is strict; library version mismatches between cuDF, cuML, and the CUDA driver version are a common source of environment failures

**Pattern that works:** profile the CPU baseline first with `cProfile` or `line_profiler` — only rewrite the top 2–3 bottleneck operations to GPU; rewriting everything introduces unnecessary transfer overhead and debugging complexity.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`optimize-for-gpu` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/optimize-for-gpu) of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.