optimize-for-gpu

GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, and RAFT — use whenever you want to speed up NumPy, pandas, scikit-learn, NetworkX, or image processing workloads with NVIDIA GPU acceleration.

GPU-accelerate existing Python NumPy and pandas code

Source K-Dense AI
License MIT
First documented

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

  • GPU accelerate this code
  • speed up with CUDA
  • CuPy instead of NumPy
  • cuDF dataframe
  • GPU optimization

What it does

optimize-for-gpu is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into an NVIDIA RAPIDS and CUDA expert that identifies GPU acceleration opportunities in Python code and rewrites or supplements them with the appropriate GPU library: CuPy for NumPy arrays, Numba CUDA for custom kernels, cuDF for pandas DataFrames, cuML for scikit-learn, cuGraph for NetworkX, KvikIO for file I/O, cuCIM for image processing, and RAFT for vector similarity search.

A session produces a GPU-accelerated version of your Python code — with the same API contract where possible (CuPy mirrors NumPy, cuDF mirrors pandas, cuML mirrors scikit-learn) — and a note on which operations benefit most from GPU acceleration.

When to use it

Reach for it when:

  • You have CPU-bound NumPy, pandas, or scikit-learn code running on large arrays or DataFrames and have an NVIDIA GPU available
  • You’re building a vector similarity search pipeline and want RAFT-based GPU indexing and search instead of FAISS-CPU
  • You have custom numerical algorithms in Python loops that are candidates for Numba CUDA kernel optimization

When not to reach for it:

  • No NVIDIA GPU available — all RAPIDS libraries require CUDA
  • Small datasets where the GPU transfer overhead exceeds the compute savings
  • PyTorch or TensorFlow model training — those frameworks already use CUDA natively

Install

Copy the SKILL.md from K-Dense AI’s optimize-for-gpu folder into .claude/skills/optimize-for-gpu/ in your project. RAPIDS installation requires an NVIDIA GPU with CUDA support; install via conda install -c rapidsai -c conda-forge rapids.

Trigger phrases: “GPU accelerate this code”, “speed up with CUDA”, “CuPy instead of NumPy”, “cuDF dataframe”.

What a session looks like

A typical session has three phases:

  1. Code profiling and bottleneck identification. Claude reviews the Python code, identifies the compute-heavy operations (matrix operations, groupby aggregations, pairwise distance computations), and determines which GPU library addresses each bottleneck.
  2. GPU rewrite. Claude rewrites the bottleneck operations: numpycupy with an import cupy as cp swap, pd.DataFramecudf.DataFrame, sklearn estimators → cuml equivalents, nx. graph algorithms → cugraph. Operations that can’t be GPU-accelerated are flagged.
  3. Integration and fallback. Claude adds a CPU fallback path (try/except for CuPy import) so the code runs on CPU when no GPU is available, and adds explicit cp.asnumpy() calls at the boundary where GPU arrays need to be returned to CPU memory.

Receipts

Where it works well:

  • NumPy matrix operations on large arrays (>1M elements) — CuPy is a near drop-in replacement and the speedup on large array operations is substantial for compute-bound operations
  • cuDF DataFrame operations on large CSVs — the groupby and aggregation speedup is significant for datasets in the hundreds-of-MB range

Where it backfires:

  • Operations with large GPU-CPU data transfers between steps negate the compute speedup — the skill doesn’t always identify when a mixed CPU/GPU pipeline is slower than pure CPU
  • RAPIDS version pinning is strict; library version mismatches between cuDF, cuML, and the CUDA driver version are a common source of environment failures

Pattern that works: profile the CPU baseline first with cProfile or line_profiler — only rewrite the top 2–3 bottleneck operations to GPU; rewriting everything introduces unnecessary transfer overhead and debugging complexity.

Source and attribution

Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the optimize-for-gpu folder of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.