optimize-for-gpu
GPU-accelerate Python code using CuPy, Numba CUDA, Warp, cuDF, cuML, cuGraph, and RAFT — use whenever you want to speed up NumPy, pandas, scikit-learn, NetworkX, or image processing workloads with NVIDIA GPU acceleration.
GPU-accelerate existing Python NumPy and pandas code
Trigger phrases
Phrases that activate this skill when typed to Claude Code:
GPU accelerate this codespeed up with CUDACuPy instead of NumPycuDF dataframeGPU optimization
What it does
optimize-for-gpu is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into an NVIDIA RAPIDS and CUDA expert that identifies GPU acceleration opportunities in Python code and rewrites or supplements them with the appropriate GPU library: CuPy for NumPy arrays, Numba CUDA for custom kernels, cuDF for pandas DataFrames, cuML for scikit-learn, cuGraph for NetworkX, KvikIO for file I/O, cuCIM for image processing, and RAFT for vector similarity search.
A session produces a GPU-accelerated version of your Python code — with the same API contract where possible (CuPy mirrors NumPy, cuDF mirrors pandas, cuML mirrors scikit-learn) — and a note on which operations benefit most from GPU acceleration.
When to use it
Reach for it when:
- You have CPU-bound NumPy, pandas, or scikit-learn code running on large arrays or DataFrames and have an NVIDIA GPU available
- You’re building a vector similarity search pipeline and want RAFT-based GPU indexing and search instead of FAISS-CPU
- You have custom numerical algorithms in Python loops that are candidates for Numba CUDA kernel optimization
When not to reach for it:
- No NVIDIA GPU available — all RAPIDS libraries require CUDA
- Small datasets where the GPU transfer overhead exceeds the compute savings
- PyTorch or TensorFlow model training — those frameworks already use CUDA natively
Install
Copy the SKILL.md from K-Dense AI’s optimize-for-gpu folder into .claude/skills/optimize-for-gpu/ in your project. RAPIDS installation requires an NVIDIA GPU with CUDA support; install via conda install -c rapidsai -c conda-forge rapids.
Trigger phrases: “GPU accelerate this code”, “speed up with CUDA”, “CuPy instead of NumPy”, “cuDF dataframe”.
What a session looks like
A typical session has three phases:
- Code profiling and bottleneck identification. Claude reviews the Python code, identifies the compute-heavy operations (matrix operations, groupby aggregations, pairwise distance computations), and determines which GPU library addresses each bottleneck.
- GPU rewrite. Claude rewrites the bottleneck operations:
numpy→cupywith animport cupy as cpswap,pd.DataFrame→cudf.DataFrame,sklearnestimators →cumlequivalents,nx.graph algorithms →cugraph. Operations that can’t be GPU-accelerated are flagged. - Integration and fallback. Claude adds a CPU fallback path (try/except for CuPy import) so the code runs on CPU when no GPU is available, and adds explicit
cp.asnumpy()calls at the boundary where GPU arrays need to be returned to CPU memory.
Receipts
Where it works well:
- NumPy matrix operations on large arrays (>1M elements) — CuPy is a near drop-in replacement and the speedup on large array operations is substantial for compute-bound operations
- cuDF DataFrame operations on large CSVs — the groupby and aggregation speedup is significant for datasets in the hundreds-of-MB range
Where it backfires:
- Operations with large GPU-CPU data transfers between steps negate the compute speedup — the skill doesn’t always identify when a mixed CPU/GPU pipeline is slower than pure CPU
- RAPIDS version pinning is strict; library version mismatches between cuDF, cuML, and the CUDA driver version are a common source of environment failures
Pattern that works: profile the CPU baseline first with cProfile or line_profiler — only rewrite the top 2–3 bottleneck operations to GPU; rewriting everything introduces unnecessary transfer overhead and debugging complexity.
Source and attribution
Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the optimize-for-gpu folder of their public scientific-agent-skills repository.
License: MIT. Install, adapt, and redistribute with attribution preserved.
This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.