dask
Distributed computing for larger-than-RAM pandas/NumPy workflows — use when you need to scale existing pandas/NumPy code beyond memory or across clusters, with parallel file processing, distributed ML, and seamless pandas API compatibility.
Scale pandas and NumPy workflows beyond available RAM
Trigger phrases
Phrases that activate this skill when typed to Claude Code:
my data is too big for pandasdask dataframeparallel processing with daskdistributed computationout-of-core computation
What it does
dask is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a Dask expert for scaling Python data workflows — Dask DataFrames for larger-than-RAM tabular data, Dask Arrays for chunked NumPy computations, the Dask distributed scheduler for cluster parallelism, and Dask-ML for distributed machine learning. Dask’s API mirrors pandas and NumPy closely, minimizing rewrite effort when scaling existing code.
A session produces Dask-based Python code that runs the same logical operations as pandas/NumPy code but on data that exceeds available RAM, with guidance on chunk sizing and scheduler selection.
When to use it
Reach for it when:
- Your data doesn’t fit in RAM but you want to keep the pandas-style API rather than rewriting to Spark or SQL
- You need to process a directory of large files in parallel (reading, filtering, aggregating) on a single machine or a cluster
- You’re scaling an existing pandas-based ML preprocessing pipeline to larger data without changing the model code
When not to reach for it:
- Data fits in RAM and speed is the goal — use
polarsfor much better single-machine performance - You need an interactive analytics database — consider DuckDB or similar
Install
Copy the SKILL.md from K-Dense AI’s dask folder into .claude/skills/dask/ in your project.
Trigger phrases: “my data is too big for pandas”, “dask dataframe”, “parallel processing with dask”, “out-of-core computation”.
What a session looks like
A typical session has three phases:
- Scale assessment. Describe your data size, file format, and the operations you need. Claude recommends whether a Dask DataFrame, Dask Array, or Dask Delayed is the right abstraction and advises on chunk size.
- Code migration or generation. Claude writes or adapts the pandas/NumPy code to Dask equivalents, with
.compute()calls placed at the final output step rather than intermediate steps. - Scheduler configuration. Claude sets up the appropriate scheduler — local multi-threaded, local multi-process, or distributed cluster — and adds progress monitoring with the Dask dashboard.
Receipts
Where it works well:
- Reading and filtering 50+ GB CSV or Parquet file collections where chunked reads allow the full dataset to be processed without loading everything into memory
- Embarrassingly parallel file processing (e.g., convert a folder of 10,000 CSV files to Parquet) where Dask’s task graph trivially parallelizes the loop
Where it backfires:
- Operations that require full data shuffles (groupby with aggregation across all chunks, joins without aligned partitions) incur significant overhead compared to in-memory pandas on the same data
- Debugging Dask task graphs when something goes wrong requires understanding deferred execution in a way that pandas users often find unintuitive
Pattern that works: profile the bottleneck operation in pandas on a 10% sample first; if the operation vectorizes cleanly (filter, map, column-wise transform), Dask scales it well — if it requires global state (ranking, mode), the overhead may not be worth it.
Source and attribution
Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the dask folder of their public scientific-agent-skills repository.
License: MIT. Install, adapt, and redistribute with attribution preserved.
This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.