# dask

> Distributed computing for larger-than-RAM pandas/NumPy workflows — use when you need to scale existing pandas/NumPy code beyond memory or across clusters, with parallel file processing, distributed ML, and seamless pandas API compatibility.

**Use case**: Scale pandas and NumPy workflows beyond available RAM

**Canonical URL**: https://agentcookbooks.com/skills/dask/

**Topics**: claude-code, skills, science, data-science

**Trigger phrases**: "my data is too big for pandas", "dask dataframe", "parallel processing with dask", "distributed computation", "out-of-core computation"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/dask)

**License**: MIT

---

## What it does

`dask` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into a Dask expert for scaling Python data workflows — Dask DataFrames for larger-than-RAM tabular data, Dask Arrays for chunked NumPy computations, the Dask distributed scheduler for cluster parallelism, and Dask-ML for distributed machine learning. Dask's API mirrors pandas and NumPy closely, minimizing rewrite effort when scaling existing code.

A session produces Dask-based Python code that runs the same logical operations as pandas/NumPy code but on data that exceeds available RAM, with guidance on chunk sizing and scheduler selection.

## When to use it

Reach for it when:

- Your data doesn't fit in RAM but you want to keep the pandas-style API rather than rewriting to Spark or SQL
- You need to process a directory of large files in parallel (reading, filtering, aggregating) on a single machine or a cluster
- You're scaling an existing pandas-based ML preprocessing pipeline to larger data without changing the model code

When *not* to reach for it:

- Data fits in RAM and speed is the goal — use `polars` for much better single-machine performance
- You need an interactive analytics database — consider DuckDB or similar

## Install

Copy the `SKILL.md` from K-Dense AI's [dask folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/dask) into `.claude/skills/dask/` in your project.

Trigger phrases: "my data is too big for pandas", "dask dataframe", "parallel processing with dask", "out-of-core computation".

## What a session looks like

A typical session has three phases:

1. **Scale assessment.** Describe your data size, file format, and the operations you need. Claude recommends whether a Dask DataFrame, Dask Array, or Dask Delayed is the right abstraction and advises on chunk size.
2. **Code migration or generation.** Claude writes or adapts the pandas/NumPy code to Dask equivalents, with `.compute()` calls placed at the final output step rather than intermediate steps.
3. **Scheduler configuration.** Claude sets up the appropriate scheduler — local multi-threaded, local multi-process, or distributed cluster — and adds progress monitoring with the Dask dashboard.

## Receipts

**Where it works well:**
- Reading and filtering 50+ GB CSV or Parquet file collections where chunked reads allow the full dataset to be processed without loading everything into memory
- Embarrassingly parallel file processing (e.g., convert a folder of 10,000 CSV files to Parquet) where Dask's task graph trivially parallelizes the loop

**Where it backfires:**
- Operations that require full data shuffles (groupby with aggregation across all chunks, joins without aligned partitions) incur significant overhead compared to in-memory pandas on the same data
- Debugging Dask task graphs when something goes wrong requires understanding deferred execution in a way that pandas users often find unintuitive

**Pattern that works:** profile the bottleneck operation in pandas on a 10% sample first; if the operation vectorizes cleanly (filter, map, column-wise transform), Dask scales it well — if it requires global state (ranking, mode), the overhead may not be worth it.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`dask` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/dask) of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.