# umap-learn

> UMAP dimensionality reduction — fast nonlinear manifold learning for 2D/3D visualization, clustering preprocessing with HDBSCAN, supervised and parametric UMAP for high-dimensional data including single-cell, image embeddings, and text representations.

**Use case**: Reduce high-dimensional data to 2D/3D with UMAP

**Canonical URL**: https://agentcookbooks.com/skills/umap-learn/

**Topics**: claude-code, skills, science, bioinformatics

**Trigger phrases**: "UMAP dimensionality reduction", "visualize high-dimensional data", "embed this dataset", "UMAP clustering", "reduce dimensions with UMAP"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/umap-learn)

**License**: MIT

---

## What it does

`umap-learn` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into a UMAP expert for dimensionality reduction and visualization — covering standard unsupervised UMAP for 2D/3D embedding, supervised UMAP with label information to improve class separation, parametric UMAP for out-of-sample projection, and UMAP as a preprocessing step for HDBSCAN density-based clustering.

A session produces Python code that takes a high-dimensional array (single-cell expression, image embeddings, text vectors, tabular features) and returns a 2D or 3D embedding suitable for visualization or downstream clustering.

## When to use it

Reach for it when:

- You have high-dimensional data and want a 2D visualization that preserves neighborhood structure better than PCA
- You're preprocessing high-dimensional features for HDBSCAN or other density-based clustering
- You need out-of-sample projection — embedding new data points into an existing UMAP space (parametric UMAP)

When *not* to reach for it:

- Small datasets (<500 points) where PCA or t-SNE is sufficient and UMAP's graph construction overhead isn't justified
- When linear interpretability of the axes matters — UMAP axes are not interpretable like PCA loadings

## Install

Copy the `SKILL.md` from K-Dense AI's [umap-learn folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/umap-learn) into `.claude/skills/umap-learn/` in your project.

Trigger phrases: "UMAP dimensionality reduction", "visualize high-dimensional data", "embed this dataset", "UMAP clustering".

## What a session looks like

A typical session has three phases:

1. **Data and goal specification.** Describe the data type, dimensionality, and downstream goal (visualization, clustering, or projection). Claude selects appropriate `n_neighbors` and `min_dist` parameters — larger `n_neighbors` for global structure, smaller `min_dist` for tighter clusters.
2. **UMAP fitting.** Claude generates the UMAP fit code with a fixed random seed for reproducibility, optional metric selection (cosine for text/image embeddings, euclidean for normalized data), and appropriate data preprocessing (scaling, PCA initialization for speed).
3. **Visualization and downstream use.** The 2D embedding is plotted with matplotlib/seaborn, colored by available labels or cluster assignments. For clustering use cases, Claude adds HDBSCAN fitting on the UMAP embedding.

## Receipts

**Where it works well:**
- Single-cell RNA-seq visualization — UMAP on PCA-reduced expression data is the de facto standard in the field and Scanpy calls this skill's functionality internally
- Text embedding visualization — running UMAP on sentence-transformer or word2vec embeddings reveals semantic clusters that are invisible in the raw 768-dimensional space

**Where it backfires:**
- UMAP is non-deterministic across runs without a fixed seed, and different seeds can produce qualitatively different global layouts on the same data — always set `random_state`
- Comparing UMAP embeddings across conditions or time points requires the parametric variant or aligned embedding approaches; standard UMAP embeddings are not directly comparable

**Pattern that works:** always initialize UMAP with `init='pca'` for large datasets — it's faster to converge and produces more reproducible global structure than the default spectral initialization.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`umap-learn` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/umap-learn) of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.