umap-learn
UMAP dimensionality reduction — fast nonlinear manifold learning for 2D/3D visualization, clustering preprocessing with HDBSCAN, supervised and parametric UMAP for high-dimensional data including single-cell, image embeddings, and text representations.
Reduce high-dimensional data to 2D/3D with UMAP
Trigger phrases
Phrases that activate this skill when typed to Claude Code:
UMAP dimensionality reductionvisualize high-dimensional dataembed this datasetUMAP clusteringreduce dimensions with UMAP
What it does
umap-learn is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a UMAP expert for dimensionality reduction and visualization — covering standard unsupervised UMAP for 2D/3D embedding, supervised UMAP with label information to improve class separation, parametric UMAP for out-of-sample projection, and UMAP as a preprocessing step for HDBSCAN density-based clustering.
A session produces Python code that takes a high-dimensional array (single-cell expression, image embeddings, text vectors, tabular features) and returns a 2D or 3D embedding suitable for visualization or downstream clustering.
When to use it
Reach for it when:
- You have high-dimensional data and want a 2D visualization that preserves neighborhood structure better than PCA
- You’re preprocessing high-dimensional features for HDBSCAN or other density-based clustering
- You need out-of-sample projection — embedding new data points into an existing UMAP space (parametric UMAP)
When not to reach for it:
- Small datasets (<500 points) where PCA or t-SNE is sufficient and UMAP’s graph construction overhead isn’t justified
- When linear interpretability of the axes matters — UMAP axes are not interpretable like PCA loadings
Install
Copy the SKILL.md from K-Dense AI’s umap-learn folder into .claude/skills/umap-learn/ in your project.
Trigger phrases: “UMAP dimensionality reduction”, “visualize high-dimensional data”, “embed this dataset”, “UMAP clustering”.
What a session looks like
A typical session has three phases:
- Data and goal specification. Describe the data type, dimensionality, and downstream goal (visualization, clustering, or projection). Claude selects appropriate
n_neighborsandmin_distparameters — largern_neighborsfor global structure, smallermin_distfor tighter clusters. - UMAP fitting. Claude generates the UMAP fit code with a fixed random seed for reproducibility, optional metric selection (cosine for text/image embeddings, euclidean for normalized data), and appropriate data preprocessing (scaling, PCA initialization for speed).
- Visualization and downstream use. The 2D embedding is plotted with matplotlib/seaborn, colored by available labels or cluster assignments. For clustering use cases, Claude adds HDBSCAN fitting on the UMAP embedding.
Receipts
Where it works well:
- Single-cell RNA-seq visualization — UMAP on PCA-reduced expression data is the de facto standard in the field and Scanpy calls this skill’s functionality internally
- Text embedding visualization — running UMAP on sentence-transformer or word2vec embeddings reveals semantic clusters that are invisible in the raw 768-dimensional space
Where it backfires:
- UMAP is non-deterministic across runs without a fixed seed, and different seeds can produce qualitatively different global layouts on the same data — always set
random_state - Comparing UMAP embeddings across conditions or time points requires the parametric variant or aligned embedding approaches; standard UMAP embeddings are not directly comparable
Pattern that works: always initialize UMAP with init='pca' for large datasets — it’s faster to converge and produces more reproducible global structure than the default spectral initialization.
Source and attribution
Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the umap-learn folder of their public scientific-agent-skills repository.
License: MIT. Install, adapt, and redistribute with attribution preserved.
This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.