arboreto
High-performance gene regulatory network inference using GENIE3 and GRNBoost2 algorithms on single-cell or bulk expression data.
Infer gene regulatory networks at scale
Trigger phrases
Phrases that activate this skill when typed to Claude Code:
gene regulatory networkGRNBoost2GENIE3transcription factor inference
What it does
arboreto is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a gene regulatory network inference specialist, guiding the use of arboreto’s distributed GRNBoost2 and GENIE3 implementations to reconstruct transcription factor–target gene relationships from expression matrices.
The output of a session is a ranked edge list (TF → target gene, importance score) derived from gradient boosting ensembles, ready to feed into downstream network analysis tools like SCENIC or networkx. Arboreto runs on Dask for parallelism, so it scales across cores or a cluster without rewriting the core algorithm.
When to use it
Reach for it when:
- You have bulk or single-cell expression data and want to infer which transcription factors drive target gene expression
- You need a scalable GRN method that runs faster than naive GENIE3 on large gene sets
- You are building a SCENIC pipeline and need the arboreto step to produce the co-expression adjacency matrix
When not to reach for it:
- You need causal rather than correlative network inference — arboreto is association-based
- Your dataset is very small (fewer than a few hundred cells/samples); simpler correlation methods may suffice
Install
Copy the SKILL.md from scientific-skills/arboreto into .claude/skills/arboreto/.
The skill activates on trigger phrases including “gene regulatory network”, “GRNBoost2”, and “transcription factor inference”.
What a session looks like
A typical session has three phases:
- Input preparation. Claude shapes your expression matrix into the required format (genes as columns, cells/samples as rows) and loads a transcription factor list from a reference database or user-supplied file.
- GRN inference. Claude configures a Dask LocalCluster (or remote cluster) and runs GRNBoost2, setting the
seedparameter for reproducibility and thenum_workersto match available cores. - Output handling. Claude saves the resulting adjacency DataFrame as a TSV, filters to a top-N edges threshold, and optionally converts to a networkx graph for downstream topology analysis.
Receipts
Honest reporting on what arboreto handles well and where it falls short:
Where it works well:
- Scaling GRN inference across a multi-core workstation without cluster overhead
- Producing reproducible results when
seedis fixed — important for comparative experiments - Serving as the first step in a SCENIC pipeline, where the adjacency output feeds directly into pySCENIC
Where it backfires:
- Very large gene sets (>20,000 genes) with many cells can require substantial memory even with Dask parallelism
- The resulting network is correlation-based; downstream biological interpretation requires additional filtering and validation
Pattern that works: pre-filter to highly variable genes before running arboreto — restricting to the top 2,000–5,000 HVGs dramatically cuts runtime with minimal information loss for TF inference.
Source and attribution
Originally authored by K-Dense, Inc.. The canonical SKILL.md lives in the arboreto folder of the scientific-agent-skills repository.
License: MIT. Install, adapt, and redistribute with attribution preserved.
This page documents the skill from a practitioner’s perspective. For the formal spec and updates, defer to the source repo.