arboreto

High-performance gene regulatory network inference using GENIE3 and GRNBoost2 algorithms on single-cell or bulk expression data.

Infer gene regulatory networks at scale

Source K-Dense AI
License MIT
First documented

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

  • gene regulatory network
  • GRNBoost2
  • GENIE3
  • transcription factor inference

What it does

arboreto is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a gene regulatory network inference specialist, guiding the use of arboreto’s distributed GRNBoost2 and GENIE3 implementations to reconstruct transcription factor–target gene relationships from expression matrices.

The output of a session is a ranked edge list (TF → target gene, importance score) derived from gradient boosting ensembles, ready to feed into downstream network analysis tools like SCENIC or networkx. Arboreto runs on Dask for parallelism, so it scales across cores or a cluster without rewriting the core algorithm.

When to use it

Reach for it when:

  • You have bulk or single-cell expression data and want to infer which transcription factors drive target gene expression
  • You need a scalable GRN method that runs faster than naive GENIE3 on large gene sets
  • You are building a SCENIC pipeline and need the arboreto step to produce the co-expression adjacency matrix

When not to reach for it:

  • You need causal rather than correlative network inference — arboreto is association-based
  • Your dataset is very small (fewer than a few hundred cells/samples); simpler correlation methods may suffice

Install

Copy the SKILL.md from scientific-skills/arboreto into .claude/skills/arboreto/.

The skill activates on trigger phrases including “gene regulatory network”, “GRNBoost2”, and “transcription factor inference”.

What a session looks like

A typical session has three phases:

  1. Input preparation. Claude shapes your expression matrix into the required format (genes as columns, cells/samples as rows) and loads a transcription factor list from a reference database or user-supplied file.
  2. GRN inference. Claude configures a Dask LocalCluster (or remote cluster) and runs GRNBoost2, setting the seed parameter for reproducibility and the num_workers to match available cores.
  3. Output handling. Claude saves the resulting adjacency DataFrame as a TSV, filters to a top-N edges threshold, and optionally converts to a networkx graph for downstream topology analysis.

Receipts

Honest reporting on what arboreto handles well and where it falls short:

Where it works well:

  • Scaling GRN inference across a multi-core workstation without cluster overhead
  • Producing reproducible results when seed is fixed — important for comparative experiments
  • Serving as the first step in a SCENIC pipeline, where the adjacency output feeds directly into pySCENIC

Where it backfires:

  • Very large gene sets (>20,000 genes) with many cells can require substantial memory even with Dask parallelism
  • The resulting network is correlation-based; downstream biological interpretation requires additional filtering and validation

Pattern that works: pre-filter to highly variable genes before running arboreto — restricting to the top 2,000–5,000 HVGs dramatically cuts runtime with minimal information loss for TF inference.

Source and attribution

Originally authored by K-Dense, Inc.. The canonical SKILL.md lives in the arboreto folder of the scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and updates, defer to the source repo.