# molfeat

> Molecular featurization for ML with 100+ featurizers — ECFP, MACCS, physicochemical descriptors, and pretrained models (ChemBERTa) to convert SMILES to feature vectors for QSAR and molecular machine learning pipelines.

**Use case**: Convert SMILES to ML-ready features with 100+ featurizers

**Canonical URL**: https://agentcookbooks.com/skills/molfeat/

**Topics**: claude-code, skills, science, cheminformatics

**Trigger phrases**: "featurize molecules for ML", "ECFP fingerprints", "ChemBERTa embeddings", "molecular features for QSAR", "molfeat featurizer"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/molfeat)

**License**: Apache-2.0

---

## What it does

`molfeat` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into a molfeat expert for molecular featurization — covering 100+ featurizers under a unified API: circular fingerprints (ECFP2/4/6, Morgan), MACCS keys, RDKit physicochemical descriptors, pretrained molecular language models (ChemBERTa, MolBERT), graph features, and pharmacophore fingerprints — all converting SMILES strings to numpy arrays or DataFrames ready for ML pipelines.

A session produces Python code that takes SMILES inputs and returns feature matrices in the format needed for scikit-learn, PyTorch, or any other ML framework — without assembling RDKit or model code manually.

## When to use it

Reach for it when:

- You want to compare multiple featurization approaches (fingerprints vs. descriptors vs. pretrained embeddings) on the same task with a consistent API
- You're building a QSAR model and need to convert SMILES to features for scikit-learn or other tabular ML
- You want pretrained molecular language model embeddings (ChemBERTa) without managing the Hugging Face model code yourself

When *not* to reach for it:

- Full end-to-end molecular ML with training loop and benchmarks — use `deepchem`
- Low-level fingerprint computation where you need RDKit parameter control — use `rdkit` directly

## Install

Copy the `SKILL.md` from K-Dense AI's [molfeat folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/molfeat) into `.claude/skills/molfeat/` in your project. Install via `pip install molfeat`. Pretrained model featurizers require additional dependencies (PyTorch, transformers) specified when the featurizer is loaded.

Trigger phrases: "featurize molecules for ML", "ECFP fingerprints", "ChemBERTa embeddings", "molecular features for QSAR".

## What a session looks like

A typical session has three phases:

1. **Featurizer selection.** Describe the downstream ML task and any computational constraints. Claude recommends an appropriate featurizer — ECFP4 for fast fingerprint-based models, ChemBERTa for transfer learning from large chemical datasets — and explains the trade-offs.
2. **Featurization.** Claude initializes the featurizer and runs it on the SMILES list, handling invalid molecules gracefully and returning a feature matrix. For pretrained models, Claude adds the GPU/CPU device configuration.
3. **Pipeline integration.** The feature matrix is formatted for the downstream framework — a numpy array for scikit-learn, a torch.Tensor for PyTorch — with the featurizer object serialized for applying the same featurization to new data during inference.

## Receipts

**Where it works well:**
- Side-by-side comparison of multiple featurizers on the same QSAR task — the unified API means switching from ECFP to ChemBERTa is one line, making benchmark comparisons straightforward
- ChemBERTa embeddings for chemical space visualization — the pretrained embeddings capture semantic molecular similarity that circular fingerprints miss for structurally diverse compounds

**Where it backfires:**
- Pretrained model featurizers download large model weights on first use; this can fail in offline environments or slow network conditions without warning
- Not all 100+ featurizers are equally well-tested — less common featurizers may have undocumented edge cases with unusual molecular inputs

**Pattern that works:** always include ECFP4 as a baseline featurizer when benchmarking — it's fast, well-understood, and competitive enough that models built on more complex featurizers should be clearly better to justify the additional complexity.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`molfeat` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/molfeat) of their public scientific-agent-skills repository.

License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.