molfeat

Molecular featurization for ML with 100+ featurizers — ECFP, MACCS, physicochemical descriptors, and pretrained models (ChemBERTa) to convert SMILES to feature vectors for QSAR and molecular machine learning pipelines.

Convert SMILES to ML-ready features with 100+ featurizers

Source K-Dense AI
License Apache-2.0
First documented

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

  • featurize molecules for ML
  • ECFP fingerprints
  • ChemBERTa embeddings
  • molecular features for QSAR
  • molfeat featurizer

What it does

molfeat is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a molfeat expert for molecular featurization — covering 100+ featurizers under a unified API: circular fingerprints (ECFP2/4/6, Morgan), MACCS keys, RDKit physicochemical descriptors, pretrained molecular language models (ChemBERTa, MolBERT), graph features, and pharmacophore fingerprints — all converting SMILES strings to numpy arrays or DataFrames ready for ML pipelines.

A session produces Python code that takes SMILES inputs and returns feature matrices in the format needed for scikit-learn, PyTorch, or any other ML framework — without assembling RDKit or model code manually.

When to use it

Reach for it when:

  • You want to compare multiple featurization approaches (fingerprints vs. descriptors vs. pretrained embeddings) on the same task with a consistent API
  • You’re building a QSAR model and need to convert SMILES to features for scikit-learn or other tabular ML
  • You want pretrained molecular language model embeddings (ChemBERTa) without managing the Hugging Face model code yourself

When not to reach for it:

  • Full end-to-end molecular ML with training loop and benchmarks — use deepchem
  • Low-level fingerprint computation where you need RDKit parameter control — use rdkit directly

Install

Copy the SKILL.md from K-Dense AI’s molfeat folder into .claude/skills/molfeat/ in your project. Install via pip install molfeat. Pretrained model featurizers require additional dependencies (PyTorch, transformers) specified when the featurizer is loaded.

Trigger phrases: “featurize molecules for ML”, “ECFP fingerprints”, “ChemBERTa embeddings”, “molecular features for QSAR”.

What a session looks like

A typical session has three phases:

  1. Featurizer selection. Describe the downstream ML task and any computational constraints. Claude recommends an appropriate featurizer — ECFP4 for fast fingerprint-based models, ChemBERTa for transfer learning from large chemical datasets — and explains the trade-offs.
  2. Featurization. Claude initializes the featurizer and runs it on the SMILES list, handling invalid molecules gracefully and returning a feature matrix. For pretrained models, Claude adds the GPU/CPU device configuration.
  3. Pipeline integration. The feature matrix is formatted for the downstream framework — a numpy array for scikit-learn, a torch.Tensor for PyTorch — with the featurizer object serialized for applying the same featurization to new data during inference.

Receipts

Where it works well:

  • Side-by-side comparison of multiple featurizers on the same QSAR task — the unified API means switching from ECFP to ChemBERTa is one line, making benchmark comparisons straightforward
  • ChemBERTa embeddings for chemical space visualization — the pretrained embeddings capture semantic molecular similarity that circular fingerprints miss for structurally diverse compounds

Where it backfires:

  • Pretrained model featurizers download large model weights on first use; this can fail in offline environments or slow network conditions without warning
  • Not all 100+ featurizers are equally well-tested — less common featurizers may have undocumented edge cases with unusual molecular inputs

Pattern that works: always include ECFP4 as a baseline featurizer when benchmarking — it’s fast, well-understood, and competitive enough that models built on more complex featurizers should be clearly better to justify the additional complexity.

Source and attribution

Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the molfeat folder of their public scientific-agent-skills repository.

License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.