molfeat
Molecular featurization for ML with 100+ featurizers — ECFP, MACCS, physicochemical descriptors, and pretrained models (ChemBERTa) to convert SMILES to feature vectors for QSAR and molecular machine learning pipelines.
Convert SMILES to ML-ready features with 100+ featurizers
Trigger phrases
Phrases that activate this skill when typed to Claude Code:
featurize molecules for MLECFP fingerprintsChemBERTa embeddingsmolecular features for QSARmolfeat featurizer
What it does
molfeat is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a molfeat expert for molecular featurization — covering 100+ featurizers under a unified API: circular fingerprints (ECFP2/4/6, Morgan), MACCS keys, RDKit physicochemical descriptors, pretrained molecular language models (ChemBERTa, MolBERT), graph features, and pharmacophore fingerprints — all converting SMILES strings to numpy arrays or DataFrames ready for ML pipelines.
A session produces Python code that takes SMILES inputs and returns feature matrices in the format needed for scikit-learn, PyTorch, or any other ML framework — without assembling RDKit or model code manually.
When to use it
Reach for it when:
- You want to compare multiple featurization approaches (fingerprints vs. descriptors vs. pretrained embeddings) on the same task with a consistent API
- You’re building a QSAR model and need to convert SMILES to features for scikit-learn or other tabular ML
- You want pretrained molecular language model embeddings (ChemBERTa) without managing the Hugging Face model code yourself
When not to reach for it:
- Full end-to-end molecular ML with training loop and benchmarks — use
deepchem - Low-level fingerprint computation where you need RDKit parameter control — use
rdkitdirectly
Install
Copy the SKILL.md from K-Dense AI’s molfeat folder into .claude/skills/molfeat/ in your project. Install via pip install molfeat. Pretrained model featurizers require additional dependencies (PyTorch, transformers) specified when the featurizer is loaded.
Trigger phrases: “featurize molecules for ML”, “ECFP fingerprints”, “ChemBERTa embeddings”, “molecular features for QSAR”.
What a session looks like
A typical session has three phases:
- Featurizer selection. Describe the downstream ML task and any computational constraints. Claude recommends an appropriate featurizer — ECFP4 for fast fingerprint-based models, ChemBERTa for transfer learning from large chemical datasets — and explains the trade-offs.
- Featurization. Claude initializes the featurizer and runs it on the SMILES list, handling invalid molecules gracefully and returning a feature matrix. For pretrained models, Claude adds the GPU/CPU device configuration.
- Pipeline integration. The feature matrix is formatted for the downstream framework — a numpy array for scikit-learn, a torch.Tensor for PyTorch — with the featurizer object serialized for applying the same featurization to new data during inference.
Receipts
Where it works well:
- Side-by-side comparison of multiple featurizers on the same QSAR task — the unified API means switching from ECFP to ChemBERTa is one line, making benchmark comparisons straightforward
- ChemBERTa embeddings for chemical space visualization — the pretrained embeddings capture semantic molecular similarity that circular fingerprints miss for structurally diverse compounds
Where it backfires:
- Pretrained model featurizers download large model weights on first use; this can fail in offline environments or slow network conditions without warning
- Not all 100+ featurizers are equally well-tested — less common featurizers may have undocumented edge cases with unusual molecular inputs
Pattern that works: always include ECFP4 as a baseline featurizer when benchmarking — it’s fast, well-understood, and competitive enough that models built on more complex featurizers should be clearly better to justify the additional complexity.
Source and attribution
Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the molfeat folder of their public scientific-agent-skills repository.
License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.
This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.