datamol

Pythonic wrapper around RDKit with simplified interface and sensible defaults — preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, and parallel processing.

Drug discovery cheminformatics with a clean Pythonic RDKit wrapper

Source K-Dense AI

License Apache-2.0

First documented 2026-04-28

Science Cheminformatics

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

standardize SMILES
datamol fingerprints
molecular clustering
drug-like properties
batch molecular processing

What it does

datamol is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a datamol expert — the Pythonic wrapper around RDKit that makes standard drug discovery workflows cleaner and less error-prone: SMILES parsing with automatic sanitization, molecular standardization, descriptor computation, fingerprint generation, Tanimoto similarity clustering, 3D conformer generation, and parallel processing across compound libraries.

A session produces concise Python code that handles the common cheminformatics operations that require verbose RDKit boilerplate — datamol reduces this to one-liners with sensible defaults while returning native rdkit.Chem.Mol objects for compatibility.

When to use it

Reach for it when:

You’re doing standard drug discovery cheminformatics (SMILES parsing, standardization, descriptor computation, clustering) and want clean, Pythonic code rather than verbose RDKit boilerplate
You need to process a compound library in parallel — datamol’s dm.parallelized() wraps joblib for easy multiprocessing
You want automatic error handling — datamol returns None for failed molecules and continues processing rather than crashing on the first invalid SMILES

When not to reach for it:

Non-standard sanitization, unusual valence handling, or algorithm-level control — use rdkit directly
Molecular ML with featurization and training — use deepchem or molfeat

Install

Copy the SKILL.md from K-Dense AI’s datamol folder into .claude/skills/datamol/ in your project. Install via pip install datamol or conda install -c conda-forge datamol.

Trigger phrases: “standardize SMILES”, “datamol fingerprints”, “molecular clustering”, “batch molecular processing”.

What a session looks like

A typical session has three phases:

Input loading. Provide a list of SMILES strings or an SDF file path. Claude uses dm.to_mol() to parse all inputs, with dm.standardize_mol() applied to handle tautomers, salts, and charge normalization.
Core operation. The requested computation runs: dm.descriptors.batch() for descriptor DataFrames, dm.to_fp() for fingerprint arrays, dm.cluster_mols() for Butina clustering, or dm.conformers.generate() for 3D structures — all with datamol’s clean API.
Output. Results are returned as pandas DataFrames or numpy arrays compatible with scikit-learn and other downstream tools. Invalid molecules are flagged in the output with a mol_valid boolean column.

Receipts

Where it works well:

Library standardization before any ML or analysis — dm.standardize_mol() handles the fragment selection, charge neutralization, and isotope removal that manual RDKit code handles inconsistently
Parallel descriptor computation on large compound libraries — dm.parallelized() gives multicore speedup with one additional argument and proper error collection

Where it backfires:

datamol’s high-level interface sometimes makes it harder to debug when a molecular operation fails silently; checking None in the output requires explicit filtering
Less flexible than RDKit for any operation that requires access to the low-level C++ API or non-default algorithm parameters

Pattern that works: always run dm.sanitize_mol() before dm.to_fp() on external compound data — unsanitized molecules produce fingerprints that look valid but are computed on a corrupt molecular graph.

Source and attribution

Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the datamol folder of their public scientific-agent-skills repository.

License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.