datamol

Pythonic wrapper around RDKit with simplified interface and sensible defaults — preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, and parallel processing.

Drug discovery cheminformatics with a clean Pythonic RDKit wrapper

Source K-Dense AI
License Apache-2.0
First documented

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

  • standardize SMILES
  • datamol fingerprints
  • molecular clustering
  • drug-like properties
  • batch molecular processing

What it does

datamol is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a datamol expert — the Pythonic wrapper around RDKit that makes standard drug discovery workflows cleaner and less error-prone: SMILES parsing with automatic sanitization, molecular standardization, descriptor computation, fingerprint generation, Tanimoto similarity clustering, 3D conformer generation, and parallel processing across compound libraries.

A session produces concise Python code that handles the common cheminformatics operations that require verbose RDKit boilerplate — datamol reduces this to one-liners with sensible defaults while returning native rdkit.Chem.Mol objects for compatibility.

When to use it

Reach for it when:

  • You’re doing standard drug discovery cheminformatics (SMILES parsing, standardization, descriptor computation, clustering) and want clean, Pythonic code rather than verbose RDKit boilerplate
  • You need to process a compound library in parallel — datamol’s dm.parallelized() wraps joblib for easy multiprocessing
  • You want automatic error handling — datamol returns None for failed molecules and continues processing rather than crashing on the first invalid SMILES

When not to reach for it:

  • Non-standard sanitization, unusual valence handling, or algorithm-level control — use rdkit directly
  • Molecular ML with featurization and training — use deepchem or molfeat

Install

Copy the SKILL.md from K-Dense AI’s datamol folder into .claude/skills/datamol/ in your project. Install via pip install datamol or conda install -c conda-forge datamol.

Trigger phrases: “standardize SMILES”, “datamol fingerprints”, “molecular clustering”, “batch molecular processing”.

What a session looks like

A typical session has three phases:

  1. Input loading. Provide a list of SMILES strings or an SDF file path. Claude uses dm.to_mol() to parse all inputs, with dm.standardize_mol() applied to handle tautomers, salts, and charge normalization.
  2. Core operation. The requested computation runs: dm.descriptors.batch() for descriptor DataFrames, dm.to_fp() for fingerprint arrays, dm.cluster_mols() for Butina clustering, or dm.conformers.generate() for 3D structures — all with datamol’s clean API.
  3. Output. Results are returned as pandas DataFrames or numpy arrays compatible with scikit-learn and other downstream tools. Invalid molecules are flagged in the output with a mol_valid boolean column.

Receipts

Where it works well:

  • Library standardization before any ML or analysis — dm.standardize_mol() handles the fragment selection, charge neutralization, and isotope removal that manual RDKit code handles inconsistently
  • Parallel descriptor computation on large compound libraries — dm.parallelized() gives multicore speedup with one additional argument and proper error collection

Where it backfires:

  • datamol’s high-level interface sometimes makes it harder to debug when a molecular operation fails silently; checking None in the output requires explicit filtering
  • Less flexible than RDKit for any operation that requires access to the low-level C++ API or non-default algorithm parameters

Pattern that works: always run dm.sanitize_mol() before dm.to_fp() on external compound data — unsanitized molecules produce fingerprints that look valid but are computed on a corrupt molecular graph.

Source and attribution

Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the datamol folder of their public scientific-agent-skills repository.

License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.