# datamol

> Pythonic wrapper around RDKit with simplified interface and sensible defaults — preferred for standard drug discovery including SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, and parallel processing.

**Use case**: Drug discovery cheminformatics with a clean Pythonic RDKit wrapper

**Canonical URL**: https://agentcookbooks.com/skills/datamol/

**Topics**: claude-code, skills, science, cheminformatics

**Trigger phrases**: "standardize SMILES", "datamol fingerprints", "molecular clustering", "drug-like properties", "batch molecular processing"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/datamol)

**License**: Apache-2.0

---

## What it does

`datamol` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into a datamol expert — the Pythonic wrapper around RDKit that makes standard drug discovery workflows cleaner and less error-prone: SMILES parsing with automatic sanitization, molecular standardization, descriptor computation, fingerprint generation, Tanimoto similarity clustering, 3D conformer generation, and parallel processing across compound libraries.

A session produces concise Python code that handles the common cheminformatics operations that require verbose RDKit boilerplate — datamol reduces this to one-liners with sensible defaults while returning native `rdkit.Chem.Mol` objects for compatibility.

## When to use it

Reach for it when:

- You're doing standard drug discovery cheminformatics (SMILES parsing, standardization, descriptor computation, clustering) and want clean, Pythonic code rather than verbose RDKit boilerplate
- You need to process a compound library in parallel — datamol's `dm.parallelized()` wraps joblib for easy multiprocessing
- You want automatic error handling — datamol returns `None` for failed molecules and continues processing rather than crashing on the first invalid SMILES

When *not* to reach for it:

- Non-standard sanitization, unusual valence handling, or algorithm-level control — use `rdkit` directly
- Molecular ML with featurization and training — use `deepchem` or `molfeat`

## Install

Copy the `SKILL.md` from K-Dense AI's [datamol folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/datamol) into `.claude/skills/datamol/` in your project. Install via `pip install datamol` or `conda install -c conda-forge datamol`.

Trigger phrases: "standardize SMILES", "datamol fingerprints", "molecular clustering", "batch molecular processing".

## What a session looks like

A typical session has three phases:

1. **Input loading.** Provide a list of SMILES strings or an SDF file path. Claude uses `dm.to_mol()` to parse all inputs, with `dm.standardize_mol()` applied to handle tautomers, salts, and charge normalization.
2. **Core operation.** The requested computation runs: `dm.descriptors.batch()` for descriptor DataFrames, `dm.to_fp()` for fingerprint arrays, `dm.cluster_mols()` for Butina clustering, or `dm.conformers.generate()` for 3D structures — all with datamol's clean API.
3. **Output.** Results are returned as pandas DataFrames or numpy arrays compatible with scikit-learn and other downstream tools. Invalid molecules are flagged in the output with a `mol_valid` boolean column.

## Receipts

**Where it works well:**
- Library standardization before any ML or analysis — `dm.standardize_mol()` handles the fragment selection, charge neutralization, and isotope removal that manual RDKit code handles inconsistently
- Parallel descriptor computation on large compound libraries — `dm.parallelized()` gives multicore speedup with one additional argument and proper error collection

**Where it backfires:**
- datamol's high-level interface sometimes makes it harder to debug when a molecular operation fails silently; checking `None` in the output requires explicit filtering
- Less flexible than RDKit for any operation that requires access to the low-level C++ API or non-default algorithm parameters

**Pattern that works:** always run `dm.sanitize_mol()` before `dm.to_fp()` on external compound data — unsanitized molecules produce fingerprints that look valid but are computed on a corrupt molecular graph.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`datamol` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/datamol) of their public scientific-agent-skills repository.

License: Apache-2.0. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.