# scikit-learn

> Machine learning in Python — use for supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), model evaluation, hyperparameter tuning, preprocessing, and building ML pipelines with comprehensive algorithm reference.

**Use case**: Build and evaluate machine learning pipelines in Python

**Canonical URL**: https://agentcookbooks.com/skills/scikit-learn/

**Topics**: claude-code, skills, science, ml-libraries

**Trigger phrases**: "train a classifier", "scikit-learn pipeline", "cross-validation", "hyperparameter tuning", "fit this ML model"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/scikit-learn)

**License**: MIT

---

## What it does

`scikit-learn` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into a scikit-learn expert covering the full supervised and unsupervised ML toolkit — classification (random forest, SVM, gradient boosting, logistic regression), regression, clustering (k-means, DBSCAN, hierarchical), dimensionality reduction (PCA, t-SNE, UMAP), preprocessing pipelines, cross-validation, and hyperparameter search (GridSearchCV, RandomizedSearchCV, Optuna integration).

A session produces complete, runnable ML code: a Pipeline object that chains preprocessing and the model, cross-validation evaluation with the appropriate metrics, and either the best model artifact or a hyperparameter search setup.

## When to use it

Reach for it when:

- You need a standard ML model (not deep learning) trained on tabular data with reliable performance baselines
- You're building a preprocessing + model pipeline that needs to be reproducible and easy to serialize
- You want cross-validated performance metrics and feature importances, not just a single train/test split

When *not* to reach for it:

- Deep learning on images, text, or sequences — use `transformers` or `pytorch-lightning`
- Graph-structured data — use `torch-geometric`
- Model explainability after fitting — combine with `shap`

## Install

Copy the `SKILL.md` from K-Dense AI's [scikit-learn folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/scikit-learn) into `.claude/skills/scikit-learn/` in your project.

Trigger phrases: "train a classifier", "scikit-learn pipeline", "cross-validation", "hyperparameter tuning", "fit this ML model".

## What a session looks like

A typical session has three phases:

1. **Task and data description.** Specify the ML task (binary classification, multi-class, regression), describe the feature types (numerical, categorical, text, mixed), and indicate any class imbalance or missing data concerns.
2. **Pipeline construction.** Claude writes a `sklearn.Pipeline` with appropriate preprocessing steps (imputation, scaling, encoding) followed by the model. A cross-validation loop with the right metric (AUC-ROC, F1, RMSE) is set up.
3. **Evaluation and interpretation.** Cross-validated metrics are computed, a confusion matrix or regression residuals are plotted, and feature importances are extracted where the model supports them. Claude flags if the results suggest overfitting or class imbalance problems.

## Receipts

**Where it works well:**
- Tabular classification problems where gradient boosting (HistGradientBoosting, XGBoost via the sklearn wrapper) consistently produces strong baselines — Claude's pipeline code handles missing values and mixed types cleanly
- Preprocessing pipelines that need to generalize from training to test data without leakage — the Pipeline object enforces this correctly

**Where it backfires:**
- Very large datasets where scikit-learn's in-memory computation is slow — Dask-ML provides distributed wrappers for some estimators
- Custom loss functions or architectures are not supported in most scikit-learn estimators without significant workarounds

**Pattern that works:** fit a dummy classifier (predicting majority class) as your baseline before any real model — it takes one line and immediately tells you if the problem has a trivial baseline you need to beat.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`scikit-learn` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/scikit-learn) of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.