# exploratory-data-analysis

> Perform comprehensive exploratory data analysis on scientific data files across 200+ file formats — automatically detecting file type and generating detailed markdown reports with format-specific analysis, quality metrics, and downstream recommendations.

**Use case**: Auto-detect and profile any scientific data file format

**Canonical URL**: https://agentcookbooks.com/skills/exploratory-data-analysis/

**Topics**: claude-code, skills, science, data-science

**Trigger phrases**: "analyze this data file", "explore this dataset", "what's in this file", "data quality report", "profile this dataset"

**Source**: [K-Dense AI](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/exploratory-data-analysis)

**License**: MIT

---

## What it does

`exploratory-data-analysis` is a Claude Code skill from K-Dense AI's [scientific-agent-skills repo](https://github.com/K-Dense-AI/scientific-agent-skills). It turns Claude into a data profiler that handles 200+ scientific file formats — from standard CSV/Parquet to domain-specific formats like HDF5, DICOM, FASTQ, SDF (chemical), and mass spectrometry formats — automatically detecting the format and producing a comprehensive markdown EDA report with quality metrics and recommendations for downstream analysis.

A session produces a structured EDA report: file format detection, data dimensions, missing value analysis, distribution summaries, format-specific quality metrics (e.g., read depth for FASTQ, SMILES validity for SDF), and recommended next analysis steps.

## When to use it

Reach for it when:

- You receive an unfamiliar data file and need to understand its structure before writing any analysis code
- You want a systematic data quality report before fitting models or running statistical tests
- You're handed a domain-specific scientific format (bioinformatics, proteomics, microscopy) and need format-appropriate quality checks

When *not* to reach for it:

- You already know your data structure and want to build specific visualizations — use `seaborn` or `matplotlib`
- You want a statistical model fit — use `statsmodels` or `scikit-learn`

## Install

Copy the `SKILL.md` from K-Dense AI's [exploratory-data-analysis folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/exploratory-data-analysis) into `.claude/skills/exploratory-data-analysis/` in your project.

Trigger phrases: "analyze this data file", "explore this dataset", "what's in this file", "data quality report", "profile this dataset".

## What a session looks like

A typical session has three phases:

1. **File intake and format detection.** Provide the file path or upload the file. Claude identifies the format, appropriate parser, and domain context (genomics, chemistry, tabular, imaging) from the file extension and header.
2. **Format-specific profiling.** The appropriate analysis runs: for tabular data, standard distributional summaries; for FASTQ, read quality scores and adapter contamination; for SDF/MOL, SMILES validity and descriptor distributions; for HDF5, dataset hierarchy and shape summary.
3. **Report and recommendations.** A markdown report is generated with all quality metrics flagged, missing data patterns visualized, and recommended downstream analysis steps tailored to the data type and detected quality issues.

## Receipts

**Where it works well:**
- Unfamiliar tabular datasets where the column semantics aren't obvious — the automatic type detection and missing value pattern analysis surfaces data entry issues immediately
- Bioinformatics files (FASTQ, BAM, VCF) where format-specific quality metrics (mapping rate, coverage depth, variant call quality) need checking before downstream analysis

**Where it backfires:**
- Very large files (>10GB) may be slow to profile without sampling — the skill doesn't automatically sample for performance, so large file EDA may require a file size check first
- Proprietary vendor formats with undocumented binary structure fall outside the 200+ supported formats and return a generic binary report

**Pattern that works:** run EDA first on any new dataset before writing analysis code — catching data quality issues at the profiling stage saves far more time than discovering them mid-analysis.

## Source and attribution

Originally authored by [K-Dense Inc.](https://github.com/K-Dense-AI). The canonical SKILL.md lives in the [`exploratory-data-analysis` folder](https://github.com/K-Dense-AI/scientific-agent-skills/tree/main/scientific-skills/exploratory-data-analysis) of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner's perspective. For the formal spec and any updates, defer to the source repo.