exploratory-data-analysis

Perform comprehensive exploratory data analysis on scientific data files across 200+ file formats — automatically detecting file type and generating detailed markdown reports with format-specific analysis, quality metrics, and downstream recommendations.

Auto-detect and profile any scientific data file format

Source K-Dense AI
License MIT
First documented

Trigger phrases

Phrases that activate this skill when typed to Claude Code:

  • analyze this data file
  • explore this dataset
  • what's in this file
  • data quality report
  • profile this dataset

What it does

exploratory-data-analysis is a Claude Code skill from K-Dense AI’s scientific-agent-skills repo. It turns Claude into a data profiler that handles 200+ scientific file formats — from standard CSV/Parquet to domain-specific formats like HDF5, DICOM, FASTQ, SDF (chemical), and mass spectrometry formats — automatically detecting the format and producing a comprehensive markdown EDA report with quality metrics and recommendations for downstream analysis.

A session produces a structured EDA report: file format detection, data dimensions, missing value analysis, distribution summaries, format-specific quality metrics (e.g., read depth for FASTQ, SMILES validity for SDF), and recommended next analysis steps.

When to use it

Reach for it when:

  • You receive an unfamiliar data file and need to understand its structure before writing any analysis code
  • You want a systematic data quality report before fitting models or running statistical tests
  • You’re handed a domain-specific scientific format (bioinformatics, proteomics, microscopy) and need format-appropriate quality checks

When not to reach for it:

  • You already know your data structure and want to build specific visualizations — use seaborn or matplotlib
  • You want a statistical model fit — use statsmodels or scikit-learn

Install

Copy the SKILL.md from K-Dense AI’s exploratory-data-analysis folder into .claude/skills/exploratory-data-analysis/ in your project.

Trigger phrases: “analyze this data file”, “explore this dataset”, “what’s in this file”, “data quality report”, “profile this dataset”.

What a session looks like

A typical session has three phases:

  1. File intake and format detection. Provide the file path or upload the file. Claude identifies the format, appropriate parser, and domain context (genomics, chemistry, tabular, imaging) from the file extension and header.
  2. Format-specific profiling. The appropriate analysis runs: for tabular data, standard distributional summaries; for FASTQ, read quality scores and adapter contamination; for SDF/MOL, SMILES validity and descriptor distributions; for HDF5, dataset hierarchy and shape summary.
  3. Report and recommendations. A markdown report is generated with all quality metrics flagged, missing data patterns visualized, and recommended downstream analysis steps tailored to the data type and detected quality issues.

Receipts

Where it works well:

  • Unfamiliar tabular datasets where the column semantics aren’t obvious — the automatic type detection and missing value pattern analysis surfaces data entry issues immediately
  • Bioinformatics files (FASTQ, BAM, VCF) where format-specific quality metrics (mapping rate, coverage depth, variant call quality) need checking before downstream analysis

Where it backfires:

  • Very large files (>10GB) may be slow to profile without sampling — the skill doesn’t automatically sample for performance, so large file EDA may require a file size check first
  • Proprietary vendor formats with undocumented binary structure fall outside the 200+ supported formats and return a generic binary report

Pattern that works: run EDA first on any new dataset before writing analysis code — catching data quality issues at the profiling stage saves far more time than discovering them mid-analysis.

Source and attribution

Originally authored by K-Dense Inc.. The canonical SKILL.md lives in the exploratory-data-analysis folder of their public scientific-agent-skills repository.

License: MIT. Install, adapt, and redistribute with attribution preserved.

This page documents the skill from a practitioner’s perspective. For the formal spec and any updates, defer to the source repo.