Skip to main content

Crawled – currently not indexed isn't always a bug

Search Console flagged 22 URLs as “Crawled – currently not indexed.” The instinct is to treat that as a problem: Google found these pages and decided they weren’t worth keeping. But every one of the 22 was a .md URL — and that changes the reading entirely. This isn’t a quality verdict on the pages; it’s the duplicate-content protection on their raw-markdown alternates doing exactly what it’s supposed to.

What the 22 URLs had in common

Every entry was a raw-markdown alternate — /skills/<slug>.md or /blog/<slug>.md — never an HTML page (examples from the report: seo-sxo.md, handoff.md, caveman.md, tdd.md). The site serves a chrome-free markdown twin of every page for AI ingestion, advertised in <head> via <link rel="alternate" type="text/markdown">. One header check explains why Google won’t index them:

$ curl -sI https://agentcookbooks.com/skills/seo-sxo.md
HTTP/1.1 200 OK
Content-Type: text/markdown; charset=utf-8
X-Robots-Tag: noindex

Meanwhile the actual HTML pages (/skills/seo-sxo/, /blog/<slug>/) carry no robots directive, are indexable, and are the only form present in sitemap-0.xml — zero .md URLs are in the sitemap. Google was never asked to index the markdown; it found it on its own and correctly declined.

Why Google crawls them but won’t index them

The chain: Googlebot follows the rel="alternate" link in the page head → fetches the .md → reads X-Robots-Tag: noindex → drops it from the index → files it under “Crawled – currently not indexed.” That’s the intended outcome. The markdown twins exist to be crawlable by AI assistants (clean text for citation, no nav, no CSS) but not indexed by Google, because indexing them would duplicate the canonical HTML page. The noindex header is the thing standing between “useful AI-ingestion surface” and “duplicate-content mess.” Seeing them in that bucket means it’s working.

What you shouldn’t “fix”

The label tempts a fix, and two tempting ones both backfire:

  • Disallow: /*.md in robots.txt would stop Google crawling them — and stop GPTBot, ClaudeBot, and PerplexityBot too, killing the AI-citation path the markdown files exist for. The design is crawlable-but-noindexed; a crawl block breaks half of it.
  • Removing the rel="alternate" head link would cut Google’s discovery of the .md — and also remove the signal AI search tools use to find the markdown in the first place.

The reframe that matters: in this report, watch whether the HTML pages get indexed, not whether the .md alternates don’t. An HTML page sitting in “Crawled – currently not indexed” is a real signal worth chasing (thin, duplicate, quality). A .md alternate sitting there is the goal. The count climbs roughly one per page as the site grows — also expected, not decay.

The trap hiding underneath: a stale sitemap stalls discovery

There’s a related issue worth acting on. Right after publishing three new posts, sitemap-0.xml at the edge still listed 272 URLs — the three new HTML pages (origin count: 275) weren’t in it, because the sitemap is served from Cloudflare’s edge cache, which a deploy doesn’t purge. Googlebot reads the stale sitemap, so the new pages aren’t even discoverable until the cache is purged.

That’s the asymmetry to internalize: the “Crawled – currently not indexed” .md noise is benign and needs nothing; a stale sitemap that omits your newest pages is the actual problem and needs a purge. Read the markdown alternates as confirmation the system works — then go make sure the sitemap Google is reading is the one you just shipped.