# Crawled – currently not indexed isn't always a bug

> 22 URLs hit Search Console's 'Crawled – currently not indexed' — all noindex'd markdown alternates. When that bucket is your noindex working, not a defect.

**Canonical URL**: https://agentcookbooks.com/blog/gsc-crawled-not-indexed-noindex-alternates/

**Published**: 2026-05-29

**Tags**: claude-code, seo

---

Search Console flagged 22 URLs as "Crawled – currently not indexed." The instinct is to treat that as a problem: Google found these pages and decided they weren't worth keeping. But every one of the 22 was a `.md` URL — and that changes the reading entirely. This isn't a quality verdict on the pages; it's the duplicate-content protection on their raw-markdown alternates doing exactly what it's supposed to.

## What the 22 URLs had in common

Every entry was a raw-markdown alternate — `/skills/<slug>.md` or `/blog/<slug>.md` — never an HTML page (examples from the report: `seo-sxo.md`, `handoff.md`, `caveman.md`, `tdd.md`). The site serves a chrome-free markdown twin of every page for AI ingestion, advertised in `<head>` via `<link rel="alternate" type="text/markdown">`. One header check explains why Google won't index them:

```
$ curl -sI https://agentcookbooks.com/skills/seo-sxo.md
HTTP/1.1 200 OK
Content-Type: text/markdown; charset=utf-8
X-Robots-Tag: noindex
```

Meanwhile the actual HTML pages (`/skills/seo-sxo/`, `/blog/<slug>/`) carry no robots directive, are indexable, and are the only form present in `sitemap-0.xml` — zero `.md` URLs are in the sitemap. Google was never *asked* to index the markdown; it found it on its own and correctly declined.

## Why Google crawls them but won't index them

The chain: Googlebot follows the `rel="alternate"` link in the page head → fetches the `.md` → reads `X-Robots-Tag: noindex` → drops it from the index → files it under "Crawled – currently not indexed." That's the intended outcome. The markdown twins exist to be **crawlable by AI assistants** (clean text for citation, no nav, no CSS) but **not indexed by Google**, because indexing them would duplicate the canonical HTML page. The `noindex` header is the thing standing between "useful AI-ingestion surface" and "duplicate-content mess." Seeing them in that bucket means it's working.

## What you shouldn't "fix"

The label tempts a fix, and two tempting ones both backfire:

- **`Disallow: /*.md` in robots.txt** would stop Google crawling them — and stop GPTBot, ClaudeBot, and PerplexityBot too, killing the AI-citation path the markdown files exist for. The design is *crawlable-but-noindexed*; a crawl block breaks half of it.
- **Removing the `rel="alternate"` head link** would cut Google's discovery of the `.md` — and also remove the signal AI search tools use to find the markdown in the first place.

The reframe that matters: in this report, watch whether the **HTML** pages get indexed, not whether the `.md` alternates don't. An HTML page sitting in "Crawled – currently not indexed" is a real signal worth chasing (thin, duplicate, quality). A `.md` alternate sitting there is the goal. The count climbs roughly one per page as the site grows — also expected, not decay.

## The trap hiding underneath: a stale sitemap stalls discovery

There's a related issue worth acting on. Right after publishing three new posts, `sitemap-0.xml` at the edge still listed **272** URLs — the three new HTML pages (origin count: 275) weren't in it, because the sitemap is served from [Cloudflare's edge cache, which a deploy doesn't purge](/blog/cloudflare-pages-purge-cache-after-deploy/). Googlebot reads the stale sitemap, so the new pages aren't even *discoverable* until the cache is purged.

That's the asymmetry to internalize: the "Crawled – currently not indexed" `.md` noise is benign and needs nothing; a stale sitemap that omits your newest pages is the actual problem and needs a purge. Read the markdown alternates as confirmation the system works — then go make sure the sitemap Google is reading is the one you just shipped.