# MarkItDown read the 19MB PDF WebFetch wouldn't

> WebFetch caps fetches at 10MB, so a 19.5MB system-card PDF stayed unread. MarkItDown converted it in 66s — and recovered the number we couldn't verify.

**Canonical URL**: https://agentcookbooks.com/blog/markitdown-19mb-pdf-webfetch-limit/

**Published**: 2026-05-31

**Tags**: claude-code, documentation, context-engineering

---

Two days ago a [15-agent fact-check workflow](/blog/opus-4-8-dynamic-workflows-fact-check/) hit a wall it couldn't climb: one Opus 4.8 claim — that the model "fails to raise important events to the user only 3.7% of the time" — lived in Anthropic's system-card PDF, and **the verifier couldn't open it because WebFetch caps fetched content at 10MB.** The PDF is 19.5MB. So the claim shipped marked *unverified*, with a note that the comparison baseline "isn't pinned down without the primary." That's the kind of dead end that feels permanent. It isn't. [MarkItDown](https://github.com/microsoft/markitdown) — Microsoft's MIT-licensed document-to-Markdown converter — turned the same 124-page PDF into **64,335 words of clean Markdown in 66 seconds** on a laptop, fully offline, and the number came back quotable from the primary source *with* the denominator the workflow couldn't find. The cookbook lesson: WebFetch is a page reader with a content-size ceiling, not a document reader. When a source blows past the ceiling, a local doc→markdown pass is the missing step.

## The wall

The fact-check's adversarial verifier did everything right. It found the claim, traced it to the system card, tried to fetch the card — and got back, verbatim, that `WebFetch exceeded its 10MB content limit`. With the primary unreadable, every corroboration traced back to the same secondary blogs quoting the same card, so the honest verdict was `single-source-only`, with the explicit caveat that the multi-fold-drop comparison ("vs Opus 4.7? vs Mythos Preview?") couldn't be settled. A dozen agents and ~1.1M tokens, blocked by a file-size limit.

WebFetch's 10MB cap is reasonable — it exists so a single fetch can't blow up a context window. But it means WebFetch silently isn't an option for exactly the documents you most want to cite: model cards, datasheets, filings, decks. Those arrive as PDF, and they're big.

## The fix, locally

MarkItDown converts PDF / DOCX / PPTX / XLSX / HTML / EPUB / CSV and more into structure-preserving Markdown — headings, lists, and tables, not flat text dumps — specifically so the output is clean to hand to an LLM. I ran it in an isolated venv in a throwaway tooling directory (not in this site's repo — the Astro build stays Python-free), with only the PDF extra, which is fully offline:

```bash
python -m venv .venv
.venv/Scripts/python -m pip install "markitdown[pdf]"
curl -sL -o opus-4-8-system-card.pdf "https://cdn.sanity.io/.../<hash>.pdf"   # Anthropic CDN
.venv/Scripts/markitdown opus-4-8-system-card.pdf -o opus-4-8-system-card.md
```

I deliberately did **not** enable MarkItDown's image-OCR or audio-transcription paths — those call an LLM (OpenAI-compatible) and send content out. Plain document conversion is local and offline, which is the whole point here: read a file you already have, don't ship it anywhere.

## What it cost

The receipt, measured:

- **Input:** 20,430,397 bytes (~19.5MB), PDF 1.4, 124+ pages — the Opus 4.8 system card. Roughly 2× over WebFetch's ceiling.
- **Convert:** 66 seconds on the `pdfminer.six` backend, exit code 0.
- **Output:** 6,355 lines / **64,335 words** / 434,861 characters (~425KB) of Markdown.
- **Tables survived** as Markdown pipe tables — e.g. the honesty figure also appears in a comparison row, `| Without thinking | 10.8% | 3.7% | 3.1% | 2.5% |`, not mangled into prose.

About one second per two pages. Slow next to a web fetch, instant next to "this number is unknowable."

## The payoff: the number came back

Here is the sentence the workflow couldn't reach, lifted straight from the converted Markdown:

> Claude Opus 4.8 fails to raise the important events to the user only 3.7% of the time, down 5-fold from Mythos Preview, which misleads the user 27.6% of the time in this scenario, and down almost as much from Opus 4.7.

That doesn't just confirm the 3.7%. It **pins the denominator the fact-check flagged as unpinned**: the multi-fold drop is measured against *Mythos Preview* (which misleads 27.6% of the time), and separately "almost as much" against Opus 4.7 — so both secondary framings were half-right and neither was complete. A claim that fifteen agents had to leave at `unverified` became a one-laptop primary-source quote, with its missing context attached, in about a minute.

## The gotchas

Four worth knowing, all hit firsthand:

- **`stderr` floods with `Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats`.** These are `pdfminer` font-descriptor warnings, one per offending glyph. Non-fatal — exit 0, text clean — but enough of them to look like a crash. Redirect with `2>/dev/null` or you'll misread success as failure.
- **On Python 3.14, `pip install` prints `WARNING: Cache entry deserialization failed, entry ignored` repeatedly.** Harmless pip-cache noise on a very new interpreter; the install completes.
- **`markitdown[all]` is heavy** (Azure, ONNX, and more) and risks missing wheels on a bleeding-edge Python. `markitdown[pdf]` is enough for the PDF path and installs clean.
- **OCR and audio transcription are neither free nor offline** — they need an `llm_client`/`llm_model`. If you only want text-bearing PDFs and Office files, you never touch that path.

## Where it fits, and where it doesn't

MarkItDown isn't a WebFetch replacement; it's the step *after* you've got the file. It reads from disk, so you still download the source first (a plain `curl` cleared the 10MB issue that WebFetch couldn't, because the limit is on WebFetch's returned content, not on the network). For sources that are already Markdown — GitHub repos, docs in `.md` — it adds nothing. Its slot is precisely the one that bit the fact-check: a real document, too big or too binary for a fetch, whose contents you need to quote.

That slot recurs often enough that the wiki's [`markitdown`](/skills/markitdown/) skill now carries this run as its firsthand receipt — so the next "couldn't open the PDF" is a one-line convert, not a research dead end.

## Takeaway

When a verification stalls on "couldn't open the source," check whether it stalled on *size or format*, not on availability. A 10MB fetch ceiling is a transport limit, not a verdict on whether the source is knowable. Download it, convert it, re-read it. The number was there the whole time — it was just 2MB past where the fetcher would look.