MarkItDown read the 19MB PDF WebFetch wouldn't

2026-05-31

claude-codedocumentationcontext-engineering

Illustrated receipt card summarizing: MarkItDown read the 19MB PDF WebFetch wouldn't

Two days ago a 15-agent fact-check workflow hit a wall it couldn’t climb: one Opus 4.8 claim — that the model “fails to raise important events to the user only 3.7% of the time” — lived in Anthropic’s system-card PDF, and the verifier couldn’t open it because WebFetch caps fetched content at 10MB. The PDF is 19.5MB. So the claim shipped marked unverified, with a note that the comparison baseline “isn’t pinned down without the primary.” That’s the kind of dead end that feels permanent. It isn’t. MarkItDown — Microsoft’s MIT-licensed document-to-Markdown converter — turned the same 124-page PDF into 64,335 words of clean Markdown in 66 seconds on a laptop, fully offline, and the number came back quotable from the primary source with the denominator the workflow couldn’t find. The cookbook lesson: WebFetch is a page reader with a content-size ceiling, not a document reader. When a source blows past the ceiling, a local doc→markdown pass is the missing step.

The wall

The fact-check’s adversarial verifier did everything right. It found the claim, traced it to the system card, tried to fetch the card — and got back, verbatim, that WebFetch exceeded its 10MB content limit. With the primary unreadable, every corroboration traced back to the same secondary blogs quoting the same card, so the honest verdict was single-source-only, with the explicit caveat that the multi-fold-drop comparison (“vs Opus 4.7? vs Mythos Preview?”) couldn’t be settled. A dozen agents and ~1.1M tokens, blocked by a file-size limit.

WebFetch’s 10MB cap is reasonable — it exists so a single fetch can’t blow up a context window. But it means WebFetch silently isn’t an option for exactly the documents you most want to cite: model cards, datasheets, filings, decks. Those arrive as PDF, and they’re big.

The fix, locally

MarkItDown converts PDF / DOCX / PPTX / XLSX / HTML / EPUB / CSV and more into structure-preserving Markdown — headings, lists, and tables, not flat text dumps — specifically so the output is clean to hand to an LLM. I ran it in an isolated venv in a throwaway tooling directory (not in this site’s repo — the Astro build stays Python-free), with only the PDF extra, which is fully offline:

python -m venv .venv
.venv/Scripts/python -m pip install "markitdown[pdf]"
curl -sL -o opus-4-8-system-card.pdf "https://cdn.sanity.io/.../<hash>.pdf"   # Anthropic CDN
.venv/Scripts/markitdown opus-4-8-system-card.pdf -o opus-4-8-system-card.md

I deliberately did not enable MarkItDown’s image-OCR or audio-transcription paths — those call an LLM (OpenAI-compatible) and send content out. Plain document conversion is local and offline, which is the whole point here: read a file you already have, don’t ship it anywhere.

What it cost

The receipt, measured:

Input: 20,430,397 bytes (~19.5MB), PDF 1.4, 124+ pages — the Opus 4.8 system card. Roughly 2× over WebFetch’s ceiling.
Convert: 66 seconds on the pdfminer.six backend, exit code 0.
Output: 6,355 lines / 64,335 words / 434,861 characters (~425KB) of Markdown.
Tables survived as Markdown pipe tables — e.g. the honesty figure also appears in a comparison row, | Without thinking | 10.8% | 3.7% | 3.1% | 2.5% |, not mangled into prose.

About one second per two pages. Slow next to a web fetch, instant next to “this number is unknowable.”

The payoff: the number came back

Here is the sentence the workflow couldn’t reach, lifted straight from the converted Markdown:

Claude Opus 4.8 fails to raise the important events to the user only 3.7% of the time, down 5-fold from Mythos Preview, which misleads the user 27.6% of the time in this scenario, and down almost as much from Opus 4.7.

That doesn’t just confirm the 3.7%. It pins the denominator the fact-check flagged as unpinned: the multi-fold drop is measured against Mythos Preview (which misleads 27.6% of the time), and separately “almost as much” against Opus 4.7 — so both secondary framings were half-right and neither was complete. A claim that fifteen agents had to leave at unverified became a one-laptop primary-source quote, with its missing context attached, in about a minute.

The gotchas

Four worth knowing, all hit firsthand:

stderr floods with Could not get FontBBox from font descriptor because None cannot be parsed as 4 floats. These are pdfminer font-descriptor warnings, one per offending glyph. Non-fatal — exit 0, text clean — but enough of them to look like a crash. Redirect with 2>/dev/null or you’ll misread success as failure.
On Python 3.14, pip install prints WARNING: Cache entry deserialization failed, entry ignored repeatedly. Harmless pip-cache noise on a very new interpreter; the install completes.
markitdown[all] is heavy (Azure, ONNX, and more) and risks missing wheels on a bleeding-edge Python. markitdown[pdf] is enough for the PDF path and installs clean.
OCR and audio transcription are neither free nor offline — they need an llm_client/llm_model. If you only want text-bearing PDFs and Office files, you never touch that path.

Where it fits, and where it doesn’t

MarkItDown isn’t a WebFetch replacement; it’s the step after you’ve got the file. It reads from disk, so you still download the source first (a plain curl cleared the 10MB issue that WebFetch couldn’t, because the limit is on WebFetch’s returned content, not on the network). For sources that are already Markdown — GitHub repos, docs in .md — it adds nothing. Its slot is precisely the one that bit the fact-check: a real document, too big or too binary for a fetch, whose contents you need to quote.

That slot recurs often enough that the wiki’s markitdown skill now carries this run as its firsthand receipt — so the next “couldn’t open the PDF” is a one-line convert, not a research dead end.

Takeaway

When a verification stalls on “couldn’t open the source,” check whether it stalled on size or format, not on availability. A 10MB fetch ceiling is a transport limit, not a verdict on whether the source is knowable. Download it, convert it, re-read it. The number was there the whole time — it was just 2MB past where the fetcher would look.