Contextractor extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need. Its Markdown output typically runs 80–90% fewer tokens than the raw HTML, so it is cheap to feed to an LLM. It is free and open source so you just download it an run it by yourself. Use it to build LLM training datasets, RAG pipelines, or research corpora.

It uses a Rust port of Trafilatura, which scores the highest F1 (0.966) on the ScrapingHub benchmark, with Crawlee handling the crawling. The Rust port keeps Trafilatura's heuristic extraction core, adds ML page-type routing so each page gets a matching extraction strategy. Compiled Rust speed, no Python, no GPU.

Run it as a hosted Apify actor, or self-host the npm CLI or npm library — the same engine and options in every channel.

Getting started

The easiest way to try Contextractor — one npx command extracts a page straight to your terminal, with no browser install (you can install it later) and no API key — you host it yourself:

npx contextractor extract-one https://example.com/ --crawler-type cheerio

--crawler-type cheerio fetches over plain HTTP, so no headless browser is downloaded. Need a whole-site crawl or specific formats? Use the playground to build a more advanced command visually, then copy it.

Contextractor: web scraper (content extraction tool)

Contextractor playground

Extraction mode

Content

Output

What is Contextractor?

Getting started