Web content extraction tool

Open-source Python tool for extracting clean text from web pages.

Contextractor playground

Preview extraction results, adjust Trafilatura settings, and generate ready-to-run commands. Install via PyPI or NPM, run with Docker, or scale on Apify.

HTML to extract

Trafilatura Settings

Extraction

Content

Metadata

Other

Generate Commands

What is Contextractor?

Contextractor is an open-source Python tool that extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need.

It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958). Ideal for building LLM training datasets, RAG pipelines, and research applications.

Install via PyPI or NPM, run with Docker, or scale on Apify. Use the Playground to configure settings, preview results, and generate commands. Source code on GitHub.

What is Trafilatura?

Trafilatura is an open-source Python library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.

Did you know? Apify offers a free tier — you get $5 to use monthly.


Apify also has a super generous Creator plan (though you can run only your own actors) that costs just $1/month (billed $6 semi-annually) and includes a one-time $500 platform credit for your first 6 months — with up to 32 GB RAM and 32 concurrent actor runs.