Open-source Python tool for extracting clean text from web pages.
Contextractor playground
Preview extraction results, adjust Trafilatura settings, and generate ready-to-run commands. Install via PyPI or NPM, run with Docker, or scale on Apify.
Contextractor is an open-source Python tool that extracts clean, readable content from any web page — stripping away navigation, ads, and boilerplate to leave just the text you need.
It uses Trafilatura, the highest-rated open-source content extraction library (F1 score 0.958). Ideal for building LLM training datasets, RAG pipelines, and research applications.
Install via PyPI or NPM, run with Docker, or scale on Apify. Use the Playground to configure settings, preview results, and generate commands. Source code on GitHub.
Trafilatura is an open-sourcePython library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and consistently scores highest in independent extraction benchmarks. Contextractor is powered by Trafilatura as its extraction engine, giving you a web interface and API on top of it.
Apify also has a super generousCreator plan (though you can run only your own actors) that costs just $1/month (billed $6 semi-annually) and includes a one-time $500 platform credit for your first 6 months — with up to 32 GB RAM and 32 concurrent actor runs.