Skip to main content
Contextractor — content extraction tool

Library

Guides and references for Contextractor — the web content extractor built on the Rust port of Trafilatura, Crawlee, and Playwright. These articles cover the save formats (text, Markdown, JSON, cleaned HTML, and raw page source) and how extraction compares to running a headless browser.

Skip the Headless Browser — When Extraction Beats Playwright

Most scraping projects default to Playwright or Selenium when a plain HTTP request would do. HTTP-based content extraction handles 50-200 pages per second on a single core — headless browsers manage 3-5. This article walks through when you actually need a browser and when you're burning RAM for nothing, with a decision tree and resource benchmarks to settle the question.

HTML to Markdown for AI — Comparing 8 Conversion Approaches

Converting HTML to Markdown for LLM consumption isn't one problem — it's four. Rule-based converters like Turndown faithfully transform markup but keep all the boilerplate. Content extractors like Trafilatura strip the noise first, cutting token counts by 90%+. ML models like Jina's ReaderLM-v2 produce the cleanest output but need a GPU. Full-service APIs handle JavaScript rendering and anti-bot measures on top.

Trafilatura: High-Accuracy Web Content Extraction

Trafilatura is an open-source library that extracts the main content from web pages — article text, headings, and metadata — while stripping navigation, ads, sidebars, and footers. It uses a heuristic pipeline with fallback algorithms and is consistently one of the top-rated extractors in independent benchmarks. Contextractor runs the Rust port of Trafilatura via a napi-rs binding. The Rust port scores the highest F1 (0.966) on the ScrapingHub article set, ahead of the Go port (0.960) and the original Python implementation (0.958), while keeping the same heuristics.

HTML explained

HTML is what web pages are made of — a tree of nested elements that browsers render into the pages you see. Tim Berners-Lee created the first version with 18 tags at CERN in 1991. Today the WHATWG living standard defines over 100 elements. Contextractor outputs cleaned, extracted HTML — and a separate original option saves the raw page source before extraction, giving you the unmodified markup to process however you need.

JSON explained

JSON grew from a JavaScript naming convention in Douglas Crockford's garage to the most widely used data interchange format on the internet. Contextractor's JSON output wraps extracted text alongside metadata fields — title, author, date, site name, source URL — in a single structured object, ready for pipelines that need machine-parseable fields without regex.

Markdown explained

Markdown started as a Perl script by John Gruber in 2004 and became the default format for technical writing, documentation, and LLM pipelines. Its lightweight syntax preserves headings, lists, and links with minimal token overhead — roughly 10% more than plain text. Contextractor outputs Markdown by default because LLMs handle it natively, trained on millions of GitHub READMEs and Stack Overflow posts.

Content Formats for LLMs — Choosing What to Feed Your AI Pipeline

Plain text, Markdown, cleaned HTML, and JSON each carry different structural signals into your AI pipeline — and each costs a different number of tokens. Markdown adds just 10% overhead while preserving headings and lists, making it the default for most LLM work. Cleaned HTML can outperform plain text for table-heavy RAG tasks, and JSON is the natural fit when your pipeline needs structured metadata fields.

Structured Data Extraction from HTML

CSS selectors and XPath extract structured data from HTML for fractions of a penny per page, but break when sites redesign. LLM-powered extraction adapts to any layout but costs 100-1000x more at scale. A hybrid pipeline — content extraction first, then LLM structuring on clean text — gets the best of both approaches while cutting LLM costs by 99%.

Heuristic vs. ML-Powered Extraction — Trafilatura vs. Jina ReaderLM

Trafilatura uses a multi-stage heuristic pipeline with fallback algorithms — no ML, no GPU, single-digit milliseconds per page. Jina's ReaderLM-v2 is a 1.54B-parameter transformer trained specifically on HTML-to-Markdown conversion, with better structural fidelity but requiring GPU and running orders of magnitude slower. The SIGIR 2023 benchmark found heuristic extractors still outperform neural models on content extraction, though ReaderLM-v2 excels at preserving tables, nested lists, and document formatting that heuristics tend to flatten.

Trafilatura vs. Readability vs. Newspaper4k

Trafilatura, readability-lxml, and Newspaper4k are Python's three main open-source content extraction libraries, but they don't do the same thing. Trafilatura leads on F1 accuracy (0.958) with seven output formats and a fallback extraction chain. Newspaper4k is built for news articles with built-in NLP. readability-lxml gives you cleaned HTML and nothing else.

Plain text explained

Plain text is the simplest output format — just the extracted words with no markup, no formatting, no structural hints. It evolved from 7-bit ASCII through decades of competing code pages until UTF-8 unified everything. Contextractor's plain text output is ideal for embedding pipelines and classification tasks where every token should carry semantic meaning, not formatting syntax.

Cookie Consent Handling for Web Scrapers

Cookie consent banners inject dialog markup into the DOM, contaminate extracted text with "accept cookies" boilerplate, and can block entire pages behind consent walls. Handling them requires a two-layer approach: network-level blocking with filter lists like EasyList Cookie (via @ghostery/adblocker-playwright) to prevent CMP scripts from loading, and DOM-level interaction with tools like autoconsent for anything that slips through. Contextractor combines both layers in its crawler: @ghostery/adblocker-playwright network blocking plus DOM-level stripping of residual consent containers before extraction, all toggled by the closeCookieModals setting, which covers the majority of consent dialogs without per-site configuration.