WebMainBench

WebMainBench is a high-precision benchmark for evaluating web main content extraction. It provides:

A 7,809-page, 100% human-annotated evaluation dataset covering 5,434 unique domains, 150 TLDs, and 46 languages.
A 545-sample subset with manually calibrated ground-truth markdown (groundtruth_content), enabling fine-grained metric evaluation across text, code, formula, and table dimensions.
A unified evaluation toolkit (webmainbench) that scores extractors with both ROUGE-N and content-type-specific edit-distance metrics.

WebMainBench is introduced in the paper Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM and serves as the primary benchmark for the MinerU-HTML project.

Architecture

Core Modules:

Module	Description
`data`	Dataset loading, saving, and sample management
`extractors`	Unified interface for content extractors and a factory registry
`metrics`	Edit-distance, TEDS, and ROUGE metric implementations
`evaluator`	Orchestrates extraction, scoring, and report generation

Dataset Statistics

The full dataset (7,809 samples) is annotated at the HTML tag level through a rigorous 3-round process (annotator → reviewer → senior inspector).

Language Distribution (Top 10 of 46)

Language	Count	%
English	6,711	85.09
Chinese	716	9.08
Spanish	61	0.77
German	51	0.65
Japanese	48	0.61
Russian	45	0.57
French	36	0.46
Italian	22	0.28
Korean	20	0.25
Portuguese	17	0.22

TLD Distribution (Top 10 of 150)

TLD	Count	%
.com	4,550	57.69
.org	816	10.35
.cn	459	5.82
.net	318	4.03
.uk	235	2.98
.edu	180	2.28
.de	101	1.28
.au	94	1.19
.ru	69	0.87
.gov	59	0.75

Page Style & Difficulty

Pages are classified by GPT-5 into styles (Article, Content Listing, Forum, etc.) and assigned difficulty levels (Simple / Mid / Hard) based on DOM structural complexity, text distribution sparsity, content-type diversity, and link density.

Evaluation Metrics

WebMainBench supports two complementary evaluation protocols:

ROUGE-N F1 (primary metric from the paper)

All extracted content is converted to canonical Markdown via html2text, then scored with ROUGE-N (N=5, jieba tokenization). This is the metric reported in the Dripper paper.

Fine-Grained Edit-Distance Metrics (from this toolkit)

Computed on the 545-sample subset with manually calibrated groundtruth_content:

Metric	Formula	Description
`overall`	arithmetic mean of the five sub-metrics	Composite quality score
`text_edit`	1 − edit_dist / max(len_pred, len_gt)	Plain-text similarity
`code_edit`	same, on code blocks only	Code content similarity
`formula_edit`	same, on formulas only	Formula content similarity
`table_edit`	same, on table text only	Table content similarity
`table_TEDS`	1 − tree_edit_dist / max(nodes_pred, nodes_gt)	Table structure similarity

All scores are in [0, 1]; higher is better.

Leaderboard

ROUGE-N F1 on Full Dataset (7,809 samples)

Results from the Dripper paper (Table 2):

Extractor	Mode	All	Simple	Mid	Hard
DeepSeek-V3.2*	Html+MD	0.9098	0.9415	0.9104	0.8771
GPT-5*	Html+MD	0.9024	0.9382	0.9042	0.8638
Gemini-2.5-Pro*	Html+MD	0.8979	0.9345	0.8978	0.8610
Dripper_fallback	Html+MD	0.8925	0.9325	0.8958	0.8477
Dripper (0.6B)	Html+MD	0.8779	0.9205	0.8804	0.8313
magic-html	Html+MD	0.7138	0.7857	0.7121	0.6434
Readability	Html+MD	0.6543	0.7415	0.6550	0.5652
Trafilatura	Html+MD	0.6402	0.7309	0.6417	0.5466
Resiliparse	TEXT	0.6290	0.7140	0.6323	0.5388

* Frontier models used as drop-in replacements within the Dripper pipeline.

Fine-Grained Metrics on 545-Sample Subset

Extractor	Version	overall	text_edit	code_edit	formula_edit	table_edit	table_TEDS
mineru-html	4.1.1	0.8256	0.8621	0.9093	0.9399	0.6780	0.7388
magic-html	0.1.5	0.5141	0.7791	0.4117	0.7204	0.2611	0.3984
trafilatura (md)	2.0.0	0.3858	0.6887	0.1305	0.6242	0.1653	0.3203
resiliparse	0.14.5	0.2954	0.7381	0.0641	0.6747	0.0000	0.0000
trafilatura (txt)	2.0.0	0.2657	0.7126	0.0000	0.6162	0.0000	0.0000

Contributions of new extractor results are welcome — open a PR!

Quick Start

Installation

pip install webmainbench

# Or install from source
git clone https://github.com/opendatalab/WebMainBench.git
cd WebMainBench
pip install -e .

Download the Dataset

The dataset is hosted on Hugging Face: opendatalab/WebMainBench

from huggingface_hub import hf_hub_download

hf_hub_download(
    repo_id="opendatalab/WebMainBench",
    repo_type="dataset",
    filename="WebMainBench_545.jsonl",
    local_dir="data/",
)

Run an Evaluation

from webmainbench import DataLoader, Evaluator, ExtractorFactory

dataset = DataLoader.load_jsonl("data/WebMainBench_545.jsonl")
extractor = ExtractorFactory.create("trafilatura")

evaluator = Evaluator(llm_config={
    "use_llm": True,
    "llm_base_url": "https://api.openai.com/v1",
    "llm_api_key": "sk-xxxxxxxxxxxx",
    "llm_model": "gpt-4o",
})
result = evaluator.evaluate(dataset, extractor)

print(f"Overall Score: {result.overall_metrics['overall']:.4f}")

If you don't need LLM-enhanced content splitting (for formula/table/code extraction), disable it explicitly:

evaluator = Evaluator(llm_config={"use_llm": False})

Compare Multiple Extractors

extractors = ["trafilatura", "resiliparse", "magic-html"]
results = evaluator.compare_extractors(dataset, extractors)

for name, result in results.items():
    print(f"{name}: {result.overall_metrics['overall']:.4f}")

A complete example is available at examples/multi_extractor_compare.py.

Dataset Format

Each JSONL line represents one web page:

{
  "track_id": "0b7f2636-d35f-40bf-9b7f-94be4bcbb396",
  "url": "https://example.com/page",
  "html": "<html>...<h1 cc-select=\"true\">Title</h1>...</html>",
  "main_html": "<h1>Title</h1><p>Body text...</p>",
  "convert_main_content": "# Title\n\nBody text...",
  "groundtruth_content": "# Title\n\nBody text...",
  "meta": {
    "language": "en",
    "style": "Article",
    "level": "mid",
    "table": [],
    "code": ["interline"],
    "equation": ["inline"]
  }
}

Field	Description
`track_id`	Unique sample identifier (UUID)
`url`	Original page URL
`html`	Full page HTML; human-annotated regions carry `cc-select="true"`
`main_html`	Ground-truth HTML subtree pruned from `html` (available for all 7,809 samples)
`convert_main_content`	Markdown converted from `main_html` via `html2text` (available for all 7,809 samples)
`groundtruth_content`	Manually calibrated ground-truth markdown (available for the 545-sample subset)
`meta.language`	Language code — `en`, `zh`, `es`, `de`, `ja`, `ko`, `ru`, … (46 languages)
`meta.style`	Page style — `Article`, `Content Listing`, `Forum_or_Article_with_commentsection`, `Other`
`meta.level`	Complexity — `simple`, `mid`, `hard`
`meta.table`	Table types: `[]`, `["data"]`, `["layout"]`, `["data", "layout"]`
`meta.code`	Code types: `[]`, `["inline"]`, `["interline"]`, `["inline", "interline"]`
`meta.equation`	Formula types: `[]`, `["inline"]`, `["interline"]`, `["inline", "interline"]`

Supported Extractors

Extractor	Package	Output
`mineru-html`	MinerU-HTML	HTML → Markdown
`trafilatura`	trafilatura	Markdown or plain text
`resiliparse`	resiliparse	Plain text
`magic-html`	magic-html	HTML
Custom	Inherit from `BaseExtractor`	Any

Advanced Usage

Custom Extractor

from webmainbench.extractors import BaseExtractor, ExtractionResult, ExtractorFactory

class MyExtractor(BaseExtractor):
    def _setup(self):
        pass

    def _extract_content(self, html, url=None):
        content = your_extraction_logic(html)
        return ExtractionResult(content=content, content_list=[], success=True)

ExtractorFactory.register("my-extractor", MyExtractor)

Custom Metric

from webmainbench.metrics import BaseMetric, MetricResult

class CustomMetric(BaseMetric):
    def _setup(self):
        pass

    def _calculate_score(self, predicted, groundtruth, **kwargs):
        score = your_scoring_logic(predicted, groundtruth)
        return MetricResult(metric_name=self.name, score=score, details={})

evaluator.metric_calculator.add_metric("custom", CustomMetric("custom"))

Output Files

After evaluation, the following files are generated in results/:

File	Description
`leaderboard.csv`	Per-extractor overall and per-metric scores
`evaluation_results.json`	Full evaluation details with metadata
`dataset_with_results.jsonl`	Original samples enriched with extraction outputs

Project Structure

webmainbench/
├── data/           # Dataset loading and saving
├── extractors/     # Extractor implementations and factory
├── metrics/        # Metric implementations and calculator
├── evaluator/      # Orchestrates extraction + scoring
└── utils/          # Logging and helper functions

Citation

If you use WebMainBench in your research, please cite the Dripper paper:

@misc{liu2025dripper,
    title   = {Dripper: Token-Efficient Main HTML Extraction with a Lightweight LM},
    author  = {Mengjie Liu and Jiahui Peng and Pei Chu and Jiantao Qiu and Ren Ma and He Zhu and Rui Min and Lindong Lu and Wenchang Ning and Linfeng Hou and Kaiwen Liu and Yuan Qu and Zhenxiang Li and Chao Xu and Zhongying Tu and Wentao Zhang and Conghui He},
    year    = {2025},
    eprint  = {2511.23119},
    archivePrefix = {arXiv},
    primaryClass  = {cs.CL},
    url     = {https://arxiv.org/abs/2511.23119},
}

License

This project is licensed under the Apache License 2.0 — see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
.github/workflows		.github/workflows
data		data
docs		docs
examples		examples
results		results
scripts		scripts
tests		tests
tools		tools
webmainbench		webmainbench
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebMainBench

Architecture

Dataset Statistics

Evaluation Metrics

ROUGE-N F1 (primary metric from the paper)

Fine-Grained Edit-Distance Metrics (from this toolkit)

Leaderboard

ROUGE-N F1 on Full Dataset (7,809 samples)

Fine-Grained Metrics on 545-Sample Subset

Quick Start

Installation

Download the Dataset

Run an Evaluation

Compare Multiple Extractors

Dataset Format

Supported Extractors

Advanced Usage

Custom Extractor

Custom Metric

Output Files

Project Structure

Citation

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WebMainBench

Architecture

Dataset Statistics

Evaluation Metrics

ROUGE-N F1 (primary metric from the paper)

Fine-Grained Edit-Distance Metrics (from this toolkit)

Leaderboard

ROUGE-N F1 on Full Dataset (7,809 samples)

Fine-Grained Metrics on 545-Sample Subset

Quick Start

Installation

Download the Dataset

Run an Evaluation

Compare Multiple Extractors

Dataset Format

Supported Extractors

Advanced Usage

Custom Extractor

Custom Metric

Output Files

Project Structure

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages