Wikiextractor | Awesome Repository

Wikiextractor is a Wikipedia dump parser and dataset preprocessor designed to extract plain text and metadata from MediaWiki database dumps. It functions as a converter that transforms these archives into structured document files or line-delimited JSON objects for use in text corpora and machine learning datasets.

The utility includes a MediaWiki template expander that resolves complex template placeholders into their full text representation. It also supports the isolation and extraction of specific individual pages from a full archive without requiring the processing of the entire dataset.

The system handles large-scale data processing through stream-based XML parsing and regex-based markup stripping to produce clean text. Extracted data is organized via document sharding and exported as JSON containing article IDs, revision IDs, URLs, titles, and body text.

Features

Wikipedia Tools - Converts Wikipedia database dumps into plain text files containing individual documents for each article.
Corpus Preprocessing - Cleans MediaWiki markup from raw dumps to create a text-only corpus for training machine learning models.
Stream-Based Parsing - Processes large database dumps sequentially using an event-driven parser to avoid loading entire files into memory.
Wikipedia Content Expansion - Replaces MediaWiki template placeholders with full content to ensure accurate text representation of articles.

Features

Wikipedia Tools - Converts Wikipedia database dumps into plain text files containing individual documents for each article.
Corpus Preprocessing - Cleans MediaWiki markup from raw dumps to create a text-only corpus for training machine learning models.
Stream-Based Parsing - Processes large database dumps sequentially using an event-driven parser to avoid loading entire files into memory.
Wikipedia Content Expansion - Replaces MediaWiki template placeholders with full content to ensure accurate text representation of articles.