Wikiextractor is a Wikipedia dump parser and dataset preprocessor designed to extract plain text and metadata from MediaWiki database dumps. It functions as a converter that transforms these archives into structured document files or line-delimited JSON objects for use in text corpora and machine learning datasets.
The utility includes a MediaWiki template expander that resolves complex template placeholders into their full text representation. It also supports the isolation and extraction of specific individual pages from a full archive without requiring the processing of the entire dataset.
The system handles large-scale data processing through stream-based XML parsing and regex-based markup stripping to produce clean text. Extracted data is organized via document sharding and exported as JSON containing article IDs, revision IDs, URLs, titles, and body text.