# attardi/wikiextractor

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/attardi-wikiextractor).**

3,970 stars · 1,006 forks · Python · agpl-3.0

## Links

- GitHub: https://github.com/attardi/wikiextractor
- awesome-repositories: https://awesome-repositories.com/repository/attardi-wikiextractor.md

## Description

Wikiextractor is a Wikipedia dump parser and dataset preprocessor designed to extract plain text and metadata from MediaWiki database dumps. It functions as a converter that transforms these archives into structured document files or line-delimited JSON objects for use in text corpora and machine learning datasets.

The utility includes a MediaWiki template expander that resolves complex template placeholders into their full text representation. It also supports the isolation and extraction of specific individual pages from a full archive without requiring the processing of the entire dataset.

The system handles large-scale data processing through stream-based XML parsing and regex-based markup stripping to produce clean text. Extracted data is organized via document sharding and exported as JSON containing article IDs, revision IDs, URLs, titles, and body text.

## Tags

### Part of an Awesome List

- [Wikipedia Tools](https://awesome-repositories.com/f/awesome-lists/devtools/wikipedia-tools.md) — Converts Wikipedia database dumps into plain text files containing individual documents for each article. ([source](https://github.com/attardi/wikiextractor/wiki/File-Format))
- [Stream-Based Parsing](https://awesome-repositories.com/f/awesome-lists/data/html-and-xml-parsing/xml-parsing/stream-based-parsing.md) — Processes large database dumps sequentially using an event-driven parser to avoid loading entire files into memory.
- [Wikipedia Content Expansion](https://awesome-repositories.com/f/awesome-lists/devtools/wikipedia-tools/wikipedia-content-expansion.md) — Replaces MediaWiki template placeholders with full content to ensure accurate text representation of articles. ([source](https://github.com/attardi/wikiextractor#readme))

### Artificial Intelligence & ML

- [Corpus Preprocessing](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/machine-learning-datasets/natural-language-processing-datasets/corpus-preprocessing.md) — Cleans MediaWiki markup from raw dumps to create a text-only corpus for training machine learning models.

### Content Management & Publishing

- [Plain Text Extraction](https://awesome-repositories.com/f/content-management-publishing/plain-text-persistence/document-text-extractors/plain-text-extraction.md) — Parses database backup files to remove markup and save cleaned plain text into manageable files. ([source](https://github.com/attardi/wikiextractor#readme))
- [Wiki Template Resolvers](https://awesome-repositories.com/f/content-management-publishing/text-template-processing/wiki-template-resolvers.md) — Resolves complex MediaWiki templates into their full text representation during the extraction process.
- [Page Content Retrievals](https://awesome-repositories.com/f/content-management-publishing/page-content-retrievals.md) — Enables the retrieval and cleaning of individual Wikipedia articles from a full archive.

### Data & Databases

- [Database Dump Parsers](https://awesome-repositories.com/f/data-databases/database-dump-parsers.md) — Extracts plain text and metadata from MediaWiki database dumps into structured document files.
- [Markup Stripping](https://awesome-repositories.com/f/data-databases/markup-stripping.md) — Uses regular expression patterns to strip wiki-specific formatting and metadata to produce clean plain text.
- [Direct-Access Data Extraction](https://awesome-repositories.com/f/data-databases/direct-access-data-extraction.md) — Implements a mechanism to isolate specific articles from a dump file by skipping to the relevant byte offset.
- [Single-Record Extraction](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/single-file-extraction/single-record-extraction.md) — Isolates and processes specific pages from a larger dump file instead of the entire dataset. ([source](https://github.com/attardi/wikiextractor/blob/master/README.md))

### Software Engineering & Architecture

- [Recursive Template Resolution](https://awesome-repositories.com/f/software-engineering-architecture/recursive-variable-expansion/recursive-template-resolution.md) — Resolves complex MediaWiki template placeholders by recursively fetching and inserting the corresponding content.
