1 مستودع
Techniques for retrieving specific records from large archives using byte offsets to avoid full file scanning.
Distinct from PDF Page Extraction: None of the candidates cover byte-offset based access to large archive files; they focus on web pages or PDFs.
Explore 1 awesome GitHub repository matching data & databases · Direct-Access Data Extraction. Refine with filters or upvote what's useful.
Wikiextractor is a Wikipedia dump parser and dataset preprocessor designed to extract plain text and metadata from MediaWiki database dumps. It functions as a converter that transforms these archives into structured document files or line-delimited JSON objects for use in text corpora and machine learning datasets. The utility includes a MediaWiki template expander that resolves complex template placeholders into their full text representation. It also supports the isolation and extraction of specific individual pages from a full archive without requiring the processing of the entire dataset.
Implements a mechanism to isolate specific articles from a dump file by skipping to the relevant byte offset.