Postlight Parser is a command-line tool that extracts the main article content from any web page URL, returning clean structured data including the title, author, date, excerpt, and lead image while stripping away ads and clutter. It uses a readability-based heuristic that scores HTML elements on text density and structural cues to identify the article body, and can accept pre-fetched HTML strings directly for parsing instead of fetching the URL.
The tool distinguishes itself through a modular architecture that supports domain-specific extractor overrides, allowing custom JavaScript modules to be loaded at runtime for particular domains to replace the generic extraction routine. It also enables CSS selector field injection, letting users extend parsed results with additional custom data by specifying selectors for single or multiple matches via CLI flags. Custom HTTP headers can be attached to every page request for authentication or site-specific requirements.
The parser offers a multi-format output pipeline that converts extracted HTML content into Markdown or plain text through a post-processing step before returning the result. All extraction functionality is exposed through a command-line interface that accepts URL, format, and header flags without requiring any code.