# postlight/parser

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/postlight-parser).**

5,786 stars · 527 forks · JavaScript · Apache-2.0

## Links

- GitHub: https://github.com/postlight/parser
- Homepage: https://reader.postlight.com
- awesome-repositories: https://awesome-repositories.com/repository/postlight-parser.md

## Topics

`jest` `labs` `mercury` `mercury-parser` `node` `parser` `parser-library` `rollup`

## Description

Postlight Parser is a command-line tool that extracts the main article content from any web page URL, returning clean structured data including the title, author, date, excerpt, and lead image while stripping away ads and clutter. It uses a readability-based heuristic that scores HTML elements on text density and structural cues to identify the article body, and can accept pre-fetched HTML strings directly for parsing instead of fetching the URL.

The tool distinguishes itself through a modular architecture that supports domain-specific extractor overrides, allowing custom JavaScript modules to be loaded at runtime for particular domains to replace the generic extraction routine. It also enables CSS selector field injection, letting users extend parsed results with additional custom data by specifying selectors for single or multiple matches via CLI flags. Custom HTTP headers can be attached to every page request for authentication or site-specific requirements.

The parser offers a multi-format output pipeline that converts extracted HTML content into Markdown or plain text through a post-processing step before returning the result. All extraction functionality is exposed through a command-line interface that accepts URL, format, and header flags without requiring any code.

## Tags

### Content Management & Publishing

- [Web Article Extraction](https://awesome-repositories.com/f/content-management-publishing/web-article-extraction.md) — Extracts main article content, title, author, and metadata from any web page URL while stripping ads and clutter.
- [Heuristic Extraction Methods](https://awesome-repositories.com/f/content-management-publishing/full-text-content-extraction/heuristic-extraction-methods.md) — Identifies main content by scoring HTML elements on text density, link ratio, and structural cues to isolate the article body.
- [Web-to-Markdown Conversions](https://awesome-repositories.com/f/content-management-publishing/markdown-conversions/web-to-markdown-conversions.md) — Returns the extracted article body as GitHub-flavored Markdown instead of raw HTML. ([source](https://github.com/postlight/parser/blob/main/README.md))
- [Markdown to Plain Text Converters](https://awesome-repositories.com/f/content-management-publishing/markdown-to-plain-text-converters.md) — Returns the extracted article body as GitHub-flavored Markdown or plain text instead of the default HTML. ([source](https://cdn.jsdelivr.net/gh/postlight/parser@main/README.md))
- [Multi-Format Output Converters](https://awesome-repositories.com/f/content-management-publishing/multi-format-output-converters.md) — Returns extracted article body as HTML, Markdown, or plain text based on a user-specified output format.

### Development Tools & Productivity

- [HTML Parsing Command Line Tools](https://awesome-repositories.com/f/development-tools-productivity/html-parsing-command-line-tools.md) — Runs the extraction from a terminal and outputs the structured result directly to stdout. ([source](https://github.com/postlight/parser/blob/main/README.md))
- [Web Content Parser CLI](https://awesome-repositories.com/f/development-tools-productivity/web-content-parser-cli.md) — Ships a terminal-based tool for retrieving and parsing web page content with format and header flags.

### User Interface & Experience

- [Command Line Interface Design](https://awesome-repositories.com/f/user-interface-experience/command-line-interface-design.md) — Exposes all extraction functionality through a command-line tool that accepts URL, format, and header flags.
- [Customizable HTML Parsers](https://awesome-repositories.com/f/user-interface-experience/customizable-html-parsers.md) — Accepts pre-fetched HTML strings and custom site extractors loaded at runtime for domain-specific parsing.
- [CSS Selector Field Injections](https://awesome-repositories.com/f/user-interface-experience/field-customization/custom-data-fields/css-selector-field-injections.md) — Extends the parsed result with custom data by specifying CSS selectors for single or multiple matches via CLI flags. ([source](https://github.com/postlight/parser))

### Part of an Awesome List

- [Custom Extractor Implementations](https://awesome-repositories.com/f/awesome-lists/data/document-parsing-and-extraction/document-text-extractors/custom-extractor-implementations.md) — Applies a user-defined extractor script to a URL during parsing without modifying the library's source code. ([source](https://cdn.jsdelivr.net/gh/postlight/parser@main/README.md))

### Data & Databases

- [CSS Selector](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-transformation/data-parsing-extraction/field-extractors/css-selector.md) — Adds user-defined fields to parsed output by specifying CSS selectors for single or multiple matches.

### Software Engineering & Architecture

- [Extractors](https://awesome-repositories.com/f/software-engineering-architecture/integration-extensibility/extensibility/plugin-architectures/domain-specific/extractors.md) — Loads custom JavaScript modules at runtime for particular domains to replace the generic extraction routine.
- [Site-Specific Extractors](https://awesome-repositories.com/f/software-engineering-architecture/site-specific-extractors.md) — Registers a custom extraction script for a specific domain at runtime without modifying the library's source code.
- [CSS Selector Data Extractors](https://awesome-repositories.com/f/software-engineering-architecture/syntax-query-definitions/css-selector-engines/css-selector-data-extractors.md) — Extends parsed results with additional fields by evaluating user-supplied CSS selectors against the page DOM.
