30 open-source projects similar to postlight/parser, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Parser alternative.
Defuddle is a command line web parser and content extractor designed to isolate the primary article body from web pages and convert the result into standardized markdown. It functions as a content cleaner that removes layout clutter, such as sidebars and headers, to retrieve the main text and associated metadata. The tool provides a terminal interface that processes content from remote URLs, local files, or piped HTML streams. It supports custom content targeting, allowing users to specify CSS selectors to manually define the main content area when automatic detection is insufficient. The sy
Readability is a JavaScript library designed for web content extraction. It functions as a DOM parsing utility and article metadata extractor that isolates the primary text of a webpage by removing clutter such as advertisements and navigation bars. The library employs a heuristic-based content detector to predict if a webpage contains a parseable article before performing full extraction. It uses a parsing workflow to convert complex HTML documents into a simplified format, facilitating the implementation of distraction-free reader views. The tool covers several capability areas, including
markdown-clipper is a browser extension that converts website content into markdown files for offline storage and personal knowledge bases. It functions as a content extractor and HTML to markdown converter that removes layout clutter to isolate primary text. The tool includes a specific integration for sending clipped web content directly into vaults and folders within the Obsidian note-taking application. It also supports batch processing to convert all open browser tabs into individual markdown files. The extension covers a broad range of extraction capabilities, including capturing selec
Pup is a command line tool for extracting and filtering data from HTML documents using CSS selectors. It functions as a parser and selector engine that isolates specific elements based on tags, IDs, classes, and attributes. The project provides utilities for converting selected HTML nodes into plain text, attribute values, or structured JSON objects. It includes a markup formatter that corrects missing tags and applies consistent indentation to improve the readability of HTML documents. The tool handles the retrieval of text content and attributes through a CSS selector engine, supporting co
Spider is a web-based platform designed for automated data extraction, providing a centralized framework to collect, process, and route structured information from websites. It functions as a comprehensive pipeline that manages the entire lifecycle of data gathering, from initial configuration to final storage in external databases or message queues. The platform distinguishes itself through a visual configuration interface that allows users to define extraction rules and manage scraping templates without writing custom code. It supports both static and dynamic content retrieval by integratin
Percollate is a command-line tool for converting web pages and RSS feeds into structured files. It functions as a web content converter, static document generator, and page bundler that transforms online content into PDF, EPUB, HTML, or Markdown formats. The tool creates self-contained documents by embedding external images as encoded data URLs and applying custom HTML templates and CSS stylesheets. It can combine multiple web URLs or feed entries into a single digital book featuring a generated table of contents and hyperlinked index. Additional capabilities include the decomposition of Ato
Trafilatura is a Python library and command-line tool for extracting clean, structured text and metadata from web pages. It downloads HTML content, identifies the main body of text, and strips away navigation, ads, and other boilerplate, returning the core article content along with fields like title, author, date, and URL. The tool can also extract user comments and test whether a page contains extractable text, making it a general-purpose web text extraction library. What distinguishes Trafilatura from simpler extractors is its configurable extraction pipeline, which offers high-speed, high
This project is a Node.js web scraping framework designed to automate data extraction through a programmatic workflow of requests, parsing, and document interaction. It functions as a headless web crawler, an HTTP request manager, and a DOM parser and extractor. The framework distinguishes itself by combining a JavaScript execution engine to interact with dynamic content and a hybrid selection system that utilizes both CSS and XPath selectors. It includes specialized middleware for proxy rotation and cookie-jar session management to maintain authenticated states and manage automated traffic.
python-goose is a Python library for web scraping and content extraction. It functions as an HTML boilerplate remover and article parser designed to isolate primary text and metadata from web pages by stripping away navigation, layout noise, and non-essential elements. The tool features multilingual processing capabilities, utilizing language-specific stop-word analyzers to identify and extract primary content across different languages. It also identifies and collects embedded media, including source URLs and embed codes for lead images and videos associated with an article. The library cov
Wechatsync is a multi-platform content synchronizer and cross-platform publishing tool. It extracts articles from webpages and distributes them to multiple social media and blogging platforms simultaneously. The system utilizes a web content extractor with reader-mode logic to strip advertisements and navigation elements from source pages. The project employs a markdown content pipeline that converts extracted web content into a standardized format for editing before redistribution. It features an automated media migrator that performs host-to-host image migration, downloading images from sou
pterm is a Go terminal UI library used to build rich command-line interfaces. It provides toolsets for terminal data visualization, operation progress tracking, interactive user input, and structured logging. The library distinguishes itself through a comprehensive set of visual tools, including a framework for interactive terminal prompts such as selection menus and confirmation dialogs, and a specialized system for rendering bar charts, heatmaps, and tree structures. It also includes a structured terminal logger capable of producing leveled, colorful system messages. The project covers bro
ReadYou is a self-hosted reading application and RSS feed aggregator that centralizes content from multiple web sources. It functions as a full-text RSS reader, extracting the complete body text from web pages to provide a distraction-free reading experience. The application includes specialized accessibility and speed tools, such as a bionic reading mode that uses pattern-based text highlighting to guide the eye and a text-to-speech system for audio content consumption. The project covers comprehensive subscription management through OPML import and export, feed categorization, and keyword-
yargs is a command-line interface framework and argument parser for Node.js. It translates raw command-line strings into structured JavaScript objects, providing a toolkit for building terminal applications with nested sub-commands, dedicated handlers, and a structured user interface. The framework distinguishes itself through automated help text generation, which constructs formatted usage menus and instructions based on registered metadata. It also provides shell completion generation for Bash and Zsh and uses string-distance algorithms to offer typo correction suggestions when invalid inpu
spectre.console is a .NET console UI library and command-line interface framework used to build rich user interfaces for console applications. It functions as a console text markup processor and terminal layout engine to provide advanced text styling and structured data organization. The library enables the creation of visually organized interfaces using a system of tables, grids, and panels. It utilizes a dedicated markup language to apply colors and styles to terminal output, allowing for the rendering of formatted text and complex visual structures.
react-blessed is a React renderer for the blessed library that enables the construction of interactive command-line interfaces using a component-based architecture. It functions as a terminal user interface framework that maps a virtual component tree to a terminal environment, allowing React's declarative state management to control blessed terminal widgets and layout nodes. The system supports the integration of custom renderers through a dedicated creation function to change how nodes are instantiated. It provides a mechanism to retrieve original terminal library objects through references
This project is a comprehensive resource directory for web data extraction, providing a curated collection of tools and libraries for parsing data, automating browsers, and managing network operations. It serves as a guide for extracting structured information from HTML, XML, JSON, and PDF formats. The toolkit focuses on advanced data collection strategies, including headless browser automation to interact with JavaScript and a suite of network utilities for DNS resolution and WebSocket connections. It specifically covers methods for bypassing bot protections through proxy pool management, us
This project is an LLM knowledge base builder and personal knowledge management tool. It is a desktop application designed to transform diverse documents into a persistent, interlinked wiki through LLM analysis and incremental ingestion. The system distinguishes itself with a knowledge graph visualizer that uses community detection algorithms to map relationships between concepts and identify topical clusters. It features a hybrid retrieval system that combines keyword matching, vector embeddings, and graph relevance to locate information. The platform covers a wide range of capabilities inc
bilibili-api is a Bilibili API wrapper and content scraper designed for programmatically accessing video metadata, user profiles, and content data. It functions as an anti-bot crawler framework and a WebSocket live chat client for retrieving platform information and real-time interaction data. The project incorporates tools to bypass anti-crawling measures and rate limits through the use of proxies and TLS fingerprint spoofing. It also includes logic for mapping and converting various video and content identifiers to ensure consistent data retrieval across different endpoints. Its capability
htmlq is a suite of command-line utilities for querying and extracting data from HTML documents using CSS selectors. It functions as a query language tool for HTML structures and attributes, providing a way to retrieve specific information from documents via the terminal. The tool provides capabilities for extracting text content, specific HTML attributes, and document fragments. It includes an HTML document formatter for cleaning and reformatting output with consistent indentation, as well as utilities for stripping tags to isolate plain text. The software handles structural HTML processing
goquery is a Go HTML parsing library and CSS selector engine used to isolate and retrieve specific text or attributes from HTML documents. It functions as an HTML DOM manipulator that converts raw HTML strings into a structured tree for programmatic navigation and search. The library provides a fluent interface for chaining selection and filtering operations and utilizes a wrapper-based abstraction to simplify data extraction and manipulation of nodes. It employs an iterator-based processing mechanism to apply operations to every node within a matched selection. Its primary capabilities cove
This project is a collection of resources and utilities for macOS terminal customization, providing a set of color schemes, a palette previewer, and theme conversion tools. Its primary purpose is to manage the visual appearance of the default macOS Terminal application. The toolkit includes a theme converter for transforming color scheme files between different terminal application formats. It also features a previewer that uses escape sequences to generate visual representations of color palettes for validation. The system covers broader capabilities for command line interface visual design
WeChatDownload is a content archiving tool for WeChat Official Accounts that enables automated batch downloading of articles, comments, collections, and embedded media assets. It extracts account identifiers and session keys from a single shared article link, then iterates through paginated article lists to retrieve all historical content without requiring separate login credentials. The tool distinguishes itself through its comprehensive capture capabilities, including comment threads, reply chains, and entire article collections alongside the main content. It provides granular control over
so-novel is a web novel downloader and scraping engine designed to extract structured text from websites and convert it into electronic book formats. It functions as a multi-interface content extractor, providing a shared backend accessible via a web-based management dashboard, a terminal user interface, and a command line interface. The system utilizes a rule-driven approach for data extraction, using CSS selectors and XPath rules defined in external configuration files to map web elements to specific data fields. To maintain access to content, it includes a proxy-routed request pipeline to
colors.js is a Node.js terminal color library and console text styling tool. It serves as an ANSI escape code wrapper, providing a high-level API to apply foreground and background colors, styles, and decorative patterns to console output. The library includes a terminal output formatter capable of creating specialized visual effects, such as rainbow and zebra patterns. It employs a mechanism to automatically detect terminal color support or allow for manual overrides of visual styling. The tool covers a broad range of text formatting, including text emphasis attributes like bold, italic, un
scrape-it is a Node.js web scraper and HTML parser designed to extract structured data from websites and HTML files. It functions as a web data extraction tool that retrieves specific information from DOM elements and converts web content into usable data fields. The tool uses CSS selectors to target specific data points and employs schema-driven data mapping to organize unstructured web text into a consistent format. It supports custom value transformation to convert raw extracted strings into specific data formats. The system provides capabilities for web data extraction and automated cont
This project is a collection of Python implementations for web scraping, network traffic interception, data analysis, and sentiment analysis. It provides methods for extracting structured data from websites and mobile application interfaces. The collection includes tools for capturing and analyzing network packets from mobile applications to identify hidden internal API endpoints. It also features scripts for evaluating the emotional tone and public perception of text data. The project covers data manipulation and transformation of large datasets, as well as the generation of charts and grap
This project is an interactive tutorial generator and static site generator that transforms source documents, such as Markdown and Google Docs, into structured instructional guides. It functions as a documentation conversion tool that compiles source content into static HTML assets and metadata for distribution to public or private audiences. The system utilizes a custom element UI framework to embed interactive instructional components using standard HTML custom elements, removing the need for external JavaScript frameworks. It supports multi-format content export, allowing a single source o
Quarto is an open-source scientific and technical publishing system built on Pandoc that converts Markdown and Jupyter notebooks into a wide range of output formats. It functions as a multi-format document converter, a reproducible research platform, a static site generator for technical content, and an interactive dashboard builder, all within a single framework. The system is distinguished by its ability to produce HTML, PDF, Word, ePub, and slide decks from a single Markdown source, while embedding executable code blocks in Python, R, Julia, or Observable for dynamic, reproducible document
This is an ANSI terminal color library and console output manager used for applying colors and text attributes to command line interface output. It functions as a terminal text styler and RGB color formatter, generating the escape codes necessary for foreground and background styling. The project supports 24-bit high-color RGB mapping for precise color rendering in compatible terminal emulators. It enables the creation of colorized text fragments that can be embedded into larger blocks of text or applied as global styles that persist across subsequent output streams. The library covers broad