# ericchiang/pup

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/ericchiang-pup).**

8,427 stars · 264 forks · HTML · MIT

## Links

- GitHub: https://github.com/ericchiang/pup
- awesome-repositories: https://awesome-repositories.com/repository/ericchiang-pup.md

## Description

Pup is a command line tool for extracting and filtering data from HTML documents using CSS selectors. It functions as a parser and selector engine that isolates specific elements based on tags, IDs, classes, and attributes.

The project provides utilities for converting selected HTML nodes into plain text, attribute values, or structured JSON objects. It includes a markup formatter that corrects missing tags and applies consistent indentation to improve the readability of HTML documents.

The tool handles the retrieval of text content and attributes through a CSS selector engine, supporting complex tags and combinators. It also manages character encoding through automatic detection or specified charsets to ensure correct text rendering during the extraction process.

## Tags

### Development Tools & Productivity

- [CLI Data Extractors](https://awesome-repositories.com/f/development-tools-productivity/cli-data-extractors.md) — Provides a command line interface for extracting specific elements from HTML using CSS selectors. ([source](https://github.com/ericchiang/pup/blob/master/parse.go))
- [Command Line HTML Parsers](https://awesome-repositories.com/f/development-tools-productivity/command-line-html-parsers.md) — Provides a terminal-based tool for extracting specific data from HTML via CSS selectors.
- [HTML Parsing Command Line Tools](https://awesome-repositories.com/f/development-tools-productivity/html-parsing-command-line-tools.md) — Ships a command line interface for retrieving and filtering specific elements from HTML documents.
- [HTML Formatters](https://awesome-repositories.com/f/development-tools-productivity/html-formatters.md) — Includes a markup formatter that corrects missing tags and applies consistent indentation to improve HTML readability.

### Software Engineering & Architecture

- [CSS Selector Engines](https://awesome-repositories.com/f/software-engineering-architecture/syntax-query-definitions/css-selector-engines.md) — Implements a CSS-style selector engine for isolating specific HTML elements based on tags, IDs, classes, and attributes.
- [CSS Selector Data Extractors](https://awesome-repositories.com/f/software-engineering-architecture/syntax-query-definitions/css-selector-engines/css-selector-data-extractors.md) — Uses precise CSS selectors to filter and collect specific elements from HTML documents. ([source](https://github.com/ericchiang/pup/blob/master/pup.go))
- [DOM to JSON Serialization](https://awesome-repositories.com/f/software-engineering-architecture/abstract-syntax-tree-tools/internal-tree-representations/json-serializers/dom-to-json-serialization.md) — Transforms selected HTML elements into machine-readable JSON objects for interoperability.

### Artificial Intelligence & ML

- [CSS Selector-Based Element Isolation](https://awesome-repositories.com/f/artificial-intelligence-ml/metadata-extraction/element/extraction-element-filters/css-selector-based-element-isolation.md) — Supports complex CSS selectors and combinators to precisely isolate specific HTML elements. ([source](https://github.com/ericchiang/pup/blob/master/README.md))

### Part of an Awesome List

- [HTML Parsing](https://awesome-repositories.com/f/awesome-lists/data/html-parsing.md) — Provides a parser that constructs a DOM by automatically correcting malformed HTML markup.
- [HTML and XML Processing](https://awesome-repositories.com/f/awesome-lists/devtools/html-and-xml-processing.md) — Query HTML pages with CSS selectors.

### Business & Productivity Software

- [Selector-Based Element Isolation](https://awesome-repositories.com/f/business-productivity-software/tag-filtering-systems/html-element-filters/selector-based-element-isolation.md) — Filters HTML documents to find specific elements based on tags, IDs, classes, and attributes. ([source](https://github.com/ericchiang/pup#readme))

### User Interface & Experience

- [DOM Element Filtering](https://awesome-repositories.com/f/user-interface-experience/css-selectors/dom-element-filtering.md) — Implements a CSS selector engine to isolate specific HTML elements within a document.
- [PDF and HTML Content Extraction](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/pdf-and-html-content-extraction.md) — Extracts all text content from selected HTML nodes and their children. ([source](https://github.com/ericchiang/pup/blob/master/README.md))
- [Markup Cleaning](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/markup-cleaning.md) — Includes a utility to fix missing tags and apply consistent indentation to HTML documents. ([source](https://github.com/ericchiang/pup/blob/master/README.md))

### Web Development

- [Web Scraping](https://awesome-repositories.com/f/web-development/web-scraping.md) — Provides utilities for extracting structured data from websites and online sources using CSS selectors.
- [HTML Data Attribute Extraction](https://awesome-repositories.com/f/web-development/html-data-attribute-extraction.md) — Retrieves the values of specific attribute keys from all selected HTML nodes. ([source](https://github.com/ericchiang/pup/blob/master/README.md))
- [Terminal Text Formatting](https://awesome-repositories.com/f/web-development/renderer-output-customizers/renderer-output-customizers/terminal-text-formatting.md) — Applies indentation and character escaping to make HTML output readable in a terminal.

### Data & Databases

- [HTML to Text JSON Converters](https://awesome-repositories.com/f/data-databases/text-to-json-converters/html-to-text-json-converters.md) — Converts selected HTML nodes into plain text, attribute values, or structured objects. ([source](https://github.com/ericchiang/pup#readme))

### DevOps & Infrastructure

- [HTML Formatting](https://awesome-repositories.com/f/devops-infrastructure/configuration-management/application-settings-management/formatting-rule-definitions/html-formatting.md) — Controls the visual presentation of HTML data through indentation and colorization. ([source](https://github.com/ericchiang/pup#readme))

### Security & Cryptography

- [Noise Reduction Filtering](https://awesome-repositories.com/f/security-cryptography/application-and-system-security/browser-security/content-filtering-blocking/content-filtering/html-content-filters/noise-reduction-filtering.md) — Isolates specific HTML nodes based on tags and classes to filter out noise from webpages.
