Pup is a command line tool for extracting and filtering data from HTML documents using CSS selectors. It functions as a parser and selector engine that isolates specific elements based on tags, IDs, classes, and attributes.
The project provides utilities for converting selected HTML nodes into plain text, attribute values, or structured JSON objects. It includes a markup formatter that corrects missing tags and applies consistent indentation to improve the readability of HTML documents.
The tool handles the retrieval of text content and attributes through a CSS selector engine, supporting complex tags and combinators. It also manages character encoding through automatic detection or specified charsets to ensure correct text rendering during the extraction process.