htmlq is a suite of command-line utilities for querying and extracting data from HTML documents using CSS selectors. It functions as a query language tool for HTML structures and attributes, providing a way to retrieve specific information from documents via the terminal.
The tool provides capabilities for extracting text content, specific HTML attributes, and document fragments. It includes an HTML document formatter for cleaning and reformatting output with consistent indentation, as well as utilities for stripping tags to isolate plain text.
The software handles structural HTML processing through stream-based parsing, recursive tree traversal, and node filtering to remove unwanted elements before final data extraction. These capabilities support automated document analysis and web scraping data collection.