# mgdm/htmlq

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/mgdm-htmlq).**

7,552 stars · 129 forks · Rust · MIT

## Links

- GitHub: https://github.com/mgdm/htmlq
- awesome-repositories: https://awesome-repositories.com/repository/mgdm-htmlq.md

## Description

htmlq is a suite of command-line utilities for querying and extracting data from HTML documents using CSS selectors. It functions as a query language tool for HTML structures and attributes, providing a way to retrieve specific information from documents via the terminal.

The tool provides capabilities for extracting text content, specific HTML attributes, and document fragments. It includes an HTML document formatter for cleaning and reformatting output with consistent indentation, as well as utilities for stripping tags to isolate plain text.

The software handles structural HTML processing through stream-based parsing, recursive tree traversal, and node filtering to remove unwanted elements before final data extraction. These capabilities support automated document analysis and web scraping data collection.

## Tags

### Software Engineering & Architecture

- [CSS Selector Data Extractors](https://awesome-repositories.com/f/software-engineering-architecture/syntax-query-definitions/css-selector-engines/css-selector-data-extractors.md) — Extracts specific elements and text from HTML documents using precise CSS selector patterns. ([source](https://github.com/mgdm/htmlq#readme))
- [CSS Selector Engines](https://awesome-repositories.com/f/software-engineering-architecture/syntax-query-definitions/css-selector-engines.md) — Implements a CSS selector engine for querying and isolating elements within HTML document structures.
- [DOM Node Manipulators](https://awesome-repositories.com/f/software-engineering-architecture/trees/tree-node-reorderers/dom-node-manipulators.md) — Provides utilities to modify the DOM tree by removing unwanted nodes before final data extraction.

### Content Management & Publishing

- [Plain Text Converters](https://awesome-repositories.com/f/content-management-publishing/markdown-to-rich-text-parsers/plain-text-converters.md) — Strips HTML tags and formatting to extract clean, plain-text content from selected elements.

### Data & Databases

- [Text Extraction](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction.md) — Isolates and retrieves plain text content by stripping all HTML tags from elements.
- [Automated HTML Document Analysis](https://awesome-repositories.com/f/data-databases/automated-html-document-analysis.md) — Enables programmatic extraction of specific information from large sets of HTML files using CSS patterns.
- [Web Data Extraction](https://awesome-repositories.com/f/data-databases/web-data-extraction.md) — Supports automated web scraping by extracting specific text and attributes using CSS selectors.

### Development Tools & Productivity

- [Command Line HTML Parsers](https://awesome-repositories.com/f/development-tools-productivity/command-line-html-parsers.md) — Provides a command-line interface for parsing and querying HTML files without needing external scripts.

### Web Development

- [Data Extractions](https://awesome-repositories.com/f/web-development/dom-element-selectors/data-extractions.md) — Targets and retrieves specific HTML fragments and text from files using CSS selectors. ([source](https://github.com/mgdm/htmlq/blob/master/README.md))
- [HTML Data Attribute Extraction](https://awesome-repositories.com/f/web-development/html-data-attribute-extraction.md) — Retrieves the values of specific HTML attributes from elements matching a CSS selector. ([source](https://github.com/mgdm/htmlq/blob/master/README.md))
- [DOM Query Languages](https://awesome-repositories.com/f/web-development/html-dom-manipulators/dom-query-languages.md) — Implements a query language for navigating and retrieving data from HTML document trees.
- [Web Page Content Cleaning](https://awesome-repositories.com/f/web-development/web-page-content-cleaning.md) — Cleans web pages by removing non-essential elements to isolate core text content.

### Artificial Intelligence & ML

- [Extraction Element Filters](https://awesome-repositories.com/f/artificial-intelligence-ml/metadata-extraction/element/extraction-element-filters.md) — Filters and removes unwanted HTML elements from the document structure to clean output. ([source](https://github.com/mgdm/htmlq#readme))

### Part of an Awesome List

- [Stream-Based Parsing](https://awesome-repositories.com/f/awesome-lists/data/html-and-xml-parsing/xml-parsing/stream-based-parsing.md) — Uses stream-based parsing to process HTML input efficiently into a traversable tree structure.
- [HTML and XML Processing](https://awesome-repositories.com/f/awesome-lists/devtools/html-and-xml-processing.md) — Query HTML documents using CSS selectors.

### DevOps & Infrastructure

- [HTML Formatting](https://awesome-repositories.com/f/devops-infrastructure/configuration-management/application-settings-management/formatting-rule-definitions/html-formatting.md) — Formats HTML output with consistent indentation and structural cleaning. ([source](https://github.com/mgdm/htmlq/blob/master/README.md))

### Scientific & Mathematical Computing

- [DOM Tree Traversers](https://awesome-repositories.com/f/scientific-mathematical-computing/recursive-tree-traversal-algorithms/dom-tree-traversers.md) — Employs recursive tree traversal to navigate nested HTML hierarchies and isolate matching nodes.
