Jsoup | Awesome Repository

Jsoup is a Java library designed for parsing, extracting, and manipulating HTML and XML content. It provides a document object model that represents web content as a hierarchical tree, allowing for programmatic navigation and modification of elements, attributes, and text. The library functions as a toolkit for web scraping, enabling the retrieval of remote content via standard web protocols and the management of HTTP sessions for automated form interaction.

The library distinguishes itself through its fault-tolerant tokenization, which reconstructs valid document structures from malformed or non-standard markup. It utilizes CSS-style selector syntax for querying and traversing document trees, providing a flexible way to locate specific nodes. Additionally, it includes a security utility that filters untrusted HTML against a configurable safelist to prevent cross-site scripting vulnerabilities while preserving safe content.

The project supports a broad range of document processing capabilities, including incremental stream parsing for memory-efficient handling of large files and serialization mechanisms for outputting formatted HTML or text. It offers extensive configuration options for parsing sensitivity, ensuring compatibility with specific standards and document requirements. The library is designed to integrate with external tools by converting parsed structures into W3C-compliant document formats.

Features

HTML Document Transformation - Converts raw HTML strings and streams into a structured document object model.
HTML Parsers - Provides a comprehensive library for parsing, extracting, and manipulating HTML content using DOM traversal and CSS selectors.
HTML Allowlists - Cleans untrusted HTML content against a strict allow-list to prevent security vulnerabilities.
In-Memory DOM Representations - Models web content as a hierarchical tree of nodes to enable programmatic navigation and manipulation.

Features

HTML Document Transformation - Converts raw HTML strings and streams into a structured document object model.
HTML Parsers - Provides a comprehensive library for parsing, extracting, and manipulating HTML content using DOM traversal and CSS selectors.
HTML Allowlists - Cleans untrusted HTML content against a strict allow-list to prevent security vulnerabilities.
In-Memory DOM Representations - Models web content as a hierarchical tree of nodes to enable programmatic navigation and manipulation.