# jhy/jsoup

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/jhy-jsoup).**

11,340 stars · 2,278 forks · Java · mit

## Links

- GitHub: https://github.com/jhy/jsoup
- Homepage: https://jsoup.org
- awesome-repositories: https://awesome-repositories.com/repository/jhy-jsoup.md

## Topics

`css` `css-selectors` `dom` `html` `java` `java-html-parser` `jsoup` `parser` `xml` `xpath`

## Description

Jsoup is a Java library designed for parsing, extracting, and manipulating HTML and XML content. It provides a document object model that represents web content as a hierarchical tree, allowing for programmatic navigation and modification of elements, attributes, and text. The library functions as a toolkit for web scraping, enabling the retrieval of remote content via standard web protocols and the management of HTTP sessions for automated form interaction.

The library distinguishes itself through its fault-tolerant tokenization, which reconstructs valid document structures from malformed or non-standard markup. It utilizes CSS-style selector syntax for querying and traversing document trees, providing a flexible way to locate specific nodes. Additionally, it includes a security utility that filters untrusted HTML against a configurable safelist to prevent cross-site scripting vulnerabilities while preserving safe content.

The project supports a broad range of document processing capabilities, including incremental stream parsing for memory-efficient handling of large files and serialization mechanisms for outputting formatted HTML or text. It offers extensive configuration options for parsing sensitivity, ensuring compatibility with specific standards and document requirements. The library is designed to integrate with external tools by converting parsed structures into W3C-compliant document formats.

## Tags

### Content Management & Publishing

- [HTML Document Transformation](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing-tools/markup-and-structure-parsers/html-document-transformation.md) — Converts raw HTML strings and streams into a structured document object model. ([source](https://jsoup.org/apidocs/))
- [Hierarchical Document Models](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/hierarchical-document-models.md) — Represents web content as a hierarchical tree of elements, text, and comments for programmatic access. ([source](https://jsoup.org/apidocs/org/jsoup/nodes/package-summary))

### Graphics & Multimedia

- [HTML Parsers](https://awesome-repositories.com/f/graphics-multimedia/media-production-suites/media-management-production/media-management-systems/data-parsing-conversion/html-parsers.md) — Provides a comprehensive library for parsing, extracting, and manipulating HTML content using DOM traversal and CSS selectors.

### Security & Cryptography

- [HTML Allowlists](https://awesome-repositories.com/f/security-cryptography/security/utilities/allowlist-management/html-allowlists.md) — Cleans untrusted HTML content against a strict allow-list to prevent security vulnerabilities.
- [HTML Content Filters](https://awesome-repositories.com/f/security-cryptography/application-and-system-security/browser-security/content-filtering-blocking/content-filtering/html-content-filters.md) — Filters untrusted HTML against a configurable safelist to prevent cross-site scripting while preserving safe content.

### User Interface & Experience

- [In-Memory DOM Representations](https://awesome-repositories.com/f/user-interface-experience/dom-manipulation-libraries/in-memory-dom-representations.md) — Models web content as a hierarchical tree of nodes to enable programmatic navigation and manipulation.
- [HTML Content Processing](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/html-content-processing.md) — Parses and integrates raw HTML strings into structured document models for programmatic access.
- [CSS Selectors](https://awesome-repositories.com/f/user-interface-experience/css-selectors.md) — Uses CSS-style selector syntax to efficiently locate and traverse specific nodes within a document tree.

### Web Development

- [Web Scraping](https://awesome-repositories.com/f/web-development/web-scraping.md) — Provides robust utilities for extracting structured data from websites and online sources.
- [Form Submission Clients](https://awesome-repositories.com/f/web-development/form-submission-clients.md) — Automates web form interaction by extracting fields and managing session state for data entry.
- [Form Processing](https://awesome-repositories.com/f/web-development/form-processing.md) — Extracts form fields and controls to simplify automated data retrieval and submission. ([source](https://jsoup.org/apidocs/org/jsoup/nodes/package-summary))
- [Element Attributes](https://awesome-repositories.com/f/web-development/element-attributes.md) — Provides methods to inspect, modify, and extract attributes from HTML elements. ([source](https://jsoup.org/apidocs/org/jsoup/nodes/package-summary))

### Development Tools & Productivity

- [Remote Content Fetchers](https://awesome-repositories.com/f/development-tools-productivity/remote-repository-managers/remote-content-fetchers.md) — Retrieves and processes remote web content via standard protocols for scraping and data extraction tasks. ([source](https://jsoup.org/apidocs/org/jsoup/helper/package-summary))
- [Web Scraping](https://awesome-repositories.com/f/development-tools-productivity/web-scraping.md) — Offers a toolkit for fetching remote web content, managing HTTP sessions, and cleaning untrusted HTML input.

### Software Engineering & Architecture

- [Document Object Models](https://awesome-repositories.com/f/software-engineering-architecture/document-object-models.md) — Models web content as a hierarchical tree of nodes to enable programmatic navigation and structural modification.
- [CSS Selector Engines](https://awesome-repositories.com/f/software-engineering-architecture/syntax-query-definitions/css-selector-engines.md) — Locates specific nodes within a document structure using CSS-style selector syntax. ([source](https://jsoup.org/apidocs/org/jsoup/select/package-summary))
- [Fault-Tolerant Architectures](https://awesome-repositories.com/f/software-engineering-architecture/fault-tolerant-architectures.md) — Implements fault-tolerant tokenization to reconstruct valid document structures from malformed or non-standard markup.

### Data & Databases

- [XML Parsers](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-serialization/xml-parsers.md) — Processes XML input using specific rules to ensure accurate structure for non-HTML data formats. ([source](https://jsoup.org/apidocs/org/jsoup/parser/package-summary))
- [Tree Traversal Engines](https://awesome-repositories.com/f/data-databases/tree-traversal-engines.md) — Enables recursive navigation and inspection of hierarchical document structures. ([source](https://jsoup.org/apidocs/org/jsoup/nodes/package-summary))
- [Document Parsing Engines](https://awesome-repositories.com/f/data-databases/document-parsing-engines.md) — Processes input incrementally to build document structures efficiently without loading entire files into memory. ([source](https://jsoup.org/apidocs/org/jsoup/parser/package-summary))
- [Incremental Data Streaming](https://awesome-repositories.com/f/data-databases/incremental-data-streaming.md) — Supports incremental stream parsing to handle large files with a reduced memory footprint.