parse5 is a WHATWG HTML parser and serializer for Node.js. It transforms HTML strings into a document object model and converts those trees back into valid HTML strings, following the logic defined by the HTML Living Standard. The project functions as a streaming HTML processor, using incremental parsing to handle large documents in chunks. It includes an HTML5 compliant tokenizer that uses a state-machine approach to break input into tokens according to official web specifications. The toolset covers HTML document parsing, serialization, and real-time rewriting via streams. These capabiliti
SwiftSoup is a cross-platform HTML processing library for Swift that converts raw HTML or XML strings and files into a structured document object model. It provides the core infrastructure to parse web content into a traversable tree, enabling programmatic access to page elements across iOS, macOS, and Linux. The library features a CSS selector engine for data extraction and a whitelist-based sanitization system to remove unsafe tags and attributes from user-submitted content. It optimizes repetitive document queries through memoized query caching. The project covers DOM manipulation for upd
Nokogiri is an XML and HTML parsing library that builds navigable document trees from strings, files, or URLs using native C parsers for speed and standards compliance. It provides a CSS selector engine that translates CSS3 selectors into XPath expressions for querying nodes, an XPath query interface with namespace support, a document manipulation toolkit for modifying parsed documents, XSD schema validation, and XSLT transformation capabilities. The library wraps libxml2 and libxslt C libraries with Ruby bindings for high-performance parsing, and integrates Google's Gumbo parser for standard
jsdom is a Node.js implementation of web standards that functions as a headless browser emulator. It provides a JavaScript execution environment and an HTML and XML parser to simulate a browser environment on the server side, implementing various web APIs and W3C standards. The project distinguishes itself by providing a sandboxed runtime for executing scripts embedded in HTML or external files. It includes specialized polyfills for the Canvas API and manages session state through HTTP cookie management. Its broader capabilities cover network interaction via request interception and resource
htmlparser2 is a collection of tools for high-performance markup parsing, DOM manipulation, and incremental stream processing. It functions as an HTML and XML parser that converts markup strings into structured object trees, alongside a streaming markup parser designed for memory-efficient processing of large documents.
The main features of fb55/htmlparser2 are: HTML and XML Parsing, DOM Tree Construction, DOM Manipulation, Streaming Parsers, XML and HTML Document Parsers, State-Machine Parsers, Stream-Based Text Processing, Node Querying.
Open-source alternatives to fb55/htmlparser2 include: inikulin/parse5 — parse5 is a WHATWG HTML parser and serializer for Node.js. It transforms HTML strings into a document object model and… scinfu/swiftsoup — SwiftSoup is a cross-platform HTML processing library for Swift that converts raw HTML or XML strings and files into a… sparklemotion/nokogiri — Nokogiri is an XML and HTML parsing library that builds navigable document trees from strings, files, or URLs using… tmpvar/jsdom — jsdom is a Node.js implementation of web standards that functions as a headless browser emulator. It provides a… martinblech/xmltodict — xmltodict is a Python library that provides bidirectional serialization between XML documents and dictionaries. It… symfony/dom-crawler — This project is an HTML and XML DOM parser designed for loading and navigating the structure of web documents to…