30 open-source projects similar to tmpvar/jsdom, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Jsdom alternative.
jsdom is a Node.js DOM implementation that functions as a headless browser emulator and virtual browser environment. It provides a pure JavaScript implementation of web standards, acting as a web standards polyfill that simulates the window and document objects within a non-browser runtime. The project implements W3C and WHATWG specifications to provide a programmatic environment for parsing HTML and manipulating content. It serves as an HTML parser and serializer, allowing for the transformation of HTML strings into document structures and the export of those structures back into text. The
This project is an HTML and XML DOM parser designed for loading and navigating the structure of web documents to extract specific data points. It functions as a web scraping utility that provides a system for locating precise elements using a CSS and XPath selector engine. The library includes a URI resolver that converts relative links found in documents into absolute addresses using a base URI. It provides a set of tools for retrieving text, attributes, and media sources from parsed content. The toolset covers document hierarchy traversal, selector-based filtering, and text extraction with
htmlparser2 is a collection of tools for high-performance markup parsing, DOM manipulation, and incremental stream processing. It functions as an HTML and XML parser that converts markup strings into structured object trees, alongside a streaming markup parser designed for memory-efficient processing of large documents. The project includes a DOM manipulation library for querying, modifying, and serializing document object model trees. It also provides a web feed parser to extract structured metadata and entries from RSS, RDF, and Atom feeds. The library covers broad capabilities in data par
parse5 is a WHATWG HTML parser and serializer for Node.js. It transforms HTML strings into a document object model and converts those trees back into valid HTML strings, following the logic defined by the HTML Living Standard. The project functions as a streaming HTML processor, using incremental parsing to handle large documents in chunks. It includes an HTML5 compliant tokenizer that uses a state-machine approach to break input into tokens according to official web specifications. The toolset covers HTML document parsing, serialization, and real-time rewriting via streams. These capabiliti
Nokogiri is an XML and HTML parsing library that builds navigable document trees from strings, files, or URLs using native C parsers for speed and standards compliance. It provides a CSS selector engine that translates CSS3 selectors into XPath expressions for querying nodes, an XPath query interface with namespace support, a document manipulation toolkit for modifying parsed documents, XSD schema validation, and XSLT transformation capabilities. The library wraps libxml2 and libxslt C libraries with Ruby bindings for high-performance parsing, and integrates Google's Gumbo parser for standard
SwiftSoup is a cross-platform HTML processing library for Swift that converts raw HTML or XML strings and files into a structured document object model. It provides the core infrastructure to parse web content into a traversable tree, enabling programmatic access to page elements across iOS, macOS, and Linux. The library features a CSS selector engine for data extraction and a whitelist-based sanitization system to remove unsafe tags and attributes from user-submitted content. It optimizes repetitive document queries through memoized query caching. The project covers DOM manipulation for upd
Cheerio is an HTML and XML parsing library and server-side DOM implementation. It functions as a markup manipulation tool and CSS selector engine, allowing users to parse, query, and modify HTML or XML documents in non-browser environments. The project provides a DOM-like tree representation of markup strings, enabling programmatic addition, removal, and modification of elements and attributes. It features a prototype-based plugin system that allows the extension of core functionality by adding custom methods to the document prototype. The library covers a broad range of capabilities includi
goquery is a Go HTML parsing library and CSS selector engine used to isolate and retrieve specific text or attributes from HTML documents. It functions as an HTML DOM manipulator that converts raw HTML strings into a structured tree for programmatic navigation and search. The library provides a fluent interface for chaining selection and filtering operations and utilizes a wrapper-based abstraction to simplify data extraction and manipulation of nodes. It employs an iterator-based processing mechanism to apply operations to every node within a matched selection. Its primary capabilities cove
This project is a component testing framework and utility designed for testing React components. It functions as a DOM testing library that allows for the verification of rendered output and component functionality without accessing internal implementation details. The library focuses on behavior driven development by simulating user interactions within a virtual DOM environment. It utilizes implementation-agnostic querying to locate elements via accessible roles and labels, ensuring that the interface is verified from the perspective of the user. The toolset covers frontend integration test
AngleSharp is an HTML5 DOM parser and web scraping framework designed to parse HTML5, SVG, and MathML documents into a W3C compliant document object model. It functions as a programmatic HTML generator and a CSS selector engine for querying and locating specific elements within a DOM. The project provides tools for simulating browser environments to automate web interactions, navigate URLs, and submit forms. It includes a dedicated HTML and CSS minifier to reduce the file size of web assets by removing unnecessary characters. The library supports HTML DOM manipulation and the extraction of s
MechanicalSoup is a Python web automation library designed to simulate browser behavior. It functions as a toolkit for web scraping and automation, providing an HTML parsing engine and an HTTP session manager to interact with websites programmatically. The library enables headless web interaction by mimicking a real user session. It manages persistent state through cookie handling and automatic redirect following, allowing for programmatic website navigation and the simulation of complex browser interactions. Its capabilities cover automated form population and submission using CSS selectors
Mechanize is a Ruby library for web browser automation and headless browser emulation. It allows for programmatically navigating websites and simulating human behavior without a graphical user interface. The library provides an automated interface for populating and submitting web forms, including text fields, checkboxes, and file uploads. It manages stateful sessions by automatically storing and sending cookies across multiple requests to maintain user authentication and identity. Additional capabilities include web data scraping, the ability to download remote web content, and the maintena
Fetch-mock is a testing utility designed to isolate application code from external network dependencies by intercepting and overriding outgoing traffic. It functions as a network request interceptor that captures calls made via the Fetch API, allowing developers to simulate server responses and verify application behavior without requiring a live backend infrastructure. The library distinguishes itself through a unified interface that provides consistent network interception logic across diverse runtime environments, including browsers, service workers, and server-side platforms. By replacing
This project is a reference library and collection of practical code samples for building browser extensions using WebExtensions APIs. It provides implementation guides and functional examples for core extension components, including content scripts, background processes, and browser action popups. The repository focuses on demonstrating specific implementation patterns for browser UI customization and web page manipulation. It includes samples for creating sidebars, context menus, and options pages, as well as techniques for injecting scripts and styles to alter DOM elements and page appeara
This repository contains the HTML specification, which defines the core standards for web page structuring, content organization, and document rendering. It establishes the fundamental algorithms for state-machine-based tokenization, tree construction for the document object model, and origin-based security isolation. The specification provides a framework for defining custom elements with independent lifecycles and registries. It also details the requirements for cross-document communication, session history management, and the synchronization of interface properties with content attributes.
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
PhantomJS is a scriptable, headless browser engine based on WebKit that provides a programmatic interface for automating web page interactions. It operates without a graphical user interface, allowing for the execution of JavaScript to navigate pages, manipulate the document object model, and perform functional testing of web applications. The tool distinguishes itself by providing low-level control over the browser rendering lifecycle and network stack. It enables real-time interception and modification of network traffic, alongside the ability to generate visual snapshots and document expor
pydoll is a Chrome DevTools Protocol automation library and headless browser controller used for web data extraction and parallel browser automation. It controls Chromium-based browsers via direct WebSocket connections, allowing it to manage isolated browser contexts and tabs while bypassing the overhead and detection associated with WebDriver. The project features an anti-bot evasion framework that mimics natural human behavior, including mouse movements generated via Bezier curves and variable typing patterns. It provides specialized stealth capabilities to bypass behavioral analysis and au
X-ray is a headless browser web scraper and HTML content crawler designed to extract structured data from websites. It functions as a stream-based data scraper and structured data extractor, using selectors to retrieve text and attributes from HTML as nested objects or arrays. The project includes a request rate controller to manage network traffic through concurrency limits, throttles, and timeouts. It handles dynamic website scraping by rendering JavaScript via a headless browser and performs automated website crawling using breadth-first link following and pagination management. The syste
This project serves as an agentic browser controller, providing a programmatic bridge that enables autonomous software agents to navigate web pages and interact with document elements. It functions as a browser automation protocol, facilitating headless browser operations and automated web interactions to perform repetitive tasks and end-to-end testing without manual human input. The system distinguishes itself by utilizing the Chrome DevTools Protocol to establish a bidirectional communication channel with the browser engine. This allows for protocol-based remote control, where external appl
This project is a high-performance headless browser engine designed for scalable web automation, data extraction, and AI agent integration. It provides a specialized environment that allows autonomous agents and testing frameworks to interact with web content through standardized remote control protocols. By executing pages in a lightweight, headless state, the engine minimizes resource consumption while maintaining the ability to perform complex navigation and dynamic content rendering. The platform distinguishes itself through deep integration with AI-centric communication layers and advanc
Puppeteer is a JavaScript library for programmatically controlling Chrome and Firefox through the Chrome DevTools Protocol or the WebDriver BiDi protocol. It launches and manages browser instances—typically without a visible user interface—to automate interactions with web pages, enabling navigation, clicking, typing, and data extraction entirely through code. The library distinguishes itself through deep integration with the Chromium embedding layer, allowing fine-grained process configuration with custom flags, permissions, and sandbox policies. It maintains multiple concurrent command stre
This project is a build-time tool that converts single-page application routes into static HTML files. It functions as a Webpack build plugin that uses a headless browser to execute JavaScript and capture the final DOM state as static markup to improve search engine optimization and initial page load speeds. The system provides precise control over the capture process through custom render triggers, allowing HTML generation to be delayed until a specific DOM element appears, a custom event fires, or a timer expires. It also supports global state injection, which embeds JSON-serializable data
Chromeless is a serverless deployment of Chrome and a programmable interface for automating headless browser interactions. It functions as a web page rendering engine and browser orchestrator, enabling the execution of automation tasks within an AWS Lambda environment. The project specializes in managing browser state, cookies, and viewport settings across remote Chrome instances. It provides tools for generating screenshots, PDFs, and raw text exports from rendered web pages. The system supports dynamic web interaction, including form filling, element clicking, and the execution of custom J
Camoufox is a Firefox-based stealth automation browser designed to evade detection during automated browsing. It combines a fingerprint randomization engine that generates thousands of unique device attributes per session, native-level API interception to spoof WebRTC, WebGL, media, and other fingerprintable properties, and human behavior simulation that moves the cursor along natural, distance-aware trajectories. The browser is compiled from source with build-time stealth patches and runs headlessly via a lightweight virtual display buffer, making it suitable for web scraping, automated testi
Taiko is a browser automation framework and web end-to-end testing library used to perform programmatic user actions and verify application behavior. It functions as a headless browser testing tool capable of simulating real interactions and asserting page states in Chromium and Firefox. The project includes a browser interaction recorder that captures live actions and exports them as executable JavaScript automation scripts. It also serves as a web accessibility auditor, analyzing pages to detect accessibility violations and ensure compliance with inclusive design standards. The framework c
Reader is an AI data ingestion pipeline and web content parser designed to convert websites and documents into clean markdown for use with large language models. It functions as a headless browser content extractor and web-to-markdown converter, transforming URLs and PDF files into structured text formats while removing irrelevant web clutter. The system optimizes retrieval augmented generation by acting as a search optimizer that retrieves web results and applies re-ranking to improve context relevance. It further enhances content accessibility by using vision models to generate descriptive
This project is a browser rendering service and headless Chrome PDF generator built on Puppeteer. It functions as a backend tool for converting web pages and raw HTML content into PDF documents and screenshots. The service distinguishes itself through browser session control, allowing for the injection of session cookies and the configuration of navigation timeouts to handle authenticated pages. It also includes viewport-based layout scaling to adjust browser dimensions and device scale factors during the rendering process. The broader capability surface covers HTML content export and automa
gpt-crawler is a web scraping utility designed to extract website content and convert it into structured text files for use as AI model knowledge bases. It functions as a data generator that crawls specified web addresses to produce the knowledge files required for building custom GPTs, grounding large language models, and providing context to AI agents. The system transforms raw HTML into clean Markdown text to reduce token usage and improve readability for AI models. It utilizes token-aware content chunking and output file size limitations to ensure generated datasets remain compatible with
EyeWitness is a web infrastructure mapper and reconnaissance tool designed to automate the visual mapping of exposed web services. It functions as a headless browser screenshotter and HTTP reconnaissance utility that captures visual evidence and extracts server headers from lists of web targets. The system identifies server technologies and audits for common default administrative credentials to map an organization's external attack surface. It generates searchable HTML security reports that combine screenshots, page source code, and categorized analysis results for vulnerability assessment.