30 open-source projects similar to zipstack/unstract, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Unstract alternative.
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
ZenML is an extensible machine learning orchestration framework designed to manage the end-to-end lifecycle of data pipelines and AI agent workflows. It functions as a durable orchestrator that executes machine learning tasks as directed acyclic graphs, ensuring that every step is containerized for consistent performance across local, cloud, and hybrid infrastructure. By decoupling pipeline code from underlying compute and storage backends, the platform allows developers to define infrastructure-agnostic stacks that remain portable across diverse environments. The project distinguishes itself
Surya is a document processing platform designed to transform unstructured files into structured, machine-readable data. It provides a comprehensive suite of tools for text recognition, layout analysis, and reading order detection, enabling the conversion of PDFs and images into formats such as JSON, HTML, or markdown. The platform is built to handle complex document workflows, offering capabilities for data extraction, document segmentation, and automated form completion. The platform distinguishes itself through a robust pipeline-based architecture that allows users to chain analysis tasks
Parsr is an unstructured data extractor and document parsing pipeline that converts raw files and images into cleaned, machine-readable formats. It functions as a document layout analyzer and a pipeline for extracting structured data and labels using large language models. The system includes a document parsing visualizer, providing a graphical interface to upload documents and inspect the resulting structured data output. The project covers document digitization workflows, including layout analysis to detect headings, tables, and lists, and automated data entry through the cleaning and enri
Unstructured is an enterprise-grade data orchestration engine designed to transform raw, unstructured files into structured, machine-readable formats. It functions as a comprehensive platform for document ingestion, partitioning, and enrichment, specifically engineered to prepare complex data for retrieval-augmented generation and agentic AI workflows. The platform distinguishes itself through its sophisticated document processing strategies, which combine rule-based extraction with vision-language models to handle diverse file layouts, tables, and images. It provides a modular architecture t
Mage AI is a Python-based data pipeline orchestrator and self-hosted data integrated development environment. It is designed for building, scheduling, and monitoring data workflows using a block-based pipeline design and interactive notebook interface. The platform distinguishes itself by integrating generative AI capabilities, allowing users to connect large language model providers via API to incorporate artificial intelligence into automated data streams. It also functions as an Apache Spark data processor, managing the kernels and infrastructure required for high-volume analytics and larg
This project is a Model Context Protocol server that enables artificial intelligence assistants to interact directly with Microsoft Excel files. It functions as a bridge, allowing external systems to read, write, and modify spreadsheet data through a standardized interface. By supporting both direct file manipulation and headless application automation, the server provides a comprehensive utility for programmatic workbook management. The server distinguishes itself by combining data processing capabilities with a visual rendering pipeline. It can generate image snapshots of specific spreadshe
MoviePilot is a self-hosted media orchestrator and NAS media library automator. It coordinates workflows between downloaders, metadata scrapers, and file systems to automate the discovery, downloading, renaming, and organization of movie and television content. The system functions as an LLM media management agent, allowing users to control subscriptions, searches, and file organization through conversational text commands. It also acts as a Model Context Protocol server, exposing internal media management tools via a standardized interface for external AI clients and agents. The project inc
This project is a comprehensive suite of AI tools and frameworks, featuring an LLM multi-agent orchestrator, an autonomous agent runtime, and a stateful application framework. It provides the infrastructure to build and manage specialized AI agents capable of coordinating complex tasks through graph-based workflows and shared state. The system is distinguished by its implementation of the Model Context Protocol, allowing for standardized resource discovery and communication between AI clients and servers. It further includes an AI-powered documentation generator designed to analyze source cod
Model Context Protocol is a standardized framework for connecting large language models to external data sources and executable tools. It enables the creation of a universal interface where servers expose tools, resources, and prompts that can be discovered and utilized by various AI clients. The protocol utilizes a JSON-RPC message system that is transport-agnostic, supporting both standard input/output for local processes and HTTP with server-sent events for remote connections. It emphasizes security and control by delegating model sampling to the client to keep API keys secure from servers
This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer. The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
Qodo Cover is an engineering governance platform and AI-powered assistant designed for automated code review and unit test generation. It utilizes an abstract syntax tree codebase knowledge graph to map dependencies and architectural relationships, allowing it to analyze pull requests and enforce organizational coding standards. The system distinguishes itself through a multi-agent analysis pipeline that performs architectural reasoning and identifies bugs beyond the immediate diff. It features a model context protocol server to expose codebase intelligence to external tools and can automatic
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Ever Gauzy is an integrated business management suite providing an ERP and CRM framework for professional services automation. It functions as a multi-tenant SaaS platform that combines time tracking, billing, and human resource management into a unified system. The project is distinguished by its headless architecture, utilizing a REST and GraphQL API gateway to expose business operations. It features a Model Context Protocol server that allows AI assistants to interact with system data and execute functional tools for automated business workflows. The platform covers a broad operational su
Deepagents is an LLM agent orchestration platform and stateful application server designed for deploying and managing AI agents built with computational graphs. It provides a containerized runtime environment that handles agent execution, state persistence, and the versioning of AI assistants. The platform distinguishes itself through deep integration with the Model Context Protocol, allowing agents to function as servers that expose tools and capabilities to external clients. It features a sophisticated observability suite for capturing execution traces, performing LLM-based evaluations agai
Odysseus is a self-hosted AI workspace and autonomous agent framework designed for deploying and managing large language models. It serves as a centralized platform for orchestrating agentic tasks, utilizing a model context protocol server to connect AI models to external system utilities, browser automation, and local hardware. The system distinguishes itself through a combination of retrieval-augmented generation and a RAG knowledge base, using vector stores and local embeddings to provide persistent semantic memory. It further integrates AI-driven communication management to triage email i
WeKnora is a multi-tenant retrieval-augmented generation (RAG) knowledge platform and autonomous AI agent framework. It transforms raw documents into queryable knowledge bases and integrates large language models with vector databases to provide grounded AI responses. The system also functions as a Model Context Protocol (MCP) tool server, exposing knowledge search and agentic capabilities to external AI clients. The platform distinguishes itself through an autonomous agent framework that utilizes iterative reasoning, tool calling, and web search to solve multi-step tasks. It implements a sta
Faraday is a vulnerability management platform and security tool aggregator designed to centralize security findings from multiple scanners into a single dashboard. It utilizes a relational security database to catalog hosts, services, and security flaws, enabling users to track remediation and analyze organizational risk. The platform distinguishes itself through a plugin-based system that normalizes diverse security tool outputs into a unified data model. It supports deep integration with a wide array of scanners and CLI tools, intercepting shell command output or parsing report files to ag
Arize Phoenix is an LLM observability platform and evaluation framework designed to capture execution traces and monitor large language model applications. It serves as a prompt management system for versioning and testing templates, and as a self-hosted AI operations infrastructure for managing telemetry and experiments. The platform differentiates itself through a specialized embedding visualization tool used to detect data drift and optimize vector search. It provides a comprehensive evaluation suite that utilizes judge-based evaluators and ground-truth datasets to score model outputs, and
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane. The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It
This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content. The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
Langextract is a framework designed to transform unstructured text into structured, machine-readable data using language model orchestration. It provides a high-performance pipeline that processes large volumes of narrative text by utilizing parallel execution and sequential extraction passes. The library is built to handle complex data extraction tasks, including specialized support for clinical information and medical entity relationship recognition. The project distinguishes itself through a plugin-based architecture that supports both local hardware execution and cloud-hosted model endpoi
This project is a Model Context Protocol server that functions as an automation tool for 3D design software. It acts as a bridge between creative applications and external intelligence agents, enabling users to manipulate geometry, materials, and lighting through natural language instructions. The tool distinguishes itself by providing a standardized interface for remote command execution and scene data exchange. By utilizing a protocol-based communication layer, it allows external models to query viewport status and object properties, facilitating automated decision-making and real-time scen
mcp-agent is a framework for building AI agents that integrate with Model Context Protocol servers to execute tools and access data. It functions as a multi-agent orchestrator and protocol-compliant server, enabling the creation of agents that can discover and invoke tools from connected external servers. The project distinguishes itself through a durable workflow engine that supports long-running tasks capable of pausing, resuming, and surviving restarts. It implements complex orchestration patterns, including iterative evaluator-optimizer loops, hierarchical workflow nesting, and specialist
This project is an MCP browser automation server that connects large language models to headless cloud browsers. It functions as an autonomous web workflow engine and an LLM web agent interface, enabling the translation of natural language instructions into browser actions and structured data retrieval. The system distinguishes itself through a managed headless browser cloud API that supports concurrent Chromium sessions with integrated stealth modes, CAPTCHA solving, and proxy traffic routing. It utilizes self-healing element selection to maintain automation resilience when page structures c
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabi
This project provides a Model Context Protocol server that enables autonomous agents to interact with and manage automation workflows. It functions as an integration layer, allowing language models to discover, build, test, and deploy complex automation sequences through natural language instructions and structured schema-based communication. The platform distinguishes itself by offering granular control over automation logic, including the ability to perform surgical, incremental patches to specific workflow nodes rather than replacing entire structures. It supports multi-instance connectivi