# PDF and Table Extraction Tools

> Search results for `parse PDFs and tables for LLM ingestion` on awesome-repositories.com. 117 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/parse-pdfs-and-tables-for-llm-ingestion

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/parse-pdfs-and-tables-for-llm-ingestion).**

## Results

- [cinnamon/kotaemon](https://awesome-repositories.com/repository/cinnamon-kotaemon.md) (25,139 ⭐) — Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines.

The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
- [parse-community/parse-server](https://awesome-repositories.com/repository/parse-community-parse-server.md) (21,403 ⭐) — Parse Server is a backend-as-a-service solution and Node.js framework that provides a ready-to-use REST and GraphQL API for mobile and web applications. It functions as a core backend infrastructure for managing database schemas, user authentication, and API routing.

The system distinguishes itself with a real-time data engine that pushes database updates to clients via WebSockets and a GraphQL server that automatically generates schemas based on application data models. It also features an adapter-based storage layer that abstracts interactions with various cloud and local backends.

The pla
- [camel-ai/camel](https://awesome-repositories.com/repository/camel-ai-camel.md) (17,253 ⭐) — This project is a comprehensive framework for building and managing autonomous agent systems. It provides a unified architecture for orchestrating multi-agent societies, where specialized agents collaborate through roleplay to decompose and solve complex tasks. The system integrates language models with external environments, enabling agents to perform real-world actions through a standardized tool-calling abstraction layer.

The framework distinguishes itself through its focus on iterative reasoning and data reliability. It employs automated feedback loops to refine agent outputs and self-eva
- [mendableai/firecrawl-mcp-server](https://awesome-repositories.com/repository/mendableai-firecrawl-mcp-server.md) (6,602 ⭐) — This project is a Model Context Protocol server that connects large language models to web scraping and crawling tools. It functions as a bridge, allowing LLM clients to utilize a web crawling engine and scraping utilities to extract and process web data.

The server integrates a markdown web converter that transforms dynamic web pages and PDF documents into clean markdown to optimize consumption by AI models. It also provides a browser automation interface for controlling headless sessions and bypassing access restrictions.

The system covers broad capabilities including large-scale website d
- [jsfenfen/parsing-prickly-pdfs](https://awesome-repositories.com/repository/jsfenfen-parsing-prickly-pdfs.md) (63 ⭐) — Resources and worksheet for the NICAR 2016 workshop of the same name. Instructors: Jacob Fenton (jsfenfen@gmail.com) and Jeremy Singer-Vine (jsvine@gmail.com).
- [microsoft/markitdown](https://awesome-repositories.com/repository/microsoft-markitdown.md) (154,485 ⭐) — This project is an AI-powered document processing engine designed to transform diverse file formats into structured Markdown. By leveraging multimodal language models, it performs complex layout analysis and semantic text extraction, allowing for the conversion of both unstructured files and scanned images into machine-readable content.

The toolkit distinguishes itself through a modular, plugin-based architecture that orchestrates multi-stage extraction pipelines. Users can steer the parsing behavior by injecting custom instructions, enabling the system to adapt to domain-specific document st
- [3lf/llm-for-humans](https://awesome-repositories.com/repository/3lf-llm-for-humans.md) (75 ⭐) — برای آدمیزاد LLM آموزش / Teaching LLM in Persian
- [nashsu/llm_wiki](https://awesome-repositories.com/repository/nashsu-llm-wiki.md) (12,563 ⭐) — This project is an LLM knowledge base builder and personal knowledge management tool. It is a desktop application designed to transform diverse documents into a persistent, interlinked wiki through LLM analysis and incremental ingestion.

The system distinguishes itself with a knowledge graph visualizer that uses community detection algorithms to map relationships between concepts and identify topical clusters. It features a hybrid retrieval system that combines keyword matching, vector embeddings, and graph relevance to locate information.

The platform covers a wide range of capabilities inc
- [vikparuchuri/marker](https://awesome-repositories.com/repository/vikparuchuri-marker.md) (36,164 ⭐) — Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures.

The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements.

Capabi
- [briland/llm-security-and-privacy](https://awesome-repositories.com/repository/briland-llm-security-and-privacy.md) (54 ⭐) — LLM security and privacy
- [pathwaycom/pathway](https://awesome-repositories.com/repository/pathwaycom-pathway.md) (62,959 ⭐) — Pathway is a high-performance data processing framework designed for building unified batch and streaming pipelines. It functions as an orchestrator for complex data transformations, utilizing a differential dataflow engine to process updates incrementally. By treating static datasets and continuous event streams with identical logic, the platform ensures exactly-once processing semantics and consistent results across diverse data sources.

The framework distinguishes itself through its specialized support for real-time artificial intelligence and retrieval-augmented generation. It features in
- [llmware-ai/llmware](https://awesome-repositories.com/repository/llmware-ai-llmware.md) (14,838 ⭐) — llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation.

The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural lang
- [overleaf/overleaf](https://awesome-repositories.com/repository/overleaf-overleaf.md) (17,853 ⭐) — This project is a web-based collaborative editor and scientific document management system designed for LaTeX. It provides a centralized environment for writing, editing, and compiling academic manuscripts, enabling multiple users to work on the same project simultaneously through real-time synchronization.

The platform distinguishes itself by treating documents as version-controlled repositories, allowing for granular history tracking and bidirectional synchronization with external version control systems. It features a secure, containerized compilation pipeline that isolates build processes
- [parse-community/parse-sdk-flutter](https://awesome-repositories.com/repository/parse-community-parse-sdk-flutter.md) (587 ⭐) — The Dart/Flutter SDK for Parse Platform
- [gitbookio/gitbook](https://awesome-repositories.com/repository/gitbookio-gitbook.md) (28,902 ⭐) — Gitbook is a documentation-as-code platform designed for centralized technical knowledge management. It functions as a knowledge management system that synchronizes documentation files directly with version control repositories, allowing teams to maintain content alongside their source code.

The platform distinguishes itself through an integrated artificial intelligence layer that provides context-aware search assistance and automated content suggestions. By utilizing block-based content modeling, it enables the construction of structured, modular documentation that can be compiled into stati
- [anionex/banana-slides](https://awesome-repositories.com/repository/anionex-banana-slides.md) (12,060 ⭐) — Banana-slides is a generative AI workflow engine designed to automate the creation and refinement of professional slide decks. By leveraging large language models, the platform transforms raw text, structured outlines, and existing documents into visual presentations. It functions as an automated tool that orchestrates the entire lifecycle of a presentation, from initial content generation and layout design to final export.

The system distinguishes itself through a modular provider abstraction that allows users to integrate various artificial intelligence services for content and image synthe
- [tpn/pdfs](https://awesome-repositories.com/repository/tpn-pdfs.md) (9,828 ⭐) — This project is a digital document repository and technical PDF library. It serves as a computer science reference archive designed to store a curated collection of academic papers, specifications, and manuals focused on computing and software engineering.

The archive functions as an engineering knowledge base for technical research archiving. It manages a structured library of documents to preserve institutional knowledge and ensure technical documentation remains accessible.

The system employs a curated content pipeline and metadata-driven indexing to organize materials. Documents are mana
- [infiniflow/ragflow](https://awesome-repositories.com/repository/infiniflow-ragflow.md) (82,922 ⭐) — This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execute multi-step reasoning workflows. By integrating document intelligence with advanced retrieval pipelines, the platform enables the creation of grounded, verifiable responses supported by traceable citations.

The platform distinguishes itself through deep document understanding and sophisticated know
- [xitanggg/open-resume](https://awesome-repositories.com/repository/xitanggg-open-resume.md) (8,460 ⭐) — Open-resume is an ATS-friendly resume builder and browser-based document editor designed for creating professional resumes with a focus on applicant tracking system readability. It functions as a resume template engine that allows users to construct structured documents while keeping all personal data stored locally in the browser to ensure privacy and data ownership.

The project features a PDF resume parser that extracts professional information from existing files to automatically populate new templates. It also includes ATS compatibility testing to verify how effectively automated tracking
- [psecio/parse](https://awesome-repositories.com/repository/psecio-parse.md) (382 ⭐) — Parse: A Static Security Scanner
- [alam00000/bentopdf](https://awesome-repositories.com/repository/alam00000-bentopdf.md) (11,550 ⭐) — BentoPDF is a browser-based document toolkit designed for local-first PDF manipulation, conversion, and metadata management. By executing all file processing tasks directly within the browser sandbox, the application ensures that sensitive data remains on the user's device and is never uploaded to or stored on external servers.

The platform distinguishes itself through a modular architecture that supports dynamic remote script loading and the integration of external processing engines. Users can extend the core functionality by connecting third-party libraries, which are executed as compiled
- [wojtekmaj/react-pdf](https://awesome-repositories.com/repository/wojtekmaj-react-pdf.md) (10,920 ⭐) — React-pdf is a library of components designed to integrate document viewing and interaction into web applications. It provides a standardized interface for parsing and displaying portable document format files directly within a browser environment, supporting input from local files, remote web addresses, and encoded data strings.

The library renders document content onto HTML5 canvas elements to ensure consistent visual display across browsers. To maintain interface responsiveness during document processing, it offloads parsing tasks to background threads. It also implements a layered approac
- [zhiburt/tabled](https://awesome-repositories.com/repository/zhiburt-tabled.md) (2,337 ⭐) — An easy to use library for pretty print tables of Rust structs and enums.
- [pdfminer/pdfminer.six](https://awesome-repositories.com/repository/pdfminer-pdfminer-six.md) (6,906 ⭐) — pdfminer.six is a programmatic tool for extracting text, layout information, and metadata from PDF documents into machine-readable formats. It functions as a document parser that converts internal PDF objects and structures into accessible data objects for analysis.

The project includes utilities for decrypting RC4 and AES encrypted files to enable content extraction. It also provides a layout analyzer to identify fonts, colors, and text locations to determine the organizational structure of pages.

The system covers a broad range of extraction capabilities, including the retrieval of embedde
- [typelevel/cats-parse](https://awesome-repositories.com/repository/typelevel-cats-parse.md) (244 ⭐) — A parsing library for the cats ecosystem
- [adhikasp/mcp-git-ingest](https://awesome-repositories.com/repository/adhikasp-mcp-git-ingest.md) (312 ⭐) — A Model Context Protocol (MCP) server that helps read GitHub repository structure and important files.
- [pathwaycom/llm-app](https://awesome-repositories.com/repository/pathwaycom-llm-app.md) (59,341 ⭐) — This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows.

The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
- [jsvine/pdfplumber](https://awesome-repositories.com/repository/jsvine-pdfplumber.md) (9,732 ⭐) — pdfplumber is a PDF data extraction library and layout analysis tool used to retrieve text, tables, and geometric objects from PDF files using precise coordinate-based analysis. It functions as a layout analyzer and table parser that identifies the bounding boxes and visual coordinates for every character and image on a page.

The library distinguishes itself through visual debugging capabilities, allowing users to render PDF pages as images and draw annotations to verify the position of extracted data. It employs line and intersection analysis to identify cell structures and convert unstructu
- [pymupdf/pymupdf](https://awesome-repositories.com/repository/pymupdf-pymupdf.md) (9,086 ⭐) — PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents.

The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines.

It
- [table-library/react-table-library](https://awesome-repositories.com/repository/table-library-react-table-library.md) (795 ⭐) — React Table Library
- [open-circle/valibot](https://awesome-repositories.com/repository/open-circle-valibot.md) (8,769 ⭐) — Valibot is a modular, type-safe schema library for validating and parsing structural data in TypeScript environments.
- [honojs/hono](https://awesome-repositories.com/repository/honojs-hono.md) (30,994 ⭐) — Hono is a lightweight web framework built on Web Standard APIs that executes across JavaScript runtimes including Cloudflare Workers, Deno, Bun, and Node.js.
- [sindresorhus/parse-json](https://awesome-repositories.com/repository/sindresorhus-parse-json.md) (372 ⭐) — Parse JSON with more helpful errors
- [pdf2htmlex/pdf2htmlex](https://awesome-repositories.com/repository/pdf2htmlex-pdf2htmlex.md) (5,412 ⭐) — pdf2htmlEX is a PDF to HTML converter that transforms documents into web pages while preserving the original layout, fonts, and formatting. It functions as a layout engine and text extractor, mapping PDF coordinate data to HTML and CSS to maintain visual fidelity.

The tool converts PDF content into searchable and selectable native HTML text by embedding original document fonts. It maintains document interactivity by preserving internal links, bookmarks, and outlines, converting them into functional web navigation.

The conversion process supports flexible output structures, allowing documents
- [tomlazar/table](https://awesome-repositories.com/repository/tomlazar-table.md) (52 ⭐) — pretty colorfull tables in go with less effort
- [materializeinc/materialize](https://awesome-repositories.com/repository/materializeinc-materialize.md) (6,314 ⭐) — Materialize is a streaming SQL database that continuously ingests live data from external sources and incrementally maintains materialized views, providing consistent, up-to-date query results through any PostgreSQL-compatible client. It combines a change data capture platform for MySQL and PostgreSQL with a Kafka stream ingestion engine, a real-time materialized view engine, and a PostgreSQL-compatible query engine into a single system that processes data as it arrives.

The platform distinguishes itself through its ability to maintain correct, incrementally updated SQL views across joins fro
- [parsely/streamparse](https://awesome-repositories.com/repository/parsely-streamparse.md) (1,506 ⭐) — Run Python in Apache Storm topologies. Pythonic API, CLI tooling, and a topology DSL.
- [duckdb/duckdb](https://awesome-repositories.com/repository/duckdb-duckdb.md) (38,805 ⭐) — DuckDB is an in-process analytical database engine designed to run directly within an application process. As a zero-dependency, embedded system, it provides enterprise-grade SQL data processing capabilities without the overhead of managing a dedicated database server. It is built to handle complex analytical and aggregation tasks by storing and retrieving information in columns, allowing for high-performance relational data manipulation.

The engine distinguishes itself through a columnar vectorized execution model that maximizes CPU cache efficiency during query operations. It employs adapti
- [tabulapdf/tabula](https://awesome-repositories.com/repository/tabulapdf-tabula.md) (7,425 ⭐) — Tabula is a PDF table extraction tool and data scraper designed to isolate tabular structures within text-based PDF files. It functions as a converter that transforms these layouts into structured CSV or spreadsheet formats for data recovery and analysis.

The project provides both a visual interface for manually selecting table areas and a headless command-line interface. This dual approach allows for a choice between manual data recovery via visual-area selection and the integration of table extraction into automated data pipelines.

The extraction process utilizes Java-based PDF parsing and
- [tesseract-ocr/tesseract](https://awesome-repositories.com/repository/tesseract-ocr-tesseract.md) (74,751 ⭐) — Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into desktop, mobile, or server-side applications. By utilizing long short-term memory networks, the engine provides robust text extraction across more than one hundred languages and dozens of scripts.

The project distinguishes itself through a sophisticated document layout analysis f
- [scjangra/table-nvim](https://awesome-repositories.com/repository/scjangra-table-nvim.md) (79 ⭐) — A markdown table editor for Neovim that formats the table as you type.
- [aishwaryanr/awesome-generative-ai-guide](https://awesome-repositories.com/repository/aishwaryanr-awesome-generative-ai-guide.md) (24,755 ⭐) — This project is a community-driven knowledge repository and technical learning resource focused on the field of generative artificial intelligence. It serves as a centralized hub for developers and practitioners to access curated research, tutorials, and foundational concepts necessary for building and deploying modern artificial intelligence applications.

The platform distinguishes itself through a collaborative, distributed contribution model that aggregates diverse learning materials into a structured, searchable knowledge base. It covers a wide range of specialized topics, including retri
- [opendatalab/pdf-extract-kit](https://awesome-repositories.com/repository/opendatalab-pdf-extract-kit.md) (9,724 ⭐) — PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as Markdown, HTML, and LaTeX. It functions as a multi-stage parsing framework that combines a document layout analyzer, a formula recognition engine, an OCR text extractor, and a table extraction system.

The project focuses on recovering complex document elements by translating images of mathematical formulas and tabular structures into editable source code. It utilizes model-driven layout analysis to identify structural elements in reports and textbooks while ignoring noise like wa
- [iamkun/dayjs](https://awesome-repositories.com/repository/iamkun-dayjs.md) (48,662 ⭐) — Day.js is a lightweight utility for parsing, validating, and manipulating date objects. It provides a fluent, chainable interface that allows for complex time calculations and transformations to be performed through a sequence of readable method calls. By utilizing an immutable wrapper pattern, the library ensures data integrity by creating new instances for every operation rather than modifying existing objects.

The project is distinguished by a minimalist core abstraction that maintains a small footprint by offloading non-essential features to an optional, modular plugin system. This archit
- [tj/terminal-table](https://awesome-repositories.com/repository/tj-terminal-table.md) (1,573 ⭐) — Ruby ASCII Table Generator, simple and feature rich.
- [aslagle/reactive-table](https://awesome-repositories.com/repository/aslagle-reactive-table.md) (327 ⭐) — A reactive table for Meteor, using Blaze.
- [docling-project/docling](https://awesome-repositories.com/repository/docling-project-docling.md) (61,674 ⭐) — Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types.

The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
- [openhands/openhands](https://awesome-repositories.com/repository/openhands-openhands.md) (77,330 ⭐) — OpenHands is an autonomous agent framework designed for software engineering workflows. It provides a modular platform for orchestrating AI agents that reason, plan, and execute tasks within isolated, containerized development environments. By integrating with standard version control and development tools, the system enables agents to autonomously navigate codebases, implement features, and resolve issues through iterative reasoning and tool execution.

The platform distinguishes itself through a model-agnostic orchestrator that connects diverse language models to a unified tool registry. It
- [forem/forem](https://awesome-repositories.com/repository/forem-forem.md) (22,726 ⭐) — Forem is an open-source platform designed for building and managing technical communities. It functions as a social publishing engine that enables members to share long-form content, participate in threaded discussions, and engage through social interactions. The platform provides tools for organizations to maintain branded profiles, host community hackathons, and facilitate collaborative learning through structured educational tracks.

Beyond its social features, Forem integrates advanced capabilities for AI agent workflow orchestration and codebase knowledge graphing. It allows developers to
- [breezedeus/pix2text](https://awesome-repositories.com/repository/breezedeus-pix2text.md) (3,012 ⭐) — Pix2Text is an optical character recognition system and document conversion tool designed to transform images and PDFs into Markdown. It functions as a multilingual OCR engine supporting over 80 languages, a LaTeX formula recognizer for mathematical notations, and a parser integrated with vision language models.

The project utilizes a hybrid pipeline to separate plain text from mathematical formulas and tabular structures within a single pass. It converts recognized formulas into LaTeX expressions and transforms detected tables and layouts into structured Markdown formatting.

The system incl
