What are the best open-source alternatives to Markitdown?

30 open-source projects similar to microsoft/markitdown, ranked by shared features. Top picks: docling-project/docling, vikparuchuri/marker, quivrhq/megaparse, microsoft/unilm, getomni-ai/zerox, allenai/olmocr, infiniflow/ragflow, ds4sd/docling, bytedance/dolphin, opendatalab/mineru.

Is docling-project/docling a good alternative to Markitdown?

Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elem…

Is vikparuchuri/marker a good alternative to Markitdown?

Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout stru…

Is quivrhq/megaparse a good alternative to Markitdown?

Megaparse is a document parsing tool and RAG data preprocessor designed to convert PDFs, Word documents, and presentations into clean text formats. It functions as a vision-based document extractor that recovers high-fidelity information from images and complex layouts to optimize data for large la…

Is microsoft/unilm a good alternative to Markitdown?

This project is a comprehensive framework and toolkit for developing, optimizing, and deploying transformer-based models across multimodal, document intelligence, and natural language processing tasks. It provides a unified neural architecture that processes text, vision, audio, and document layout…

Is getomni-ai/zerox a good alternative to Markitdown?

Zerox is a multimodal document parser and OCR tool that uses vision models to convert PDF files and images into structured Markdown text. It functions as a visual layout extraction engine, leveraging large multimodal models to digitize documents while maintaining their original structural formattin…

Is allenai/olmocr a good alternative to Markitdown?

Olmocr is a distributed document processing framework designed to convert PDF and image files into structured markdown. It functions as a vision-based document parser that utilizes multimodal neural networks to interpret complex visual layouts and translate them into standardized text representatio…

Is infiniflow/ragflow a good alternative to Markitdown?

This project is a comprehensive retrieval-augmented generation platform designed for building, managing, and deploying knowledge-based AI applications. It provides a unified environment for organizing datasets, configuring conversational chat assistants, and developing autonomous agents that execut…

Is ds4sd/docling a good alternative to Markitdown?

Docling is a multimodal content converter and document parser designed to transform PDFs, Office files, and HTML into structured Markdown or JSON for generative AI applications. It functions as an OCR document processor and a PDF layout analyzer that extracts tables, charts, and hierarchical struct…

Is bytedance/dolphin a good alternative to Markitdown?

Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content.…

Is opendatalab/mineru a good alternative to Markitdown?

MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural…

Back to microsoft/markitdown

Open-source alternatives to Markitdown

30 open-source projects similar to microsoft/markitdown, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Markitdown alternative.

docling-project/docling
docling-project/docling
61,674View on GitHub
Docling is a modular framework designed for document parsing, layout analysis, and structured data extraction. It transforms unstructured files and web content into a unified, hierarchical data model that preserves the spatial and semantic relationships between text, tables, images, and layout elements. By normalizing diverse input formats into a consistent internal representation, the library enables uniform processing across various document types. The project distinguishes itself through a schema-driven approach that maps document regions to strongly-typed objects, ensuring data accuracy t
Pythonaiconvertdocument-parser
View on GitHub61,674
vikparuchuri/marker
VikParuchuri/marker
36,164View on GitHub
Marker is an LLM-powered document parser and OCR pipeline designed to convert PDFs and unstructured files into structured markdown, JSON, and HTML. It functions as a data preprocessor that transforms complex documents into machine-readable formats while preserving tables, equations, and layout structures. The system utilizes large language models to refine OCR accuracy, clean mathematical notation, and merge fragmented tables across multiple pages. It employs model-based layout analysis to predict block types and bounding boxes, ensuring a more precise conversion of document elements. Capabi
Python
View on GitHub36,164
quivrhq/megaparse
quivrhq/megaparse
7,389View on GitHub
Megaparse is a document parsing tool and RAG data preprocessor designed to convert PDFs, Word documents, and presentations into clean text formats. It functions as a vision-based document extractor that recovers high-fidelity information from images and complex layouts to optimize data for large language model ingestion. The system employs multimodal AI and vision models to perform schema-preserving parsing, which maintains structural hierarchies such as tables and headers. It utilizes lossless structural transformation to turn layout-heavy binary files into text sequences while preserving th
Python
View on GitHub7,389

Open-source alternatives to Markitdown

docling-project/docling

VikParuchuri/marker

quivrhq/megaparse

microsoft/unilm

getomni-ai/zerox

allenai/olmocr

infiniflow/ragflow

DS4SD/docling

bytedance/Dolphin

opendatalab/MinerU

Zipstack/unstract

Stirling-Tools/Stirling-PDF

ucbepic/docetl

adithya-s-k/omniparse

Unstructured-IO/unstructured

kreuzberg-dev/kreuzberg

pymupdf/PyMuPDF

unclecode/crawl4ai

codexu/note-gen

jpmens/jo

huggingface/datatrove

jazzband/tablib

mikefarah/yq

Kozea/WeasyPrint

ekzhu/datasketch

lm-sys/llm-decontaminator

datalab-to/chandra

ConardLi/easy-dataset

argilla-io/distilabel

katanaml/sparrow