What are the best open-source alternatives to LabelLLM?

30 open-source projects similar to opendatalab/labelllm, ranked by shared features. Top picks: lm-sys/llm-decontaminator, raznem/parsera, ds4sd/docling, katanaml/sparrow, opendatalab/mineru, quivrhq/megaparse, catchthetornado/pdf-extract-api, datalab-to/chandra, getomni-ai/zerox, jf-tech/omniparser.

Is lm-sys/llm-decontaminator a good alternative to LabelLLM?

Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"

Is raznem/parsera a good alternative to LabelLLM?

Lightweight library for scraping web-sites with LLMs

Is ds4sd/docling a good alternative to LabelLLM?

Docling is a multimodal content converter and document parser designed to transform PDFs, Office files, and HTML into structured Markdown or JSON for generative AI applications. It functions as an OCR document processor and a PDF layout analyzer that extracts tables, charts, and hierarchical struct…

Is katanaml/sparrow a good alternative to LabelLLM?

Sparrow is an LLM document extraction platform and vision-based inference engine designed to convert images and PDFs into validated structured data. It functions as an agentic workflow orchestrator that chains classification, extraction, and validation tasks into multi-step pipelines. The system d…

Is opendatalab/mineru a good alternative to LabelLLM?

MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural…

Is quivrhq/megaparse a good alternative to LabelLLM?

Megaparse is a document parsing tool and RAG data preprocessor designed to convert PDFs, Word documents, and presentations into clean text formats. It functions as a vision-based document extractor that recovers high-fidelity information from images and complex layouts to optimize data for large la…

Is catchthetornado/pdf-extract-api a good alternative to LabelLLM?

catchthetornado/pdf-extract-api is an open-source alternative to LabelLLM.

Is datalab-to/chandra a good alternative to LabelLLM?

sChandra is a document processing platform that converts images, PDFs, Word documents, spreadsheets, and other formats into structured output such as HTML, Markdown, or JSON while preserving layout. It can also extract specific data fields from invoices, contracts, or reports using user-defined JSO…

Is getomni-ai/zerox a good alternative to LabelLLM?

Zerox is a multimodal document parser and OCR tool that uses vision models to convert PDF files and images into structured Markdown text. It functions as a visual layout extraction engine, leveraging large multimodal models to digitize documents while maintaining their original structural formattin…

Is jf-tech/omniparser a good alternative to LabelLLM?

omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.

Back to opendatalab/labelllm

Open-source alternatives to LabelLLM

30 open-source projects similar to opendatalab/labelllm, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best LabelLLM alternative.

lm-sys/llm-decontaminator
lm-sys/llm-decontaminator
324View on GitHub
Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"
Python
View on GitHub324
raznem/parsera
raznem/parsera
1,279View on GitHub
Lightweight library for scraping web-sites with LLMs
Pythonaiai-scrapingdata-extraction
View on GitHub1,279
ds4sd/docling
DS4SD/docling
62,172View on GitHub
Docling is a multimodal content converter and document parser designed to transform PDFs, Office files, and HTML into structured Markdown or JSON for generative AI applications. It functions as an OCR document processor and a PDF layout analyzer that extracts tables, charts, and hierarchical structures while preserving the original page layout. The system operates as a local-first inference engine, allowing for the processing of sensitive data in air-gapped environments without external network connectivity. It can also be deployed as an API or a Model Context Protocol server to provide parsi
Python
View on GitHub62,172
katanaml/sparrow
katanaml/sparrow
5,162View on GitHub
Sparrow is an LLM document extraction platform and vision-based inference engine designed to convert images and PDFs into validated structured data. It functions as an agentic workflow orchestrator that chains classification, extraction, and validation tasks into multi-step pipelines. The system distinguishes itself through a backend-agnostic inference layer that manages models across local GPUs, Apple Silicon, and cloud providers. It employs coordinate-based visual grounding to map extracted text to precise bounding box coordinates and utilizes hint-based model steering to guide attention an
Pythonagentic-aicomputer-visiondocumentai
View on GitHub5,162

Open-source alternatives to LabelLLM

lm-sys/llm-decontaminator

raznem/parsera

DS4SD/docling

katanaml/sparrow

opendatalab/MinerU

quivrhq/megaparse

CatchTheTornado/pdf-extract-api

datalab-to/chandra

getomni-ai/zerox

jf-tech/omniparser

MinishLab/semhash

opendatalab/DocLayout-YOLO

OpenDCAI/DataFlow

pdf2htmlEX/pdf2htmlEX

599yongyang/DatasetLoom

bytedance/Dolphin

chatdoc-com/OCRFlux

ConardLi/easy-dataset

ekzhu/datasketch

funstory-ai/BabelDOC

huggingface/datatrove

huggingface/llm-swarm

microsoft/markitdown

mikefarah/yq

modelscope/data-juicer

modelscope/easydistill

argilla-io/distilabel

opendatalab/PDF-Extract-Kit

allenai/olmocr

refuel-ai/autolabel