Autolabel

Label, clean and enrich text datasets with LLMs.

Features

Data Processing - Automated labeling, cleaning, and enrichment of text datasets.

allenai/olmocr

Olmocr is a distributed document processing framework designed to convert PDF and image files into structured markdown. It functions as a vision-based document parser that utilizes multimodal neural networks to interpret complex visual layouts and translate them into standardized text representations. The system operates as a remote inference orchestrator, offloading heavy document analysis tasks to external servers or cloud APIs to minimize local computational requirements. By employing a stateless worker architecture, it decouples document ingestion from inference, allowing for the distribu

argilla-io/distilabel

3,277View on GitHub

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

599yongyang/DatasetLoom

0View on GitHub

bytedance/Dolphin

8,820View on GitHub

Dolphin is a multimodal layout analyzer and image-to-structure converter that transforms photographed or digital document images into machine-readable structured data. It functions as an LLM document parser, utilizing vision-language models to simultaneously predict spatial layout and text content. The system is designed as a concurrent document processor, employing parallel document parsing to process multiple elements across distributed compute nodes. This high-throughput approach reduces the total time required to convert large volumes of images into structured formats. The project covers

allenai/olmocr

17,396View on GitHub

argilla-io/distilabel

3,277View on GitHub

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.

599yongyang/DatasetLoom

0View on GitHub

bytedance/Dolphin

8,820View on GitHub

refuel-aiautolabel

Features

Open-source alternatives to Autolabel

allenai/olmocr

argilla-io/distilabel

599yongyang/DatasetLoom

bytedance/Dolphin

Star history

Open-source alternatives to Autolabel

allenai/olmocr

argilla-io/distilabel

599yongyang/DatasetLoom

bytedance/Dolphin