2 repos
Tools that convert internal document representations into structured formats like JSON for downstream consumption.
Explore 2 awesome GitHub repositories matching data & databases · Structured Data Exporters. Refine with filters or upvote what's useful.
Tesseract is a neural network-based optical character recognition engine designed to convert scanned images and digital documents into machine-readable, searchable text. It functions as both a command-line utility for automating large-scale digitization workflows and a cross-platform library that can be embedded into d
Produce structured results in JSON or XML formats to facilitate integration with external data processing and layout analysis tools.
MinerU is a document parsing pipeline designed to transform unstructured files into machine-readable, structured data. It utilizes deep learning models to perform layout analysis, identifying document regions and extracting complex content such as mathematical expressions. By combining these neural network inferences w
Exports parsing results as structured JSON files to facilitate deeper data analysis through automated scripts.