# tabulapdf/tabula

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/tabulapdf-tabula).**

7,425 stars · 697 forks · CSS · MIT

## Links

- GitHub: https://github.com/tabulapdf/tabula
- Homepage: http://tabula.technology
- awesome-repositories: https://awesome-repositories.com/repository/tabulapdf-tabula.md

## Topics

`csv` `excel` `pdf` `scraping` `tables`

## Description

Tabula is a PDF table extraction tool and data scraper designed to isolate tabular structures within text-based PDF files. It functions as a converter that transforms these layouts into structured CSV or spreadsheet formats for data recovery and analysis.

The project provides both a visual interface for manually selecting table areas and a headless command-line interface. This dual approach allows for a choice between manual data recovery via visual-area selection and the integration of table extraction into automated data pipelines.

The extraction process utilizes Java-based PDF parsing and pattern-based row detection to identify table boundaries. Once identified, the tool performs coordinate-based text extraction to serialize the data into comma-separated values.

## Tags

### Content Management & Publishing

- [Table Extraction Utilities](https://awesome-repositories.com/f/content-management-publishing/documentation-knowledge-management/pdf-structural-elements/table-extraction-utilities.md) — Provides a visual interface for isolating and converting tabular data from PDF structural elements. ([source](https://cdn.jsdelivr.net/gh/tabulapdf/tabula@master/README.md))
- [Automated Data Extraction](https://awesome-repositories.com/f/content-management-publishing/content-processing-transformation/document-processing-conversion/document-processing/data-extraction-analysis/automated-data-extraction.md) — Offers a command-line tool and language bindings for automated extraction of tabular PDF data. ([source](https://cdn.jsdelivr.net/gh/tabulapdf/tabula@master/README.md))
- [PDF to CSV Converters](https://awesome-repositories.com/f/content-management-publishing/pdf-to-html-converters/pdf-to-csv-converters.md) — Transforms PDF table layouts into structured CSV or spreadsheet formats.

### Part of an Awesome List

- [PDF Tools](https://awesome-repositories.com/f/awesome-lists/productivity/pdf-tools.md) — Provides a specialized tool for extracting tabular data from text-based PDFs into spreadsheets.

### Data & Databases

- [PDF Spatial Layout Parsers](https://awesome-repositories.com/f/data-databases/document-parsing-engines/web-document-parsing/visual-layout-parsing/pdf-spatial-layout-parsers.md) — Utilizes Java-based parsing to decompose PDF structures while preserving spatial layout information.
- [PDF Parsers](https://awesome-repositories.com/f/data-databases/pdf-parsers.md) — Acts as a PDF parser that identifies and isolates table structures for data recovery.
- [Tabular Row Detection](https://awesome-repositories.com/f/data-databases/tabular-row-detection.md) — Algorithmically identifies table rows by analyzing the vertical spacing and alignment of text elements.
- [Data Extraction Pipelines](https://awesome-repositories.com/f/data-databases/data-extraction-pipelines.md) — Enables the integration of PDF table extraction into automated data processing workflows.
- [CSV Serialization](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/data-serialization/json-serializers/csv-serialization.md) — Converts extracted tabular text into comma-separated value formats for spreadsheet compatibility.
- [PDF Coordinate Extraction](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/coordinate-based-extraction/pdf-coordinate-extraction.md) — Extracts character data based on precise X and Y coordinates defined by detected table boundaries.

### Development Tools & Productivity

- [Command Line Interfaces](https://awesome-repositories.com/f/development-tools-productivity/command-line-interfaces.md) — Provides a headless command-line interface for integrating table extraction into automated pipelines.
- [PDF Command Line Utilities](https://awesome-repositories.com/f/development-tools-productivity/pdf-command-line-utilities.md) — Ships a command-line utility for automating the retrieval of tables from PDF files.

### User Interface & Experience

- [Tabular Data Extraction](https://awesome-repositories.com/f/user-interface-experience/html-content-processing/pdf-and-html-content-extraction/tabular-data-extraction.md) — Specializes in the extraction of structured table data from PDF document formats.
- [PDF Table Area Selection](https://awesome-repositories.com/f/user-interface-experience/selectable-lists/multiple-selections/canvas-area-selections/pdf-table-area-selection.md) — Provides a visual interface for manually defining table coordinates for data extraction.

### Business & Productivity Software

- [Manual Tabular Data Recovery](https://awesome-repositories.com/f/business-productivity-software/manual-tabular-data-recovery.md) — Allows for manual selection of table areas to recover data without manual typing.
