# apache/tika

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/apache-tika).**

3,572 stars · 909 forks · Java · apache-2.0

## Links

- GitHub: https://github.com/apache/tika
- Homepage: https://tika.apache.org/
- awesome-repositories: https://awesome-repositories.com/repository/apache-tika.md

## Topics

`content` `extraction` `java` `metadata` `tika`

## Description

Tika is a content analysis toolkit and Java library designed for detecting and extracting metadata and text from thousands of different file types. It functions as a universal document text extractor and metadata extraction engine, converting complex files into plain text or XHTML.

The system employs a specialized MIME type detector that identifies document formats using magic bytes and metadata to determine the correct parser. It serves as an OCR integration gateway, connecting to external text recognition tools to extract content from image files.

The project covers a broad range of extraction and analysis capabilities, including digital asset metadata retrieval, email archive processing for formats like PST and mbox, and natural language detection. It further supports automated document parsing, recursive archive unpacking, and text content analysis through integrations for sentiment classification and named entity recognition.

## Tags

### Data & Databases

- [Content Extraction](https://awesome-repositories.com/f/data-databases/content-extraction.md) — Provides a unified interface for retrieving raw text and metadata from a vast variety of document types. ([source](https://tika.apache.org/))
- [MIME Type Detection Engines](https://awesome-repositories.com/f/data-databases/mime-type-detection-engines.md) — Determines file formats by analyzing magic bytes, filenames, and metadata using a sequence of prioritized detectors.
- [Content Analysis Toolkits](https://awesome-repositories.com/f/data-databases/content-analysis-toolkits.md) — Ships a comprehensive Java library for detecting and extracting metadata and text from thousands of different file formats.
- [Content Type Detection](https://awesome-repositories.com/f/data-databases/content-type-detection.md) — Automates the identification of media types by analyzing byte patterns, filenames, and metadata. ([source](https://tika.apache.org/3.3.1/detection.html))
- [Document Parsing Engines](https://awesome-repositories.com/f/data-databases/document-parsing-engines.md) — Converts unstructured documents from complex formats into structured plain text or XHTML.
- [Media Metadata Extraction](https://awesome-repositories.com/f/data-databases/media-metadata-extraction.md) — Retrieves descriptive attributes and system metadata from images, audio, video, and scientific formats. ([source](https://tika.apache.org/2.8.0/index.html))
- [MIME Type Detection](https://awesome-repositories.com/f/data-databases/mime-type-detection.md) — Identifies the format and media type of a file by analyzing byte patterns and file metadata.
- [MIME Type Detectors](https://awesome-repositories.com/f/data-databases/mime-type-detectors.md) — Includes a specialized detector that identifies document formats using magic bytes and metadata to determine the correct media type.
- [Container Format Inspection](https://awesome-repositories.com/f/data-databases/container-format-inspection.md) — Analyzes files wrapped in common container formats to determine the specific document type residing inside. ([source](https://tika.apache.org/3.3.1/detection.html))
- [Detection Pipeline Configurations](https://awesome-repositories.com/f/data-databases/content-type-detection/detection-pipeline-configurations.md) — Allows users to define active detectors and their execution sequence to determine the MIME type of a document. ([source](https://tika.apache.org/3.3.1/configuring.html))
- [Structural Event Streaming](https://awesome-repositories.com/f/data-databases/data-processing-pipelines/stream-processing-systems/data-streaming/structured-event-streams/structural-event-streaming.md) — Converts documents into a stream of events to preserve structural elements without loading the full file into memory. ([source](https://tika.apache.org/3.3.1/api/))
- [Memory-Efficient Streaming](https://awesome-repositories.com/f/data-databases/text-processing-utilities/text-extraction/text-segmentation/memory-efficient-streaming.md) — Processes extracted text in small segments via custom handlers to minimize memory usage for large documents. ([source](https://tika.apache.org/3.3.1/examples.html))

### Part of an Awesome List

- [Document Text Extractors](https://awesome-repositories.com/f/awesome-lists/data/document-parsing-and-extraction/document-text-extractors.md) — Converts complex binary files like PDFs and Office documents into plain text or XHTML for downstream processing.
- [Stream-Based Parsing](https://awesome-repositories.com/f/awesome-lists/data/html-and-xml-parsing/xml-parsing/stream-based-parsing.md) — Implements event-driven parsing of large documents to extract text while minimizing memory consumption.
- [Custom Extractor Implementations](https://awesome-repositories.com/f/awesome-lists/data/document-parsing-and-extraction/document-text-extractors/custom-extractor-implementations.md) — Provides interfaces for implementing custom logic to extract text and metadata from unsupported file formats. ([source](https://tika.apache.org/3.3.1/parser_guide.html))
- [Visual Content Analysis](https://awesome-repositories.com/f/awesome-lists/media/audio-and-video-analysis/visual-content-analysis.md) — Identifies objects and visual elements within images and videos by integrating with external recognition frameworks. ([source](https://tika.apache.org/3.3.1/formats.html))
- [Email Text Extraction](https://awesome-repositories.com/f/awesome-lists/more/text-extraction-and-ocr/email-text-extraction.md) — Extracts individual messages and attachments from mailbox formats such as mbox, PST, and MSG. ([source](https://tika.apache.org/3.3.1/formats.html))
- [OCR Integration Gateways](https://awesome-repositories.com/f/awesome-lists/more/text-extraction-and-ocr/ocr-integration-gateways.md) — Provides an interface for connecting to external text recognition tools to extract text from image files.

### Content Management & Publishing

- [General Metadata Engines](https://awesome-repositories.com/f/content-management-publishing/document-metadata-extraction/general-metadata-engines.md) — Implements a framework for retrieving embedded properties and descriptive attributes from diverse media and document formats.
- [Multi-Format Archive Extraction Commands](https://awesome-repositories.com/f/content-management-publishing/content-archiving/web-content-archivers/session-data-archivers/remote-archive-extraction/archive-extraction/multi-format-archive-extraction-commands.md) — Decompresses various packaging formats like Zip, Tar, and RAR to extract nested document streams. ([source](https://tika.apache.org/3.3.1/formats.html))
- [Email Archive Processing](https://awesome-repositories.com/f/content-management-publishing/email-archive-processing.md) — Extracts individual messages and embedded attachments from mailbox formats like PST and mbox.
- [Parsing Behavior Configurations](https://awesome-repositories.com/f/content-management-publishing/parsing-behavior-configurations.md) — Enables control over parser selection and priority via configuration files to override default extraction behaviors. ([source](https://tika.apache.org/3.3.1/configuring.html))

### Development Tools & Productivity

- [Digital Asset Metadata Retrieval](https://awesome-repositories.com/f/development-tools-productivity/digital-asset-metadata-retrieval.md) — Pulls descriptive attributes and system properties from images, audio, video, and scientific file formats.
- [Magic Byte File Identification](https://awesome-repositories.com/f/development-tools-productivity/magic-byte-file-identification.md) — Identifies file types by analyzing header magic bytes to determine the correct parser for extraction. ([source](https://tika.apache.org/3.3.1/gettingstarted.html))
- [Recursive Document Unpacking](https://awesome-repositories.com/f/development-tools-productivity/archive-management/archive-importers/recursive-archive-traversers/recursive-document-unpacking.md) — Decompresses nested container formats to extract and parse embedded document streams and attachments.
- [Embedded File Extraction](https://awesome-repositories.com/f/development-tools-productivity/embedded-file-extraction.md) — Retrieves and saves embedded files and attachments found within documents to a local directory. ([source](https://tika.apache.org/3.3.1/gettingstarted.html))
- [External Process Delegation](https://awesome-repositories.com/f/development-tools-productivity/external-process-delegation.md) — Delegates complex tasks like OCR and translation to external third-party binary tools.
- [Recursive Batch Processing](https://awesome-repositories.com/f/development-tools-productivity/recursive-batch-processing.md) — Automates the extraction of content and metadata for all files within a directory using multi-threading. ([source](https://tika.apache.org/3.3.1/gettingstarted.html))

### Graphics & Multimedia

- [Optical Character Recognition](https://awesome-repositories.com/f/graphics-multimedia/optical-character-recognition.md) — Integrates with external OCR tools to extract text from image files and scanned PDFs. ([source](https://tika.apache.org/3.3.1/formats.html))

### Artificial Intelligence & ML

- [Named Entity Recognition](https://awesome-repositories.com/f/artificial-intelligence-ml/named-entity-recognition.md) — Extracts specific named entities such as people and organizations from text using external recognition services. ([source](https://tika.apache.org/3.3.1/formats.html))
- [Language Identification](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-entity-extraction/language-identification.md) — Analyzes text within documents to identify the natural language used when metadata is missing. ([source](https://tika.apache.org/3.3.1/api/))
- [Natural Language Processing Analysis](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing-analysis.md) — Integrates OCR and language detection to perform linguistic analysis and process text from images.
- [Memory-Efficient Chunking](https://awesome-repositories.com/f/artificial-intelligence-ml/natural-language-processing/text-tokenization/text-chunks/memory-efficient-chunking.md) — Processes extracted content in small segments via custom handlers to maintain a low memory footprint.
- [Sentiment Analysis Tools](https://awesome-repositories.com/f/artificial-intelligence-ml/sentiment-analysis-tools.md) — Classifies the emotional tone of documents by processing text through integrated natural language analysis tools. ([source](https://tika.apache.org/3.3.1/formats.html))

### Programming Languages & Runtimes

- [Encoding Extraction](https://awesome-repositories.com/f/programming-languages-runtimes/character-encoding-utilities/encoding-extraction.md) — Automatically determines the character encoding of plain text documents to ensure accurate extraction. ([source](https://tika.apache.org/3.3.1/formats.html))

### Software Engineering & Architecture

- [Parser Mapping Interfaces](https://awesome-repositories.com/f/software-engineering-architecture/parser-mapping-interfaces.md) — Maps detected MIME types to specialized parser classes through a unified interface for consistent text extraction.
