What are the best open-source alternatives to Tika?

30 open-source projects similar to apache/tika, ranked by shared features. Top picks: kreuzberg-dev/kreuzberg, pymupdf/pymupdf, sindresorhus/file-type, ahupp/python-magic, shengqiangzhang/examples-of-web-crawlers, axa-group/nlp.js, deanmalmgren/textract, markdown-it/markdown-it, protectai/llm-guard, stanfordnlp/corenlp.

Is kreuzberg-dev/kreuzberg a good alternative to Tika?

Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for b…

Is pymupdf/pymupdf a good alternative to Tika?

PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-perfo…

Is sindresorhus/file-type a good alternative to Tika?

file-type is a binary file type detector that identifies file extensions and MIME types by analyzing magic numbers and signature bytes in binary data. It functions as a magic number parser and MIME type resolver, mapping binary signatures to standardized media type strings. The project is an exten…

Is ahupp/python-magic a good alternative to Tika?

python-magic is a C-binding wrapper that provides a Python interface for the libmagic system library. It functions as a file signature analyzer and MIME type detector, identifying file formats by comparing header bytes against a database of known binary signatures. The library enables the identifi…

Is shengqiangzhang/examples-of-web-crawlers a good alternative to Tika?

This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data a…

Is axa-group/nlp.js a good alternative to Tika?

nlp.js is a JavaScript natural language processing library and development framework used to build natural language understanding engines. It provides a toolkit for creating local machine learning models for intent classification and acts as a multilingual text processor that detects languages and…

Is deanmalmgren/textract a good alternative to Tika?

Textract is a multi-format text extraction tool and parser. It provides a unified interface to extract plain text from a variety of sources, including documents, images, and audio files. The system functions as a document content parser for PDFs and spreadsheets, an image text extractor using opti…

Is markdown-it/markdown-it a good alternative to Tika?

markdown-it is a token-based Markdown compiler and CommonMark-compliant parser that converts structured plaintext markup into HTML. It functions as an extensible markup processor designed to transform text into browser-ready content while managing security and preventing cross-site scripting. The…

Is protectai/llm-guard a good alternative to Tika?

LLM Guard is a security firewall and guardrail framework designed to scan and sanitize inputs and outputs for large language models. It functions as a proxy gateway and security layer to block prompt injections, toxicity, and sensitive data leakage while ensuring that model interactions remain comp…

Is stanfordnlp/corenlp a good alternative to Tika?

CoreNLP is a Java natural language processing library designed to convert raw human language text into structured data. It utilizes a suite of linguistic annotators to analyze text through a pipeline, extracting grammatical structures, sentiment, and linguistic patterns. The project includes a cor…

Back to apache/tika

Open-source alternatives to Tika

30 open-source projects similar to apache/tika, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Tika alternative.

kreuzberg-dev/kreuzberg
kreuzberg-dev/kreuzberg
8,527View on GitHub
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Rustdocument-intelligenceelixirffi
View on GitHub8,527
pymupdf/pymupdf
pymupdf/PyMuPDF
9,086View on GitHub
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
Pythondata-scienceepubextract-data
View on GitHub9,086
sindresorhus/file-type
sindresorhus/file-type
4,297View on GitHub
file-type is a binary file type detector that identifies file extensions and MIME types by analyzing magic numbers and signature bytes in binary data. It functions as a magic number parser and MIME type resolver, mapping binary signatures to standardized media type strings. The project is an extensible file format identifier that allows for the addition of custom detector plugins to recognize uncommon or non-binary file formats. The engine supports binary format identification across various data sources, including buffers and data streams. It utilizes a supported format registry and provide
JavaScript
View on GitHub4,297

Open-source alternatives to Tika

kreuzberg-dev/kreuzberg

pymupdf/PyMuPDF

sindresorhus/file-type

ahupp/python-magic

shengqiangzhang/examples-of-web-crawlers

axa-group/nlp.js

deanmalmgren/textract

markdown-it/markdown-it

protectai/llm-guard

stanfordnlp/CoreNLP

llmware-ai/llmware

whatwg/html

JimLiu/baoyu-skills

nlptown/nlp-notebooks

ownthink/KnowledgeGraphData

shekhargulati/52-technologies-in-2016

codelucas/newspaper

cheeriojs/cheerio

mstamy2/PyPDF2

FreeTubeApp/FreeTube

BuilderIO/gpt-crawler

browseros-ai/BrowserOS

box/spout

apify/crawlee-python

py-pdf/pypdf

quivrhq/megaparse

iOfficeAI/OfficeCLI

opendatalab/PDF-Extract-Kit

google/magika

the-paperless-project/paperless