30 open-source projects similar to apache/tika, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Tika alternative.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
PyMuPDF is a comprehensive PDF manipulation library and document analysis tool. It serves as a text extraction tool, OCR engine, and image converter, providing a programmatic interface to edit, merge, split, and optimize PDF and Office documents. The project distinguishes itself through high-performance capabilities, including the use of C-bindings for low-level manipulation and parallelized page processing to accelerate workloads. It provides specialized conversion paths, such as transforming PDF content into Markdown for retrieval-augmented generation and large language model pipelines. It
file-type is a binary file type detector that identifies file extensions and MIME types by analyzing magic numbers and signature bytes in binary data. It functions as a magic number parser and MIME type resolver, mapping binary signatures to standardized media type strings. The project is an extensible file format identifier that allows for the addition of custom detector plugins to recognize uncommon or non-binary file formats. The engine supports binary format identification across various data sources, including buffers and data streams. It utilizes a supported format registry and provide
python-magic is a C-binding wrapper that provides a Python interface for the libmagic system library. It functions as a file signature analyzer and MIME type detector, identifying file formats by comparing header bytes against a database of known binary signatures. The library enables the identification of file types from both file paths and raw data buffers. It supports custom file signature matching through the injection of user-provided magic databases, allowing for the detection of specialized or proprietary formats. The project covers binary data analysis and MIME type mapping to transl
This project is a collection of Python scripts and tools designed for web scraping, browser automation, and large-scale data extraction. It provides a set of implementations for retrieving information from websites and private APIs, including tools for multimedia downloading and social media data archiving. The toolset includes specialized mechanisms for bypassing anti-scraping measures through IP proxy pool rotation and multi-threaded crawlers. It also features capabilities for simulating browser sessions to handle authentication, intercepting session cookies, and decrypting network payloads
nlp.js is a JavaScript natural language processing library and development framework used to build natural language understanding engines. It provides a toolkit for creating local machine learning models for intent classification and acts as a multilingual text processor that detects languages and normalizes text across various dialects. The framework distinguishes itself by supporting local execution on both servers and mobile devices, enabling chatbot functionality without an internet connection. It features a specialized system for conversational slot filling to collect mandatory informati
Textract is a multi-format text extraction tool and parser. It provides a unified interface to extract plain text from a variety of sources, including documents, images, and audio files. The system functions as a document content parser for PDFs and spreadsheets, an image text extractor using optical character recognition, and a speech-to-text transcriber for audio recordings.
markdown-it is a token-based Markdown compiler and CommonMark-compliant parser that converts structured plaintext markup into HTML. It functions as an extensible markup processor designed to transform text into browser-ready content while managing security and preventing cross-site scripting. The project is distinguished by a modular plugin system that allows for the extension of parsing capabilities and the addition of custom syntax, such as footnotes, tables, or emojis. It utilizes a two-stage tokenization process to break documents into structural tokens before rendering them into final HT
LLM Guard is a security firewall and guardrail framework designed to scan and sanitize inputs and outputs for large language models. It functions as a proxy gateway and security layer to block prompt injections, toxicity, and sensitive data leakage while ensuring that model interactions remain compliant with organizational policies. The system distinguishes itself through a modular scanner pipeline that utilizes local model orchestration to eliminate external network dependencies. It supports real-time security filtering via streaming chunk analysis and implements a fail-fast execution model
CoreNLP is a Java natural language processing library designed to convert raw human language text into structured data. It utilizes a suite of linguistic annotators to analyze text through a pipeline, extracting grammatical structures, sentiment, and linguistic patterns. The project includes a coreference resolution engine that links multiple mentions of the same entity to maintain contextual consistency across documents. It also provides tools for named entity recognition to categorize people, companies, and locations, and a part-of-speech tagger to assign grammatical categories and base for
llmware is a Python framework for AI agent orchestration and model management, designed to coordinate multi-model workflows and autonomous agents. It provides a unified model catalog and standardized interface to execute specialized language models for complex research, analysis, and structured data generation. The project distinguishes itself through its heavy emphasis on local execution and quantized inference, allowing models to run on private infrastructure using CPU, GPU, and NPU acceleration via runtimes like ONNX and OpenVino. It features a specialized ability to translate natural lang
This repository contains the HTML specification, which defines the core standards for web page structuring, content organization, and document rendering. It establishes the fundamental algorithms for state-machine-based tokenization, tree construction for the document object model, and origin-based security isolation. The specification provides a framework for defining custom elements with independent lifecycles and registries. It also details the requirements for cross-document communication, session history management, and the synchronization of interface properties with content attributes.
This project is an automation suite comprising an AI visual asset generator, a browser-based social publisher, an Electron resource extractor, and a Markdown content transformer. It functions as a content automation pipeline that uses large language models to generate text and images for distribution across social media platforms. The system distinguishes itself through specialized visual generation capabilities, producing professional infographics, slide decks, educational comics, and SVG diagrams via structured prompts. It also features a dedicated workflow for extracting resources from Ele
This repository is a collection of educational Jupyter notebooks designed to demonstrate practical machine learning and natural language processing techniques. It serves as a tutorial library for implementing statistical models and neural architectures to solve common linguistic analysis tasks through interactive, modular code execution. The project provides guided workflows for a wide range of applied tasks, including sentiment evaluation, named entity extraction, and document classification. It distinguishes itself by offering concrete implementations for complex operations such as probabil
KnowledgeGraphData is a collection of structured datasets and corpora designed to provide a foundational layer for cognitive intelligence and artificial intelligence systems. It primarily consists of large-scale Chinese knowledge graph datasets, including entity-relation data and NLP training sets used to drive semantic understanding and automated question answering. The project focuses on the construction and export of massive entity-attribute-value graphs, organizing knowledge into portable formats. It provides specialized domain partitioning to tailor information retrieval for professional
This project serves as a comprehensive educational repository and technical reference collection, documenting a wide range of software engineering practices and modern development technologies. It provides a structured learning path for developers, curating tutorials and practical examples that cover the full lifecycle of application development, from initial project scaffolding to deployment and maintenance. The repository distinguishes itself by offering deep technical insights into complex architectural patterns, including actor-based concurrency models for managing parallel tasks and cont
Newspaper is a Python library designed for scraping, parsing, and analyzing web-based information. It functions as a framework for automated news aggregation and large-scale web content extraction, providing tools to download, clean, and structure text, metadata, and media from diverse online sources. The project distinguishes itself through a pipeline-oriented architecture that combines heuristic-based content extraction with natural language processing. It automatically identifies and isolates article bodies from web page boilerplate while simultaneously performing language detection, keywo
Cheerio is an HTML and XML parsing library and server-side DOM implementation. It functions as a markup manipulation tool and CSS selector engine, allowing users to parse, query, and modify HTML or XML documents in non-browser environments. The project provides a DOM-like tree representation of markup strings, enabling programmatic addition, removal, and modification of elements and attributes. It features a prototype-based plugin system that allows the extension of core functionality by adding custom methods to the document prototype. The library covers a broad range of capabilities includi
PyPDF2 is a pure Python library for reading, writing, and manipulating PDF files. It functions as a document manipulator, text extractor, and encryption tool, allowing users to process PDF files without relying on external C libraries or native binaries. The library provides specialized tools for modifying document structures, such as merging multiple files into one, splitting documents into separate files, and transforming page layouts through cropping. It also includes capabilities for securing documents via passwords and encryption. Additional capabilities include the extraction of writte
FreeTube is a privacy-focused desktop application for watching YouTube videos without ads, tracking cookies, or the requirement of a Google account. It functions as a local-first subscription manager that tracks channels and playlists in local files instead of a centralized cloud account. The application avoids tracking-heavy official APIs by using a content extractor that parses web pages directly. To further protect user identity, it can route network traffic through proxies or Tor to mask the hardware IP address. The software provides tools for distraction-free viewing, including the abil
gpt-crawler is a web scraping utility designed to extract website content and convert it into structured text files for use as AI model knowledge bases. It functions as a data generator that crawls specified web addresses to produce the knowledge files required for building custom GPTs, grounding large language models, and providing context to AI agents. The system transforms raw HTML into clean Markdown text to reduce token usage and improve readability for AI models. It utilizes token-aware content chunking and output file size limitations to ensure generated datasets remain compatible with
BrowserOS is an AI agent browser orchestrator and automation framework designed to manage browser state and execute complex web workflows. It functions as a local AI browser assistant and a Model Context Protocol controller, enabling the control of browser tabs, windows, and navigation through programmable AI agents and standardized context protocols. The system distinguishes itself through a graph-based visual workflow builder for creating repeatable automation sequences and the use of markdown-based files to define agent personalities and task recipes. It supports multi-provider orchestrati
Spout is a spreadsheet file processing library and multi-format generator designed for reading and writing CSV, XLSX, and ODS files. It functions as a stream-based parser that processes large spreadsheet files incrementally to avoid loading entire documents into memory. The library provides capabilities for programmatic spreadsheet generation and data extraction. It supports custom content styling, allowing for the application of fonts, backgrounds, borders, and number formats to individual cells or rows. Beyond basic file input and output, the project covers workbook manipulation through se
Crawlee-python is a web crawling framework for building scalable scrapers using Python. It serves as a comprehensive tool for web scraping automation, providing a system to extract structured data from websites using both lightweight HTTP requests and headless browser automation. The framework is distinguished by its anti-bot evasion capabilities, which include browser fingerprint impersonation and tiered proxy rotation to bypass detection systems and solve challenges such as Cloudflare. It also incorporates artificial intelligence for autonomous website navigation and schema-based data extra
pypdf is a Python library for parsing, manipulating, and generating PDF documents. It provides high-level operations for document processing, such as merging multiple files into one or splitting a single document into smaller files. The project includes specialized tools for managing interactive elements, including the creation and modification of annotations, hyperlinks, and form fields. It also supports advanced metadata management, allowing for the extraction and modification of standard document properties and XML-based XMP metadata. Beyond basic structural changes, the library covers pa
Megaparse is a document parsing tool and RAG data preprocessor designed to convert PDFs, Word documents, and presentations into clean text formats. It functions as a vision-based document extractor that recovers high-fidelity information from images and complex layouts to optimize data for large language model ingestion. The system employs multimodal AI and vision models to perform schema-preserving parsing, which maintains structural hierarchies such as tables and headers. It utilizes lossless structural transformation to turn layout-heavy binary files into text sequences while preserving th
OfficeCLI is a headless office suite and automation tool designed for programmatically reading, editing, and generating Microsoft Office documents. It functions as an OOXML manipulation library and a document templating engine, providing a standalone binary that allows for the management of Word, Excel, and PowerPoint files without requiring a local installation of office software. The project distinguishes itself by exposing document operations as tools for AI agents via a JSON-RPC server and the Model Context Protocol. It enables advanced customization through raw XML manipulation using XPa
PDF-Extract-Kit is a document extraction toolkit designed to convert PDF documents into structured formats such as Markdown, HTML, and LaTeX. It functions as a multi-stage parsing framework that combines a document layout analyzer, a formula recognition engine, an OCR text extractor, and a table extraction system. The project focuses on recovering complex document elements by translating images of mathematical formulas and tabular structures into editable source code. It utilizes model-driven layout analysis to identify structural elements in reports and textbooks while ignoring noise like wa
Magika is an AI content type classifier and MIME type prediction engine that uses deep learning to identify file formats based on binary data. It analyzes byte sequences through a neural network to predict the content type of a file and provide associated confidence scores. The system features a foreign function interface that allows the core detection logic to be integrated across different programming languages. It includes a mechanism for configuring detection sensitivity and per-type thresholds to balance precision and recall. The project provides capabilities for bulk file analysis via
Paperless is a self-hosted document management system designed to digitize, index, and archive paper documents. It functions as an optical character recognition system that converts scanned images and PDFs into a searchable digital library, providing a web-based interface for querying and retrieving documents from a database. The system features an automated file ingestion pipeline that monitors specific directories and email inboxes to process and import documents without manual uploading. To maintain a private archive, it includes on-disk encryption for sensitive files and the ability to or