Open-source libraries and frameworks for identifying entities and extracting structured information from unstructured text data.
This project is a comprehensive Python toolkit designed for natural language processing, research, and education. It functions as a linguistic data processor that provides a standardized framework for managing, cleaning, and analyzing large collections of annotated text corpora and lexical resources. The library distinguishes itself through its integration of both symbolic and statistical methods, allowing users to perform complex tasks ranging from rule-based grammar parsing to machine learning-driven classification. It offers a modular pipeline for text processing, enabling the transformation of raw, unstructured language data into structured formats through tokenization, stemming, and part-of-speech tagging. Beyond basic text manipulation, the toolkit supports advanced linguistic analysis, including syntactic and semantic parsing, named entity recognition, and information extraction. It provides consistent programmatic interfaces for accessing diverse datasets and visualizing grammatical structures, facilitating the study of linguistic patterns and the development of computational models.
This toolkit provides a comprehensive suite of libraries for natural language processing, including built-in support for named entity recognition and information extraction pipelines alongside extensive tools for custom model development.
HanLP is a natural language processing library and deep learning framework specifically optimized for the Chinese language, while also functioning as a multilingual text processor. It serves as a toolkit for performing linguistic analysis, semantic understanding, and script conversion. The project distinguishes itself through a dedicated focus on Chinese linguistic structures, including a specialized script converter for transforming text between Simplified Chinese, Traditional Chinese, and Pinyin. It further supports domain-specific model training to improve the recognition of professional terminology within specialized datasets. Its broader capabilities cover information extraction via named entity recognition and text summarization, as well as comprehensive linguistic analysis including part-of-speech tagging and dependency syntax parsing. The toolkit also provides semantic analysis for sentiment detection and coreference resolution, alongside text transformation utilities for grammar and style conversion.
HanLP is a comprehensive NLP framework that provides robust named entity recognition and information extraction capabilities, including support for custom model training and multilingual processing.
This project is a transformer-based language model and natural language processing toolkit designed to generate deep contextual representations of text. By utilizing a transformer-based encoder architecture, the system processes input sequences through stacked self-attention layers to capture the semantic meaning of tokens based on their surrounding sentence structure. The model distinguishes itself through bidirectional contextual processing, which analyzes text in both directions simultaneously, and masked language modeling, which trains the system by predicting hidden tokens within a sequence. It also employs next sentence prediction to understand relationships between text segments and utilizes shared parameter multilingualism to maintain a unified structure across diverse languages. Beyond these core capabilities, the toolkit provides utilities for subword-based tokenization to manage vocabulary and punctuation, as well as functionality for generating high-dimensional contextual embeddings. It supports the development of question answering systems by identifying specific start and end positions for text segments within a document.
This repository provides the foundational transformer architecture and pre-trained models necessary for building named entity recognition and information extraction systems, though it functions as a core language model rather than a specialized, out-of-the-box extraction toolkit.
PaddleNLP is a development library and toolkit for training, fine-tuning, and deploying large and small language models using the PaddlePaddle framework. It provides a comprehensive suite for the entire natural language processing lifecycle, from model development to high-performance inference. The project features a standardized model zoo for loading and managing pre-trained models and tokenizers through a unified interface. It distinguishes itself with a specialized model compression framework that reduces memory footprints via weight precision conversion and lossless size optimization, alongside an inference engine that utilizes operator fusion and backend-agnostic execution to increase token generation speed. The library covers a broad range of capabilities including distributed parallel training, parameter-efficient fine-tuning, and model weight merging. It also supports a full natural language processing pipeline for tasks such as text generation and zero-shot structured information extraction.
PaddleNLP is a comprehensive toolkit that provides pre-trained models, fine-tuning capabilities, and specialized pipelines for named entity recognition and structured information extraction across multiple languages.
Donut is an OCR-free document transformer and end-to-end document parser. It functions as a neural network that converts unstructured document images directly into structured data or text without the use of an external optical character recognition engine. The project includes a synthetic document generator to create artificial images and ground-truth labels for training. It employs a transformer model to perform visual question answering and document image classification based on visual layout and text. The system covers several document understanding capabilities, including structured information extraction, document text transcription, and visual document question answering. It provides tools for transformer model fine-tuning and model accuracy evaluation.
Donut is an end-to-end document parser that extracts structured data directly from document images, serving as a specialized tool for information extraction even though it operates on visual inputs rather than raw text.
Compromise is a natural language processing library and rule-based text parser designed to analyze unstructured text. It functions as a toolkit for identifying parts of speech, linguistic patterns, and semantic meaning, while providing specialized engines for named entity recognition and the parsing of temporal and numeric data. The project is distinguished by its linguistic morphological engine, which can conjugate verbs across different tenses and inflect nouns and adjectives. It further allows for linguistic model customization through a plugin system that enables the extension of lexicons and the modification of baseline grammar rules. The library covers a broad range of computational linguistics capabilities, including part-of-speech tagging, phonetic analysis, and sentence structure detection. It provides utilities for text normalization and formatting standardization, as well as tools for pattern matching, text statistics analysis, and the conversion of written numbers and currencies into structured values. Processing performance is managed through parallel text parsing across worker threads and the use of partial parse caching for document segments.
Compromise is a rule-based NLP toolkit that provides built-in named entity recognition and structured data extraction capabilities, making it a capable choice for developers needing a lightweight, JavaScript-native approach to information extraction.
LARK is a development toolkit for training, fine-tuning, and deploying large language models and multimodal models based on PaddlePaddle. It functions as a comprehensive framework that includes an LLM training orchestrator, an inference server, and a multimodal model framework for processing text, image, and video inputs. The project features a retrieval-augmented generation system for building conversational applications that integrate web search and private knowledge bases. It provides specific capabilities for multimodal reasoning and complex logic, enabling the extraction of structured data and visual knowledge from documents, charts, and images. The toolkit covers large-scale model training through supervised fine-tuning and preference optimization, as well as model compression via quantization to reduce memory usage. It includes production infrastructure for deploying inference servers with hardware acceleration and load balancing. A web-based graphical user interface is provided to control conversations and manage the training processes of vision-language models.
LARK is a comprehensive framework for training and deploying large multimodal models that includes specific capabilities for document intelligence and structured data extraction from unstructured inputs.
FinGPT is a suite of specialized financial tools and a framework for adapting large language models to the financial domain. It provides a set of pipelines for financial entity extraction, sentiment analysis, and retrieval-augmented generation to improve the accuracy of financial information systems. The project distinguishes itself through efficient training workflows, utilizing low-rank adaptation and quantized low-rank adaptation to fine-tune models on consumer-grade hardware. It employs market-labeled datasets and reinforcement learning that uses actual stock price movements as reward signals to refine model performance. The framework covers broad capability areas including algorithmic trading signal generation, automated investment research, and stock price movement prediction. It also provides tools for collecting global financial data and generating source code for quantitative trading factors. The project is primarily implemented and demonstrated through Jupyter Notebooks.
FinGPT provides specialized pipelines for financial entity extraction and information processing, serving as a domain-specific framework for adapting large language models to extract structured data from unstructured financial text.
This project is a comprehensive framework and toolkit for developing, optimizing, and deploying transformer-based models across multimodal, document intelligence, and natural language processing tasks. It provides a unified neural architecture that processes text, vision, audio, and document layout data through a shared set of weights, enabling researchers and developers to build foundational models that align cross-modal representations. The platform distinguishes itself through advanced training and inference strategies designed for large-scale deep learning. It incorporates specialized mechanisms such as retentive state processing for efficient sequence generation, differential attention for improved focus, and distributed weight partitioning to handle memory-intensive computations. These capabilities are complemented by techniques for sparse decoding and model compression, which maintain performance while reducing the computational footprint of large-scale architectures. The project covers a broad capability surface, including end-to-end pipelines for data curation, synthetic data generation, and tokenization across diverse modalities. It supports extensive workflows for pre-training, instruction tuning, and fine-tuning, with specific focus areas in document understanding, speech synthesis, and cross-lingual transfer. Diagnostic tools for attention analysis and benchmarking further assist in evaluating model performance on complex reasoning and retrieval tasks.
This framework provides the foundational transformer architectures and pre-trained models necessary to build custom named entity recognition and information extraction pipelines, though it requires additional implementation to serve as a ready-to-use toolkit.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inference and forecasting using pretrained foundation models, alongside parameter-efficient finetuning techniques to adapt large models to specific tasks. Its broader capabilities include automated model selection and ensembling via bagging and stacking, as well as comprehensive computer vision pipelines for object detection and semantic segmentation. The framework also covers probabilistic time series forecasting, named entity recognition for natural language processing, and semantic search based on embedding extraction. The system provides utilities for deploying trained predictors as cloud endpoints or serverless functions and offers hardware acceleration through ONNX and TensorRT.
AutoGluon is an automated machine learning framework that includes specific modules for named entity recognition and information extraction, providing a high-level approach to training and deploying custom models for these tasks.
ERNIE is a development toolkit for training, fine-tuning, and deploying large language models built on the PaddlePaddle deep learning platform. It provides a comprehensive suite of core components, including an inference server for vision and language models, a training and fine-tuning toolkit, and a framework for building retrieval-augmented generation systems using private knowledge bases. The project features multimodal AI models capable of reasoning across text, images, and video to perform complex visual understanding and information extraction. It distinguishes itself through specialized training methodologies for function calling and the use of mixture-of-experts architectures to enhance cross-modal reasoning. The system covers a broad range of capabilities including industrial natural language processing deployment, visual mathematical reasoning, and document information extraction. Performance is addressed through quantization, hybrid-parallelism training, and disaggregated inference serving to optimize memory usage and throughput. A web-based user interface is provided for supervising training processes and conducting interactive conversations.
This toolkit provides a comprehensive framework for training and deploying large language models that support document information extraction and industrial natural language processing tasks.
GraphRAG is a data processing pipeline and retrieval engine designed to transform unstructured text into interconnected knowledge graphs. By utilizing language models to extract entities and relationships, it builds structured representations of information that enable context-aware retrieval for downstream applications. The system distinguishes itself through hierarchical graph clustering and large-scale data synthesis, which organize massive document corpora into multi-level structures. This approach allows for both vector-based semantic searches and graph-based traversals, providing a comprehensive method for navigating complex datasets and identifying hidden connections between concepts. The platform includes a modular orchestration pipeline that manages the entire lifecycle of information, from initial ingestion and indexing to query execution. Users can refine the synthesis and retrieval processes by adjusting prompt templates and configuration arguments to align with specific data characteristics.
This tool functions as an information extraction and knowledge graph construction pipeline that identifies entities and relationships from unstructured text to enable structured data retrieval.
Langextract is a framework designed to transform unstructured text into structured, machine-readable data using language model orchestration. It provides a high-performance pipeline that processes large volumes of narrative text by utilizing parallel execution and sequential extraction passes. The library is built to handle complex data extraction tasks, including specialized support for clinical information and medical entity relationship recognition. The project distinguishes itself through a plugin-based architecture that supports both local hardware execution and cloud-hosted model endpoints. By providing a unified abstraction layer, it allows users to switch between different inference providers without modifying core application logic. The framework enforces output consistency through schema-guided generation and prompt-driven templates, ensuring that extracted entities adhere to predefined formats. Beyond its core extraction capabilities, the library includes administrative utilities for managing model authentication, custom provider registration, and system integration testing. It supports scalable workflows through batch processing and chunked document analysis, while offering interactive visualization tools to verify extracted results against original source text. Data can be exported in standard formats to facilitate integration with external analysis environments.
Langextract is a framework for transforming unstructured text into structured data that provides the necessary pipelines and schema-guided extraction tools to perform named entity recognition and information extraction tasks.