30 open-source projects similar to hit-scir/ltp, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Ltp alternative.
SnowNLP is a Python library for Chinese natural language processing. It provides tools for text segmentation, sentiment analysis, document classification, and phonetic transliteration. The library includes capabilities for training and saving custom machine learning models for tokenization and sentiment analysis using raw training datasets. It covers a range of linguistic processing areas, including parts of speech tagging, sentence splitting, and text similarity measurement. The toolkit also provides utilities for extracting key information through text summarization and calculating word im
HanLP is a natural language processing library and deep learning framework specifically optimized for the Chinese language, while also functioning as a multilingual text processor. It serves as a toolkit for performing linguistic analysis, semantic understanding, and script conversion. The project distinguishes itself through a dedicated focus on Chinese linguistic structures, including a specialized script converter for transforming text between Simplified Chinese, Traditional Chinese, and Pinyin. It further supports domain-specific model training to improve the recognition of professional t
Stanza is a Python natural language processing library designed for tokenization, lemmatization, and dependency parsing across many human languages using neural models. It provides a neural processing pipeline that converts raw text into structured linguistic data objects, alongside a specialized analyzer for extracting medical insights from clinical and biomedical language. The project includes a wrapper that connects Python scripts to Java-based natural language processing tools and remote annotation servers. This enables a bridge for extracting linguistic annotations and analysis data from
pkuseg-python is a Chinese word segmentation toolkit and natural language processing library. It provides specialized models for splitting Chinese text into words across various domains, including news, medical, and web content, and includes a tool for assigning grammatical parts of speech tags to segmented words. The library allows for the training of custom segmentation models using annotated datasets and supports the integration of user-defined dictionaries to ensure specialized terminology is recognized correctly. It employs a multi-threaded execution engine to process large volumes of Ch
Compromise is a natural language processing library and rule-based text parser designed to analyze unstructured text. It functions as a toolkit for identifying parts of speech, linguistic patterns, and semantic meaning, while providing specialized engines for named entity recognition and the parsing of temporal and numeric data. The project is distinguished by its linguistic morphological engine, which can conjugate verbs across different tenses and inflect nouns and adjectives. It further allows for linguistic model customization through a plugin system that enables the extension of lexicons
This project is a comprehensive Python toolkit designed for natural language processing, research, and education. It functions as a linguistic data processor that provides a standardized framework for managing, cleaning, and analyzing large collections of annotated text corpora and lexical resources. The library distinguishes itself through its integration of both symbolic and statistical methods, allowing users to perform complex tasks ranging from rule-based grammar parsing to machine learning-driven classification. It offers a modular pipeline for text processing, enabling the transformati
CoreNLP is a Java natural language processing library designed to convert raw human language text into structured data. It utilizes a suite of linguistic annotators to analyze text through a pipeline, extracting grammatical structures, sentiment, and linguistic patterns. The project includes a coreference resolution engine that links multiple mentions of the same entity to maintain contextual consistency across documents. It also provides tools for named entity recognition to categorize people, companies, and locations, and a part-of-speech tagger to assign grammatical categories and base for
ansj_seg is a Java NLP toolkit and segmentation library designed for processing Chinese text. It functions as a word segmenter, part-of-speech tagger, and named entity recognizer to divide continuous Chinese characters into meaningful words and tokens. The library utilizes statistical models for text segmentation and provides capabilities for identifying and extracting person names from unstructured documents. It also assigns grammatical categories to tokens to determine their linguistic roles within a sentence. The toolkit supports domain-specific text processing through the use of custom d
Chinese-BERT-wwm is a pre-trained transformer model and encoder designed for Chinese natural language processing. It converts Chinese text into dense vector representations to be used across various natural language processing applications. The model utilizes a whole word masking strategy during pre-training, masking entire words rather than individual characters. This approach is designed to improve the capture of semantic meaning and language structure within Chinese datasets. The project covers a range of downstream tasks including text classification, sequence labeling, and reading compr
KnowledgeGraphData is a collection of structured datasets and corpora designed to provide a foundational layer for cognitive intelligence and artificial intelligence systems. It primarily consists of large-scale Chinese knowledge graph datasets, including entity-relation data and NLP training sets used to drive semantic understanding and automated question answering. The project focuses on the construction and export of massive entity-attribute-value graphs, organizing knowledge into portable formats. It provides specialized domain partitioning to tailor information retrieval for professional
spaCy is a Python natural language processing framework designed for industrial-scale text processing. It converts raw text into structured data for machine learning pipelines through a combination of statistical language model trainers, transformer-based text processors, and syntactic dependency parsers. The project enables the integration of pretrained transformer architectures to perform complex linguistic analysis and multi-task learning. It also provides a specialized system for neural named entity recognition to identify and categorize key entities within text. The framework covers a b
TextBlob is a natural language processing library that provides a unified interface for common linguistic tasks. It operates as a wrapper-based API, simplifying the use of complex processing libraries by delegating core operations to specialized external frameworks. The project features a pluggable processing pipeline that allows for the integration of custom logic and alternative language engines. It supports the extension of processing models through plugins to add specific language support or custom data processing. The library covers a broad range of linguistic capabilities, including se
Flair is a transformer-based natural language processing framework used to build and train models for text classification and sequence tagging. It provides a specialized library for generating contextual text embeddings and performing linguistic analysis. The framework includes dedicated tools for named entity recognition, including the identification of specialized biomedical entities across multiple languages. It further supports entity linking to map identified text mentions to unique entries within general or biomedical knowledge bases. The project covers a broad range of language analys
Synonyms is a Chinese natural language processing tool focused on semantic analysis. It provides capabilities for Chinese word segmentation, part-of-speech tagging, and the retrieval of synonyms based on semantic proximity. The project converts words and sentences into numerical vector representations to calculate similarity scores. This allows for the determination of semantic proximity between different phrases and the identification of chatbot intent through sentence comparison. The system also includes tools for automated keyword extraction and importance ranking to identify significant
This project is a natural language processing system designed for named entity recognition and text classification. It uses a machine learning approach to identify specific names and key information from raw text to organize unstructured content into a structured format. The system implements a multi-layer architecture that combines a pre-trained transformer for embeddings, bidirectional long short-term memory for sequence modeling, and a conditional random field for label transitions. It supports transfer learning through the fine-tuning of these models on task-specific datasets. The projec
Synonyms is a natural language processing library and semantic similarity engine specifically designed for Chinese text. It functions as a word embedding toolkit and tokenizer that extracts semantic meaning and identifies synonyms by calculating the conceptual closeness between words and sentences. The system provides a toolkit for Chinese word embedding and synonym discovery, allowing for the retrieval of semantically similar words to expand vocabulary. It distinguishes itself through a configuration-driven approach to model loading, which supports the integration of custom word embeddings t
This project is a CJK input method framework and configuration set designed for the Rime input engine. It provides a comprehensive system of schemas and dictionary packs to optimize Chinese character entry through pinyin and double-pinyin workflows. The framework is distinguished by its use of Lua-powered extensions that add dynamic utilities, such as inline mathematical calculators, automated timestamps, and text formatting, directly to the input interface. It also features refined word libraries and language models specifically tuned to improve prediction accuracy and first-choice hit rates
This project is a collection of pre-trained dense and sparse word vectors trained on diverse Chinese corpora. It serves as a library of linguistic representations and an NLP vector dataset designed to improve the accuracy of semantic and morphological analysis in text models. The collection provides corpus-specific representations and utilizes n-gram co-occurrence modeling to capture diverse linguistic patterns. It includes a hybrid of dense-sparse vectors to balance computational efficiency and semantic precision. The project covers semantic vector search and the development of Chinese natu
This project is a comprehensive Lisp AI implementation library that provides reference implementations for various artificial intelligence paradigms and symbolic algorithms. It functions as a multi-purpose toolkit containing a logic programming engine, a natural language processing suite, and a symbolic mathematics toolkit. The library is distinguished by its diverse architectural frameworks, including a Prolog-style execution engine that uses unification and goal-driven backtracking, and a system for simulating human decision-making through expert system shells and certainty factors. It also
Compromise is a natural language processing library and rule-based engine designed for English text manipulation, analysis, and parsing. It provides a toolkit for tokenizing text, identifying parts of speech, and performing linguistic analysis to achieve semantic understanding of unstructured strings. The project distinguishes itself through its ability to programmatically transform grammar, such as modifying verb tenses, noun plurality, and adjective forms. It also functions as a named entity recognizer capable of extracting people, places, organizations, dates, and contact information from
This project is a Chinese text segmentation library and tokenizer designed to split Chinese sentences into individual words. It serves as a natural language processing tool for splitting characters into words, tagging parts of speech, and extracting keywords using statistical analysis. The library distinguishes itself through support for custom dictionary configuration and vocabulary file management, allowing users to override default segmentation rules for domain-specific accuracy. It also includes a TF-IDF keyword extractor to identify significant words and core topics within documents. Th
nlp.js is a JavaScript natural language processing library and development framework used to build natural language understanding engines. It provides a toolkit for creating local machine learning models for intent classification and acts as a multilingual text processor that detects languages and normalizes text across various dialects. The framework distinguishes itself by supporting local execution on both servers and mobile devices, enabling chatbot functionality without an internet connection. It features a specialized system for conversational slot filling to collect mandatory informati
PyText is an extensible PyTorch-based framework for building, training, and deploying custom natural language processing models, including text classifiers, sequence taggers, and intent-slot predictors. It provides a modular toolkit that allows developers to assemble these models using pluggable registries for model architectures, data formats, and tensorizers, all configurable through YAML files without requiring code changes. The framework distinguishes itself through its comprehensive support for the full NLP model lifecycle, from training to production inference. It includes pre-built neu
nlp-recipes is a collection of implementation guides and reference templates for applying natural language processing techniques to real-world tasks. It provides standardized workflows and code examples for developing NLP pipelines, from dataset preparation and model training to performance evaluation. The project focuses on the practical application of transformer-based models, offering patterns for fine-tuning pretrained architectures for tasks such as text classification, named entity recognition, and question answering. It also includes a toolkit for model interpretability, allowing users
This project is a high-performance library for converting raw text into tokens and IDs for machine learning models. It functions as a fast text encoder and a text preprocessing pipeline designed to transform strings into numerical representations with high throughput for research and production. The library includes a subword tokenizer trainer used to analyze text datasets and create custom vocabularies using algorithms such as byte-pair encoding and wordpiece. It provides capabilities for subword vocabulary training and text alignment, allowing character offsets to be tracked during normaliz
Pattern is a Python web mining library that functions as an HTML web scraper, a natural language processing toolkit, and a network analysis tool. It provides a mathematical framework for categorizing datasets through a vector space model library. The project enables the extraction of structured data from web services and the creation of searchable web content indexes. It processes unstructured text using sentiment analysis, part-of-speech tagging, and n-gram searching. The library covers machine learning classification through the training of models using perceptron algorithms and support ve
Knwl.js is a JavaScript named entity recognition library and rule-based text parser. It serves as an extensible information extraction tool designed to identify and pull structured entities, such as dates, times, and locations, from unstructured text strings. The library allows for the definition of specialized rules and custom plugins to identify and extract unique pieces of information. This extensibility enables the automation of information retrieval by converting human-readable text into structured formats for applications and databases. The system utilizes regular expression matching a
DeepPavlov is a conversational AI framework and deep learning NLP library designed for building end-to-end dialogue systems and chatbots. It functions as an NLP pipeline orchestrator that allows users to compose pre-trained models and text processing components into sequential data flows for complex linguistic tasks. The system is distinguished by its ability to act as a chatbot deployment server, exposing trained conversational models as web services via REST and Socket APIs. It utilizes JSON-based pipeline configurations and dynamic variable interpolation to decouple model logic from infras
AllenNLP is a PyTorch-based research library and deep learning language toolkit designed for developing and training neural network architectures for linguistic tasks. It provides a distributed training system that coordinates data and gradients across multiple GPUs and a framework for integrating pretrained transformer architectures. The system distinguishes itself with a dedicated algorithmic bias mitigation tool used to identify and reduce bias in linguistic model predictions. It also includes model influence analysis to interpret predictions by calculating the influence of specific traini