30 open-source projects similar to explosion/spacy, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best SpaCy alternative.
Flair is a transformer-based natural language processing framework used to build and train models for text classification and sequence tagging. It provides a specialized library for generating contextual text embeddings and performing linguistic analysis. The framework includes dedicated tools for named entity recognition, including the identification of specialized biomedical entities across multiple languages. It further supports entity linking to map identified text mentions to unique entries within general or biomedical knowledge bases. The project covers a broad range of language analys
Flair is a natural language processing framework for training and applying models for sequence labeling and text classification. It provides a system for generating word embeddings and identifying semantic entities within text. The framework includes a dedicated system for zero and few-shot learning, enabling text classification and entity extraction using minimal training examples by leveraging pre-trained knowledge. Its capabilities cover named entity recognition, sentiment analysis, and the training of specialized models using custom datasets. It also includes tooling for the visual highl
HanLP is a natural language processing library and deep learning framework specifically optimized for the Chinese language, while also functioning as a multilingual text processor. It serves as a toolkit for performing linguistic analysis, semantic understanding, and script conversion. The project distinguishes itself through a dedicated focus on Chinese linguistic structures, including a specialized script converter for transforming text between Simplified Chinese, Traditional Chinese, and Pinyin. It further supports domain-specific model training to improve the recognition of professional t
This project is a comprehensive Python toolkit designed for natural language processing, research, and education. It functions as a linguistic data processor that provides a standardized framework for managing, cleaning, and analyzing large collections of annotated text corpora and lexical resources. The library distinguishes itself through its integration of both symbolic and statistical methods, allowing users to perform complex tasks ranging from rule-based grammar parsing to machine learning-driven classification. It offers a modular pipeline for text processing, enabling the transformati
Spark NLP is a toolkit for scalable text analysis and machine learning built on the Apache Spark distributed computing framework. It provides a multimodal machine learning framework and a distributed pipeline system for sequencing annotators to process large-scale linguistic data. The library includes a transformer text processor for generating contextual vector embeddings and a dedicated inference engine for managing large language models. The project distinguishes itself through its ability to process heterogeneous data types, including text, audio, and images, within a unified vision-langu
This repository is a deep learning for natural language processing course and curriculum. It provides educational material and guides focused on neural network architectures used for processing natural language, speech signals, and text classification. The content includes instructional tutorials on sequence modeling and neural language modeling, covering the implementation of n-gram and recurrent neural networks. It also provides a framework for studying word embeddings to map linguistic meanings into numerical representations. The curriculum covers a broad range of capabilities, including
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and
This is a Chinese natural language processing toolkit providing a suite of tools for word segmentation, part-of-speech tagging, and named entity recognition. It includes a neural dependency parser for analyzing syntactic and semantic relationships between words and a machine learning training suite for creating custom linguistic models using annotated datasets. The toolkit distinguishes itself through its deployment flexibility, offering a dockerized server and a web service interface that exposes processing capabilities via API. It supports the use of pretrained models and allows for the int
Stanza is a Python natural language processing library designed for tokenization, lemmatization, and dependency parsing across many human languages using neural models. It provides a neural processing pipeline that converts raw text into structured linguistic data objects, alongside a specialized analyzer for extracting medical insights from clinical and biomedical language. The project includes a wrapper that connects Python scripts to Java-based natural language processing tools and remote annotation servers. This enables a bridge for extracting linguistic annotations and analysis data from
DeepPavlov is a conversational AI framework and deep learning NLP library designed for building end-to-end dialogue systems and chatbots. It functions as an NLP pipeline orchestrator that allows users to compose pre-trained models and text processing components into sequential data flows for complex linguistic tasks. The system is distinguished by its ability to act as a chatbot deployment server, exposing trained conversational models as web services via REST and Socket APIs. It utilizes JSON-based pipeline configurations and dynamic variable interpolation to decouple model logic from infras
This project is a deep learning framework designed for constructing, training, and deploying neural networks across diverse hardware environments. It functions as a high-performance tensor computation library that provides both imperative and symbolic programming interfaces, allowing developers to balance flexible, step-by-step model building with the efficiency of compiled computation graphs. The framework distinguishes itself through a hybrid execution engine that integrates declarative graph compilation with imperative runtime logic. It supports scalable, distributed training across multip
AllenNLP is a PyTorch-based research library and deep learning language toolkit designed for developing and training neural network architectures for linguistic tasks. It provides a distributed training system that coordinates data and gradients across multiple GPUs and a framework for integrating pretrained transformer architectures. The system distinguishes itself with a dedicated algorithmic bias mitigation tool used to identify and reduce bias in linguistic model predictions. It also includes model influence analysis to interpret predictions by calculating the influence of specific traini
nlp-recipes is a collection of implementation guides and reference templates for applying natural language processing techniques to real-world tasks. It provides standardized workflows and code examples for developing NLP pipelines, from dataset preparation and model training to performance evaluation. The project focuses on the practical application of transformer-based models, offering patterns for fine-tuning pretrained architectures for tasks such as text classification, named entity recognition, and question answering. It also includes a toolkit for model interpretability, allowing users
This repository serves as an educational resource for learning the foundational architectures of natural language processing through concise code implementations. It provides a structured collection of deep learning models designed to process and understand human language, focusing on the core mechanics of neural network sequence modeling and text analysis. The project distinguishes itself by offering direct, hands-on implementations of complex architectures, including Transformers, attention mechanisms, and word embedding generation. By utilizing tensor-based computational graphs and gradien
This project is a high-performance library for converting raw text into tokens and IDs for machine learning models. It functions as a fast text encoder and a text preprocessing pipeline designed to transform strings into numerical representations with high throughput for research and production. The library includes a subword tokenizer trainer used to analyze text datasets and create custom vocabularies using algorithms such as byte-pair encoding and wordpiece. It provides capabilities for subword vocabulary training and text alignment, allowing character offsets to be tracked during normaliz
TextBlob is a natural language processing library that provides a unified interface for common linguistic tasks. It operates as a wrapper-based API, simplifying the use of complex processing libraries by delegating core operations to specialized external frameworks. The project features a pluggable processing pipeline that allows for the integration of custom logic and alternative language engines. It supports the extension of processing models through plugins to add specific language support or custom data processing. The library covers a broad range of linguistic capabilities, including se
ESPnet is a comprehensive speech processing toolkit and PyTorch-based trainer designed for building end-to-end speech recognition, synthesis, and translation models. It provides a structured framework for developing automatic speech recognition systems using transducer and encoder-decoder architectures, alongside engines for text-to-speech synthesis and speech translation pipelines. The project distinguishes itself through a recipe-based workflow execution system that ensures experimental reproducibility by running standardized sequences of scripts for data preparation and model training. It
Gensim is a natural language processing toolkit designed for large-scale text analysis and the training of semantic vector embeddings. It provides a framework for identifying latent thematic structures within document collections and calculating semantic similarity between text segments using unsupervised statistical algorithms. The project is distinguished by its ability to handle datasets that exceed available system memory through incremental corpus streaming, which processes documents one at a time from disk. It utilizes sparse vector representations and dictionary-based token mapping to
Haystack is an orchestration framework designed for building complex search and generative AI pipelines. It functions as an agentic workflow engine, enabling the construction of automated sequences that allow AI agents to perform multi-step reasoning and data analysis. The framework utilizes a modular, component-based architecture that connects processing steps into directed acyclic graphs. By employing a provider-agnostic integration layer, it decouples core logic from specific external AI services and vector databases, allowing for the flexible exchange of underlying technologies. This desi
GoCV is a computer vision library and Go language binding for OpenCV. It serves as an image processing toolkit and deep learning inference engine, providing programmatic access to a wide range of algorithms for image manipulation, object detection, and video analysis. The project differentiates itself through high-performance native bindings and hardware acceleration. It utilizes a foreign function interface to map Go calls to C++ functions and includes a hardware-agnostic backend dispatch to route neural network tasks to computation engines such as CUDA and OpenVINO. The library covers a br
This project is a comprehensive collection of educational examples and reference implementations for building vision and language models using PyTorch. It serves as a deep learning tutorial covering the end-to-end process of developing neural networks, from initial architecture definition to final production deployment. The repository provides detailed guides on implementing a wide range of domain-specific models, including convolutional neural networks for object detection and segmentation, as well as transformer and recurrent architectures for natural language processing. It emphasizes gene
CoreNLP is a Java natural language processing library designed to convert raw human language text into structured data. It utilizes a suite of linguistic annotators to analyze text through a pipeline, extracting grammatical structures, sentiment, and linguistic patterns. The project includes a coreference resolution engine that links multiple mentions of the same entity to maintain contextual consistency across documents. It also provides tools for named entity recognition to categorize people, companies, and locations, and a part-of-speech tagger to assign grammatical categories and base for
minGPT is a minimal implementation of the Transformer architecture designed for training and experimenting with language models. It functions as a neural network training framework and a text generation engine, providing the necessary tools to manage data loading, backpropagation, and parameter updates for custom deep learning models. The project is structured as an educational resource for understanding how transformer architectures function by building and training models from scratch. It utilizes a modular block architecture and transformer-based self-attention to process sequences, allowi
This project is a collection of supervised and unsupervised machine learning algorithms implemented from scratch using Python. It serves as an educational resource for studying model training, parameter optimization, and the implementation of core predictive models. The library provides a variety of supervised learning tools, including linear and logistic regression, decision trees, and support vector machines. It also features unsupervised learning capabilities for discovering patterns in unlabeled datasets through clustering algorithms. Broad capability areas include ensemble learning thro
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
This repository serves as an educational framework for building large language models from the ground up. It provides a structured curriculum that guides learners through the end-to-end lifecycle of model development, including data processing, architecture design, and optimization. By focusing on low-level implementation, the project enables users to master the fundamental mechanics of artificial intelligence without relying on high-level abstraction frameworks. The project distinguishes itself by constructing neural network components and gradient-based optimization logic from first princip
SentencePiece is a text segmentation engine and tokenization library designed for machine learning workflows. It provides a comprehensive toolkit for transforming raw text into subword units or numerical identifiers, enabling consistent data representation for neural network training and inference. The library supports the training of segmentation models from raw text, allowing for the creation of custom vocabularies tailored to specific domain requirements. The project distinguishes itself through its byte-level encoding and fallback mechanisms, which ensure that every input can be represent
YSDA course in Natural Language Processing
Compromise is a natural language processing library and rule-based text parser designed to analyze unstructured text. It functions as a toolkit for identifying parts of speech, linguistic patterns, and semantic meaning, while providing specialized engines for named entity recognition and the parsing of temporal and numeric data. The project is distinguished by its linguistic morphological engine, which can conjugate verbs across different tenses and inflect nouns and adjectives. It further allows for linguistic model customization through a plugin system that enables the extension of lexicons
This project is a PyTorch sentiment analysis tutorial and a deep learning implementation for analyzing text. It provides a natural language processing sequence classification pipeline designed to clean text data and train neural networks to categorize sequences of words. The implementation focuses on adapting pretrained language models for specific text classification tasks using custom datasets. It includes a process for fine-tuning large-scale language models and implementing recurrent networks and transformers for emotional tone detection. The project covers the broader surface of text se