35 Repos
Comprehensive frameworks and toolkits for deep learning and linguistic analysis.
Explore 35 awesome GitHub repositories matching part of an awesome list · Python NLP Libraries. Refine with filters or upvote what's useful.
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and
State-of-the-art library for Transformer-based models.
spaCy is a Python natural language processing framework designed for industrial-scale text processing. It converts raw text into structured data for machine learning pipelines through a combination of statistical language model trainers, transformer-based text processors, and syntactic dependency parsers. The project enables the integration of pretrained transformer architectures to perform complex linguistic analysis and multi-task learning. It also provides a specialized system for neural named entity recognition to identify and categorize key entities within text. The framework covers a b
Industrial-strength library for advanced natural language processing.
Fairseq is a deep learning research toolkit and sequence-to-sequence framework built on PyTorch. It provides a system for training and deploying models that map input sequences to output sequences, with a primary focus on neural machine translation and speech recognition. The toolkit allows for the generation of text sequences through search algorithms such as beam search and nucleus sampling. It includes capabilities for producing synthetic parallel training data by translating monolingual text using reverse sequence models. The framework supports large scale model training through multi-de
Facebook AI Research implementations of sequence-to-sequence models.
Haystack is an orchestration framework designed for building complex search and generative AI pipelines. It functions as an agentic workflow engine, enabling the construction of automated sequences that allow AI agents to perform multi-step reasoning and data analysis. The framework utilizes a modular, component-based architecture that connects processing steps into directed acyclic graphs. By employing a provider-agnostic integration layer, it decouples core logic from specific external AI services and vector databases, allowing for the flexible exchange of underlying technologies. This desi
End-to-end framework for building natural language search interfaces.
Flair is a natural language processing framework for training and applying models for sequence labeling and text classification. It provides a system for generating word embeddings and identifying semantic entities within text. The framework includes a dedicated system for zero and few-shot learning, enabling text classification and entity extraction using minimal training examples by leveraging pre-trained knowledge. Its capabilities cover named entity recognition, sentiment analysis, and the training of specialized models using custom datasets. It also includes tooling for the visual highl
Simple framework for multilingual NLP built on PyTorch.
AllenNLP is a PyTorch-based research library and deep learning language toolkit designed for developing and training neural network architectures for linguistic tasks. It provides a distributed training system that coordinates data and gradients across multiple GPUs and a framework for integrating pretrained transformer architectures. The system distinguishes itself with a dedicated algorithmic bias mitigation tool used to identify and reduce bias in linguistic model predictions. It also includes model influence analysis to interpret predictions by calculating the influence of specific traini
Research library for building deep learning models on PyTorch.
This project is a high-performance library for converting raw text into tokens and IDs for machine learning models. It functions as a fast text encoder and a text preprocessing pipeline designed to transform strings into numerical representations with high throughput for research and production. The library includes a subword tokenizer trainer used to analyze text datasets and create custom vocabularies using algorithms such as byte-pair encoding and wordpiece. It provides capabilities for subword vocabulary training and text alignment, allowing character offsets to be tracked during normaliz
High-performance tokenization for research and production.
YSDA course in Natural Language Processing
Teaches NLP algorithms using Python with NumPy, PyTorch, and NLTK for all assignments and examples.
Nebullvm is an AI inference accelerator, GPU resource orchestrator, and performance optimization library for large language models. It functions as an optimization layer designed to lower operational costs by aligning model execution with underlying hardware architectures. The system maximizes cluster efficiency through real-time dynamic partitioning and elastic quotas for shared hardware resources. It employs alignment methods and techniques to reduce the hardware and data requirements necessary for tuning large language models. The project covers broad capability areas including AI infrast
Optimizes inference speed for deep learning models.
Stanza is a Python natural language processing library designed for tokenization, lemmatization, and dependency parsing across many human languages using neural models. It provides a neural processing pipeline that converts raw text into structured linguistic data objects, alongside a specialized analyzer for extracting medical insights from clinical and biomedical language. The project includes a wrapper that connects Python scripts to Java-based natural language processing tools and remote annotation servers. This enables a bridge for extracting linguistic annotations and analysis data from
Provides a comprehensive Python library for deep learning-based linguistic analysis, tokenization, and dependency parsing.
PraisonAI is an autonomous AI agent platform that coordinates multiple LLM-powered agents for research, planning, and execution of complex workflows. It functions as a multi-agent orchestration framework, a workflow builder, and a Model Context Protocol server, while also providing retrieval-augmented generation through vector knowledge bases. Agents can interact via CLI, web, or standardized protocols with sandboxed code execution. The platform distinguishes itself with a rich set of agent communication protocols, including A2A, REST, WebSocket, voice and telephony integration, and MCP, allo
Multi-agent framework with LLM support and agentic workflows.
snips-nlu ist eine Python-Bibliothek und eine Engine für Natural Language Understanding, die entwickelt wurde, um unstrukturierten Text in strukturierte Daten umzuwandeln. Sie identifiziert Benutzerabsichten (Intents) und extrahiert zugehörige Entitäten aus natürlichsprachlichen Sätzen, um eine maschinenlesbare Befehlsverarbeitung zu ermöglichen. Die Engine fungiert als mehrsprachiger Parser, der in der Lage ist, Text in mehreren Sprachen zu verarbeiten. Sie bildet identifizierte Entitäten auf kanonische Werte oder standardisierte ISO-Formate ab, wie z. B. Zeitstempel, um die Datenkonsistenz sicherzustellen. Das Projekt deckt Intent-Klassifizierung und Named Entity Recognition ab und nutzt Sequenz-Labeling und Tokenisierung, um Benutzerziele und spezifische Daten-Slots zu identifizieren.
Production-ready library for intent parsing and slot filling.
TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP https://textattack.readthedocs.io/en/master/
Framework for adversarial attacks and data augmentation in NLP.
A Deep Learning NLP/NLU library by Intel® AI Lab
Library for exploring state-of-the-art deep learning topologies.
.. raw:: html
Deep learning toolkit for research and industrial NLP deployment.
Kashgari is a production-level NLP Transfer learning framework built on top of tf.keras for text-labeling and text-classification, includes Word2Vec, BERT, and GPT2 Language Embedding.
Keras-powered framework for named entity recognition and classification.
Beautiful visualizations of how language differs among document types.
Visualizes language differences between corpora using D3.
NLP, before and after spaCy
Higher-level NLP utilities built on top of spaCy.
Basic Utilities for PyTorch Natural Language Processing (NLP)
Toolkit for rapid prototyping with data loaders and metrics.
:housewithgarden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
Transfer learning framework focused on industrial question answering.