30 open-source projects similar to qdata/textattack, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best TextAttack alternative.
Stanza is a Python natural language processing library designed for tokenization, lemmatization, and dependency parsing across many human languages using neural models. It provides a neural processing pipeline that converts raw text into structured linguistic data objects, alongside a specialized analyzer for extracting medical insights from clinical and biomedical language. The project includes a wrapper that connects Python scripts to Java-based natural language processing tools and remote annotation servers. This enables a bridge for extracting linguistic annotations and analysis data from
AllenNLP is a PyTorch-based research library and deep learning language toolkit designed for developing and training neural network architectures for linguistic tasks. It provides a distributed training system that coordinates data and gradients across multiple GPUs and a framework for integrating pretrained transformer architectures. The system distinguishes itself with a dedicated algorithmic bias mitigation tool used to identify and reduce bias in linguistic model predictions. It also includes model influence analysis to interpret predictions by calculating the influence of specific traini
This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet it is not always as trivial a task as it appears to be. This binding makes the power of the ucto tokeniser available to Python. Ucto itself is regular-expression based, extensible, and advanced tokeniser written in C++ (http://ilk.uvt.nl/ucto).
Basic Utilities for PyTorch Natural Language Processing (NLP)
snips-nlu is a Python library and natural language understanding engine designed to convert unstructured text into structured data. It identifies user intents and extracts associated entities from natural language sentences to enable machine-readable command processing. The engine functions as a multilingual parser capable of processing text across multiple languages. It maps identified entities to canonical values or standardized ISO formats, such as timestamps, to ensure data consistency. The project covers intent classification and named entity recognition, utilizing sequence labeling and
spaCy is a Python natural language processing framework designed for industrial-scale text processing. It converts raw text into structured data for machine learning pipelines through a combination of statistical language model trainers, transformer-based text processors, and syntactic dependency parsers. The project enables the integration of pretrained transformer architectures to perform complex linguistic analysis and multi-task learning. It also provides a specialized system for neural named entity recognition to identify and categorize key entities within text. The framework covers a b
Haystack is an orchestration framework designed for building complex search and generative AI pipelines. It functions as an agentic workflow engine, enabling the construction of automated sequences that allow AI agents to perform multi-step reasoning and data analysis. The framework utilizes a modular, component-based architecture that connects processing steps into directed acyclic graphs. By employing a provider-agnostic integration layer, it decouples core logic from specific external AI services and vector databases, allowing for the flexible exchange of underlying technologies. This desi
Beautiful visualizations of how language differs among document types.
PyNLPl, pronounced as 'pineapple', is a Python library for Natural Language Processing. It contains various modules useful for common, and less common, NLP tasks. PyNLPl can be used for basic tasks such as the extraction of n-grams and frequency lists, and to build simple language model. There are also more complex data types and algorithms. Moreover, there are parsers for file formats common in NLP (e.g. FoLiA/Giza/Moses/ARPA/Timbl/CQL). There are also clients to interface with various NLP specific servers. PyNLPl most notably features a very extensive library for working with FoLiA XML (Form
A Python library for Interpretable Machine Learning in Text Classification using the SS3 model, with easy-to-use visualization tools for Explainable AI :octocat:
Fairseq is a deep learning research toolkit and sequence-to-sequence framework built on PyTorch. It provides a system for training and deploying models that map input sequences to output sequences, with a primary focus on neural machine translation and speech recognition. The toolkit allows for the generation of text sequences through search algorithms such as beam search and nucleus sampling. It includes capabilities for producing synthetic parallel training data by translating monolingual text using reverse sequence models. The framework supports large scale model training through multi-de
Library for translating between 200 languages. Built on 🤗 transformers.
Flair is a natural language processing framework for training and applying models for sequence labeling and text classification. It provides a system for generating word embeddings and identifying semantic entities within text. The framework includes a dedicated system for zero and few-shot learning, enabling text classification and entity extraction using minimal training examples by leveraging pre-trained knowledge. Its capabilities cover named entity recognition, sentiment analysis, and the training of specialized models using custom datasets. It also includes tooling for the visual highl
A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.
Sockeye is an open-source sequence-to-sequence framework for Neural Machine Translation built on PyTorch. It implements distributed training and optimized inference for state-of-the-art models, powering Amazon Translate and other MT applications. Recent developments and changes are tracked in…
Tools, wrappers, etc... for data science with a concentration on text processing
This project is a high-performance library for converting raw text into tokens and IDs for machine learning models. It functions as a fast text encoder and a text preprocessing pipeline designed to transform strings into numerical representations with high throughput for research and production. The library includes a subword tokenizer trainer used to analyze text datasets and create custom vocabularies using algorithms such as byte-pair encoding and wordpiece. It provides capabilities for subword vocabulary training and text alignment, allowing character offsets to be tracked during normaliz
Transformers is a comprehensive library for machine learning that provides a unified interface for training, fine-tuning, and deploying transformer-based models. It supports a wide range of tasks, including text classification, language modeling, question answering, and sequence-to-sequence translation, while offering specialized architectures for both text and vision processing. The framework includes tools for managing the entire model lifecycle, from data preprocessing and tokenization to distributed training and inference. The library features extensive support for model optimization and
Learn_Prompting is an educational project focused on prompt engineering, providing the principles and techniques required to craft effective inputs and improve the quality of generative AI outputs. The project covers advanced prompting strategies to enhance reasoning, reliability, and output quality. This includes techniques for task decomposition, chain-of-thought reasoning, and the use of few-shot and zero-shot guidance. It also addresses model security through the study of prompt hacking, vulnerability analysis, and privacy auditing to prevent sensitive data leaks. The scope extends to th
YSDA course in Natural Language Processing
Chat with your favourite LLaMA models in a native macOS app
AudioGPT is an LLM-driven audio framework and processing suite that uses large language models to orchestrate neural audio pipelines. It functions as a multimodal audio generator and processing system, integrating a collection of pretrained models to handle speech synthesis, sound generation, and audio manipulation. The system is distinguished by its ability to generate audio from diverse inputs, including text and images, and its capacity to produce synchronized talking head videos. It also operates as a neural speech translator, converting spoken language between different tongues while pre
Gibran is an Elixir natural language processor, and a port of WordsCounted.
DocsGPT is a retrieval-augmented generation platform and private knowledge base used to build AI agents that perform grounded search and analysis. It functions as a multi-model AI orchestrator and enterprise agent builder, allowing for the integration of various local and cloud language models to customize reasoning and text generation. The project provides a visual environment for developing automated assistants using conditional logic and third-party API connectivity. It enables the creation of private AI agents capable of performing enterprise search and detailed document analysis using pr
Argilla is a collaborative AI feedback tool and data curation management system. It serves as a human-in-the-loop dataset platform designed to coordinate workforce annotators and domain experts in labeling, rating, and refining data samples for machine learning projects. The platform focuses on large language model dataset curation and reinforcement learning from human feedback workflows. It provides a shared workspace for integrating human expertise into AI development to validate model outputs and correct data errors. The system manages the end-to-end machine learning data pipeline, includ
Implementation of various topic models
lecture notes for probabilistic topic models using ipython notebook
This repository consists of all my NLP Projects