14 dépôts
Libraries for parsing, formatting, and manipulating text-based data structures.
Explore 14 awesome GitHub repositories matching data & databases · Text Preprocessing. Refine with filters or upvote what's useful.
This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently discover resources tailored to specific functional domains. The repository distinguishes itself through a decentralized contribution model, where community-driven updates ensure the index remains current with the rapidly evolving software landscape. Beyond simple resource listing,
Offers libraries for parsing, formatting, and manipulating text data.
This project is an open-source, interactive educational platform designed to teach deep learning through a comprehensive, code-first curriculum. It provides a structured learning path that covers foundational mathematics, modern neural network architectures, and practical optimization techniques, enabling practitioners to master complex artificial intelligence concepts through hands-on experimentation. The platform distinguishes itself by integrating technical explanations with executable Jupyter notebooks. This design allows readers to modify code and hyperparameters in real-time, facilitati
Demonstrates practical workflows for cleaning, tokenizing, and preparing diverse text data for downstream natural language processing tasks.
This project is an educational resource providing practical code examples and implementations of machine learning algorithms using the Python language. It serves as a guide for constructing predictive pipelines, clustering models, and dimensionality reduction within the Scikit-Learn ecosystem. The repository includes comprehensive demonstrations for supervised and unsupervised learning, as well as detailed examples for implementing neural networks and deep architectures. It also provides practical guidance on exporting model parameters to JSON and wrapping trained models in web APIs for produ
Cleans raw text and performs tokenization to prepare documents for feature extraction.
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
Implements regex-based text splitting by category to prevent cross-category BPE merges during tokenization.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Tokenizes and concatenates multiple text fields into single sequences for model consumption.
Fuzzywuzzy is a Python library and text processing utility designed to calculate similarity scores between strings. It functions as a text similarity scoring engine and an approximate string matching tool used to identify the closest textual matches within a list of candidate strings. The library provides a suite of tools for measuring the degree of similarity between pieces of text, accounting for typos and formatting differences. These capabilities include extracting the best match from a candidate list and performing fuzzy string matching through various scoring methods. The toolset cover
Normalizes strings by removing special characters and forcing ASCII encoding to optimize fuzzy comparisons.
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Extracts meaning from text through sentence splitting, tokenization, stemming, and tagging.
Ce projet est un cursus éducatif en machine learning et une plateforme d'apprentissage délivrée via des Jupyter Notebooks interactifs. Il sert de guide complet pour maîtriser le toolkit de science des données Python, fournissant des tutoriels structurés pour le calcul numérique, la manipulation de données tabulaires et la visualisation statistique. Le cursus inclut des guides d'implémentation spécifiques pour Scikit-Learn et un cours pratique sur TensorFlow pour construire, entraîner et déployer des réseaux de neurones et des modèles de vision par ordinateur. Il couvre le processus de bout en bout de la construction de modèles prédictifs, de la formulation initiale du problème et de la catégorisation des tâches au déploiement des modèles via des interfaces web interactives. Le projet couvre une large surface de capacités incluant le calcul numérique avec des tableaux multidimensionnels, l'analyse exploratoire des données et les routines de prétraitement des données. Il fournit des flux de travail détaillés pour l'apprentissage supervisé et non supervisé, les pipelines de machine learning automatisés, l'optimisation des hyperparamètres et l'évaluation des modèles utilisant des métriques de classification et la validation croisée. Le contenu éducatif est organisé sous forme d'une série de notebooks qui entremêlent code Python et explications narratives pour documenter les flux de travail en science des données.
Applies string transformations to standardize text formatting across data columns for preprocessing.
Accepts user-provided functions for stemming, stop-word removal, or other text preprocessing instead of imposing a built-in locale.
AiNiee is an LLM-based localization tool that automates the translation of games, books, subtitles, and documents across multiple languages. It operates as a batch processing engine, translating entire folders of files in parallel while preserving directory structure, and includes a glossary management system that enforces terminology consistency using AI-powered glossaries, forbidden terms, and user-defined text substitution rules. The tool differentiates itself through key architectural decisions: it distributes translation requests across multiple API keys to bypass rate limits and acceler
Applies user-defined substitution rules and regex patterns to modify or protect text before and after translation.
This project is a PyTorch sentiment analysis tutorial and a deep learning implementation for analyzing text. It provides a natural language processing sequence classification pipeline designed to clean text data and train neural networks to categorize sequences of words. The implementation focuses on adapting pretrained language models for specific text classification tasks using custom datasets. It includes a process for fine-tuning large-scale language models and implementing recurrent networks and transformers for emotional tone detection. The project covers the broader surface of text se
Provides text preprocessing routines to scrub and simplify raw datasets for sequence classification.
This project is a comprehensive instructional resource and course for building neural networks using PyTorch. It covers the fundamental building blocks of deep learning, including tensor manipulation, automatic differentiation, and the construction of modular neural network components. The repository serves as a technical guide for several specialized domains. It provides implementation details for computer vision tasks such as image classification, object detection, and semantic segmentation, as well as natural language processing workflows involving transformers, recurrent networks, and gen
Converts text into indexed sequences and ensures uniform length using padding and truncation.
tts-server-android is a system-level text-to-speech service for Android that routes synthesis requests to external cloud APIs or local engines. It functions as an HTTP speech synthesis gateway, converting system speech requests into customizable HTTP requests for remote cloud services. The project includes a narrative dialogue parser that uses quotation marks to differentiate between narration and dialogue, allowing for different reading styles. It also features a voice manager and synthesis interface to implement text replacement rules and automatic retries to improve voice output accuracy.
Modifies raw input text using replacement rules to ensure correct pronunciation before synthesis.
CrawlerTutorial est un tutoriel complet de web scraping en Python et un framework conçu pour extraire des données de sites web statiques et dynamiques. Il fonctionne comme un pipeline d'extraction de données web et un orchestrateur de requêtes HTTP, couvrant tout le cycle de vie des applications de scraping, de la récupération initiale au stockage final des données. Le projet fournit des conseils spécialisés sur les techniques de contournement anti-bot et l'ingénierie inverse d'API web. Il inclut des méthodes pour échapper à la détection par navigateur via le masquage d'identité et la rotation de proxies, ainsi que des techniques pour identifier les points de terminaison d'API cachés en analysant le trafic réseau et les signatures de requêtes. Le framework englobe un large ensemble de capacités, incluant l'automatisation de navigateur pour les pages riches en JavaScript, l'authentification utilisateur automatisée via codes QR ou SMS, et la gestion de la persistance de session. Il dispose également d'outils de prétraitement de données pour nettoyer le texte brut, supprimer les enregistrements en double et persister les informations recueillies dans des fichiers plats ou des bases de données relationnelles.
Includes tools for cleaning raw scraped text, removing duplicate records, and transforming data into analysis-ready formats.