14 个仓库
Libraries for parsing, formatting, and manipulating text-based data structures.
Explore 14 awesome GitHub repositories matching data & databases · Text Preprocessing. Refine with filters or upvote what's useful.
This project serves as a comprehensive language ecosystem index, functioning as a centralized, community-curated directory for the Go programming language. It organizes a vast landscape of software components, libraries, and development tools into a structured, navigable hierarchy, enabling developers to efficiently discover resources tailored to specific functional domains. The repository distinguishes itself through a decentralized contribution model, where community-driven updates ensure the index remains current with the rapidly evolving software landscape. Beyond simple resource listing,
Offers libraries for parsing, formatting, and manipulating text data.
This project is an open-source, interactive educational platform designed to teach deep learning through a comprehensive, code-first curriculum. It provides a structured learning path that covers foundational mathematics, modern neural network architectures, and practical optimization techniques, enabling practitioners to master complex artificial intelligence concepts through hands-on experimentation. The platform distinguishes itself by integrating technical explanations with executable Jupyter notebooks. This design allows readers to modify code and hyperparameters in real-time, facilitati
Demonstrates practical workflows for cleaning, tokenizing, and preparing diverse text data for downstream natural language processing tasks.
This project is an educational resource providing practical code examples and implementations of machine learning algorithms using the Python language. It serves as a guide for constructing predictive pipelines, clustering models, and dimensionality reduction within the Scikit-Learn ecosystem. The repository includes comprehensive demonstrations for supervised and unsupervised learning, as well as detailed examples for implementing neural networks and deep architectures. It also provides practical guidance on exporting model parameters to JSON and wrapping trained models in web APIs for produ
Cleans raw text and performs tokenization to prepare documents for feature extraction.
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
Implements regex-based text splitting by category to prevent cross-category BPE merges during tokenization.
AutoGluon is an automated machine learning framework and multimodal library designed to automate the end-to-end pipeline from data preprocessing to high-accuracy model training and validation. It functions as an automated model trainer for tabular, image, text, and time series data, as well as a tool for time series forecasting and foundation model finetuning. The project is distinguished by its ability to jointly process and fuse different data types, allowing for the construction of multimodal neural networks that integrate images, text, and structured tables. It supports zero-shot inferenc
Tokenizes and concatenates multiple text fields into single sequences for model consumption.
Fuzzywuzzy is a Python library and text processing utility designed to calculate similarity scores between strings. It functions as a text similarity scoring engine and an approximate string matching tool used to identify the closest textual matches within a list of candidate strings. The library provides a suite of tools for measuring the degree of similarity between pieces of text, accounting for typos and formatting differences. These capabilities include extracting the best match from a candidate list and performing fuzzy string matching through various scoring methods. The toolset cover
Normalizes strings by removing special characters and forcing ASCII encoding to optimize fuzzy comparisons.
Smile is a comprehensive JVM machine learning library and statistical computing toolkit. It provides a suite of algorithms for classification, regression, and clustering, implemented natively for Java, Scala, and Kotlin. The project also functions as a deep learning framework, a natural language processing library, and an inference engine for large language models. The library distinguishes itself through GPU acceleration via LibTorch bindings and support for the ONNX model interchange format. It includes specialized capabilities for large language model inference, featuring Byte-Pair Encodin
Extracts meaning from text through sentence splitting, tokenization, stemming, and tagging.
本项目是一个机器学习教育课程和学习平台,通过交互式 Jupyter Notebooks 提供。它作为掌握 Python 数据科学工具包的综合指南,为数值计算、表格数据操作和统计可视化提供结构化教程。 该课程包括 Scikit-Learn 的具体实现指南,以及关于构建、训练和部署神经网络及计算机视觉模型的 TensorFlow 实践课程。它涵盖了构建预测模型的端到端过程,从初始问题定义和任务分类,到通过交互式 Web 界面部署模型。 该项目涵盖了广泛的功能领域,包括多维数组的数值计算、探索性数据分析和数据预处理例程。它为监督和无监督学习、自动化机器学习流水线、超参数优化以及使用分类指标和交叉验证的模型评估提供了详细的工作流。 教育内容组织为一系列 Notebook,将 Python 代码与叙述性解释交织在一起,以记录数据科学工作流。
Applies string transformations to standardize text formatting across data columns for preprocessing.
Accepts user-provided functions for stemming, stop-word removal, or other text preprocessing instead of imposing a built-in locale.
AiNiee is an LLM-based localization tool that automates the translation of games, books, subtitles, and documents across multiple languages. It operates as a batch processing engine, translating entire folders of files in parallel while preserving directory structure, and includes a glossary management system that enforces terminology consistency using AI-powered glossaries, forbidden terms, and user-defined text substitution rules. The tool differentiates itself through key architectural decisions: it distributes translation requests across multiple API keys to bypass rate limits and acceler
Applies user-defined substitution rules and regex patterns to modify or protect text before and after translation.
本项目是一个 PyTorch 情感分析教程,也是一个用于文本分析的深度学习实现。它提供了一个自然语言处理序列分类流水线,旨在清洗文本数据并训练神经网络以对单词序列进行分类。 该实现专注于针对特定文本分类任务调整预训练语言模型,并使用自定义数据集。它包括微调大规模语言模型的过程,以及实现用于情感基调检测的循环神经网络和 Transformer 模型。 该项目涵盖了文本序列分类和 PyTorch 文本处理的更广泛领域。这包括使用 TorchText 库准备原始文本数据集,以及构建深度学习模型以对文本进行分类的工作流。
Provides text preprocessing routines to scrub and simplify raw datasets for sequence classification.
这是一个关于使用 PyTorch 构建神经网络的综合教学资源和课程。它涵盖了深度学习的基本构建块,包括张量操作、自动微分以及模块化神经网络组件的构建。 该仓库是多个专业领域的参考指南。它提供了计算机视觉任务(如图像分类、目标检测和语义分割)的实现细节,以及涉及 Transformer、循环网络和生成模型的自然语言处理工作流。此外,它还包括生成式 AI 的参考资料,专门关注通过扩散模型和对抗网络进行图像合成。 材料延伸至模型优化和部署流水线。它涵盖了通过量化和将模型导出为 ONNX 和 TensorRT 等格式来减小模型大小并提高推理速度的技术。其他能力领域包括用于并行加载的数据工程、使用自定义指标的模型评估,以及开源大语言模型的部署。 该项目主要以一系列 Jupyter Notebook 的形式提供。
Converts text into indexed sequences and ensures uniform length using padding and truncation.
tts-server-android 是一个 Android 系统级文本转语音服务,将合成请求路由到外部云 API 或本地引擎。它作为一个 HTTP 语音合成网关,将系统语音请求转换为用于远程云服务的可自定义 HTTP 请求。 该项目包括一个叙述对话解析器,使用引号来区分叙述和对话,从而允许不同的阅读风格。它还具有语音管理器和合成接口,以实现文本替换规则和自动重试,从而提高语音输出的准确性。 该服务涵盖了更广泛的功能,包括用于离线语音功能的本地引擎管理、云 API 路由以及通过基于规则的文本预处理进行的语音定制。
Modifies raw input text using replacement rules to ensure correct pronunciation before synthesis.
CrawlerTutorial 是一个全面的 Python 网络爬虫教程和框架,旨在从静态和动态网站中提取数据。它作为一个网络数据提取管道和 HTTP 请求编排器,涵盖了从初始获取到最终数据存储的爬虫应用程序全生命周期。 该项目提供了关于反机器人绕过技术和 Web API 逆向工程的专业指导。它包括通过身份掩码和代理轮换规避浏览器检测的方法,以及通过分析网络流量和请求签名识别隐藏 API 端点的技术。 该框架包含广泛的功能,包括针对 JavaScript 重度页面的浏览器自动化、通过 QR 码或短信的自动用户身份验证以及会话持久性管理。它还具有用于清理原始文本、删除重复记录并将收集到的信息持久化到平面文件或关系数据库中的数据预处理工具。
Includes tools for cleaning raw scraped text, removing duplicate records, and transforming data into analysis-ready formats.