12 个仓库
Techniques and algorithms for extracting patterns and actionable insights from large datasets.
Distinct from Trend Analysis: Distinct from Trend Analysis: focuses on broad data mining and pattern extraction rather than specific time-series metric monitoring.
Explore 12 awesome GitHub repositories matching data & databases · Data Mining. Refine with filters or upvote what's useful.
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
Implements algorithms to identify semantically similar sentence pairs across different languages for unsupervised learning.
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
Distributes scraping tasks across multiple instances to increase the volume and throughput of collected web data.
This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and infrastructure automation. By leveraging large language models, the system decomposes high-level objectives into actionable plans, ensuring that complex operations are executed with consistency and reliability. The framework distinguishes itself through its hierarchical agent registry and policy-driven
Extracts patterns and actionable insights from large datasets.
This project is a transformer-based framework for generating dense and sparse vector embeddings of text and multimodal data. It serves as a library for fine-tuning models to perform semantic similarity tasks, retrieval, and reranking. The system is distinguished by its support for diverse architectural patterns, including bi-encoders for fast similarity search and cross-encoders for high-precision reranking. It provides dedicated pipelines for multimodal embeddings, mapping text and images into a shared vector space, and implements knowledge distillation to compress large models into smaller,
Identifies translated sentence pairs across different language corpora using multilingual embedding alignment.
Ai-Learn is an educational repository and technical reference designed to facilitate the mastery of artificial intelligence and data science workflows. It provides a structured curriculum that combines theoretical mathematical foundations with practical coding exercises, enabling users to build predictive models, neural networks, and analytical pipelines using Python. The project distinguishes itself by emphasizing a first-principles approach to machine learning. Rather than relying solely on high-level abstractions, it guides users through the reconstruction of core algorithms from scratch,
Applies statistical methods and feature engineering to identify hidden patterns within complex datasets.
This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora. The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-
Supplies aligned sentence pairs extracted for use in building machine translation and cross-lingual models.
This project is a machine learning implementation library featuring a collection of code examples that implement supervised, unsupervised, and reinforcement learning algorithms from scratch. It provides a comprehensive set of toolkits for core machine learning components, including a natural language processing toolkit, a reinforcement learning framework, and suites for data dimensionality reduction and pattern mining. The library includes specialized implementations for reinforcement learning, such as Q-Learning, Deep Q-Networks, and Actor-Critic agents. The natural language processing capab
Implements algorithms like Apriori and FP-Tree for extracting patterns and actionable insights from large datasets.
Superalgos 是一个加密货币算法交易平台,用于设计、回测和部署自动化交易机器人。它以一个可视化策略设计器为中心,允许用户通过图形界面而不是编写手动代码来创建指标和交易逻辑。 该平台具有一个令牌门控信号网络,实现了一个用于广播和货币化交易情报的去中心化市场。对这些信号和预测的访问通过数字令牌和声誉分数进行管理,而分布式交易基础设施允许用户在多个服务器网络上协调数据挖掘和高频执行。 该系统涵盖了广泛的功能,包括历史回测引擎、自动化市场数据挖掘和实时交易执行。它结合了用于模式识别的机器学习,并提供可视化调试工具来追踪活跃机器人的内部运行时状态。 该基础设施支持自托管部署,允许用户在本地运行环境,以保持对资金、密钥和策略的控制。
Runs large-scale data operations to extract market insights for use in automated trading strategies.
DotnetSpider 是一个 .NET 网络爬虫框架和可编程工具,旨在遍历网站并从网页中捕获结构化数据。它作为一个分布式爬虫引擎,支持自动化网络爬虫以发现和提取数据。 该框架专为分布式数据提取而设计,允许将爬取任务分散到多台服务器上,以处理海量 Web 内容。这种架构支持高性能 Web 抓取和企业级数据收集工作流,用于收集结构化信息。
Uses a distributed scraping architecture to collect high volumes of web data for analysis.
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
Provides access to multilingual datasets where Chinese text is aligned with equivalent translations in multiple other languages.
该项目是一个数据挖掘算法库和机器学习参考实现。它提供了一系列用于执行分类、聚类和关联规则挖掘的工具,以及一个用于自然启发式优化的工具包。 该库包括用于图和序列挖掘的专用实用程序,能够提取频繁子图和序列模式。它还具有一个使用粗糙集理论从数据集中删除冗余属性的降维实用程序。 该项目涵盖了广泛的分析功能,包括用于对节点重要性进行排序的网络和图分析,以及用于数据分类的概率模型和决策树。它还实现了用于数据分组的基于距离和密度的方法,以及用于解决复杂优化问题的启发式搜索模式。
Provides a comprehensive library of classical data mining algorithms for classification, clustering, and association rules.
LASER is a cross-lingual sentence embedding library and multilingual text encoder. It functions as a parallel text mining tool that maps sentences from multiple languages into a shared vector space for similarity and classification tasks. The system converts raw text into fixed-length embeddings, enabling the discovery of translation pairs by calculating the vector distance between sentences. This shared representation allows for cross-lingual document classification, where a model trained on one language can be used to categorize documents in another. The library includes a sentence-piece t
Discovers translation pairs across different languages by calculating vector distance between embeddings.