Why is facebookresearch/fairseq a recommended Data Mining GitHub Repositories repository?

Implements algorithms to identify semantically similar sentence pairs across different languages for unsupervised learning.

Why is asciimoo/colly a recommended Data Mining GitHub Repositories repository?

Distributes scraping tasks across multiple instances to increase the volume and throughput of collected web data.

Why is voltagent/awesome-claude-code-subagents a recommended Data Mining GitHub Repositories repository?

Extracts patterns and actionable insights from large datasets.

Why is huggingface/sentence-transformers a recommended Data Mining GitHub Repositories repository?

Identifies translated sentence pairs across different language corpora using multilingual embedding alignment.

Why is tangyudi/ai-learn a recommended Data Mining GitHub Repositories repository?

Applies statistical methods and feature engineering to identify hidden patterns within complex datasets.

Why is brightmart/nlp_chinese_corpus a recommended Data Mining GitHub Repositories repository?

Supplies aligned sentence pairs extracted for use in building machine translation and cross-lingual models.

Why is ljpzzz/machinelearning a recommended Data Mining GitHub Repositories repository?

Implements algorithms like Apriori and FP-Tree for extracting patterns and actionable insights from large datasets.

Why is superalgos/superalgos a recommended Data Mining GitHub Repositories repository?

Runs large-scale data operations to extract market insights for use in automated trading strategies.

Why is zlzforever/dotnetspider a recommended Data Mining GitHub Repositories repository?

Uses a distributed scraping architecture to collect high volumes of web data for analysis.

Why is esbatmop/mnbvc a recommended Data Mining GitHub Repositories repository?

Provides access to multilingual datasets where Chinese text is aligned with equivalent translations in multiple other languages.

12 مستودعات

Awesome GitHub RepositoriesData Mining

Techniques and algorithms for extracting patterns and actionable insights from large datasets.

Distinct from Trend Analysis: Distinct from Trend Analysis: focuses on broad data mining and pattern extraction rather than specific time-series metric monitoring.

Explore 12 awesome GitHub repositories matching data & databases · Data Mining. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

facebookresearch/fairseq
facebookresearch/fairseq
32,228عرض على GitHub
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
Implements algorithms to identify semantically similar sentence pairs across different languages for unsupervised learning.
Python
عرض على GitHub32,228
asciimoo/colly
asciimoo/colly
25,348عرض على GitHub
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
Distributes scraping tasks across multiple instances to increase the volume and throughput of collected web data.
Go
عرض على GitHub25,348
voltagent/awesome-claude-code-subagents
VoltAgent/awesome-claude-code-subagents
21,906عرض على GitHub
This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and infrastructure automation. By leveraging large language models, the system decomposes high-level objectives into actionable plans, ensuring that complex operations are executed with consistency and reliability. The framework distinguishes itself through its hierarchical agent registry and policy-driven
Extracts patterns and actionable insights from large datasets.
Shellai-agent-frameworkai-agent-toolsai-agents
عرض على GitHub21,906
huggingface/sentence-transformers
huggingface/sentence-transformers
18,817عرض على GitHub
This project is a transformer-based framework for generating dense and sparse vector embeddings of text and multimodal data. It serves as a library for fine-tuning models to perform semantic similarity tasks, retrieval, and reranking. The system is distinguished by its support for diverse architectural patterns, including bi-encoders for fast similarity search and cross-encoders for high-precision reranking. It provides dedicated pipelines for multimodal embeddings, mapping text and images into a shared vector space, and implements knowledge distillation to compress large models into smaller,
Identifies translated sentence pairs across different language corpora using multilingual embedding alignment.
Python
عرض على GitHub18,817
tangyudi/ai-learn
tangyudi/Ai-Learn
13,065عرض على GitHub
Ai-Learn is an educational repository and technical reference designed to facilitate the mastery of artificial intelligence and data science workflows. It provides a structured curriculum that combines theoretical mathematical foundations with practical coding exercises, enabling users to build predictive models, neural networks, and analytical pipelines using Python. The project distinguishes itself by emphasizing a first-principles approach to machine learning. Rather than relying solely on high-level abstractions, it guides users through the reconstruction of core algorithms from scratch,
Applies statistical methods and feature engineering to identify hidden patterns within complex datasets.
algorithmartificial-intelligencecaffe
عرض على GitHub13,065
brightmart/nlp_chinese_corpus
brightmart/nlp_chinese_corpus
9,903عرض على GitHub
This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora. The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-
Supplies aligned sentence pairs extracted for use in building machine translation and cross-lingual models.
bertchinesechinese-corpus
عرض على GitHub9,903
ljpzzz/machinelearning
ljpzzz/machinelearning
8,706عرض على GitHub
This project is a machine learning implementation library featuring a collection of code examples that implement supervised, unsupervised, and reinforcement learning algorithms from scratch. It provides a comprehensive set of toolkits for core machine learning components, including a natural language processing toolkit, a reinforcement learning framework, and suites for data dimensionality reduction and pattern mining. The library includes specialized implementations for reinforcement learning, such as Q-Learning, Deep Q-Networks, and Actor-Critic agents. The natural language processing capab
Implements algorithms like Apriori and FP-Tree for extracting patterns and actionable insights from large datasets.
Jupyter Notebookalgorithmsmachinelearningreinforcementlearning
عرض على GitHub8,706
superalgos/superalgos
Superalgos/Superalgos
5,536عرض على GitHub
Superalgos هي منصة تداول خوارزمي للعملات المشفرة تستخدم لتصميم واختبار ونشر بوتات التداول الآلي. تركز على مصمم استراتيجية مرئي يسمح للمستخدمين بإنشاء مؤشرات ومنطق تداول من خلال واجهة رسومية بدلاً من كتابة الكود يدويًا. تتميز المنصة بشبكة إشارات مقفلة بالرموز تتيح سوقًا لا مركزيًا لبث وتحقيق الدخل من ذكاء التداول. تُدار الوصول إلى هذه الإشارات والتنبؤات عبر الرموز الرقمية ودرجات السمعة، بينما تسمح البنية التحتية للتداول الموزعة للمستخدمين بتنسيق تعدين البيانات والتنفيذ عالي الحجم عبر شبكة من خوادم متعددة. يغطي النظام مجموعة واسعة من الإمكانيات، بما في ذلك محركات الاختبار التاريخي، وتعدين بيانات السوق الآلي، وتنفيذ التداول المباشر. ويدمج التعلم الآلي للتعرف على الأنماط ويوفر أدوات تصحيح مرئية لتتبع حالة وقت التشغيل الداخلي للبوتات النشطة. تدعم البنية التحتية عمليات النشر ذاتية الاستضافة، مما يسمح للمستخدمين بتشغيل البيئة في أماكنهم المحلية للحفاظ على السيطرة على الأموال والمفاتيح والاستراتيجيات.
Runs large-scale data operations to extract market insights for use in automated trading strategies.
JavaScriptalgorithmic-tradingalgotradingbitcoin-trading
عرض على GitHub5,536
zlzforever/dotnetspider
zlzforever/DotnetSpider
4,136عرض على GitHub
DotnetSpider هو إطار عمل لزحف الويب في .NET وأداة قابلة للبرمجة مصممة لاجتياز مواقع الويب والتقاط البيانات المهيكلة من صفحات الويب. يعمل كمحرك زحف موزع يتيح أتمتة زحف الويب لاكتشاف واستخراج البيانات. تم تصميم إطار العمل لاستخراج البيانات الموزعة، مما يسمح بتوزيع مهام الزحف عبر خوادم متعددة لمعالجة كميات كبيرة من محتوى الويب. تدعم هذه البنية كشط الويب عالي الأداء وسير عمل جمع بيانات المؤسسات لجمع المعلومات المهيكلة.
Uses a distributed scraping architecture to collect high volumes of web data for analysis.
C#
عرض على GitHub4,136
esbatmop/mnbvc
esbatmop/MNBVC
4,123عرض على GitHub
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
Provides access to multilingual datasets where Chinese text is aligned with equivalent translations in multiple other languages.
chinesechinese-languagechinese-nlp
عرض على GitHub4,123
linyiqun/dataminingalgorithm
linyiqun/DataMiningAlgorithm
3,950عرض على GitHub
This project is a data mining algorithm library and machine learning reference implementation. It provides a collection of tools for performing classification, clustering, and association rule mining, as well as a toolkit for nature-inspired optimization. The library includes specialized utilities for graph and sequence mining, enabling the extraction of frequent subgraphs and sequential patterns. It also features a dimensionality reduction utility that uses rough set theory to remove redundant attributes from datasets. The project covers a broad range of analytical capabilities, including n
Provides a comprehensive library of classical data mining algorithms for classification, clustering, and association rules.
Java
عرض على GitHub3,950
facebookresearch/laser
facebookresearch/LASER
3,659عرض على GitHub
LASER is a cross-lingual sentence embedding library and multilingual text encoder. It functions as a parallel text mining tool that maps sentences from multiple languages into a shared vector space for similarity and classification tasks. The system converts raw text into fixed-length embeddings, enabling the discovery of translation pairs by calculating the vector distance between sentences. This shared representation allows for cross-lingual document classification, where a model trained on one language can be used to categorize documents in another. The library includes a sentence-piece t
Discovers translation pairs across different languages by calculating vector distance between embeddings.
Jupyter Notebook
عرض على GitHub3,659

Awesome Data Mining GitHub Repositories

facebookresearch/fairseq

asciimoo/colly

VoltAgent/awesome-claude-code-subagents

huggingface/sentence-transformers

tangyudi/Ai-Learn

brightmart/nlp_chinese_corpus

ljpzzz/machinelearning

Superalgos/Superalgos

zlzforever/DotnetSpider

esbatmop/MNBVC

linyiqun/DataMiningAlgorithm

facebookresearch/LASER

استكشف الوسوم الفرعية