Why is facebookresearch/fairseq a recommended Data Mining GitHub Repositories repository?

Implements algorithms to identify semantically similar sentence pairs across different languages for unsupervised learning.

Why is asciimoo/colly a recommended Data Mining GitHub Repositories repository?

Distributes scraping tasks across multiple instances to increase the volume and throughput of collected web data.

Why is voltagent/awesome-claude-code-subagents a recommended Data Mining GitHub Repositories repository?

Extracts patterns and actionable insights from large datasets.

Why is huggingface/sentence-transformers a recommended Data Mining GitHub Repositories repository?

Identifies translated sentence pairs across different language corpora using multilingual embedding alignment.

Why is tangyudi/ai-learn a recommended Data Mining GitHub Repositories repository?

Applies statistical methods and feature engineering to identify hidden patterns within complex datasets.

Why is brightmart/nlp_chinese_corpus a recommended Data Mining GitHub Repositories repository?

Supplies aligned sentence pairs extracted for use in building machine translation and cross-lingual models.

Why is ljpzzz/machinelearning a recommended Data Mining GitHub Repositories repository?

Implements algorithms like Apriori and FP-Tree for extracting patterns and actionable insights from large datasets.

Why is superalgos/superalgos a recommended Data Mining GitHub Repositories repository?

Runs large-scale data operations to extract market insights for use in automated trading strategies.

Why is zlzforever/dotnetspider a recommended Data Mining GitHub Repositories repository?

Uses a distributed scraping architecture to collect high volumes of web data for analysis.

Why is esbatmop/mnbvc a recommended Data Mining GitHub Repositories repository?

Provides access to multilingual datasets where Chinese text is aligned with equivalent translations in multiple other languages.

12 Repos

Awesome GitHub RepositoriesData Mining

Techniques and algorithms for extracting patterns and actionable insights from large datasets.

Distinct from Trend Analysis: Distinct from Trend Analysis: focuses on broad data mining and pattern extraction rather than specific time-series metric monitoring.

Explore 12 awesome GitHub repositories matching data & databases · Data Mining. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

facebookresearch/fairseq
facebookresearch/fairseq
32,228Auf GitHub ansehen
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
Implements algorithms to identify semantically similar sentence pairs across different languages for unsupervised learning.
Python
Auf GitHub ansehen32,228
asciimoo/colly
asciimoo/colly
25,348Auf GitHub ansehen
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
Distributes scraping tasks across multiple instances to increase the volume and throughput of collected web data.
Go
Auf GitHub ansehen25,348
voltagent/awesome-claude-code-subagents
VoltAgent/awesome-claude-code-subagents
21,906Auf GitHub ansehen
This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and infrastructure automation. By leveraging large language models, the system decomposes high-level objectives into actionable plans, ensuring that complex operations are executed with consistency and reliability. The framework distinguishes itself through its hierarchical agent registry and policy-driven
Extracts patterns and actionable insights from large datasets.
Shellai-agent-frameworkai-agent-toolsai-agents
Auf GitHub ansehen21,906
huggingface/sentence-transformers
huggingface/sentence-transformers
18,817Auf GitHub ansehen
This project is a transformer-based framework for generating dense and sparse vector embeddings of text and multimodal data. It serves as a library for fine-tuning models to perform semantic similarity tasks, retrieval, and reranking. The system is distinguished by its support for diverse architectural patterns, including bi-encoders for fast similarity search and cross-encoders for high-precision reranking. It provides dedicated pipelines for multimodal embeddings, mapping text and images into a shared vector space, and implements knowledge distillation to compress large models into smaller,
Identifies translated sentence pairs across different language corpora using multilingual embedding alignment.
Python
Auf GitHub ansehen18,817
tangyudi/ai-learn
tangyudi/Ai-Learn
13,065Auf GitHub ansehen
Ai-Learn is an educational repository and technical reference designed to facilitate the mastery of artificial intelligence and data science workflows. It provides a structured curriculum that combines theoretical mathematical foundations with practical coding exercises, enabling users to build predictive models, neural networks, and analytical pipelines using Python. The project distinguishes itself by emphasizing a first-principles approach to machine learning. Rather than relying solely on high-level abstractions, it guides users through the reconstruction of core algorithms from scratch,
Applies statistical methods and feature engineering to identify hidden patterns within complex datasets.
algorithmartificial-intelligencecaffe
Auf GitHub ansehen13,065
brightmart/nlp_chinese_corpus
brightmart/nlp_chinese_corpus
9,903Auf GitHub ansehen
This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora. The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-
Supplies aligned sentence pairs extracted for use in building machine translation and cross-lingual models.
bertchinesechinese-corpus
Auf GitHub ansehen9,903
ljpzzz/machinelearning
ljpzzz/machinelearning
8,706Auf GitHub ansehen
This project is a machine learning implementation library featuring a collection of code examples that implement supervised, unsupervised, and reinforcement learning algorithms from scratch. It provides a comprehensive set of toolkits for core machine learning components, including a natural language processing toolkit, a reinforcement learning framework, and suites for data dimensionality reduction and pattern mining. The library includes specialized implementations for reinforcement learning, such as Q-Learning, Deep Q-Networks, and Actor-Critic agents. The natural language processing capab
Implements algorithms like Apriori and FP-Tree for extracting patterns and actionable insights from large datasets.
Jupyter Notebookalgorithmsmachinelearningreinforcementlearning
Auf GitHub ansehen8,706
superalgos/superalgos
Superalgos/Superalgos
5,536Auf GitHub ansehen
Superalgos ist eine Plattform für algorithmischen Handel mit Kryptowährungen, die für das Design, Backtesting und Deployment automatisierter Trading-Bots verwendet wird. Sie konzentriert sich auf einen visuellen Strategie-Designer, der es Benutzern ermöglicht, Indikatoren und Handelslogik über eine grafische Oberfläche zu erstellen, anstatt manuellen Code zu schreiben. Die Plattform verfügt über ein Token-gesteuertes Signalnetzwerk, das einen dezentralen Marktplatz für die Übertragung und Monetarisierung von Handelsintelligenz ermöglicht. Der Zugriff auf diese Signale und Vorhersagen wird über digitale Token und Reputations-Scores verwaltet, während eine verteilte Handelsinfrastruktur es Benutzern ermöglicht, Data Mining und hochvolumige Ausführungen über ein Netzwerk mehrerer Server hinweg zu koordinieren. Das System deckt ein breites Spektrum an Funktionen ab, einschließlich historischer Backtesting-Engines, automatisiertem Marktdaten-Mining und Live-Handelsausführung. Es integriert Machine Learning für die Mustererkennung und bietet visuelle Debugging-Tools, um den internen Laufzeitzustand aktiver Bots zu verfolgen. Die Infrastruktur unterstützt selbst gehostete Deployments, was es Benutzern ermöglicht, die Umgebung lokal zu betreiben, um die Kontrolle über Gelder, Schlüssel und Strategien zu behalten.
Runs large-scale data operations to extract market insights for use in automated trading strategies.
JavaScriptalgorithmic-tradingalgotradingbitcoin-trading
Auf GitHub ansehen5,536
zlzforever/dotnetspider
zlzforever/DotnetSpider
4,136Auf GitHub ansehen
DotnetSpider is a .NET web crawler framework and programmable tool designed for traversing websites and capturing structured data from web pages. It functions as a distributed crawling engine that enables the automation of web crawling to discover and extract data. The framework is designed for distributed data extraction, allowing crawling tasks to be spread across multiple servers to process large volumes of web content. This architecture supports high-performance web scraping and enterprise data collection workflows for gathering structured information.
Uses a distributed scraping architecture to collect high volumes of web data for analysis.
C#
Auf GitHub ansehen4,136
esbatmop/mnbvc
esbatmop/MNBVC
4,123Auf GitHub ansehen
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
Provides access to multilingual datasets where Chinese text is aligned with equivalent translations in multiple other languages.
chinesechinese-languagechinese-nlp
Auf GitHub ansehen4,123
linyiqun/dataminingalgorithm
linyiqun/DataMiningAlgorithm
3,950Auf GitHub ansehen
Dieses Projekt ist eine Data-Mining-Algorithmus-Library und eine Referenzimplementierung für Machine Learning. Es bietet eine Sammlung von Tools zur Durchführung von Klassifizierung, Clustering und Assoziationsregel-Mining sowie ein Toolkit für naturinspirierte Optimierung. Die Library enthält spezialisierte Dienstprogramme für Graph- und Sequenz-Mining, die die Extraktion häufiger Teilgraphen und sequenzieller Muster ermöglichen. Zudem verfügt sie über ein Dienstprogramm zur Dimensionsreduktion, das die Rough-Set-Theorie nutzt, um redundante Attribute aus Datensätzen zu entfernen. Das Projekt deckt ein breites Spektrum analytischer Fähigkeiten ab, darunter Netzwerk- und Graphanalyse zur Bewertung der Knotenwichtigkeit sowie die Verwendung probabilistischer Modelle und Entscheidungsbäume zur Datenklassifizierung. Es implementiert zudem distanz- und dichte-basierte Methoden zur Gruppierung von Daten sowie heuristik-basierte Suchmuster zur Lösung komplexer Optimierungsprobleme.
Provides a comprehensive library of classical data mining algorithms for classification, clustering, and association rules.
Java
Auf GitHub ansehen3,950
facebookresearch/laser
facebookresearch/LASER
3,659Auf GitHub ansehen
LASER is a cross-lingual sentence embedding library and multilingual text encoder. It functions as a parallel text mining tool that maps sentences from multiple languages into a shared vector space for similarity and classification tasks. The system converts raw text into fixed-length embeddings, enabling the discovery of translation pairs by calculating the vector distance between sentences. This shared representation allows for cross-lingual document classification, where a model trained on one language can be used to categorize documents in another. The library includes a sentence-piece t
Discovers translation pairs across different languages by calculating vector distance between embeddings.
Jupyter Notebook
Auf GitHub ansehen3,659