12 Repos
Techniques and algorithms for extracting patterns and actionable insights from large datasets.
Distinct from Trend Analysis: Distinct from Trend Analysis: focuses on broad data mining and pattern extraction rather than specific time-series metric monitoring.
Explore 12 awesome GitHub repositories matching data & databases · Data Mining. Refine with filters or upvote what's useful.
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
Implements algorithms to identify semantically similar sentence pairs across different languages for unsupervised learning.
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
Distributes scraping tasks across multiple instances to increase the volume and throughput of collected web data.
This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and infrastructure automation. By leveraging large language models, the system decomposes high-level objectives into actionable plans, ensuring that complex operations are executed with consistency and reliability. The framework distinguishes itself through its hierarchical agent registry and policy-driven
Extracts patterns and actionable insights from large datasets.
This project is a transformer-based framework for generating dense and sparse vector embeddings of text and multimodal data. It serves as a library for fine-tuning models to perform semantic similarity tasks, retrieval, and reranking. The system is distinguished by its support for diverse architectural patterns, including bi-encoders for fast similarity search and cross-encoders for high-precision reranking. It provides dedicated pipelines for multimodal embeddings, mapping text and images into a shared vector space, and implements knowledge distillation to compress large models into smaller,
Identifies translated sentence pairs across different language corpora using multilingual embedding alignment.
Ai-Learn is an educational repository and technical reference designed to facilitate the mastery of artificial intelligence and data science workflows. It provides a structured curriculum that combines theoretical mathematical foundations with practical coding exercises, enabling users to build predictive models, neural networks, and analytical pipelines using Python. The project distinguishes itself by emphasizing a first-principles approach to machine learning. Rather than relying solely on high-level abstractions, it guides users through the reconstruction of core algorithms from scratch,
Applies statistical methods and feature engineering to identify hidden patterns within complex datasets.
This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora. The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-
Supplies aligned sentence pairs extracted for use in building machine translation and cross-lingual models.
This project is a machine learning implementation library featuring a collection of code examples that implement supervised, unsupervised, and reinforcement learning algorithms from scratch. It provides a comprehensive set of toolkits for core machine learning components, including a natural language processing toolkit, a reinforcement learning framework, and suites for data dimensionality reduction and pattern mining. The library includes specialized implementations for reinforcement learning, such as Q-Learning, Deep Q-Networks, and Actor-Critic agents. The natural language processing capab
Implements algorithms like Apriori and FP-Tree for extracting patterns and actionable insights from large datasets.
Superalgos ist eine Plattform für algorithmischen Handel mit Kryptowährungen, die für das Design, Backtesting und Deployment automatisierter Trading-Bots verwendet wird. Sie konzentriert sich auf einen visuellen Strategie-Designer, der es Benutzern ermöglicht, Indikatoren und Handelslogik über eine grafische Oberfläche zu erstellen, anstatt manuellen Code zu schreiben. Die Plattform verfügt über ein Token-gesteuertes Signalnetzwerk, das einen dezentralen Marktplatz für die Übertragung und Monetarisierung von Handelsintelligenz ermöglicht. Der Zugriff auf diese Signale und Vorhersagen wird über digitale Token und Reputations-Scores verwaltet, während eine verteilte Handelsinfrastruktur es Benutzern ermöglicht, Data Mining und hochvolumige Ausführungen über ein Netzwerk mehrerer Server hinweg zu koordinieren. Das System deckt ein breites Spektrum an Funktionen ab, einschließlich historischer Backtesting-Engines, automatisiertem Marktdaten-Mining und Live-Handelsausführung. Es integriert Machine Learning für die Mustererkennung und bietet visuelle Debugging-Tools, um den internen Laufzeitzustand aktiver Bots zu verfolgen. Die Infrastruktur unterstützt selbst gehostete Deployments, was es Benutzern ermöglicht, die Umgebung lokal zu betreiben, um die Kontrolle über Gelder, Schlüssel und Strategien zu behalten.
Runs large-scale data operations to extract market insights for use in automated trading strategies.
DotnetSpider is a .NET web crawler framework and programmable tool designed for traversing websites and capturing structured data from web pages. It functions as a distributed crawling engine that enables the automation of web crawling to discover and extract data. The framework is designed for distributed data extraction, allowing crawling tasks to be spread across multiple servers to process large volumes of web content. This architecture supports high-performance web scraping and enterprise data collection workflows for gathering structured information.
Uses a distributed scraping architecture to collect high volumes of web data for analysis.
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
Provides access to multilingual datasets where Chinese text is aligned with equivalent translations in multiple other languages.
Dieses Projekt ist eine Data-Mining-Algorithmus-Library und eine Referenzimplementierung für Machine Learning. Es bietet eine Sammlung von Tools zur Durchführung von Klassifizierung, Clustering und Assoziationsregel-Mining sowie ein Toolkit für naturinspirierte Optimierung. Die Library enthält spezialisierte Dienstprogramme für Graph- und Sequenz-Mining, die die Extraktion häufiger Teilgraphen und sequenzieller Muster ermöglichen. Zudem verfügt sie über ein Dienstprogramm zur Dimensionsreduktion, das die Rough-Set-Theorie nutzt, um redundante Attribute aus Datensätzen zu entfernen. Das Projekt deckt ein breites Spektrum analytischer Fähigkeiten ab, darunter Netzwerk- und Graphanalyse zur Bewertung der Knotenwichtigkeit sowie die Verwendung probabilistischer Modelle und Entscheidungsbäume zur Datenklassifizierung. Es implementiert zudem distanz- und dichte-basierte Methoden zur Gruppierung von Daten sowie heuristik-basierte Suchmuster zur Lösung komplexer Optimierungsprobleme.
Provides a comprehensive library of classical data mining algorithms for classification, clustering, and association rules.
LASER is a cross-lingual sentence embedding library and multilingual text encoder. It functions as a parallel text mining tool that maps sentences from multiple languages into a shared vector space for similarity and classification tasks. The system converts raw text into fixed-length embeddings, enabling the discovery of translation pairs by calculating the vector distance between sentences. This shared representation allows for cross-lingual document classification, where a model trained on one language can be used to categorize documents in another. The library includes a sentence-piece t
Discovers translation pairs across different languages by calculating vector distance between embeddings.