12 مستودعات
Techniques and algorithms for extracting patterns and actionable insights from large datasets.
Distinct from Trend Analysis: Distinct from Trend Analysis: focuses on broad data mining and pattern extraction rather than specific time-series metric monitoring.
Explore 12 awesome GitHub repositories matching data & databases · Data Mining. Refine with filters or upvote what's useful.
Fairseq is a PyTorch toolkit for sequence-to-sequence modeling, specializing in neural machine translation, automatic speech recognition, and large-scale language model training. It provides a framework for processing and aligning diverse data sources, including text, audio, and video, to support tasks such as speech-to-text conversion and multimodal sequence learning. The project is distinguished by its distributed training capabilities, which utilize parameter sharding, mixed-precision training, and CPU offloading to handle models that exceed single-device memory. It also includes specializ
Implements algorithms to identify semantically similar sentence pairs across different languages for unsupervised learning.
Colly is a web scraping framework and concurrent crawler written in Go. It provides a system for traversing web pages, following links, and extracting structured data from HTML and XML documents. The framework includes a distributed scraping engine designed to spread data collection tasks across multiple instances to increase throughput. It ensures compliance with website owner policies by automatically reading and respecting robots.txt files. The system manages request lifecycles through domain-based rate limiting, concurrency controls, and session management via a stateful cookie jar. It s
Distributes scraping tasks across multiple instances to increase the volume and throughput of collected web data.
This project provides a framework for managing multi-agent systems, designed to automate complex software development, infrastructure, and business workflows. It functions as a multi-agent workflow orchestrator that routes tasks to domain-specific workers while maintaining state persistence and infrastructure automation. By leveraging large language models, the system decomposes high-level objectives into actionable plans, ensuring that complex operations are executed with consistency and reliability. The framework distinguishes itself through its hierarchical agent registry and policy-driven
Extracts patterns and actionable insights from large datasets.
This project is a transformer-based framework for generating dense and sparse vector embeddings of text and multimodal data. It serves as a library for fine-tuning models to perform semantic similarity tasks, retrieval, and reranking. The system is distinguished by its support for diverse architectural patterns, including bi-encoders for fast similarity search and cross-encoders for high-precision reranking. It provides dedicated pipelines for multimodal embeddings, mapping text and images into a shared vector space, and implements knowledge distillation to compress large models into smaller,
Identifies translated sentence pairs across different language corpora using multilingual embedding alignment.
Ai-Learn is an educational repository and technical reference designed to facilitate the mastery of artificial intelligence and data science workflows. It provides a structured curriculum that combines theoretical mathematical foundations with practical coding exercises, enabling users to build predictive models, neural networks, and analytical pipelines using Python. The project distinguishes itself by emphasizing a first-principles approach to machine learning. Rather than relying solely on high-level abstractions, it guides users through the reconstruction of core algorithms from scratch,
Applies statistical methods and feature engineering to identify hidden patterns within complex datasets.
This is a large-scale collection of curated Chinese text corpora designed for training natural language processing models. The project provides a variety of datasets, including a deduplicated archive of millions of news articles with titles and keywords, high-quality categorized question-and-answer pairs, and parallel translation corpora. The collection includes millions of aligned Chinese and English sentence pairs used for cross-lingual model training and machine translation development. It also contains filtered question-and-answer data organized by label for the construction of knowledge-
Supplies aligned sentence pairs extracted for use in building machine translation and cross-lingual models.
This project is a machine learning implementation library featuring a collection of code examples that implement supervised, unsupervised, and reinforcement learning algorithms from scratch. It provides a comprehensive set of toolkits for core machine learning components, including a natural language processing toolkit, a reinforcement learning framework, and suites for data dimensionality reduction and pattern mining. The library includes specialized implementations for reinforcement learning, such as Q-Learning, Deep Q-Networks, and Actor-Critic agents. The natural language processing capab
Implements algorithms like Apriori and FP-Tree for extracting patterns and actionable insights from large datasets.
Superalgos هي منصة تداول خوارزمي للعملات المشفرة تستخدم لتصميم واختبار ونشر بوتات التداول الآلي. تركز على مصمم استراتيجية مرئي يسمح للمستخدمين بإنشاء مؤشرات ومنطق تداول من خلال واجهة رسومية بدلاً من كتابة الكود يدويًا. تتميز المنصة بشبكة إشارات مقفلة بالرموز تتيح سوقًا لا مركزيًا لبث وتحقيق الدخل من ذكاء التداول. تُدار الوصول إلى هذه الإشارات والتنبؤات عبر الرموز الرقمية ودرجات السمعة، بينما تسمح البنية التحتية للتداول الموزعة للمستخدمين بتنسيق تعدين البيانات والتنفيذ عالي الحجم عبر شبكة من خوادم متعددة. يغطي النظام مجموعة واسعة من الإمكانيات، بما في ذلك محركات الاختبار التاريخي، وتعدين بيانات السوق الآلي، وتنفيذ التداول المباشر. ويدمج التعلم الآلي للتعرف على الأنماط ويوفر أدوات تصحيح مرئية لتتبع حالة وقت التشغيل الداخلي للبوتات النشطة. تدعم البنية التحتية عمليات النشر ذاتية الاستضافة، مما يسمح للمستخدمين بتشغيل البيئة في أماكنهم المحلية للحفاظ على السيطرة على الأموال والمفاتيح والاستراتيجيات.
Runs large-scale data operations to extract market insights for use in automated trading strategies.
DotnetSpider هو إطار عمل لزحف الويب في .NET وأداة قابلة للبرمجة مصممة لاجتياز مواقع الويب والتقاط البيانات المهيكلة من صفحات الويب. يعمل كمحرك زحف موزع يتيح أتمتة زحف الويب لاكتشاف واستخراج البيانات. تم تصميم إطار العمل لاستخراج البيانات الموزعة، مما يسمح بتوزيع مهام الزحف عبر خوادم متعددة لمعالجة كميات كبيرة من محتوى الويب. تدعم هذه البنية كشط الويب عالي الأداء وسير عمل جمع بيانات المؤسسات لجمع المعلومات المهيكلة.
Uses a distributed scraping architecture to collect high volumes of web data for analysis.
MNBVC is a dataset pipeline and toolkit designed for the collection, cleaning, and normalization of massive text and code corpora used to train large language models. It provides specialized tools for harvesting source code, commit histories, and repository metadata from version control platforms, alongside a multilingual text corpus collector for gathering parallel text and academic papers. The project distinguishes itself through comprehensive capabilities for processing diverse document types, including a PDF-to-text converter that transforms complex layouts and formulas into structured JS
Provides access to multilingual datasets where Chinese text is aligned with equivalent translations in multiple other languages.
This project is a data mining algorithm library and machine learning reference implementation. It provides a collection of tools for performing classification, clustering, and association rule mining, as well as a toolkit for nature-inspired optimization. The library includes specialized utilities for graph and sequence mining, enabling the extraction of frequent subgraphs and sequential patterns. It also features a dimensionality reduction utility that uses rough set theory to remove redundant attributes from datasets. The project covers a broad range of analytical capabilities, including n
Provides a comprehensive library of classical data mining algorithms for classification, clustering, and association rules.
LASER is a cross-lingual sentence embedding library and multilingual text encoder. It functions as a parallel text mining tool that maps sentences from multiple languages into a shared vector space for similarity and classification tasks. The system converts raw text into fixed-length embeddings, enabling the discovery of translation pairs by calculating the vector distance between sentences. This shared representation allows for cross-lingual document classification, where a model trained on one language can be used to categorize documents in another. The library includes a sentence-piece t
Discovers translation pairs across different languages by calculating vector distance between embeddings.