12 مستودعات
Distributes large-scale data processing tasks across multiple workers using grouping keys for high throughput.
Distinct from Parallel Task Batching: Candidates were either too specific to reasoning tasks or too general as academic computing labels.
Explore 12 awesome GitHub repositories matching data & databases · Parallel Batch Processing. Refine with filters or upvote what's useful.
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Processes thousands of URLs concurrently using asynchronous queue-based controls to ensure scalable data retrieval.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Distributes data processing across multiple workers using grouping keys to increase overall system throughput.
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Scales large-scale data transformations across compute nodes to process massive historical datasets using grouping keys.
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Processes multiple files faster than sequential extraction using a work-stealing scheduler.
Unstract is an unstructured data extraction system and ETL pipeline orchestrator that uses large language models to convert documents, images, and scans into structured JSON. It provides a document extraction API for integrating these capabilities into external automation tools and includes a Model Context Protocol server to connect AI agents to structured information retrieval. The system ensures data accuracy through a verification tool featuring dual-model verification and human-in-the-loop review with coordinate-based document highlighting. It utilizes natural language extraction schemas
Optimizes throughput by processing documents in parallel while tracking and skipping duplicate files.
Steel is a cloud browser automation platform that provides a REST API for launching and controlling remote Chrome browser sessions. It enables programmatic browsing and web scraping using standard automation tools like Puppeteer, Playwright, and Selenium, connecting to cloud-hosted browser instances via WebSocket and the Chrome DevTools Protocol. The platform supports both headless and headful browser sessions, with language-specific SDKs for TypeScript and Python. The service distinguishes itself through comprehensive anti-detection capabilities, including residential proxy rotation, CAPTCHA
Processes multiple URLs concurrently using async concurrency controls to speed up batch browser automation tasks.
ACE Step 1.5 is a local text-to-music generation and audio editing system that runs on consumer hardware. It transforms plain-language descriptions into full-length songs with lyrics, and can edit existing audio through cover generation, vocal removal, track separation, and selective repainting. The system supports multilingual prompts and lyrics in over 50 languages, and provides precise control over musical structure including duration, BPM, key, and time signature. The project distinguishes itself through a dual-stream diffusion architecture that processes separate latent streams for vocal
Generates multiple songs simultaneously by running independent diffusion processes in parallel on the GPU, maximising throughput for batch workflows.
Firecrawl MCP Server is a Model Context Protocol tool server that exposes the full suite of Firecrawl’s web scraping, crawling, and automation capabilities as tools that large language models can invoke directly. It acts as a proxy to the Firecrawl cloud platform, which manages headless browser orchestration, async job queues, and rate limiting behind the scenes. The server distinguishes itself by packaging autonomous web agents — both a research agent that browses and collects structured data from multiple pages, and a general web agent that performs multi-step browsing and extraction tasks
Scrapes multiple URLs in parallel with rate limiting and returns operation status for later retrieval.
هذا تنفيذ تعلم عميق بـ PyTorch لتدريب نماذج لغات تعتمد على المحولات (Transformers). يعمل كمدرب GPU موزع وإطار عمل مصمم لتحسين نماذج التنبؤ بالنصوص لزيادة السرعة وكفاءة العينة. يتميز المشروع باستخدامه لمحسن الوزن Newton-Schulz. تطبق هذه الطريقة عملية تكرارية للحفاظ على تحديثات المعلمات شبه المتعامدة ومصفوفات الوزن، مما يحسن كفاءة العينة ويقلل من عبء الذاكرة أثناء عملية التدريب. يغطي إطار العمل قدرات واسعة في حوسبة GPU الموزعة، بما في ذلك توازي البيانات لتوسيع نطاق أحمال العمل عبر معالجات رسومات متعددة. كما يدمج تقنيات تحسين الشبكة العصبية مثل تحسين الزخم التكراري ومعالجة الدفعات عالية الإنتاجية.
Employs parallel batch processing to load large data chunks into memory and maximize GPU utilization.
Trafilatura is a Python library and command-line tool for extracting clean, structured text and metadata from web pages. It downloads HTML content, identifies the main body of text, and strips away navigation, ads, and other boilerplate, returning the core article content along with fields like title, author, date, and URL. The tool can also extract user comments and test whether a page contains extractable text, making it a general-purpose web text extraction library. What distinguishes Trafilatura from simpler extractors is its configurable extraction pipeline, which offers high-speed, high
Fetches multiple URLs concurrently with deduplication and archive fallback.
GAM is a command-line tool for administering Google Workspace and Cloud Identity. It translates command-line arguments into structured API calls, enabling administrators to manage users, groups, organizational units, and domain settings across a Google Workspace environment. The tool handles authentication through OAuth2 flows, service accounts, and workload identity federation, and supports multi-tenant configurations for managing multiple domains or cloud projects from a single installation. GAM distinguishes itself through its batch processing and automation capabilities. It can process la
Distributes independent API requests across parallel worker threads to process large datasets from CSV or flat files.
Yattee is a privacy-focused video player and multi-backend video aggregator designed for streaming online content without tracking, ads, or account requirements. It functions as a cross-platform application that collects video content from self-hosted servers, third-party APIs, and decentralized platforms into a single interface. The project features SponsorBlock integration to automatically skip sponsored or promotional segments using a community-sourced timestamp database. It also includes an Invidious-compatible API server that can replace standard endpoints to facilitate private playback.
Processes multiple URLs in parallel to extract video information efficiently in batches.