Why is mendableai/firecrawl a recommended Parallel Batch Processing GitHub Repositories repository?

Processes thousands of URLs concurrently using asynchronous queue-based controls to ensure scalable data retrieval.

Why is lancedb/lancedb a recommended Parallel Batch Processing GitHub Repositories repository?

Distributes data processing across multiple workers using grouping keys to increase overall system throughput.

Why is apache/beam a recommended Parallel Batch Processing GitHub Repositories repository?

Scales large-scale data transformations across compute nodes to process massive historical datasets using grouping keys.

Why is kreuzberg-dev/kreuzberg a recommended Parallel Batch Processing GitHub Repositories repository?

Processes multiple files faster than sequential extraction using a work-stealing scheduler.

Why is zipstack/unstract a recommended Parallel Batch Processing GitHub Repositories repository?

Optimizes throughput by processing documents in parallel while tracking and skipping duplicate files.

Why is steel-dev/steel-browser a recommended Parallel Batch Processing GitHub Repositories repository?

Processes multiple URLs concurrently using async concurrency controls to speed up batch browser automation tasks.

Why is ace-step/ace-step-1.5 a recommended Parallel Batch Processing GitHub Repositories repository?

Generates multiple songs simultaneously by running independent diffusion processes in parallel on the GPU, maximising throughput for batch workflows.

Why is firecrawl/firecrawl-mcp-server a recommended Parallel Batch Processing GitHub Repositories repository?

Scrapes multiple URLs in parallel with rate limiting and returns operation status for later retrieval.

Why is kellerjordan/modded-nanogpt a recommended Parallel Batch Processing GitHub Repositories repository?

Employs parallel batch processing to load large data chunks into memory and maximize GPU utilization.

Why is adbar/trafilatura a recommended Parallel Batch Processing GitHub Repositories repository?

Fetches multiple URLs concurrently with deduplication and archive fallback.

12 مستودعات

Awesome GitHub RepositoriesParallel Batch Processing

Distributes large-scale data processing tasks across multiple workers using grouping keys for high throughput.

Distinct from Parallel Task Batching: Candidates were either too specific to reasoning tasks or too general as academic computing labels.

Explore 12 awesome GitHub repositories matching data & databases · Parallel Batch Processing. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

mendableai/firecrawl
mendableai/firecrawl
139,399عرض على GitHub
Firecrawl is a headless browser automation tool and web crawling engine designed to extract structured data from the web. It functions as an API that transforms raw website content and documents into clean markdown and JSON formats to serve as context for large language models. The project distinguishes itself by using natural language prompts to translate human instructions into targeted data extraction tasks and browser actions. It can execute interactive page navigation, such as clicking and scrolling, and perform automated web research to retrieve structured data without manual interventi
Processes thousands of URLs concurrently using asynchronous queue-based controls to ensure scalable data retrieval.
TypeScript
عرض على GitHub139,399
lancedb/lancedb
lancedb/lancedb
9,031عرض على GitHub
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Distributes data processing across multiple workers using grouping keys to increase overall system throughput.
HTMLapproximate-nearest-neighbor-searchimage-searchnearest-neighbor-search
عرض على GitHub9,031
apache/beam
apache/beam
8,612عرض على GitHub
Apache Beam is a distributed data pipeline framework and unified data processing model designed to handle both bounded batch data and unbounded real-time streams. It provides a system for building scalable, data-parallel workflows that operate across compute clusters using a single programming model. The framework utilizes a cross-runner pipeline abstraction that decouples the data processing logic from the underlying execution backend, allowing the same pipeline to run on different distributed compute engines. It supports multi-language pipeline development by translating high-level code fro
Scales large-scale data transformations across compute nodes to process massive historical datasets using grouping keys.
Java
عرض على GitHub8,612
kreuzberg-dev/kreuzberg
kreuzberg-dev/kreuzberg
8,527عرض على GitHub
Kreuzberg is a document extraction engine that converts PDFs, Office files, images, and over 90 other formats into clean, structured text and metadata. It is built around a compiled Rust core that can be used as a native library, a command-line tool, a REST API server, or a WebAssembly module for browser-based processing. The system is designed to run entirely on self-hosted infrastructure, with no data leaving the user's environment. What distinguishes Kreuzberg is its breadth of integration surfaces and its pipeline architecture. It exposes extraction capabilities through native bindings fo
Processes multiple files faster than sequential extraction using a work-stealing scheduler.
Rustdocument-intelligenceelixirffi
عرض على GitHub8,527
zipstack/unstract
Zipstack/unstract
6,669عرض على GitHub
Unstract is an unstructured data extraction system and ETL pipeline orchestrator that uses large language models to convert documents, images, and scans into structured JSON. It provides a document extraction API for integrating these capabilities into external automation tools and includes a Model Context Protocol server to connect AI agents to structured information retrieval. The system ensures data accuracy through a verification tool featuring dual-model verification and human-in-the-loop review with coordinate-based document highlighting. It utilizes natural language extraction schemas
Optimizes throughput by processing documents in parallel while tracking and skipping duplicate files.
Pythonai-agentsdata-engineeringdocument-ai
عرض على GitHub6,669
steel-dev/steel-browser
steel-dev/steel-browser
6,450عرض على GitHub
Steel is a cloud browser automation platform that provides a REST API for launching and controlling remote Chrome browser sessions. It enables programmatic browsing and web scraping using standard automation tools like Puppeteer, Playwright, and Selenium, connecting to cloud-hosted browser instances via WebSocket and the Chrome DevTools Protocol. The platform supports both headless and headful browser sessions, with language-specific SDKs for TypeScript and Python. The service distinguishes itself through comprehensive anti-detection capabilities, including residential proxy rotation, CAPTCHA
Processes multiple URLs concurrently using async concurrency controls to speed up batch browser automation tasks.
TypeScriptaiai-agentsai-tools
عرض على GitHub6,450
ace-step/ace-step-1.5
ace-step/ACE-Step-1.5
6,002عرض على GitHub
ACE Step 1.5 is a local text-to-music generation and audio editing system that runs on consumer hardware. It transforms plain-language descriptions into full-length songs with lyrics, and can edit existing audio through cover generation, vocal removal, track separation, and selective repainting. The system supports multilingual prompts and lyrics in over 50 languages, and provides precise control over musical structure including duration, BPM, key, and time signature. The project distinguishes itself through a dual-stream diffusion architecture that processes separate latent streams for vocal
Generates multiple songs simultaneously by running independent diffusion processes in parallel on the GPU, maximising throughput for batch workflows.
Python
عرض على GitHub6,002
firecrawl/firecrawl-mcp-server
firecrawl/firecrawl-mcp-server
5,542عرض على GitHub
Firecrawl MCP Server is a Model Context Protocol tool server that exposes the full suite of Firecrawl’s web scraping, crawling, and automation capabilities as tools that large language models can invoke directly. It acts as a proxy to the Firecrawl cloud platform, which manages headless browser orchestration, async job queues, and rate limiting behind the scenes. The server distinguishes itself by packaging autonomous web agents — both a research agent that browses and collects structured data from multiple pages, and a general web agent that performs multi-step browsing and extraction tasks
Scrapes multiple URLs in parallel with rate limiting and returns operation status for later retrieval.
JavaScriptbatch-processingclaudecontent-extraction
عرض على GitHub5,542
kellerjordan/modded-nanogpt
KellerJordan/modded-nanogpt
5,436عرض على GitHub
هذا تنفيذ تعلم عميق بـ PyTorch لتدريب نماذج لغات تعتمد على المحولات (Transformers). يعمل كمدرب GPU موزع وإطار عمل مصمم لتحسين نماذج التنبؤ بالنصوص لزيادة السرعة وكفاءة العينة. يتميز المشروع باستخدامه لمحسن الوزن Newton-Schulz. تطبق هذه الطريقة عملية تكرارية للحفاظ على تحديثات المعلمات شبه المتعامدة ومصفوفات الوزن، مما يحسن كفاءة العينة ويقلل من عبء الذاكرة أثناء عملية التدريب. يغطي إطار العمل قدرات واسعة في حوسبة GPU الموزعة، بما في ذلك توازي البيانات لتوسيع نطاق أحمال العمل عبر معالجات رسومات متعددة. كما يدمج تقنيات تحسين الشبكة العصبية مثل تحسين الزخم التكراري ومعالجة الدفعات عالية الإنتاجية.
Employs parallel batch processing to load large data chunks into memory and maximize GPU utilization.
Python
عرض على GitHub5,436
adbar/trafilatura
adbar/trafilatura
5,319عرض على GitHub
Trafilatura is a Python library and command-line tool for extracting clean, structured text and metadata from web pages. It downloads HTML content, identifies the main body of text, and strips away navigation, ads, and other boilerplate, returning the core article content along with fields like title, author, date, and URL. The tool can also extract user comments and test whether a page contains extractable text, making it a general-purpose web text extraction library. What distinguishes Trafilatura from simpler extractors is its configurable extraction pipeline, which offers high-speed, high
Fetches multiple URLs concurrently with deduplication and archive fallback.
Pythonarticle-extractorcorpus-buildercorpus-tools
عرض على GitHub5,319
gam-team/gam
GAM-team/GAM
4,206عرض على GitHub
GAM is a command-line tool for administering Google Workspace and Cloud Identity. It translates command-line arguments into structured API calls, enabling administrators to manage users, groups, organizational units, and domain settings across a Google Workspace environment. The tool handles authentication through OAuth2 flows, service accounts, and workload identity federation, and supports multi-tenant configurations for managing multiple domains or cloud projects from a single installation. GAM distinguishes itself through its batch processing and automation capabilities. It can process la
Distributes independent API requests across parallel worker threads to process large datasets from CSV or flat files.
Pythongamgooglegoogle-admin-sdk
عرض على GitHub4,206
yattee/yattee
yattee/yattee
3,322عرض على GitHub
Yattee is a privacy-focused video player and multi-backend video aggregator designed for streaming online content without tracking, ads, or account requirements. It functions as a cross-platform application that collects video content from self-hosted servers, third-party APIs, and decentralized platforms into a single interface. The project features SponsorBlock integration to automatically skip sponsored or promotional segments using a community-sourced timestamp database. It also includes an Invidious-compatible API server that can replace standard endpoints to facilitate private playback.
Processes multiple URLs in parallel to extract video information efficiently in batches.
Swiftinvidiousiosmacos
عرض على GitHub3,322

Awesome Parallel Batch Processing GitHub Repositories

mendableai/firecrawl

lancedb/lancedb

apache/beam

kreuzberg-dev/kreuzberg

Zipstack/unstract

steel-dev/steel-browser

ace-step/ACE-Step-1.5

firecrawl/firecrawl-mcp-server

KellerJordan/modded-nanogpt

adbar/trafilatura

GAM-team/GAM

yattee/yattee

استكشف الوسوم الفرعية