7 مستودعات
Frameworks for managing high-volume, asynchronous web crawling across multiple nodes.
Explore 7 awesome GitHub repositories matching data & databases · Distributed Crawling Systems. Refine with filters or upvote what's useful.
هذا المشروع عبارة عن مورد تعليمي شامل ودليل دراسي يركز على بنية الأنظمة الموزعة وتصميم البنية التحتية للـ backend. يوفر منهجاً منظماً لإتقان مبادئ القابلية للتوسع، والموثوقية، والأداء المطلوبة لتصميم أنظمة برمجية معقدة. يتميز المستودع بتقديم نهج منهجي للتحضير للمقابلات التقنية، حيث يدمج أنماط التصميم، والمقايضات المعمارية، وأدوات التكرار المتباعد لمساعدة المستخدمين على الاحتفاظ بالمفاهيم المعقدة. ويؤكد على التحليل القائم على القيود، حيث يعلم المستخدمين كيفية تقييم المتطلبات المتنافسة مثل زمن الوصول (latency)، والاتساق، والتوافر عند صياغة التصاميم المعمارية. يغطي المحتوى طيفاً واسعاً من قدرات تصميم النظام، بما في ذلك استراتيجيات توسيع قواعد البيانات، وإدارة حركة المرور، وتحسين البنية التحتية. ويفصل تقنيات التوسع الأفقي، والتخزين المؤقت متعدد الطبقات، والتواصل غير المتزامن، واكتشاف الخدمات، مع توفير أطر عمل لإجراء تقديرات الموارد وتخطيط السعة. يتم تنظيم التوثيق كدليل دراسي، مما يوفر مساراً منهجياً عبر أساسيات هندسة الـ backend وتصميم الأنظمة واسعة النطاق.
Implements strategies for ranking and prioritizing URLs to optimize web crawling efficiency.
Crawl4AI is an AI-powered web crawling and data extraction engine designed to transform complex web content into structured formats. It functions as a headless browser orchestrator, enabling the navigation of dynamic websites, the execution of custom scripts, and the capture of visual assets like screenshots and PDFs. By integrating language models directly into the extraction workflow, the system converts raw HTML into clean, structured data or Markdown files optimized for downstream ingestion. The platform distinguishes itself through a distributed, self-hosted infrastructure that manages l
Coordinates high-volume data gathering through asynchronous job queues and self-hosted infrastructure to ensure scalable and reliable crawling operations.
Scrapy is a comprehensive framework designed for automated web data extraction and large-scale crawling. It operates on an asynchronous, event-driven engine that manages non-blocking network requests and data processing tasks, allowing for the efficient retrieval of structured information from web documents using path-based selectors. The system distinguishes itself through a highly modular architecture that supports complex data collection workflows. Users can implement custom middleware and signal handlers to intercept and modify request flows, while a priority-based scheduler manages concu
Coordinates high-volume, asynchronous crawling operations to ensure reliability during long-running data collection tasks.
Crawlee is a web scraping framework designed for building scalable, reliable, and distributed data extraction pipelines. It provides a unified interface for managing headless browser automation and lightweight HTTP requests, allowing developers to handle complex web navigation, dynamic content rendering, and large-scale data collection within a single, modular architecture. The project distinguishes itself through its resource-aware concurrency controller, which dynamically scales task execution based on real-time CPU and memory usage to prevent host machine exhaustion. It also features a rob
Persists crawl progress to allow resuming interrupted jobs from the last processed state.
This project is a comprehensive educational guide and framework for building web scrapers using Python. It provides a course-based approach to data extraction, combining a Python crawler framework with tutorials on web reverse engineering and network traffic analysis. The project distinguishes itself by covering advanced extraction challenges, including the decryption of obfuscated JavaScript and the bypass of anti-scraping measures. It specifically addresses mobile application scraping through the simulation of user interactions and the interception of network traffic. The capability surfac
Implements scalable architectures for managing high-volume, asynchronous web crawling across multiple nodes.
PySpider is a Python web crawling framework designed for automated data extraction. It provides a pipeline for periodically fetching web content, processing HTML, and persisting scraped information into database backends. The system features a web-based management interface for editing scraping scripts, monitoring task progress, and reviewing collected data. It includes a headless browser JavaScript renderer to capture rendered HTML from dynamic web pages and a distributed architecture that uses message queues to scale crawling workloads across multiple nodes. The framework also covers task
Provides a framework for high-volume, asynchronous web crawling across multiple nodes using message queues.
This project is a comprehensive knowledge base and study resource designed for mastering technical interviews. It provides structured guides, roadmaps, and curricula focused on data structures, algorithms, system design, and frontend engineering to help candidates prepare for software engineering screenings. The repository distinguishes itself by offering a holistic approach to professional advancement. Beyond technical drills, it includes a career development handbook covering resume optimization, salary benchmarking, and strategic negotiation coaching. It also provides detailed methodologie
Covers the design of distributed crawling systems using consistent hashing to partition URL space across servers.