Why is alibaba/datax a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Loads data from cloud object storage into a transportable format for analytical processing.

Why is elsa-workflows/elsa-core a recommended Cloud Storage Definition Loading GitHub Repositories repository?

The workflow engine retrieves workflow definitions from cloud storage providers like Azure Blob Storage or AWS S3.

Why is apache/pinot a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Retrieves and imports data files from remote object storage buckets for analytical processing.

Why is timescale/pgai a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Imports content for embedding from external sources including cloud storage and web addresses.

Why is alibaba/alisql a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Loads data from cloud object storage into the analytical engine for processing.

Why is kserve/kserve a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Loads model artifacts from S3, GCS, or Azure Blob storage during deployment.

Why is treeverse/lakefs a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Loads datasets from versioned object storage using a specialized URI scheme for ML libraries.

Why is gam-team/gam a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Retrieves files from cloud storage buckets using various URI schemes to provide input for administrative commands.

Why is aws/aws-sdk-pandas a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Provides capabilities to load various file formats from S3 object storage directly into pandas dataframes for analysis.

Why is awslabs/aws-data-wrangler a recommended Cloud Storage Definition Loading GitHub Repositories repository?

Facilitates loading data from cloud object storage into analytical engines for extraction and transformation workflows.

10 مستودعات

Awesome GitHub RepositoriesCloud Storage Definition Loading

Capabilities for loading configuration or definition files from cloud object storage providers.

Distinct from Azure Blob Manifest Synchronization: Existing candidates focus on data export or manifest sync, not loading executable workflow definitions from blob storage.

Explore 10 awesome GitHub repositories matching data & databases · Cloud Storage Definition Loading. Refine with filters or upvote what's useful.

اعثر على أفضل المستودعات باستخدام الذكاء الاصطناعي.سنبحث عن أفضل المستودعات المطابقة باستخدام الذكاء الاصطناعي.

alibaba/datax
alibaba/DataX
17,241عرض على GitHub
DataX is a distributed data integration framework and plugin-based ETL tool designed for synchronizing large datasets between heterogeneous sources and destinations. It functions as a JDBC data migration engine and offline synchronization tool, enabling the movement of data between relational databases, NoSQL stores, and object storage. The system utilizes a plugin-based connector architecture that decouples reader and writer logic, allowing it to map and transform data types across different storage engines using a standardized internal representation. This design supports heterogeneous data
Loads data from cloud object storage into a transportable format for analytical processing.
Java
عرض على GitHub17,241
elsa-workflows/elsa-core
elsa-workflows/elsa-core
7,629عرض على GitHub
Elsa Core is a workflow engine framework designed for defining, executing, and managing long-running business processes. It functions as a distributed workflow orchestrator and event-driven trigger system, capable of operating as a multi-tenant platform with secure data isolation. The project distinguishes itself through a flexible approach to workflow definitions, supporting a visual drag-and-drop designer, programmatic C# definitions, and portable JSON specifications. It provides a highly extensible architecture allowing for the development of custom activities and the use of a dynamic expr
The workflow engine retrieves workflow definitions from cloud storage providers like Azure Blob Storage or AWS S3.
C#csharpdotnetelsa
عرض على GitHub7,629
apache/pinot
apache/pinot
6,098عرض على GitHub
Pinot is a distributed, columnar analytical database designed for high-concurrency, low-latency query processing. It functions as a real-time OLAP datastore, enabling interactive, user-facing analytics by ingesting and querying massive datasets from both streaming and batch sources. The system architecture relies on a centralized controller for cluster coordination and a distributed segment-based storage model to ensure horizontal scalability. The platform distinguishes itself through a hybrid ingestion pipeline that unifies real-time event streams and historical batch data into a single quer
Retrieves and imports data files from remote object storage buckets for analytical processing.
Java
عرض على GitHub6,098
timescale/pgai
timescale/pgai
5,802عرض على GitHub
pgai هو مجموعة أدوات وإطار عمل لـ PostgreSQL مصمم لدمج نماذج اللغات الكبيرة وتضمينات المتجهات (vector embeddings) مباشرة داخل قاعدة البيانات. يعمل كجسر لتنفيذ طلبات نماذج تعلم الآلة وإجراء ترجمات النص إلى SQL ضمن استعلامات قاعدة البيانات القياسية. يوفر المشروع خط أنابيب آلي لتضمين المتجهات يتولى تحميل وتحليل وتقسيم النصوص من الجداول والمستندات غير المهيكلة. يستخدم هذا النظام عاملاً في الخلفية لمزامنة التضمينات تلقائياً مع تغير البيانات المصدرية، ويتضمن أدوات متخصصة لبناء تطبيقات التوليد المعزز بالاسترجاع (RAG) ومحركات البحث الدلالي. تغطي مجموعة الأدوات مجالات واسعة تشمل معالجة البيانات غير المهيكلة باستخدام OCR، وإنشاء فهارس دلالية لربط مخططات قاعدة البيانات باللغة الطبيعية، وتنفيذ عمليات بحث عن التشابه عالية الأداء من خلال فهرسة المتجهات وإعادة ترتيب النتائج. كما يتيح إثراء البيانات وتصنيفها والإشراف على المحتوى عن طريق استدعاء نماذج خارجية عبر SQL.
Imports content for embedding from external sources including cloud storage and web addresses.
PLpgSQL
عرض على GitHub5,802
alibaba/alisql
alibaba/AliSQL
5,706عرض على GitHub
AliSQL is a fork of MySQL by Alibaba that extends the relational database management system with enhancements for high performance, scalability, and enterprise-grade availability. It retains the core MySQL identity as a SQL-based database for storing, organizing, and retrieving structured data, while adding optimizations for large-scale transactional and analytical workloads. The project differentiates itself through a set of Alibaba-specific improvements, including a columnar engine for accelerating analytical queries directly on MySQL tables, and a distributed, shared-nothing NDB Cluster en
Loads data from cloud object storage into the analytical engine for processing.
C++alisqldatabaseduckdb
عرض على GitHub5,706
kserve/kserve
kserve/kserve
5,576عرض على GitHub
KServe is a Kubernetes-native platform for deploying and serving machine learning models as scalable inference services. It supports both generative AI models, including large language models, and traditional predictive models from frameworks such as TensorFlow, PyTorch, Scikit-Learn, XGBoost, and ONNX. The platform manages the full lifecycle of model deployments, including revision tracking, canary rollouts, A/B testing, and automatic rollbacks, and provides serverless scale-to-zero capabilities for cost-efficient resource management. KServe distinguishes itself through a standardized infere
Loads model artifacts from S3, GCS, or Azure Blob storage during deployment.
Go
عرض على GitHub5,576
treeverse/lakefs
treeverse/lakeFS
5,406عرض على GitHub
lakeFS هو نظام إصدارات لبحيرات البيانات يوفر تفرعاً (branching) والتزامات (commits) تشبه Git لمجموعات البيانات الكبيرة المخزنة في تخزين الكائنات. يعمل كطبقة تحكم في الإصدار، مما يتيح إنشاء لقطات غير قابلة للتغيير، والتزامات ذرية، وتفرعاً بدون نسخ (zero-copy) لإنشاء بيئات معزولة لتجارب البيانات دون تكرار الملفات الفيزيائية. يعمل النظام كبوابة تخزين متوافقة مع S3 وفهرس Iceberg REST، مما يسمح لبروتوكولات التخزين السحابي القياسية والعملاء المتوافقين بإدارة الجداول ذات الإصدارات. يعمل كحارس لجودة البيانات باستخدام نظام خطافات (hooks) قائم على الأحداث للتحقق من مجموعات البيانات مقابل سياسات الحوكمة قبل دمج التغييرات في الإنتاج. تغطي المنصة قدرات واسعة لحوكمة البيانات، بما في ذلك التعاون عبر طلبات السحب (pull requests)، والتحكم في الوصول القائم على الأدوار، وتتبع أصل البيانات. يوفر تكاملاً لتنسيق سير العمل، وخطوط أنابيب التعلم الآلي، ومحركات حوسبة البيانات الضخمة المختلفة، ويدعم اتصال التخزين متعدد السحابة ومزامنة الهوية عبر SSO وSCIM. يمكن تثبيت البرنامج باستخدام ملفات ثنائية، أو حاويات، أو Helm charts للنشر على Kubernetes.
Loads datasets from versioned object storage using a specialized URI scheme for ML libraries.
Go
عرض على GitHub5,406
gam-team/gam
GAM-team/GAM
4,206عرض على GitHub
GAM is a command-line tool for administering Google Workspace and Cloud Identity. It translates command-line arguments into structured API calls, enabling administrators to manage users, groups, organizational units, and domain settings across a Google Workspace environment. The tool handles authentication through OAuth2 flows, service accounts, and workload identity federation, and supports multi-tenant configurations for managing multiple domains or cloud projects from a single installation. GAM distinguishes itself through its batch processing and automation capabilities. It can process la
Retrieves files from cloud storage buckets using various URI schemes to provide input for administrative commands.
Pythongamgooglegoogle-admin-sdk
عرض على GitHub4,206
aws/aws-sdk-pandas
aws/aws-sdk-pandas
4,107عرض على GitHub
aws-sdk-pandas هي مكتبة Python تدمج إطارات بيانات pandas مع خدمات AWS، وتعمل كأداة ETL لبيانات السحابة وموصل لمستودع البيانات. توفر واجهة موحدة لنقل وتحويل البيانات بين إطارات البيانات في الذاكرة والتخزين السحابي وقواعد البيانات ومستودعات البيانات. يتميز المشروع كمنسق حوسبة موزع قادر على إرسال أعباء العمل القائمة على pandas إلى مجموعات EMR وبيئات المعالجة بدون خادم. كما يتخصص في تنسيق معالجة البيانات الموزعة عبر تهيئة مجموعة Ray للتعامل مع مجموعات البيانات التي تتجاوز ذاكرة جهاز واحد. تغطي المكتبة مجموعة واسعة من القدرات، بما في ذلك إدارة تخزين الكائنات لـ S3، وتنفيذ استعلام SQL لـ Athena وRedshift، والتكامل مع قواعد بيانات NoSQL، والرسم البياني، والسلاسل الزمنية. كما تتضمن أدوات لإدارة البيانات الوصفية من خلال كتالوج Glue، وفهرسة بيانات OpenSearch، وإدارة أصول ذكاء الأعمال في QuickSight. تشمل الوظائف الإضافية استرداد الأسرار، وتحليل سجلات CloudWatch، وإدارة قواعد جودة البيانات.
Provides capabilities to load various file formats from S3 object storage directly into pandas dataframes for analysis.
Pythonamazon-athenaamazon-sagemaker-notebookapache-arrow
عرض على GitHub4,107
awslabs/aws-data-wrangler
awslabs/aws-data-wrangler
4,107عرض على GitHub
هذا المشروع هو مكتبة تكامل AWS pandas وإطار عمل لخط أنابيب البيانات مصمم لتبسيط حركة وتحويل البيانات بين الذاكرة المحلية وخدمات التخزين والتحليلات في AWS. يعمل كأداة لمستودع بيانات السحابة (data lake) ومدير ملفات التخزين، مما يسمح للمستخدمين بقراءة وكتابة وتحويل البيانات المنظمة عبر بيئات سحابية مختلفة. تتميز المكتبة كمنسق حوسبة موزع قادر على إدارة المجموعات في بيئات مثل EMR لمعالجة مجموعات البيانات التي تتجاوز حدود الذاكرة لجهاز واحد. كما توفر قدرات متخصصة لإدارة فهارس المتجهات وإجراء عمليات بحث التشابه داخل حاويات التخزين السحابية. تغطي مساحة قدراتها الأوسع ETL لقاعدة بيانات السحابة لخدمات مثل DynamoDB وRDS وTimestream، بالإضافة إلى إدارة كتالوج بيانات السحابة عبر AWS Glue. وتدعم تحليلات البيانات بدون خادم من خلال Athena وRedshift، وتوفر أدوات لإدارة كائنات S3، وفهرسة المستندات في OpenSearch، وتحليل سجلات CloudWatch.
Facilitates loading data from cloud object storage into analytical engines for extraction and transformation workflows.
Python
عرض على GitHub4,107

Awesome Cloud Storage Definition Loading GitHub Repositories

alibaba/DataX

elsa-workflows/elsa-core

apache/pinot

timescale/pgai

alibaba/AliSQL

kserve/kserve

treeverse/lakeFS

GAM-team/GAM

aws/aws-sdk-pandas

awslabs/aws-data-wrangler

استكشف الوسوم الفرعية