9 dépôts
Capabilities for reading and writing data directly from cloud-based storage providers.
Distinguishing note: Focuses on the data access layer rather than the authentication layer.
Explore 9 awesome GitHub repositories matching data & databases · Cloud Data Access. Refine with filters or upvote what's useful.
Polars is a high-performance columnar data processing library designed for efficient analytical workflows. It functions as a structured data library that organizes information into typed columns, utilizing the Apache Arrow memory format to enable zero-copy data sharing and cache-friendly, vectorized operations. The engine is built to handle large-scale tabular datasets, providing both local and distributed analytical runtimes that scale from single-machine environments to multi-node clusters. The project distinguishes itself through a sophisticated lazy query engine that constructs abstract e
Reads data files directly from cloud storage buckets using URI paths.
Label Studio is a multi-modal data annotation platform designed to create and manage high-quality training datasets for machine learning. It functions as a self-hosted, containerized environment that supports secure, private deployments, including air-gapped configurations. The platform provides a centralized workspace for labeling diverse media types, such as images, text, audio, and time-series data, to support supervised and reinforcement learning workflows. The platform distinguishes itself through deep integration with machine learning backends, enabling active learning loops, automated
Label Studio downloads original media files such as images, audio, or text from the annotation environment for use in external machine learning backend processing.
This project is a deep learning framework designed for constructing, training, and deploying neural networks across diverse hardware environments. It functions as a high-performance tensor computation library that provides both imperative and symbolic programming interfaces, allowing developers to balance flexible, step-by-step model building with the efficiency of compiled computation graphs. The framework distinguishes itself through a hybrid execution engine that integrates declarative graph compilation with imperative runtime logic. It supports scalable, distributed training across multip
Enables loading training datasets directly from remote cloud object storage using secure credentials.
Metaflow is a Python machine learning framework and MLOps workflow orchestrator designed to manage the lifecycle of data pipelines from local prototyping to production. It serves as a distributed compute manager and an experiment tracking system, enabling the creation of reproducible pipelines that transition between development and high-availability production environments. The framework distinguishes itself through an integrated checkpointing system that automatically persists intermediate data artifacts to remote storage, allowing failed runs to be resumed from the last successful step. It
Connects to cloud object storage to retrieve and store large datasets efficiently.
Enso is a visual dataflow programming environment and multi-language data processing engine that compiles Enso, Python, Java, and JavaScript into a unified representation with a shared memory model for zero-overhead inter-language calls. It functions as a self-service data preparation and analysis platform where users can build data pipelines by connecting nodes in a graph, switching between a no-code visual interface and a code view while keeping all changes reviewable. The platform also serves as a cloud data workflow scheduler and API exposer, allowing workflows to run on a timetable or be
Enables interactive data processing and visualization through a cloud-based platform.
OpenDroneMap (ODM) is an open-source aerial drone photogrammetry pipeline that converts 2D images into georeferenced 3D models, orthophotos, point clouds, and digital elevation maps. At its core, the OpenDroneMap Processing Engine orchestrates a complete Structure-from-Motion workflow, from feature extraction through dense reconstruction and tiled output generation, purpose-built for transforming drone-captured imagery into geospatial data products. The toolkit distinguishes itself through GPU-accelerated SIFT feature extraction using CUDA-capable NVIDIA graphics cards, roughly doubling proce
Creates Cloud-Optimized GeoTIFF files for faster remote access and streaming of orthophoto data.
Goofys est une passerelle de stockage objet cloud compatible POSIX qui présente des buckets de stockage distants comme des répertoires système locaux. Il implémente un système de fichiers en espace utilisateur (user-space) qui mappe les services de stockage S3 et Azure Blob vers des points de montage locaux, permettant d'accéder aux objets distants via des opérations système standard. Le projet offre des capacités de montage spécifiques pour les comptes Amazon S3, Azure Blob Storage et Azure Data Lake. Il utilise une implémentation basée sur FUSE pour interfacer le stockage objet cloud avec le noyau du système d'exploitation. Le système inclut des optimisations de performance telles que la mise en cache locale en lecture pour réduire la latence et la récupération concurrente par plages (range-request) pour optimiser le téléchargement de gros objets. Il simule des structures de dossiers hiérarchiques en analysant les préfixes des clés d'objets pour émuler des répertoires.
Provides a gateway to access remote cloud object storage as local folders for simplified file management.
This project is a web application security standard and vulnerability framework. It provides a comprehensive list of the most critical security risks facing web applications, paired with technical guidance and a structured methodology for identifying and mitigating these flaws. The framework functions as a secure coding guide and a risk assessment methodology, offering a standardized approach to prioritizing vulnerabilities based on their potential impact and likelihood of exploitation. It defines architectural patterns and technical recommendations to help developers implement defense in dep
Offers guidance on restricting access to cloud storage and services to prevent sensitive data exposure.
mimic-code is a clinical data analysis framework and toolset for processing deidentified electronic health records and intensive care unit data. It provides a healthcare SQL query library and a processing tool to transform raw health records into formats suitable for longitudinal analysis and machine learning. The project features a medical research notebook environment that integrates with cloud-hosted datasets, allowing for remote querying and analysis. It includes a DICOM imaging pipeline to retrieve chest radiographs and link medical imaging with structured clinical metadata. The framewo
Connects credentialed accounts to cloud-hosted data stores for clinical querying and analysis.