Why is ray-project/ray a recommended In-Memory Data Loading GitHub Repositories repository?

Creates datasets from local Python objects or arrays to integrate existing workflows with distributed computing tasks.

Why is dask/dask a recommended In-Memory Data Loading GitHub Repositories repository?

Reads datasets directly into the cluster to avoid network overhead and memory issues caused by embedding large local objects.

Why is apache/datafusion a recommended In-Memory Data Loading GitHub Repositories repository?

Creates a DataFrame from programmatically defined rows or Arrow record batches without external storage.

Why is deepchem/deepchem a recommended In-Memory Data Loading GitHub Repositories repository?

Featurizes data already held in memory, such as lists or pandas DataFrames, and checkpoints results to disk.

Why is petyosi/react-virtuoso a recommended In-Memory Data Loading GitHub Repositories repository?

Provides scroll-triggered data loading for endless scrolling and bidirectional fetching in virtualized lists.

5 Repos

Awesome GitHub RepositoriesIn-Memory Data Loading

Methods for creating datasets from local objects or arrays.

Distinguishing note: Focuses on integrating local Python objects into distributed workflows.

Explore 5 awesome GitHub repositories matching data & databases · In-Memory Data Loading. Refine with filters or upvote what's useful.

Finde die besten Repos mit KI.Wir suchen mit KI nach den am besten passenden Repositories.

ray-project/ray
ray-project/ray
42,895Auf GitHub ansehen
Ray is a distributed computing framework designed to scale Python and Java applications across clusters by abstracting task scheduling and resource management. It functions as a resource-aware execution engine that manages task dependencies, placement, and fault tolerance across networked compute nodes. At its core, the system provides a stateful actor model, allowing developers to define classes that run in dedicated processes to maintain and mutate internal state across remote method calls. The framework distinguishes itself through a robust cross-language interoperability layer, enabling f
Creates datasets from local Python objects or arrays to integrate existing workflows with distributed computing tasks.
Pythondata-sciencedeep-learningdeployment
Auf GitHub ansehen42,895
dask/dask
dask/dask
13,746Auf GitHub ansehen
Dask ist ein Framework für paralleles Rechnen und ein verteilter Task-Scheduler, der darauf ausgelegt ist, Python-Data-Science-Workflows von einzelnen Maschinen auf große Cluster zu skalieren. Es fungiert als Cluster-Ressourcenmanager, der die Berechnungslogik orchestriert, indem Aufgaben und deren Abhängigkeiten als gerichtete azyklische Graphen dargestellt werden. Diese Architektur ermöglicht es dem System, die Verteilung von Workloads auf verfügbare Hardware zu automatisieren und gleichzeitig komplexe Ausführungsanforderungen zu verwalten. Das Projekt zeichnet sich durch eine Lazy-Evaluation-Engine aus, die Datenoperationen verzögert, bis sie explizit angefordert werden, was eine globale Graphoptimierung und effiziente Ressourcenzuweisung ermöglicht. Es integriert speicherbewusstes Data-Spilling, um Systemabstürze bei der Verarbeitung von Datensätzen zu verhindern, die den verfügbaren Speicher überschreiten, und nutzt Task-Graph-Fusion, um Sequenzen von Operationen in einzelne Ausführungsschritte zu kombinieren, wodurch Scheduling-Overhead und Inter-Node-Kommunikation minimiert werden. Die Plattform bietet eine umfassende Oberfläche für die Datenanalyse im großen Maßstab, einschließlich Unterstützung für verteiltes maschinelles Lernen, Integration in das Hochleistungsrechnen und parallele Datenverarbeitung. Sie bietet umfangreiche Werkzeuge für das Cluster-Lebenszyklusmanagement, Performance-Profiling und die Echtzeitüberwachung der Aufgabenausführung. Benutzer können diese Umgebungen über verschiedene Infrastrukturen hinweg bereitstellen, einschließlich lokaler Hardware, Cloud-Anbietern, containerisierten Systemen und Hochleistungsrechner-Clustern.
Reads datasets directly into the cluster to avoid network overhead and memory issues caused by embedding large local objects.
Pythondasknumpypandas
Auf GitHub ansehen13,746
apache/datafusion
apache/datafusion
8,908Auf GitHub ansehen
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Creates a DataFrame from programmatically defined rows or Arrow record batches without external storage.
Rustarrowbig-datadataframe
Auf GitHub ansehen8,908
deepchem/deepchem
deepchem/deepchem
6,545Auf GitHub ansehen
DeepChem is an open-source Python framework for applying deep learning to molecular, chemical, and biological data, serving as a comprehensive toolkit for drug discovery and materials science. At its core, it provides a featurizer-pipeline abstraction that converts raw molecular data into numerical representations, including graph-based molecular structures, SMILES tokenization vocabularies, and disk-sharded dataset persistence for handling large-scale data that exceeds RAM capacity. The framework distinguishes itself through integrated molecular docking workflows that automate pocket detecti
Featurizes data already held in memory, such as lists or pandas DataFrames, and checkpoints results to disk.
Pythonbiologydeep-learningdrug-discovery
Auf GitHub ansehen6,545
petyosi/react-virtuoso
petyosi/react-virtuoso
6,348Auf GitHub ansehen
React Virtuoso is a React component library for rendering large datasets efficiently through virtualized lists, grids, tables, and chat interfaces. It automatically measures variable-height items at runtime, computes accurate scroll offsets without requiring fixed sizes, and renders only the items within the visible viewport plus a configurable buffer zone. The library manages scroll position through a state machine that tracks direction, position, and anchor items to handle auto-scroll, sticky headers, and bidirectional loading. The library distinguishes itself with specialized components fo
Provides scroll-triggered data loading for endless scrolling and bidirectional fetching in virtualized lists.
TypeScriptchatcomponent-libraryfeed
Auf GitHub ansehen6,348

Awesome In-Memory Data Loading GitHub Repositories

ray-project/ray

dask/dask

apache/datafusion

deepchem/deepchem

petyosi/react-virtuoso

Unter-Tags erkunden