13 repository-uri
Reads data from Parquet, CSV, JSON, images, HuggingFace, MongoDB, SQL, and other formats using distributed readers.
Distinct from Multi-Source CSV Loading: Distinct from Multi-Source CSV Loading: supports many file formats and data sources beyond CSV, using Ray for distributed loading.
Explore 13 awesome GitHub repositories matching data & databases · Multi-Format Data Loading. Refine with filters or upvote what's useful.
TensorFlow.js is a JavaScript machine learning library used for training and deploying models in web browsers and server-side environments. It functions as a browser-based model trainer, a WebAssembly inference engine, and a WebGPU accelerated tensor library for low-level linear algebra. The project also includes a model converter to transform Python-based models into optimized formats for JavaScript execution. The library distinguishes itself through a pluggable backend architecture that allows mathematical operations to be executed via CPU, WebGL, or WebGPU. It supports the conversion of Py
Imports datasets from disk or web sources in various formats for machine learning use.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Loads data from multiple formats including CSV, JSON, and Apache Arrow into high-performance internal tables.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Reads and writes data in Parquet, CSV, JSON, and Avro formats without additional configuration.
AlaSQL is a JavaScript SQL database engine that allows for the filtering, grouping, and joining of in-memory object arrays and JSON data. It functions as an in-memory SQL database and client-side data processor, enabling the execution of SQL statements against JavaScript arrays and external data sources in both browser and server environments. The project serves as a universal data query tool capable of performing relational joins across diverse sources, such as merging Google Spreadsheets, SQLite files, and remote APIs into a single result set. It also acts as an IndexedDB SQL wrapper, allow
Provides the ability to read and process data from multiple formats including CSV, JSON, and Excel.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Reads feature data from Parquet, CSV, JSON, HuggingFace, MongoDB, SQL, and more using Ray's native readers.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Reads datasets from local files, remote repositories, and common formats using distributed readers.
This repository is the official documentation for TensorFlow, a machine learning framework. It provides comprehensive guides, tutorials, and API references for building, training, and deploying machine learning models. The documentation covers the full lifecycle of machine learning projects, from constructing data pipelines and building neural networks with high-level APIs to customizing training loops and deploying trained models in production, on edge devices, or in browsers. The documentation includes step-by-step tutorials for a range of tasks, including reinforcement learning, ranking mo
Reads CSV, image, and text data sources into processing pipelines for efficient input handling.
PlotJuggler is an interactive time series visualization tool that loads, streams, and renders large datasets using hardware-accelerated OpenGL graphics. It functions as a multi-format data loader, supporting file formats such as CSV, ULog, and ROS bags, and also serves as a live data stream viewer that subscribes to real-time sources via MQTT, WebSockets, ZeroMQ, and UDP. The tool distinguishes itself through a plugin-based extensibility platform that allows users to add custom data sources, file formats, and processing capabilities. It includes a Lua scripting engine for creating custom data
Reads time series data from CSV, ULog, and ROS bag files for analysis and visualization.
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Reads images and video clips from disk, validates paths, and formats data for anomaly detection models.
NVIDIA DALI is a GPU-accelerated data loading and preprocessing library designed for deep learning workflows. It constructs high-performance data pipelines that offload decoding, augmentation, and normalization to the GPU, eliminating CPU bottlenecks in training and inference. The library reads data from multiple storage formats and streams it directly into GPU memory, with support for multi-GPU execution to scale throughput across large-scale workloads. DALI distinguishes itself by enabling data pipelines to be built once and executed across multiple deep learning frameworks without code cha
Reads data from LMDB, RecordIO, TFRecord, WebDataset, COCO, and NumPy formats to feed into processing pipelines.
Aceasta este o bibliotecă de vizualizare de tip grammar of graphics utilizată pentru a construi grafice prin maparea datelor tabelare la marcaje vizuale. Funcționează ca un instrument de vizualizare a datelor SVG și un API de analiză exploratorie a datelor, permițând utilizatorilor să randeze vizualizări complexe și hărți geografice. Biblioteca dispune de un renderer de hărți GeoJSON care proiectează coordonatele sferice într-un spațiu de pixeli bidimensional și o interfață de vizualizare Apache Arrow pentru procesarea datelor de înaltă eficiență. Suprafața sa de capabilități acoperă transformarea datelor prin binning și grupare, codificarea vizuală prin inferența automată a scalei și aplicarea schemelor de culori, precum și generarea de small multiples. Suportă randarea formelor geometrice în vizualizări stratificate și exportul imaginilor statice în medii server-side.
Handles diverse data structures, including arrays of objects and Apache Arrow tables, to improve processing efficiency.
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
Imports data from standard files or custom parsing tools for non-standard formats like audio and PDFs.
mcp-context-forge is a Model Context Protocol federation gateway that unifies diverse AI tool servers and APIs into a single consistent interface for discovery and execution. It acts as a centralized proxy that aggregates multiple servers and APIs, allowing AI agents to access and invoke a unified set of tools, prompts, and resources. The project distinguishes itself through a multi-protocol translation bridge that converts communication between standard I/O, SSE, gRPC, and REST to enable interoperability between disparate tool servers. It includes a comprehensive LLM evaluation framework for
Imports data from multiple formats including CSV, JSON, Parquet, Excel, and SQL into a managed cache.