18 dépôts
Loads data from a file path, standard input, inline data, or files matching a regex pattern in a specified directory.
Distinct from CSV Data Loaders: Distinct from CSV Data Loaders: focuses on loading CSV from multiple source types, not just file-based CSV loading.
Explore 18 awesome GitHub repositories matching data & databases · Multi-Source CSV Loading. Refine with filters or upvote what's useful.
TensorFlow.js is a JavaScript machine learning library used for training and deploying models in web browsers and server-side environments. It functions as a browser-based model trainer, a WebAssembly inference engine, and a WebGPU accelerated tensor library for low-level linear algebra. The project also includes a model converter to transform Python-based models into optimized formats for JavaScript execution. The library distinguishes itself through a pluggable backend architecture that allows mathematical operations to be executed via CPU, WebGL, or WebGPU. It supports the conversion of Py
Imports datasets from disk or web sources in various formats for machine learning use.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Loads data from multiple formats including CSV, JSON, and Apache Arrow into high-performance internal tables.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Reads and writes data in Parquet, CSV, JSON, and Avro formats without additional configuration.
AlaSQL is a JavaScript SQL database engine that allows for the filtering, grouping, and joining of in-memory object arrays and JSON data. It functions as an in-memory SQL database and client-side data processor, enabling the execution of SQL statements against JavaScript arrays and external data sources in both browser and server environments. The project serves as a universal data query tool capable of performing relational joins across diverse sources, such as merging Google Spreadsheets, SQLite files, and remote APIs into a single result set. It also acts as an IndexedDB SQL wrapper, allow
Provides the ability to read and process data from multiple formats including CSV, JSON, and Excel.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Reads feature data from Parquet, CSV, JSON, HuggingFace, MongoDB, SQL, and more using Ray's native readers.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Reads datasets from local files, remote repositories, and common formats using distributed readers.
This repository is the official documentation for TensorFlow, a machine learning framework. It provides comprehensive guides, tutorials, and API references for building, training, and deploying machine learning models. The documentation covers the full lifecycle of machine learning projects, from constructing data pipelines and building neural networks with high-level APIs to customizing training loops and deploying trained models in production, on edge devices, or in browsers. The documentation includes step-by-step tutorials for a range of tasks, including reinforcement learning, ranking mo
Reads CSV, image, and text data sources into processing pipelines for efficient input handling.
pgloader is a command-line tool that automates the migration of data and schema from various source databases and file formats into PostgreSQL. It combines schema discovery, parallel data pipelines, and type casting into a single, declarative workflow, using PostgreSQL's COPY protocol for high-throughput bulk loading. The tool distinguishes itself by compiling a dedicated command language into concurrent reader-writer pipelines that handle schema introspection, data transformation, and error-resilient batch processing. It supports migrating entire databases from MySQL, MS SQL, SQLite, and Pos
Loads data from a file path, standard input, inline data, or files matching a regex pattern.
PlotJuggler is an interactive time series visualization tool that loads, streams, and renders large datasets using hardware-accelerated OpenGL graphics. It functions as a multi-format data loader, supporting file formats such as CSV, ULog, and ROS bags, and also serves as a live data stream viewer that subscribes to real-time sources via MQTT, WebSockets, ZeroMQ, and UDP. The tool distinguishes itself through a plugin-based extensibility platform that allows users to add custom data sources, file formats, and processing capabilities. It includes a Lua scripting engine for creating custom data
Reads time series data from CSV, ULog, and ROS bag files for analysis and visualization.
River est un framework Python pour le machine learning en ligne (online machine learning), conçu pour entraîner et évaluer des modèles sur des données en streaming. Il permet un apprentissage incrémental en mettant à jour les paramètres du modèle une observation à la fois, éliminant le besoin de stocker des jeux de données d'entraînement complets en mémoire. La bibliothèque se distingue par un système dédié de détection de dérive de concept (concept drift) qui surveille les changements dans les distributions de données pour déclencher l'adaptation du modèle. Elle fournit également un framework de validation progressive qui simule un déploiement en temps réel en testant les modèles sur des échantillons avant de les utiliser pour l'entraînement. Le système couvre un large éventail de capacités de streaming, incluant l'ingénierie de caractéristiques (feature engineering) en temps réel, la prévision de séries temporelles et la détection d'anomalies en ligne. Il prend en charge l'apprentissage non supervisé via le clustering incrémental et les arbres de décision, ainsi que l'agrégation ensembliste et les politiques de bandit pour la sélection de modèles. Le projet inclut des utilitaires pour l'ingestion de données en streaming à partir de sources telles que des fichiers CSV et des API, ainsi que des outils pour calculer des statistiques courantes et des esquisses de données (data sketches) économes en mémoire.
Reads CSV files as a sequence of dictionaries, converting columns to numeric types for online learning.
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Reads images and video clips from disk, validates paths, and formats data for anomaly detection models.
NVIDIA DALI is a GPU-accelerated data loading and preprocessing library designed for deep learning workflows. It constructs high-performance data pipelines that offload decoding, augmentation, and normalization to the GPU, eliminating CPU bottlenecks in training and inference. The library reads data from multiple storage formats and streams it directly into GPU memory, with support for multi-GPU execution to scale throughput across large-scale workloads. DALI distinguishes itself by enabling data pipelines to be built once and executed across multiple deep learning frameworks without code cha
Reads data from LMDB, RecordIO, TFRecord, WebDataset, COCO, and NumPy formats to feed into processing pipelines.
Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests • All countries • Updated daily by Our World in Data
Provides a regularly updated CSV distribution consolidating key COVID-19 metrics into a single downloadable file.
Il s'agit d'une bibliothèque de visualisation de grammaire graphique utilisée pour construire des graphiques en mappant des données tabulaires vers des marques visuelles. Elle fonctionne comme un outil de visualisation de données SVG et une API d'analyse exploratoire des données, permettant aux utilisateurs de rendre des visualisations complexes et des cartes géographiques. La bibliothèque dispose d'un moteur de rendu de carte GeoJSON qui projette des coordonnées sphériques dans un espace pixel bidimensionnel et d'une interface de visualisation Apache Arrow pour un traitement de données à haute efficacité. Sa surface de capacités couvre la transformation des données via le binning et le regroupement, l'encodage visuel via l'inférence automatique d'échelle et l'application de schémas de couleurs, ainsi que la génération de multiples petits graphiques (small multiples). Elle prend en charge le rendu de formes géométriques dans des vues en couches et l'exportation d'images statiques dans des environnements côté serveur.
Handles diverse data structures, including arrays of objects and Apache Arrow tables, to improve processing efficiency.
Ce projet est un index de données de recherche open-source et une collection de données historiques sur les tendances de recherche fournies en tant qu'archive publique des tendances. Il sert de jeu de données ouvert pour analyser les modèles et événements mondiaux via des fichiers téléchargeables. Le dépôt fournit un index agrégé de jeux de données de recherche et de médias anonymisés et normalisés. Ces ressources sont conçues pour l'analyse académique et professionnelle, permettant l'étude des tendances longitudinales à travers différentes régions et périodes. Les données prennent en charge l'analyse des tendances de recherche mondiales, l'analyse des modèles de marché et la recherche sur l'intérêt public. Il permet l'acquisition de données ouvertes pour l'étude de l'intérêt des consommateurs, des changements sociétaux et du comportement de recherche.
Provides regularly updated CSV files that merge search metrics into a single downloadable distribution for analysis.
ExcelDataReader is a C# library used to extract data and metadata from Microsoft Excel spreadsheets and CSV files. It functions as a workbook parser that converts spreadsheet content into structured data sets for programmatic access and iteration. The project includes a specialized metadata extractor for retrieving cell-level details, such as number formats, styles, row heights, column widths, and merged cell ranges. It also provides a stream processor for parsing plain text CSV files with customizable encoding and separator detection. The library supports the OpenXML standard for modern spr
Parses plain text streams using comma separated values with customizable encoding and separator detection.
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
Imports data from standard files or custom parsing tools for non-standard formats like audio and PDFs.
mcp-context-forge is a Model Context Protocol federation gateway that unifies diverse AI tool servers and APIs into a single consistent interface for discovery and execution. It acts as a centralized proxy that aggregates multiple servers and APIs, allowing AI agents to access and invoke a unified set of tools, prompts, and resources. The project distinguishes itself through a multi-protocol translation bridge that converts communication between standard I/O, SSE, gRPC, and REST to enable interoperability between disparate tool servers. It includes a comprehensive LLM evaluation framework for
Imports data from multiple formats including CSV, JSON, Parquet, Excel, and SQL into a managed cache.