31 repositorios
Constructs data loaders specifically for reading structured data from comma-separated files.
Distinct from Tabular Data Frameworks: Distinct from general tabular frameworks: focuses on CSV-specific ingestion logic.
Explore 31 awesome GitHub repositories matching data & databases · CSV Data Loaders. Refine with filters or upvote what's useful.
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Isolates table structures from raw CSV content for document integration.
TensorFlow.js is a JavaScript machine learning library used for training and deploying models in web browsers and server-side environments. It functions as a browser-based model trainer, a WebAssembly inference engine, and a WebGPU accelerated tensor library for low-level linear algebra. The project also includes a model converter to transform Python-based models into optimized formats for JavaScript execution. The library distinguishes itself through a pluggable backend architecture that allows mathematical operations to be executed via CPU, WebGL, or WebGPU. It supports the conversion of Py
Imports datasets from disk or web sources in various formats for machine learning use.
This project is an international phone number library used for parsing, formatting, and validating phone numbers based on the E.164 standard. It provides a validation engine and parser to convert raw strings into structured objects and verify if numbers conform to regional numbering rules. The library includes a metadata provider that maps phone numbers to geographic locations, time zones, and network carriers. It can distinguish between line types, such as fixed-line or mobile, to verify SMS compatibility and identify original network operators. Additional capabilities include extracting ph
Implements a metadata engine that loads regional phone number rules from CSV files.
This repository serves as a public archive for the raw datasets and analytical code used to support journalistic reporting. It functions as a platform for reproducible research, providing the necessary materials for users to verify published findings and conduct independent statistical analysis. The collection utilizes a versioned storage model to track historical changes to both data and processing scripts. By organizing information into a structured directory hierarchy, the repository maps specific journalistic projects to their corresponding inputs and outputs, ensuring that the methodolog
Delivers structured information in lightweight, human-readable CSV formats for broad analytical compatibility.
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
Imports local comma-separated files into the data warehouse as queryable tables to support data transformation workflows.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Loads data from multiple formats including CSV, JSON, and Apache Arrow into high-performance internal tables.
GoLearn is a machine learning library for the Go programming language. It provides a supervised learning framework and a toolkit for building, training, and evaluating predictive models through a standardized interface. The project implements a data frame system that loads CSV files into structured grids for matrix operations. It includes a preprocessing library for discretizing continuous variables and a model evaluation toolkit that utilizes confusion matrices and cross-validation to measure precision and recall. The library covers data engineering and management, including the ability to
Ships a dedicated CSV data loader for reading structured data from comma-separated files into ML grids.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Reads and writes data in Parquet, CSV, JSON, and Avro formats without additional configuration.
Gyroflow is a gyroscope video stabilization software and IMU telemetry processor designed to remove camera shake from video files. It functions as a hardware-accelerated video renderer and lens calibration tool, utilizing embedded or external gyroscope and accelerometer data to perform pixel-level stabilization. The system is distinguished by its ability to integrate with professional non-linear video editing software via plugins, allowing stabilization to be applied directly to timelines without transcoding original footage. It supports diverse telemetry ingestion from camera brands, flight
Reads sensor data from standardized text-based CSV sidecar files to provide logs for stabilization.
AlaSQL is a JavaScript SQL database engine that allows for the filtering, grouping, and joining of in-memory object arrays and JSON data. It functions as an in-memory SQL database and client-side data processor, enabling the execution of SQL statements against JavaScript arrays and external data sources in both browser and server environments. The project serves as a universal data query tool capable of performing relational joins across diverse sources, such as merging Google Spreadsheets, SQLite files, and remote APIs into a single result set. It also acts as an IndexedDB SQL wrapper, allow
Provides the ability to read and process data from multiple formats including CSV, JSON, and Excel.
This project is a public health monitoring platform and data aggregator that tracks COVID-19 statistics, recovery rates, and vaccination data across India. It functions as a public health data repository, archiving epidemiological metrics for regional impact tracking and research. The platform transforms raw health statistics into an interactive data visualization site. It utilizes a series of dashboards to convert these statistics into visual trends, allowing for the monitoring of regional impacts. The system provides capabilities for epidemiological data analysis, including the collection
Employs CSV files as the primary data source to simplify version control and manual updates via git.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Reads feature data from Parquet, CSV, JSON, HuggingFace, MongoDB, SQL, and more using Ray's native readers.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Reads datasets from local files, remote repositories, and common formats using distributed readers.
DeepChem is an open-source Python framework for applying deep learning to molecular, chemical, and biological data, serving as a comprehensive toolkit for drug discovery and materials science. At its core, it provides a featurizer-pipeline abstraction that converts raw molecular data into numerical representations, including graph-based molecular structures, SMILES tokenization vocabularies, and disk-sharded dataset persistence for handling large-scale data that exceeds RAM capacity. The framework distinguishes itself through integrated molecular docking workflows that automate pocket detecti
Reads tabular data from CSV files, applies a featurizer, and stores the result as a Dataset.
This repository is the official documentation for TensorFlow, a machine learning framework. It provides comprehensive guides, tutorials, and API references for building, training, and deploying machine learning models. The documentation covers the full lifecycle of machine learning projects, from constructing data pipelines and building neural networks with high-level APIs to customizing training loops and deploying trained models in production, on edge devices, or in browsers. The documentation includes step-by-step tutorials for a range of tasks, including reinforcement learning, ranking mo
Reads CSV, image, and text data sources into processing pipelines for efficient input handling.
pgloader is a command-line tool that automates the migration of data and schema from various source databases and file formats into PostgreSQL. It combines schema discovery, parallel data pipelines, and type casting into a single, declarative workflow, using PostgreSQL's COPY protocol for high-throughput bulk loading. The tool distinguishes itself by compiling a dedicated command language into concurrent reader-writer pipelines that handle schema introspection, data transformation, and error-resilient batch processing. It supports migrating entire databases from MySQL, MS SQL, SQLite, and Pos
Reads structured data from CSV files and inserts it into PostgreSQL tables using the COPY command.
PlotJuggler is an interactive time series visualization tool that loads, streams, and renders large datasets using hardware-accelerated OpenGL graphics. It functions as a multi-format data loader, supporting file formats such as CSV, ULog, and ROS bags, and also serves as a live data stream viewer that subscribes to real-time sources via MQTT, WebSockets, ZeroMQ, and UDP. The tool distinguishes itself through a plugin-based extensibility platform that allows users to add custom data sources, file formats, and processing capabilities. It includes a Lua scripting engine for creating custom data
Reads time series data from CSV, ULog, and ROS bag files for analysis and visualization.
River es un framework de Python para machine learning online, diseñado para entrenar y evaluar modelos en datos de streaming. Permite el aprendizaje incremental actualizando los parámetros del modelo una observación a la vez, eliminando la necesidad de almacenar datasets de entrenamiento completos en memoria. La librería se distingue por un sistema dedicado de detección de concept drift que monitorea cambios en las distribuciones de datos para disparar la adaptación del modelo. También proporciona un framework de validación progresiva que simula el despliegue en tiempo real probando modelos en muestras antes de usarlos para el entrenamiento. El sistema cubre un amplio rango de capacidades de streaming, incluyendo ingeniería de características en tiempo real, pronóstico de series temporales y detección de anomalías online. Soporta aprendizaje no supervisado mediante clustering incremental y árboles de decisión, así como agregación de ensamblajes y políticas de bandidos para la selección de modelos. El proyecto incluye utilidades para la ingesta de datos de streaming desde fuentes como archivos CSV y APIs, así como herramientas para calcular estadísticas en ejecución y sketches de datos eficientes en memoria.
Reads CSV files as a sequence of dictionaries, converting columns to numeric types for online learning.
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Reads images and video clips from disk, validates paths, and formats data for anomaly detection models.
NVIDIA DALI is a GPU-accelerated data loading and preprocessing library designed for deep learning workflows. It constructs high-performance data pipelines that offload decoding, augmentation, and normalization to the GPU, eliminating CPU bottlenecks in training and inference. The library reads data from multiple storage formats and streams it directly into GPU memory, with support for multi-GPU execution to scale throughput across large-scale workloads. DALI distinguishes itself by enabling data pipelines to be built once and executed across multiple deep learning frameworks without code cha
Reads data from LMDB, RecordIO, TFRecord, WebDataset, COCO, and NumPy formats to feed into processing pipelines.