Why is cinnamon/kotaemon a recommended CSV Data Loaders GitHub Repositories repository?

Isolates table structures from raw CSV content for document integration.

Why is tensorflow/tfjs a recommended CSV Data Loaders GitHub Repositories repository?

Imports datasets from disk or web sources in various formats for machine learning use.

Why is google/libphonenumber a recommended CSV Data Loaders GitHub Repositories repository?

Implements a metadata engine that loads regional phone number rules from CSV files.

Why is fivethirtyeight/data a recommended CSV Data Loaders GitHub Repositories repository?

Delivers structured information in lightweight, human-readable CSV formats for broad analytical compatibility.

Why is dbt-labs/dbt-core a recommended CSV Data Loaders GitHub Repositories repository?

Imports local comma-separated files into the data warehouse as queryable tables to support data transformation workflows.

Why is perspective-dev/perspective a recommended CSV Data Loaders GitHub Repositories repository?

Loads data from multiple formats including CSV, JSON, and Apache Arrow into high-performance internal tables.

Why is sjwhitworth/golearn a recommended CSV Data Loaders GitHub Repositories repository?

Ships a dedicated CSV data loader for reading structured data from comma-separated files into ML grids.

Why is apache/datafusion a recommended CSV Data Loaders GitHub Repositories repository?

Reads and writes data in Parquet, CSV, JSON, and Avro formats without additional configuration.

Why is gyroflow/gyroflow a recommended CSV Data Loaders GitHub Repositories repository?

Reads sensor data from standardized text-based CSV sidecar files to provide logs for stabilization.

Why is alasql/alasql a recommended CSV Data Loaders GitHub Repositories repository?

Provides the ability to read and process data from multiple formats including CSV, JSON, and Excel.

31 repositorios

Awesome GitHub RepositoriesCSV Data Loaders

Constructs data loaders specifically for reading structured data from comma-separated files.

Distinct from Tabular Data Frameworks: Distinct from general tabular frameworks: focuses on CSV-specific ingestion logic.

Explore 31 awesome GitHub repositories matching data & databases · CSV Data Loaders. Refine with filters or upvote what's useful.

Encuentra los mejores repositorios con IA.Buscaremos los repositorios que mejor coincidan usando IA.

cinnamon/kotaemon
Cinnamon/kotaemon
25,139Ver en GitHub
Kotaemon is an orchestration framework designed for building modular, agentic workflows that integrate document processing, retrieval-augmented generation, and multi-step reasoning. It provides a comprehensive platform for developing document-based question answering systems, allowing users to chain language models, prompt templates, and external tools into complex, automated pipelines. The system distinguishes itself through a highly modular architecture that emphasizes component-based composition and schema-driven data exchange. It supports autonomous agents capable of decomposing complex q
Isolates table structures from raw CSV content for document integration.
Pythonchatbotllmsopen-source
Ver en GitHub25,139
tensorflow/tfjs
tensorflow/tfjs
19,134Ver en GitHub
TensorFlow.js is a JavaScript machine learning library used for training and deploying models in web browsers and server-side environments. It functions as a browser-based model trainer, a WebAssembly inference engine, and a WebGPU accelerated tensor library for low-level linear algebra. The project also includes a model converter to transform Python-based models into optimized formats for JavaScript execution. The library distinguishes itself through a pluggable backend architecture that allows mathematical operations to be executed via CPU, WebGL, or WebGPU. It supports the conversion of Py
Imports datasets from disk or web sources in various formats for machine learning use.
TypeScript
Ver en GitHub19,134
google/libphonenumber
google/libphonenumber
18,077Ver en GitHub
This project is an international phone number library used for parsing, formatting, and validating phone numbers based on the E.164 standard. It provides a validation engine and parser to convert raw strings into structured objects and verify if numbers conform to regional numbering rules. The library includes a metadata provider that maps phone numbers to geographic locations, time zones, and network carriers. It can distinguish between line types, such as fixed-line or mobile, to verify SMS compatibility and identify original network operators. Additional capabilities include extracting ph
Implements a metadata engine that loads regional phone number rules from CSV files.
C++
Ver en GitHub18,077
fivethirtyeight/data
fivethirtyeight/data
17,394Ver en GitHub
This repository serves as a public archive for the raw datasets and analytical code used to support journalistic reporting. It functions as a platform for reproducible research, providing the necessary materials for users to verify published findings and conduct independent statistical analysis. The collection utilizes a versioned storage model to track historical changes to both data and processing scripts. By organizing information into a structured directory hierarchy, the repository maps specific journalistic projects to their corresponding inputs and outputs, ensuring that the methodolog
Delivers structured information in lightweight, human-readable CSV formats for broad analytical compatibility.
Jupyter Notebookdata
Ver en GitHub17,394
dbt-labs/dbt-core
dbt-labs/dbt-core
13,051Ver en GitHub
dbt-core is a command-line framework for transforming data within a warehouse using modular SQL and version control. It functions as a data transformation engine that enables users to define data structures and business logic through declarative configuration files, which the system then compiles into executable code. By managing complex data dependencies through a directed acyclic graph, it ensures that transformation tasks execute in the correct order while maintaining a manifest-driven state to track lineage and execution history. The project distinguishes itself through an adapter-based d
Imports local comma-separated files into the data warehouse as queryable tables to support data transformation workflows.
Rustanalyticsbusiness-intelligencedata-modeling
Ver en GitHub13,051
perspective-dev/perspective
perspective-dev/perspective
10,981Ver en GitHub
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Loads data from multiple formats including CSV, JSON, and Apache Arrow into high-performance internal tables.
C++analyticsbidata-visualization
Ver en GitHub10,981
sjwhitworth/golearn
sjwhitworth/golearn
9,438Ver en GitHub
GoLearn is a machine learning library for the Go programming language. It provides a supervised learning framework and a toolkit for building, training, and evaluating predictive models through a standardized interface. The project implements a data frame system that loads CSV files into structured grids for matrix operations. It includes a preprocessing library for discretizing continuous variables and a model evaluation toolkit that utilizes confusion matrices and cross-validation to measure precision and recall. The library covers data engineering and management, including the ability to
Ships a dedicated CSV data loader for reading structured data from comma-separated files into ML grids.
Go
Ver en GitHub9,438
apache/datafusion
apache/datafusion
8,908Ver en GitHub
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Reads and writes data in Parquet, CSV, JSON, and Avro formats without additional configuration.
Rustarrowbig-datadataframe
Ver en GitHub8,908
gyroflow/gyroflow
gyroflow/gyroflow
8,256Ver en GitHub
Gyroflow is a gyroscope video stabilization software and IMU telemetry processor designed to remove camera shake from video files. It functions as a hardware-accelerated video renderer and lens calibration tool, utilizing embedded or external gyroscope and accelerometer data to perform pixel-level stabilization. The system is distinguished by its ability to integrate with professional non-linear video editing software via plugins, allowing stabilization to be applied directly to timelines without transcoding original footage. It supports diverse telemetry ingestion from camera brands, flight
Reads sensor data from standardized text-based CSV sidecar files to provide logs for stabilization.
Rustfpvgoprogpu
Ver en GitHub8,256
alasql/alasql
AlaSQL/alasql
7,278Ver en GitHub
AlaSQL is a JavaScript SQL database engine that allows for the filtering, grouping, and joining of in-memory object arrays and JSON data. It functions as an in-memory SQL database and client-side data processor, enabling the execution of SQL statements against JavaScript arrays and external data sources in both browser and server environments. The project serves as a universal data query tool capable of performing relational joins across diverse sources, such as merging Google Spreadsheets, SQLite files, and remote APIs into a single result set. It also acts as an IndexedDB SQL wrapper, allow
Provides the ability to read and process data from multiple formats including CSV, JSON, and Excel.
JavaScript
Ver en GitHub7,278
covid19india/covid19india.github.io
covid19india/covid19india.github.io
6,808Ver en GitHub
This project is a public health monitoring platform and data aggregator that tracks COVID-19 statistics, recovery rates, and vaccination data across India. It functions as a public health data repository, archiving epidemiological metrics for regional impact tracking and research. The platform transforms raw health statistics into an interactive data visualization site. It utilizes a series of dashboards to convert these statistics into visual trends, allowing for the monitoring of regional impacts. The system provides capabilities for epidemiological data analysis, including the collection
Employs CSV files as the primary data source to simplify version control and manual updates via git.
JavaScriptanalyticscoronaviruscovid-19
Ver en GitHub6,808
feast-dev/feast
feast-dev/feast
6,727Ver en GitHub
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Reads feature data from Parquet, CSV, JSON, HuggingFace, MongoDB, SQL, and more using Ray's native readers.
Pythonbig-datadata-engineeringdata-quality
Ver en GitHub6,727
datajuicer/data-juicer
datajuicer/data-juicer
6,574Ver en GitHub
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Reads datasets from local files, remote repositories, and common formats using distributed readers.
Pythondatadata-analysisdata-pipeline
Ver en GitHub6,574
deepchem/deepchem
deepchem/deepchem
6,545Ver en GitHub
DeepChem is an open-source Python framework for applying deep learning to molecular, chemical, and biological data, serving as a comprehensive toolkit for drug discovery and materials science. At its core, it provides a featurizer-pipeline abstraction that converts raw molecular data into numerical representations, including graph-based molecular structures, SMILES tokenization vocabularies, and disk-sharded dataset persistence for handling large-scale data that exceeds RAM capacity. The framework distinguishes itself through integrated molecular docking workflows that automate pocket detecti
Reads tabular data from CSV files, applies a featurizer, and stores the result as a Dataset.
Pythonbiologydeep-learningdrug-discovery
Ver en GitHub6,545
tensorflow/docs
tensorflow/docs
6,320Ver en GitHub
This repository is the official documentation for TensorFlow, a machine learning framework. It provides comprehensive guides, tutorials, and API references for building, training, and deploying machine learning models. The documentation covers the full lifecycle of machine learning projects, from constructing data pipelines and building neural networks with high-level APIs to customizing training loops and deploying trained models in production, on edge devices, or in browsers. The documentation includes step-by-step tutorials for a range of tasks, including reinforcement learning, ranking mo
Reads CSV, image, and text data sources into processing pipelines for efficient input handling.
Jupyter Notebookdeep-learningdeep-neural-networksdocumentation
Ver en GitHub6,320
dimitri/pgloader
dimitri/pgloader
6,295Ver en GitHub
pgloader is a command-line tool that automates the migration of data and schema from various source databases and file formats into PostgreSQL. It combines schema discovery, parallel data pipelines, and type casting into a single, declarative workflow, using PostgreSQL's COPY protocol for high-throughput bulk loading. The tool distinguishes itself by compiling a dedicated command language into concurrent reader-writer pipelines that handle schema introspection, data transformation, and error-resilient batch processing. It supports migrating entire databases from MySQL, MS SQL, SQLite, and Pos
Reads structured data from CSV files and inserts it into PostgreSQL tables using the COPY command.
Common Lispclozure-clcommon-lispcsv
Ver en GitHub6,295
facontidavide/plotjuggler
facontidavide/PlotJuggler
5,957Ver en GitHub
PlotJuggler is an interactive time series visualization tool that loads, streams, and renders large datasets using hardware-accelerated OpenGL graphics. It functions as a multi-format data loader, supporting file formats such as CSV, ULog, and ROS bags, and also serves as a live data stream viewer that subscribes to real-time sources via MQTT, WebSockets, ZeroMQ, and UDP. The tool distinguishes itself through a plugin-based extensibility platform that allows users to add custom data sources, file formats, and processing capabilities. It includes a Lua scripting engine for creating custom data
Reads time series data from CSV, ULog, and ROS bag files for analysis and visualization.
C++
Ver en GitHub5,957
online-ml/river
online-ml/river
5,853Ver en GitHub
River es un framework de Python para machine learning online, diseñado para entrenar y evaluar modelos en datos de streaming. Permite el aprendizaje incremental actualizando los parámetros del modelo una observación a la vez, eliminando la necesidad de almacenar datasets de entrenamiento completos en memoria. La librería se distingue por un sistema dedicado de detección de concept drift que monitorea cambios en las distribuciones de datos para disparar la adaptación del modelo. También proporciona un framework de validación progresiva que simula el despliegue en tiempo real probando modelos en muestras antes de usarlos para el entrenamiento. El sistema cubre un amplio rango de capacidades de streaming, incluyendo ingeniería de características en tiempo real, pronóstico de series temporales y detección de anomalías online. Soporta aprendizaje no supervisado mediante clustering incremental y árboles de decisión, así como agregación de ensamblajes y políticas de bandidos para la selección de modelos. El proyecto incluye utilidades para la ingesta de datos de streaming desde fuentes como archivos CSV y APIs, así como herramientas para calcular estadísticas en ejecución y sketches de datos eficientes en memoria.
Reads CSV files as a sequence of dictionaries, converting columns to numeric types for online learning.
Python
Ver en GitHub5,853
open-edge-platform/anomalib
open-edge-platform/anomalib
5,871Ver en GitHub
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Reads images and video clips from disk, validates paths, and formats data for anomaly detection models.
Pythonanomaly-detectionanomaly-localizationanomaly-segmentation
Ver en GitHub5,871
nvidia/dali
NVIDIA/DALI
5,713Ver en GitHub
NVIDIA DALI is a GPU-accelerated data loading and preprocessing library designed for deep learning workflows. It constructs high-performance data pipelines that offload decoding, augmentation, and normalization to the GPU, eliminating CPU bottlenecks in training and inference. The library reads data from multiple storage formats and streams it directly into GPU memory, with support for multi-GPU execution to scale throughput across large-scale workloads. DALI distinguishes itself by enabling data pipelines to be built once and executed across multiple deep learning frameworks without code cha
Reads data from LMDB, RecordIO, TFRecord, WebDataset, COCO, and NumPy formats to feed into processing pipelines.
C++audio-processingdata-augmentationdata-processing
Ver en GitHub5,713

Awesome CSV Data Loaders GitHub Repositories

Cinnamon/kotaemon

tensorflow/tfjs

google/libphonenumber

fivethirtyeight/data

dbt-labs/dbt-core

perspective-dev/perspective

sjwhitworth/golearn

apache/datafusion

gyroflow/gyroflow

AlaSQL/alasql

covid19india/covid19india.github.io

feast-dev/feast

datajuicer/data-juicer

deepchem/deepchem

tensorflow/docs

dimitri/pgloader

facontidavide/PlotJuggler

online-ml/river

open-edge-platform/anomalib

NVIDIA/DALI

Explorar subetiquetas