18 مستودعات
Loads data from a file path, standard input, inline data, or files matching a regex pattern in a specified directory.
Distinct from CSV Data Loaders: Distinct from CSV Data Loaders: focuses on loading CSV from multiple source types, not just file-based CSV loading.
Explore 18 awesome GitHub repositories matching data & databases · Multi-Source CSV Loading. Refine with filters or upvote what's useful.
TensorFlow.js is a JavaScript machine learning library used for training and deploying models in web browsers and server-side environments. It functions as a browser-based model trainer, a WebAssembly inference engine, and a WebGPU accelerated tensor library for low-level linear algebra. The project also includes a model converter to transform Python-based models into optimized formats for JavaScript execution. The library distinguishes itself through a pluggable backend architecture that allows mathematical operations to be executed via CPU, WebGL, or WebGPU. It supports the conversion of Py
Imports datasets from disk or web sources in various formats for machine learning use.
Perspective is a columnar data analytics engine and high-performance visualization component powered by WebAssembly. It provides a system for analyzing and visualizing large or streaming datasets through interactive data grids and charts, utilizing a compiled binary to achieve near-native performance within the browser. The project distinguishes itself through a WebSocket-based data streaming interface and deep Apache Arrow integration, which minimize memory overhead when synchronizing tables between servers and clients. It acts as a remote query proxy capable of translating visualization con
Loads data from multiple formats including CSV, JSON, and Apache Arrow into high-performance internal tables.
Apache DataFusion is an extensible, columnar SQL query engine that runs embedded within a host application without requiring a separate server process. It processes data in columnar batches using Apache Arrow for memory-efficient analytics, and can scale analytic workloads across multiple nodes for parallel execution. The engine supports both SQL and DataFrame queries through a modular, streaming architecture that allows custom operators, data sources, functions, and optimizer rules. The engine distinguishes itself through its modular extension framework, which enables building custom query e
Reads and writes data in Parquet, CSV, JSON, and Avro formats without additional configuration.
AlaSQL is a JavaScript SQL database engine that allows for the filtering, grouping, and joining of in-memory object arrays and JSON data. It functions as an in-memory SQL database and client-side data processor, enabling the execution of SQL statements against JavaScript arrays and external data sources in both browser and server environments. The project serves as a universal data query tool capable of performing relational joins across diverse sources, such as merging Google Spreadsheets, SQLite files, and remote APIs into a single result set. It also acts as an IndexedDB SQL wrapper, allow
Provides the ability to read and process data from multiple formats including CSV, JSON, and Excel.
Feast is an open-source feature store for machine learning that provides a central platform for defining, storing, and serving features across both training and inference workflows. It operates as a declarative system where feature definitions are written as code in Python files, synchronized to a central registry, and made available for low-latency online retrieval or point-in-time correct historical joins for training datasets. The project abstracts storage behind a pluggable architecture, allowing offline and online backends to be swapped without changing retrieval logic, and coordinates ma
Reads feature data from Parquet, CSV, JSON, HuggingFace, MongoDB, SQL, and more using Ray's native readers.
Data-Juicer is an open-source framework for cleaning, filtering, deduplicating, and transforming multimodal datasets to prepare them for training large language and vision models. It functions as a distributed data pipeline engine that runs processing jobs across Ray clusters, handling billions of samples with automatic operator fusion and adaptive parallelism. The framework provides a library of operators that leverage large language models for semantic extraction, filtering, and data synthesis within processing pipelines. The project distinguishes itself through a YAML-based data recipe sys
Reads datasets from local files, remote repositories, and common formats using distributed readers.
This repository is the official documentation for TensorFlow, a machine learning framework. It provides comprehensive guides, tutorials, and API references for building, training, and deploying machine learning models. The documentation covers the full lifecycle of machine learning projects, from constructing data pipelines and building neural networks with high-level APIs to customizing training loops and deploying trained models in production, on edge devices, or in browsers. The documentation includes step-by-step tutorials for a range of tasks, including reinforcement learning, ranking mo
Reads CSV, image, and text data sources into processing pipelines for efficient input handling.
pgloader is a command-line tool that automates the migration of data and schema from various source databases and file formats into PostgreSQL. It combines schema discovery, parallel data pipelines, and type casting into a single, declarative workflow, using PostgreSQL's COPY protocol for high-throughput bulk loading. The tool distinguishes itself by compiling a dedicated command language into concurrent reader-writer pipelines that handle schema introspection, data transformation, and error-resilient batch processing. It supports migrating entire databases from MySQL, MS SQL, SQLite, and Pos
Loads data from a file path, standard input, inline data, or files matching a regex pattern.
PlotJuggler is an interactive time series visualization tool that loads, streams, and renders large datasets using hardware-accelerated OpenGL graphics. It functions as a multi-format data loader, supporting file formats such as CSV, ULog, and ROS bags, and also serves as a live data stream viewer that subscribes to real-time sources via MQTT, WebSockets, ZeroMQ, and UDP. The tool distinguishes itself through a plugin-based extensibility platform that allows users to add custom data sources, file formats, and processing capabilities. It includes a Lua scripting engine for creating custom data
Reads time series data from CSV, ULog, and ROS bag files for analysis and visualization.
River هو إطار عمل Python للتعلم الآلي عبر الإنترنت، مصمم لتدريب وتقييم النماذج على بيانات البث. يتيح التعلم التزايدي عن طريق تحديث معلمات النموذج بملاحظة واحدة في كل مرة، مما يلغي الحاجة إلى تخزين مجموعات بيانات التدريب الكاملة في الذاكرة. تتميز المكتبة بنظام مخصص للكشف عن انحراف المفهوم (concept drift) الذي يراقب التغييرات في توزيعات البيانات لتحفيز تكيف النموذج. كما توفر إطار عمل للتحقق التدريجي يحاكي النشر في الوقت الفعلي عن طريق اختبار النماذج على عينات قبل استخدامها للتدريب. يغطي النظام مجموعة واسعة من إمكانات البث، بما في ذلك هندسة الميزات في الوقت الفعلي، والتنبؤ بالسلاسل الزمنية، واكتشاف الشذوذ عبر الإنترنت. ويدعم التعلم غير الخاضع للإشراف من خلال التجميع التزايدي وأشجار القرار، بالإضافة إلى تجميع النماذج وسياسات bandit لاختيار النموذج. يتضمن المشروع أدوات لاستيعاب بيانات البث من مصادر مثل ملفات CSV و APIs، بالإضافة إلى أدوات لحساب الإحصائيات الجارية ومخططات البيانات الموفرة للذاكرة.
Reads CSV files as a sequence of dictionaries, converting columns to numeric types for online learning.
Anomalib is a PyTorch-based library for visual anomaly detection, offering a modular framework, a comprehensive model zoo, and a benchmarking suite designed for industrial defect detection. It provides a wide range of algorithms—including generative, discriminative, teacher-student, and vision-language approaches—that support unsupervised, few-shot, and zero-shot settings. The library enables deployment through model export to ONNX and OpenVINO for edge devices, and includes a no-code web application for training and inference. It also features a command-line interface for orchestrating multi
Reads images and video clips from disk, validates paths, and formats data for anomaly detection models.
NVIDIA DALI is a GPU-accelerated data loading and preprocessing library designed for deep learning workflows. It constructs high-performance data pipelines that offload decoding, augmentation, and normalization to the GPU, eliminating CPU bottlenecks in training and inference. The library reads data from multiple storage formats and streams it directly into GPU memory, with support for multi-GPU execution to scale throughput across large-scale workloads. DALI distinguishes itself by enabling data pipelines to be built once and executed across multiple deep learning frameworks without code cha
Reads data from LMDB, RecordIO, TFRecord, WebDataset, COCO, and NumPy formats to feed into processing pipelines.
Data on COVID-19 (coronavirus) cases, deaths, hospitalizations, tests • All countries • Updated daily by Our World in Data
Provides a regularly updated CSV distribution consolidating key COVID-19 metrics into a single downloadable file.
هذه مكتبة تصور لقواعد الرسومات تُستخدم لبناء المخططات عن طريق تعيين البيانات الجدولية إلى علامات مرئية. تعمل كأداة تصور بيانات SVG وواجهة برمجة تطبيقات لتحليل البيانات الاستكشافية، مما يسمح للمستخدمين بتقديم تصورات معقدة وخرائط جغرافية. تتميز المكتبة بمُصيّر خرائط GeoJSON الذي يسقط الإحداثيات الكروية في مساحة بكسل ثنائية الأبعاد وواجهة تصور Apache Arrow لمعالجة البيانات بكفاءة عالية. تغطي قدراتها تحويل البيانات من خلال التجميع (binning) والتصنيف، والترميز المرئي عبر استنتاج المقياس التلقائي وتطبيق نظام الألوان، وتوليد مضاعفات صغيرة. تدعم تقديم الأشكال الهندسية في طرق عرض ذات طبقات وتصدير الصور الثابتة في بيئات جانب الخادم.
Handles diverse data structures, including arrays of objects and Apache Arrow tables, to improve processing efficiency.
هذا المشروع عبارة عن فهرس بيانات بحث مفتوح المصدر ومجموعة من بيانات اتجاهات البحث التاريخية المقدمة كأرشيف اتجاهات عام. يعمل كمجموعة بيانات مفتوحة لتحليل الأنماط والأحداث العالمية من خلال ملفات قابلة للتنزيل. يوفر المستودع فهرساً مجمعاً لمجموعات بيانات البحث والوسائط المجهولة والمطبعة. تم تصميم هذه الموارد للتحليل الأكاديمي والمهني، مما يسمح بدراسة الاتجاهات الطولية عبر مناطق وأطر زمنية مختلفة. تدعم البيانات تحليل اتجاهات البحث العالمي، وتحليل أنماط السوق، وبحوث المصلحة العامة. تتيح الحصول على البيانات المفتوحة لدراسة اهتمامات المستهلكين، والتحولات المجتمعية، وسلوك البحث.
Provides regularly updated CSV files that merge search metrics into a single downloadable distribution for analysis.
ExcelDataReader هي مكتبة C# تُستخدم لاستخراج البيانات والبيانات الوصفية من جداول بيانات Microsoft Excel وملفات CSV. تعمل كمحلل لمصنفات العمل (workbook) يحول محتوى جدول البيانات إلى مجموعات بيانات مهيكلة للوصول البرمجي والتكرار. يتضمن المشروع مستخرجاً متخصصاً للبيانات الوصفية لاسترجاع تفاصيل مستوى الخلية، مثل تنسيقات الأرقام، والأنماط، وارتفاعات الصفوف، وعروض الأعمدة، ونطاقات الخلايا المدمجة. كما يوفر معالج تدفق لتحليل ملفات CSV النصية العادية مع ترميز قابل للتخصيص واكتشاف الفواصل. تدعم المكتبة معيار OpenXML لملفات جداول البيانات الحديثة وتستخدم التحليل القائم على التدفق والتكرار القائم على المؤشر للصفوف لاجتياز مصنفات العمل. تمكن هذه الإمكانيات من تحويل مصنفات العمل متعددة الأوراق إلى جداول بيانات علائقية.
Parses plain text streams using comma separated values with customizable encoding and separator detection.
docetl is an AI-powered document ETL tool and map-reduce orchestrator designed to transform large collections of unstructured documents into structured, queryable tables using language models. It provides a declarative pipeline framework for extracting, cleaning, and transforming data from sources such as PDFs and text files into predefined schemas. The project distinguishes itself through a semantic data integration suite that enables joining datasets and resolving duplicate entities based on embedding-based similarity. It includes an interactive prompt playground for developing and optimizi
Imports data from standard files or custom parsing tools for non-standard formats like audio and PDFs.
mcp-context-forge is a Model Context Protocol federation gateway that unifies diverse AI tool servers and APIs into a single consistent interface for discovery and execution. It acts as a centralized proxy that aggregates multiple servers and APIs, allowing AI agents to access and invoke a unified set of tools, prompts, and resources. The project distinguishes itself through a multi-protocol translation bridge that converts communication between standard I/O, SSE, gRPC, and REST to enable interoperability between disparate tool servers. It includes a comprehensive LLM evaluation framework for
Imports data from multiple formats including CSV, JSON, Parquet, Excel, and SQL into a managed cache.