12 repositorios
Capabilities for saving extracted data across various storage types including flat files, relational, and document databases.
Distinct from Relational Data Storage: Covers a hybrid approach to persistence across multiple storage paradigms rather than a single database type.
Explore 12 awesome GitHub repositories matching data & databases · Multi-Format Data Persistence. Refine with filters or upvote what's useful.
This project is a comprehensive educational guide and framework for building web scrapers using Python. It provides a course-based approach to data extraction, combining a Python crawler framework with tutorials on web reverse engineering and network traffic analysis. The project distinguishes itself by covering advanced extraction challenges, including the decryption of obfuscated JavaScript and the bypass of anti-scraping measures. It specifically addresses mobile application scraping through the simulation of user interactions and the interception of network traffic. The capability surfac
Saves extracted information into flat files, relational databases, or document databases for long-term storage.
weiboSpider is a Python web scraper and social media crawler designed to extract user profiles, posts, and engagement metrics from Sina Weibo. It functions as an automated data pipeline for academic research and trend analysis, collecting long-form text and multimedia content. The tool distinguishes itself through the use of browser session cookies to authenticate requests and access protected profiles. It implements randomized request pacing and global pauses to manage traffic and avoid platform rate limits, while supporting incremental crawling to capture only new content based on timestamp
Persists extracted information across various storage types, including flat files and relational or document databases.
This project is an educational resource and a collection of instructional materials for performing data manipulation and statistical analysis using Python. It provides a comprehensive set of guides and code examples for using the Pandas, NumPy, and Matplotlib libraries to analyze structured data. The resource includes a dedicated guide for reshaping, cleaning, and aggregating tabular data and time series via Pandas, alongside a reference for high-performance vectorized operations and linear algebra using NumPy. It also features tutorials for creating publication-quality charts, distribution p
Enables saving and loading multidimensional numerical arrays to disk in raw binary formats with compression support.
TiddlyWiki5 is a modular wiki engine and non-linear knowledge base that organizes information into small, linked chunks. It can function as a single-file personal wiki where all content and application logic are stored within one HTML file for local-first use, or as a self-hosted wiki server that serves content over HTTP. The project is distinguished by a data-driven architecture where plugins and extensions are treated as stored data entries. It features a filter-based query engine for manipulating structured data and a transclusion system that allows the live content of one entry to be embe
Supports persisting content across multiple formats, including JSON, HTML, and plain text files.
big-AGI is a self-hosted AI frontend and multi-model client that provides a unified workspace for interacting with various large language models. It functions as an orchestration dashboard, allowing users to connect to cloud-based AI providers, aggregator services, and locally hosted model servers. The project is distinguished by its ability to execute prompts across multiple models simultaneously for side-by-side comparison and response synthesis. It enables the merging of outputs from different models to reduce hallucinations and improve accuracy, while using persona-based configuration map
Supports persisting application data across multiple backends, including serverless Postgres and MongoDB Atlas.
libigl es una librería de procesamiento de geometría en C++ utilizada para analizar y manipular mallas 3D triangulares y tetraédricas. Funciona como una suite de álgebra lineal numérica y un framework de manipulación de mallas, integrando un motor de deformación geométrica para implementar transformaciones rígidas y poliharmónicas. El proyecto se distingue por su diseño de librería header-only y su implementación de técnicas de deformación especializadas, incluyendo deformación rígida y poliharmónica. También proporciona una herramienta de visualización para renderizar superficies y campos escalares con controles de escena interactivos y selección de mallas. La librería cubre una amplia gama de capacidades, incluyendo análisis de geometría para curvatura y distancias geodésicas, generación de mallas mediante extracción de iso-superficies y triangulación, y remallado mediante deformación anisotrópica. Además, admite operaciones booleanas de malla, parametrización de superficies y optimización numérica para resolver ecuaciones de Laplace y programas cuadráticos. El kit de herramientas incluye utilidades para importar y exportar varios formatos de geometría 3D y admite la interoperabilidad con Matlab para ejecutar scripts y compartir matrices.
Persists large numerical arrays to disk using binary or ASCII formats for high precision.
ArrayFire es un framework de computación agnóstico al hardware y un motor de tensores compilado JIT diseñado para la computación numérica de alto rendimiento. Sirve como una biblioteca de computación numérica en GPU y un kit de herramientas de procesamiento de señales paralelo que abstrae los backends de hardware, permitiendo que el mismo código base se ejecute en diversas arquitecturas de GPU y CPUs. El proyecto se distingue por un motor JIT que utiliza la compilación de expresiones para fusionar operaciones y minimizar la sobrecarga de memoria. Emplea un grafo de ejecución diferida para optimizar las cadenas de cálculo y proporciona primitivas de interoperabilidad para compartir datos y contextos de ejecución con plataformas de computación externas como CUDA y OpenCL. La biblioteca cubre una amplia gama de capacidades, incluyendo álgebra lineal paralela, procesamiento digital de señales y visión artificial acelerada. Proporciona herramientas para la implementación de aprendizaje automático, simulación de modelos financieros y la resolución de ecuaciones diferenciales parciales para simulaciones de sistemas físicos. Su sistema de gestión de tensores maneja la asignación de matrices multidimensionales, el corte (slicing) y las transferencias de datos entre host y dispositivo.
Saves and loads multidimensional numerical tensors to and from files using keys or indices.
Este proyecto es un scraper web de Sina Weibo y una tubería de datos de redes sociales diseñada para extraer perfiles de usuario, publicaciones, comentarios y activos multimedia. Funciona como un crawler de datos contenedorizado que automatiza la recopilación y el almacenamiento local de contenido de redes sociales y métricas de interacción. El sistema incluye una capa de procesamiento que utiliza modelos de lenguaje de gran tamaño (LLM) para analizar el texto extraído, generando resúmenes y análisis de sentimiento. Se diferencia por un modelo de contenedor listo para el despliegue que cuenta con una interfaz HTTP para gestionar tareas de extracción y monitorear el progreso de los trabajos. El crawler cubre una amplia gama de capacidades, incluyendo el monitoreo de redes sociales mediante actualizaciones incrementales programadas, el archivo de activos multimedia en discos locales y la exportación de datos en múltiples formatos a archivos planos o bases de datos. También captura interacciones sociales detalladas, como comentarios de primer nivel y republicaciones.
Supports persisting extracted content across flat files, relational databases, and document databases.
Joblib is a suite of utilities for parallelizing computational workloads and optimizing the storage of large numerical datasets and function results. It functions as a parallel computing library and multiprocessing wrapper that distributes function execution across multiple CPU cores to accelerate independent tasks and computational loops. The project provides a disk caching framework that persists expensive function outputs to the filesystem, re-evaluating them only when input arguments change. It further specializes in the serialization of large numerical arrays, utilizing efficient compres
Provides memory-mapping for large numerical arrays to allow efficient disk-based random access without consuming full RAM.
CrawlerTutorial is a comprehensive Python web scraping tutorial and framework designed for extracting data from static and dynamic websites. It functions as a web data extraction pipeline and an HTTP request orchestrator, covering the full lifecycle of scraping applications from initial fetching to final data storage. The project provides specialized guidance on anti-bot bypass techniques and web API reverse engineering. It includes methods for evading browser detection through identity masking and proxy rotation, as well as techniques for identifying hidden API endpoints by analyzing network
Saves extracted information across multiple storage types including JSON and CSV flat files.
xtensor is a C++ multidimensional array library for numerical computing that provides N-dimensional containers with an interface mirroring the NumPy API. It utilizes a lazy evaluation expression engine to defer numerical computations until assignment, which minimizes memory allocations and intermediate copies. The library features a foreign memory array adaptor that allows it to wrap external buffers, such as NumPy arrays, to perform numerical operations in-place without duplicating data. It further optimizes performance through lazy broadcasting and a system that manages the lifetime of temp
Deno-xtensor reads and writes multidimensional arrays using CSV, NPY, and JSON formats for persistence.
This project is a NestJS testing boilerplate and reference implementation. It provides a structured monorepo workspace designed to demonstrate various architectural and testing patterns for NestJS applications. The project features a dockerized test environment and an integration testing framework. It includes a dedicated GraphQL API test suite to validate graph-based endpoints and schemas for queries and mutations. The suite covers a layered testing hierarchy consisting of unit, integration, and end-to-end tests. These capabilities extend across the application and data layers, including da
Simulates interactions across multiple database technologies to verify data retrieval and storage logic.