24 open-source projects similar to frictionlessdata/tabulator-py, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Tabulator Py alternative.
python-magic is a C-binding wrapper that provides a Python interface for the libmagic system library. It functions as a file signature analyzer and MIME type detector, identifying file formats by comparing header bytes against a database of known binary signatures. The library enables the identification of file types from both file paths and raw data buffers. It supports custom file signature matching through the injection of user-provided magic databases, allowing for the detection of specialized or proprietary formats. The project covers binary data analysis and MIME type mapping to transl
A Python library to extract tabular data from PDFs
Textract is a multi-format text extraction tool and parser. It provides a unified interface to extract plain text from a variety of sources, including documents, images, and audio files. The system functions as a document content parser for PDFs and spreadsheets, an image text extractor using optical character recognition, and a speech-to-text transcriber for audio recordings.
Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams. The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
Intake is a lightweight package for finding, investigating, loading and disseminating data.
Tablib is a Python library designed for importing, exporting, and manipulating tabular datasets. It functions as a multi-format data converter and manager, allowing users to move information between different file standards. The library supports data transformation across CSV, JSON, YAML, and Excel formats. It provides a programmatic interface to manage these datasets by adding rows, filtering columns, and segregating records. The system uses a common internal representation and adapter-based mapping to normalize diverse input sources. This allows for consistent reading and writing routines
Faker is a Python library designed to generate realistic synthetic data for software testing, database prototyping, and privacy-preserving anonymization. It provides a comprehensive suite of tools to create diverse information types, including personal identities, financial records, geographic locations, and technical system metadata, allowing developers to populate environments with mock data that mimics real-world structures. The library is built on a modular provider architecture that supports dynamic method dispatch, enabling users to extend functionality by registering custom data genera
xmltodict is a Python library that provides bidirectional serialization between XML documents and dictionaries. It functions as a parser that converts marked-up input into key-value pairs and a serialization utility that transforms dictionaries back into structured XML documents. The project includes an incremental stream processor that uses depth-based callbacks to handle large XML files while maintaining constant memory usage. It features a namespace manager for mapping prefixes and declarations, as well as a security sanitizer that blocks external entity expansion and validates element nam
Extract data from a wide range of Internet sources into a pandas DataFrame.
img2dataset is a high-performance image dataset pipeline and preprocessing tool designed to download and process millions of images from URLs for machine learning training. It functions as a distributed image downloader and cloud storage data exporter, moving large visual datasets from web sources directly into structured formats. The system prioritizes high-throughput data acquisition by distributing workloads across multiple CPU cores and machines. It integrates directly with remote cloud storage buckets and employs a manifest-based tracking system to resume interrupted downloads without re
Convert CSV files into a SQLite database. Browse and publish that SQLite database with Datasette.
Singer is an open source standard for moving data between databases, web APIs, files, queues, and just about anything else you can think of. The Singer spec describes how data extraction scripts — called “Taps” — and data loading scripts — called “Targets” — should communicate using a standard…
Snorkel is a weak supervision system that enables users to programmatically generate training labels for machine learning models without manual annotation. At its core, it provides a framework for writing labeling functions as Python callables that each vote on data points, and then trains a probabilistic graphical model over these multiple weak supervision sources to estimate latent true labels without any ground truth data. The system automatically learns accuracy and correlation parameters between labeling functions by analyzing observed agreement patterns on unlabeled data, converting lab
Data search & enrichment library for Machine Learning → Easily find and add relevant features to your ML & AI pipeline from hundreds of public and premium external data sources, including open & commercial LLMs
csvkit is a composable Unix-style command-line toolkit for converting, filtering, and analyzing CSV files directly from the terminal. It provides a suite of focused single-purpose commands that can be combined via pipes to build complex data processing workflows, with a modular architecture that includes a column-type inference engine for automatically detecting data types and a streaming-pipeline design for efficient handling of tabular data. The toolkit distinguishes itself through its SQL-engine abstraction layer, which allows users to run SQL queries directly against CSV files without req
gdown is a command-line tool that downloads public files and folders from Google Drive without requiring authentication. It bypasses the mandatory virus-scan warning page to retrieve large files that conventional download tools block, and can resume interrupted transfers using HTTP range requests. Beyond simple file downloads, gdown can recursively download entire folder hierarchies while preserving the local directory structure. It lists the contents of a public folder as structured JSON without downloading the files themselves, and resolves a file's real name and extension without retrievin
xlwings - Make Excel fly with Python!