30 open-source projects similar to treeverse/lakefs, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best LakeFS alternative.
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
Prefect is a workflow orchestration platform designed to define, schedule, and monitor complex data pipelines as Python code. It functions as a container-native engine that wraps individual tasks in isolated environments, ensuring consistent dependencies and resource allocation across diverse infrastructure. By utilizing a state-machine-based orchestration model, the system tracks execution progress through discrete transitions and persistent event logs to maintain reliable and observable task processing. The platform distinguishes itself through a decoupled worker-API architecture, which sep
The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane. The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It
Delta is a lakehouse table format that brings ACID transactions and data warehouse consistency to large scale data lakes on cloud object storage. It serves as an ACID transaction manager, coordinating atomic commits and serializable isolation for concurrent reads and writes across distributed compute engines. The project provides a multi-engine interoperability layer that uses format translation to allow diverse SQL engines and processing frameworks to read and write the same tables. It functions as a data versioning system, utilizing a transaction log to enable time travel, historical snapsh
Magit is a complete Git interface that runs inside Emacs, providing a full-featured porcelain for version control operations without leaving the editor. It renders repository state as structured, collapsible sections within Emacs buffers, and manages Git command execution through a transactional process model with automatic buffer refresh and error handling. The interface exposes all configuration through Emacs' standard customization system and uses a transient command framework for context-sensitive menu-driven Git operations. What distinguishes Magit is its granular control over every stag
Noms is a distributed version control database and content-addressable data store. It identifies data by cryptographic hashes to ensure integrity and deduplication, while tracking dataset state changes through a sequence of immutable commits to enable branching, forking, and historical recovery. The system functions as a peer-to-peer data synchronizer, reconciling state between disconnected database instances to ensure all nodes converge on the same data. It distinguishes itself as a schema-flexible document store that supports self-describing types, allowing schemas to evolve and widen as ne
my-git is a comprehensive framework and reference guide for Git version control administration, repository governance, and software release management. It provides a structured approach to managing the software development lifecycle, from initial feature branching to final production deployment. The project distinguishes itself through a specialized AI-assisted development framework. This includes workflows for managing AI-generated code via automated diff reviews, intent-based commit splitting, and governance models for multi-agent coordination and session isolation using worktrees. The cod
ClearML is a comprehensive MLOps platform designed to manage the end-to-end machine learning lifecycle, from initial experimentation to production deployment. It provides a suite of integrated tools including a pipeline orchestrator for automating workflows, an experiment tracking tool for logging hyperparameters and metrics, and a metadata-driven data versioning system for managing large-scale datasets and model artifacts. The platform is distinguished by its advanced compute management and serving capabilities. It features a GPU compute manager that supports fractional resource slicing and
DVC is a data versioning tool and pipeline orchestrator designed to track large datasets and machine learning models using external storage and metadata pointers. It integrates with Git by utilizing placeholders to keep heavy artifacts out of the repository while maintaining a versioned link between code and data. The system manages remote data caches through a synchronization layer that connects local environments to cloud storage or network filesystems. It also functions as an experiment tracker, recording hyperparameters and metrics to compare the performance of different model iterations.
InsForge is a backend-as-a-service platform that provides an integrated suite of tools for managing relational databases, identity provision, object storage, and serverless compute. It functions as an open-source identity provider and a PostgreSQL database manager featuring integrated vector storage and row-level security. The platform serves as an LLM orchestration gateway, offering a unified endpoint to route requests across various AI providers through an OpenAI-compatible interface. It enables AI-driven application generation and connects AI agents to backend resources using a standardize
Wandb is a centralized platform for machine learning experiment tracking, model registry management, and workflow orchestration. It provides a comprehensive suite of tools for logging, visualizing, and versioning training metrics, model artifacts, and hyperparameter sweeps to ensure reproducibility across development cycles. The platform also functions as an observability tool for large language model applications, enabling the tracing of execution steps, token usage, and reasoning processes. The project distinguishes itself through its event-driven automation capabilities, which allow users
Wekan is an open-source, self-hosted Kanban project management tool used for organizing workflows through boards, lists, and cards. It is a real-time web application that allows teams to manage tasks on private infrastructure. The platform distinguishes itself with extensive data migration tools, specifically for importing boards and cards from Trello. It supports enterprise-grade identity integration via LDAP, OpenID Connect, and OAuth2, and offers flexible storage options including PostgreSQL as a primary relational backend and pluggable cloud storage for attachments. The system covers a w
Dagster is a data orchestration platform designed to manage the entire lifecycle of data assets through declarative modeling and version-controlled code. It functions as a workflow engine that treats data assets as first-class primitives, allowing teams to define, schedule, and monitor complex pipelines while maintaining clear visibility into lineage, dependencies, and data quality. The platform distinguishes itself by using a code-as-configuration framework that enables standard software engineering practices, such as unit testing and local mocking, to be applied directly to data workflows.
DataHub is a metadata management platform designed to unify technical, operational, and business context across diverse data ecosystems. By utilizing a graph-based metadata model and an event-driven ingestion architecture, it creates a centralized source of truth that maps complex data relationships, lineage, and ownership. This foundational framework enables organizations to maintain a synchronized view of their data landscape, supporting both human-led discovery and automated data operations. The platform distinguishes itself through its focus on grounding artificial intelligence and autono
This project is a collection of reference implementations, sample code, and starter kits for integrating Firebase backend services into web applications using the JavaScript SDK. It serves as a practical guide for bootstrapping projects with cloud-hosted authentication, databases, and serverless logic. The repository provides specific examples for implementing real-time data synchronization, user identity management, and event-driven cloud functions. It also includes reference code for using local service emulators to test cloud functionality on a local machine before production deployment.
This project is a version control style guide providing standardized rules for commit messages, branch naming, and history management. It serves as a comprehensive framework for maintaining a consistent and readable project history through a set of defined guidelines and workflow documentation. The guide emphasizes a linear history branching model, utilizing rebasing and squashing techniques to maintain a straight timeline of commits. It specifies a structured commit layout using imperative language and a kebab-case naming convention for branches to ensure organizational clarity across teams.
bup is a deduplicating backup manager and incremental backup system. It uses a Git packfile-based storage format to eliminate redundant data across files and versions, treating every incremental save as a full backup. The system provides secure remote transport interfaces for transferring and managing backup data on remote servers via SSH. It also includes a backup repository browser available as both a web interface and a filesystem mount for exploring and retrieving files from snapshots. The project covers broad capability areas including disaster recovery, repository administration, and s
zimfw is a Zsh configuration framework and plugin manager designed to customize and optimize the Zsh shell environment. It functions as a system for installing, updating, and pinning shell extensions and themes from remote or local repositories. The framework focuses on shell performance by using byte-code compilation of scripts to reduce startup time and improve execution speed. It employs a declarative configuration model for module management, allowing for version-pinned dependency resolution and the ability to fetch modules without full git clones to accelerate installation. The project
SmolLM is a project dedicated to the development of small language models. It focuses on training and fine-tuning compact models that maintain high performance while utilizing fewer parameters. The project emphasizes efficient AI inference and on-device text generation, aiming to enable the deployment of lightweight models on edge devices with limited memory and processing power. It utilizes synthetic data generation to produce artificial datasets that improve the reasoning and training of these AI systems. The system supports a variety of optimization and training capabilities, including we
minio-go is a client library and software development kit for interacting with S3-compatible object storage. It provides a programmatic interface for Go applications to manage buckets and objects using the S3 protocol. The library enables the execution of complex storage operations, including multi-part uploads for large datasets, data synchronization between filesystems, and the management of bucket lifecycle and replication policies. It also supports advanced data retrieval through object searching and SQL-based querying of stored data. The toolkit covers a broad range of administrative an
Rustfs is a distributed object storage system designed for high availability and horizontal scalability. It functions as a cluster-based platform that manages data across multiple nodes, providing a self-hosted infrastructure for large-scale storage requirements. The system is built to be container-native, utilizing an operator to automate deployment and management within orchestrated environments. It provides compatibility with standard object storage protocols, allowing existing applications and tools to interact with the storage layer through a translation interface. To ensure long-term re
在线云盘、网盘、OneDrive、云存储、私有云、对象存储、h5ai、上传、下载
This project is a Unix backup orchestrator used for modeling and executing full-stack data protection. It functions as a management system for database dumps, encrypted archiving, version rotation, and remote storage transport. The system distinguishes itself by orchestrating native system tools for various databases, including PostgreSQL, MySQL, MongoDB, Redis, and Riak. It employs a secure archive workflow that combines compression and encryption using GPG, OpenSSL, or AES before transporting packages to S3-compatible services, Dropbox, or remote servers via SFTP and RSync. Broad capabilit
CloudPaste is a secure file sharing platform and multi-backend storage aggregator. It unifies local and S3-compatible cloud storage providers into a single managed file system, serving as a gateway for centralized file access and distribution. The platform distinguishes itself through a built-in browser-based Markdown editor for composing documents with formulas and diagrams. It provides secure content sharing using password protection, expiration dates, and path-restricted API keys to control programmatic access and visibility. The system covers broad capabilities in file management, includ
JuiceFS is a distributed file system designed to mount object storage as a local, POSIX-compliant drive. It functions as a cloud-native persistent storage layer that decouples file metadata from raw data, storing metadata in a transactional database while keeping data blocks in object storage. This architecture enables multiple hosts across different regions to access the same storage simultaneously while maintaining strong consistency. The system distinguishes itself by performing data processing, including compression and encryption, directly on the client side before transmission. By split
Great Expectations is a data quality testing framework and observability platform designed to monitor the reliability of data pipelines. It provides a structured environment for defining, documenting, and automating data quality assertions, allowing teams to validate datasets against expected structure and content before they move through downstream processes. The project distinguishes itself through a declarative domain-specific language that stores quality rules as version-controlled configuration files. It utilizes an execution engine abstraction to translate these high-level assertions in
Ydata-profiling is an automated exploratory data analysis framework designed to generate comprehensive statistical reports and visual summaries from dataframes. It functions as a diagnostic tool for assessing data quality, identifying missing values, duplicates, and outliers, while providing a scalable engine for profiling massive datasets across distributed enterprise environments. The project distinguishes itself through its ability to handle large-scale data through distributed task orchestration and lazy stream processing, which minimizes memory overhead during complex computations. It in
Datasets is a library designed for the management, processing, and sharing of large-scale data collections for machine learning workflows. It functions as both a data processing framework and a versioning platform, providing tools to organize, filter, and transform massive datasets while ensuring reproducibility across research and development teams. The library distinguishes itself by enabling the handling of datasets that exceed available system memory. It utilizes memory-mapped file access, disk-based caching, and lazy iterative streaming to maintain performance when working with large-sca
Cap is a self-hosted screen recording and video collaboration platform designed for teams to replace synchronous meetings with asynchronous video updates. It provides a comprehensive suite for capturing high-resolution desktop activity, including system audio, microphone input, and camera overlays, which are then processed through an integrated post-production workflow. The platform distinguishes itself by offering full data sovereignty through containerized deployment and object storage abstractions, allowing users to host their media assets on private infrastructure or S3-compatible buckets
This project is a community-maintained, open-source job aggregator that provides a curated database of internship opportunities. It centralizes scattered professional listings into a structured, searchable collection categorized by industry, role, and location to assist students in their career search. The repository distinguishes itself by utilizing a version-controlled data store, where all job listings are maintained as plain text files. This approach enables transparent history tracking and granular change analysis through standard diffing tools. The project relies on an automated data ex