What are the best open-source GitHub repositories for base de datos vectorial para flujos de trabajo RAG?

semi-technologies/weaviate is the closest match — Weaviate is a cloud-native vector database explicitly optimized for retrieval-augmented generation, offering hybrid search (vector + keyword + metadata), scalable distributed architecture, and direct LLM integration — exactly the core capabilities this search targets.. Other strong matches: activeloopai/deeplake, chroma-core/chroma, pgvector/pgvector, qdrant/qdrant.

Why does semi-technologies/weaviate match “base de datos vectorial para flujos de trabajo RAG”?

Weaviate is a cloud-native vector database explicitly optimized for retrieval-augmented generation, offering hybrid search (vector + keyword + metadata), scalable distributed architecture, and direct LLM integration — exactly the core capabilities this search targets.

Why does activeloopai/deeplake match “base de datos vectorial para flujos de trabajo RAG”?

DeepLake is a serverless vector database with hybrid search and native RAG support, designed for multimodal AI data and LLM pipelines, which exactly matches the need for a vector database to build retrieval-augmented generation applications.

Why does chroma-core/chroma match “base de datos vectorial para flujos de trabajo RAG”?

Chroma is a purpose-built vector database that provides embeddings storage, hybrid dense‑sparse search, and metadata filtering, making it a natural and widely used foundation for building Retrieval‑Augmented Generation pipelines with large language models.

Why does pgvector/pgvector match “base de datos vectorial para flujos de trabajo RAG”?

pgvector is a PostgreSQL extension that turns the database into a full vector similarity search engine with ANN indexing, hybrid search, and metadata filtering, making it a natural fit for RAG pipelines and directly compatible with LLM frameworks like LangChain.

Why does qdrant/qdrant match “base de datos vectorial para flujos de trabajo RAG”?

Qdrant is a purpose-built vector database for high-dimensional similarity search with distributed architecture, metadata filtering, and explicit support for AI/ML pipelines, directly matching your need for a scalable RAG backend — it stores and indexes embeddings and supports the ANN search, filter…

Base de datos vectorial

Explora las mejores bases de datos vectoriales para flujos de trabajo RAG. Compara funcionalidades, número de estrellas y niveles de actividad para encontrar la opción ideal para tu proyecto.

Encuentra los mejores repositorios con IA.Buscaremos los repositorios que mejor coincidan usando IA.

semi-technologies/weaviate
semi-technologies/weaviate
16,337Ver en GitHub
Weaviate is a cloud-native vector database and distributed vector store designed to save high-dimensional vectors alongside structured data. It functions as a hybrid search engine that combines vector similarity, keyword matching, and structured metadata filtering within a single query. The system is optimized for retrieval-augmented generation, integrating vector search with generative AI and reranking to power question-and-answer workflows. It distinguishes itself through the ability to merge semantic search with traditional keyword queries and structured metadata filters to improve result
Weaviate is a cloud-native vector database explicitly optimized for retrieval-augmented generation, offering hybrid search (vector keyword metadata), scalable distributed architecture, and direct LLM integration — exactly the core capabilities this search targets.
GoHybrid Search EnginesMetadata FilteringVector Storage
Ver en GitHub16,337
activeloopai/deeplake
activeloopai/deeplake
9,175Ver en GitHub
DeepLake is AI data infrastructure consisting of a multimodal data lake, a hybrid search engine, and a serverless vector database. It provides a PostgreSQL-based AI data runtime that combines multimodal storage with streaming pipelines to load and shuffle datasets from cloud storage directly into deep learning training pipelines. The system utilizes lazy indexing to store and slice images, audio, and video without loading entire files into memory. It enables retrieval-augmented generation by persisting high-dimensional embeddings in a serverless vector store and implementing hybrid search tha
DeepLake is a serverless vector database with hybrid search and native RAG support, designed for multimodal AI data and LLM pipelines, which exactly matches the need for a vector database to build retrieval-augmented generation applications.
C++Hybrid Search EnginesVector StorageFull Text Search
Ver en GitHub9,175
chroma-core/chroma
chroma-core/chroma
26,198Ver en GitHub
Chroma is a specialized vector database designed to index and retrieve high-dimensional data representations for semantic similarity search. It functions as a comprehensive platform for information retrieval, enabling the storage and management of unstructured documents alongside structured metadata. By mapping data into numerical representations, the system facilitates rapid similarity lookups across large datasets. The platform distinguishes itself through a hybrid search infrastructure that combines dense vector embeddings with sparse keyword and regular expression matching to balance sema
Chroma is a purpose-built vector database that provides embeddings storage, hybrid dense‑sparse search, and metadata filtering, making it a natural and widely used foundation for building Retrieval‑Augmented Generation pipelines with large language models.
RustHybrid Search EnginesHybrid Search InfrastructureMetadata Filtering
Ver en GitHub26,198
pgvector/pgvector
pgvector/pgvector
21,787Ver en GitHub
Vector similarity search extension for PostgreSQL.
pgvector is a PostgreSQL extension that turns the database into a full vector similarity search engine with ANN indexing, hybrid search, and metadata filtering, making it a natural fit for RAG pipelines and directly compatible with LLM frameworks like LangChain.
CApproximate Nearest Neighbor SearchVector Similarity Search
Ver en GitHub21,787
qdrant/qdrant
qdrant/qdrant
32,372Ver en GitHub
Qdrant is a high-performance vector similarity database designed to store, index, and search high-dimensional vectors alongside structured metadata. It functions as a distributed search engine that manages large-scale data clusters, providing low-latency retrieval and complex filtering capabilities. The system is built to serve as a specialized middleware layer, connecting machine learning pipelines and AI agents to persistent storage for intelligent information retrieval and recommendation tasks. The platform distinguishes itself through advanced retrieval techniques, including support for h
Qdrant is a purpose-built vector database for high-dimensional similarity search with distributed architecture, metadata filtering, and explicit support for AI/ML pipelines, directly matching your need for a scalable RAG backend — it stores and indexes embeddings and supports the ANN search, filtering, and LLM integration that a Retrieval-Augmented Generation pipeline requires.
RustHybrid Search EnginesMetadata FilteringVector Storage
Ver en GitHub32,372
weaviate/weaviate
weaviate/weaviate
15,620Ver en GitHub
Weaviate is an AI-native vector database designed to store and index high-dimensional vector embeddings alongside traditional data objects. It serves as a backend infrastructure for retrieval-augmented generation, enabling applications to ground language model responses in private, context-aware data. The platform distinguishes itself by combining vector similarity search with traditional keyword filtering through a hybrid storage architecture. It integrates directly with external machine learning models to automate the generation of embeddings and perform complex inference tasks during inges
Weaviate is an AI-native vector database explicitly built for retrieval-augmented generation, combining vector similarity search with hybrid keyword filtering and direct model integrations, which makes it a comprehensive fit for building RAG pipelines.
GoHybrid Search EnginesRetrieval Augmented Generation
Ver en GitHub15,620
alibaba/zvec
alibaba/zvec
5,198Ver en GitHub
zvec is an embedded vector database engine and indexing library designed for high-dimensional similarity search. It functions as a hybrid search engine and a retrieval-augmented generation knowledge base, allowing for the storage and retrieval of dense and sparse vectors. The system is distinguished by its hybrid retrieval pipeline, which fuses vector similarity, full-text keyword matching, and scalar metadata filtering into single query operations. It supports a plugin-based model integration system for registering custom embedding models and rerankers, as well as language bindings for nativ
zvec is an embedded vector database engine that stores embeddings, performs hybrid search (vector keyword metadata), and supports LLM integration via plugins for custom embedding models and rerankers, making it a solid fit for building RAG pipelines, though its single-node architecture limits distributed scalability.
C++Approximate Nearest Neighbor SearchHybrid Search EnginesMetadata Filtering
Ver en GitHub5,198
milvus-io/milvus
milvus-io/milvus
44,804Ver en GitHub
Milvus is a specialized vector database engine designed for the indexing, management, and high-speed similarity retrieval of high-dimensional vector embeddings. It functions as a similarity search engine capable of identifying nearest neighbors within large-scale vector spaces, supporting the storage and retrieval of billions of data points while maintaining consistent performance. The system utilizes a distributed architecture that decouples storage, query, and coordination into independent services, allowing for horizontal scaling across clusters. It employs a global indexing mechanism that
Milvus is a purpose-built vector database with distributed architecture, ANN search, and hybrid search capabilities, and its active RAG and LLM ecosystem integrations make it a leading choice for building Retrieval-Augmented Generation pipelines.
GoHybrid Search SystemsRetrieval-Augmented Generation
Ver en GitHub44,804
lance-format/lance
lance-format/lance
6,699Ver en GitHub
Lance is a columnar data format and storage layer designed for high-performance random access and the persistence of multimodal data. It functions as a vector database storage system, a multimodal data store, and a versioned dataset manager. The project distinguishes itself as a hybrid search engine that combines vector similarity search and full-text indexing on a single dataset. It provides unified storage for diverse data types including images, audio, and video, utilizing a system that lazy-loads large binary objects only when requested. The system manages dataset evolution through schem
Lance is a columnar storage format and vector database that stores embeddings, supports ANN search, hybrid vector+full-text search, and metadata filtering, making it a solid building block for RAG pipelines, though it lacks explicit LLM integration and distributed architecture.
RustHybrid Search EnginesVector StorageVector Search Indexes
Ver en GitHub6,699
vespa-engine/vespa
vespa-engine/vespa
6,961Ver en GitHub
Vespa is a distributed search engine, vector database, and machine learning ranking engine. It serves as an AI search platform designed to handle large-scale document indexing and complex query processing across a cluster of nodes, combining keyword retrieval with high-dimensional embedding storage for semantic similarity search. The platform distinguishes itself by integrating machine learning models directly into the search pipeline to perform real-time inference and ranking. It converts these models into ranking expressions to score and order results based on relevance, while providing a s
Vespa is a distributed vector database and AI search platform that natively supports hybrid (vector keyword) search, ANN, metadata filtering, and ML model ranking, making it a comprehensive fit for building scalable RAG pipelines with LLMs.
JavaVector Search Indexes
Ver en GitHub6,961
meilisearch/meilisearch
meilisearch/meilisearch
58,118Ver en GitHub
Meilisearch is a Rust-based search engine providing typo-tolerant full-text and vector-based semantic search with real-time conversational capabilities.
Meilisearch is a high-performance search engine that now natively supports vector-based semantic search, hybrid (vector+keyword) retrieval, and metadata filtering — all delivered as a RESTful service — making it a solid, ready-to-use vector database for building RAG pipelines with LLMs.
RustDeveloper-Focused Search ToolsDocument Indexing EnginesFinite State Transducers
Ver en GitHub58,118
arangodb/arangodb
arangodb/arangodb
14,091Ver en GitHub
This project is a multi-model database system designed to store and manage information as documents, graphs, and key-value pairs within a single engine. It functions as a graph database and knowledge graph platform, providing the infrastructure to build, query, and visualize structured data models. By integrating vector search capabilities, the system serves as a vector database that supports retrieval-augmented generation for artificial intelligence applications. The platform distinguishes itself through a unified query language that allows users to perform document lookups, graph traversals
ArangoDB is a multi-model database that integrates built-in vector search with ANN, hybrid search, metadata filtering, and distributed scaling, and is explicitly cited for supporting retrieval-augmented generation, making it a strong fit as the vector database backend for RAG pipelines.
C++Graph DatabasesMulti-Model DatabasesAI Grounding Services
Ver en GitHub14,091
surrealdb/surrealdb
surrealdb/surrealdb
32,397Ver en GitHub
SurrealDB is a multi-model database engine designed to store and query document, graph, relational, and vector data within a single ACID-compliant platform. It functions as an AI-native data store, integrating vector search, graph traversal, and machine learning model execution directly into its query layer. By providing a unified declarative query language, the platform eliminates the need for external middleware to synchronize data across different storage models. The platform distinguishes itself through its ability to manage agent memory and complex workflows natively. It allows developer
SurrealDB is an AI-native multi-model database that natively integrates vector search, embeddings storage, and ML model execution into its declarative query language, making it a comprehensive and scalable foundation for building Retrieval-Augmented Generation pipelines with LLMs.
RustMulti-Model DatabasesAccess Control SystemsACID Transactional Cores
Ver en GitHub32,397
eto-ai/lance
eto-ai/lance
6,671Ver en GitHub
Lance is a versioned columnar data format and storage engine designed as a multimodal AI lakehouse. It serves as a vector database storage engine and a cloud object store dataset manager, organizing images, video, audio, and embeddings into a unified format optimized for machine learning workflows. The project distinguishes itself by combining a columnar layout for structured data with a specialized blob store for large multimodal tensors. It implements a hybrid search engine that integrates vector similarity search, full-text search, and SQL analytics on a single dataset, supported by a stor
Lance is a columnar data format and storage engine that functions as a vector database with built-in hybrid search (vector full-text) and embedding management, making it a viable retrieval store for RAG pipelines—though its lakehouse focus on multimodal data storage means it lacks explicit RAG orchestration tools or a distributed-nodes architecture out of the box.
RustHybrid Search EnginesHybrid Search MethodsVector Storage
Ver en GitHub6,671
lancedb/lancedb
lancedb/lancedb
9,031Ver en GitHub
LanceDB is a vector database and columnar data store designed to function as a versioned dataset manager and vector search engine. It serves as a high-performance backend for indexing and retrieving high-dimensional embeddings, providing the foundation for machine learning data pipelines. The system distinguishes itself through a combination of cloud-native object storage and immutable version tracking, allowing for data time-travel and reproducible AI experiments. It integrates hybrid search capabilities, merging dense vector similarity with BM25 full-text search and SQL-like scalar filters
LanceDB is a vector database designed for high-dimensional embedding storage and ANN search, with hybrid search and metadata filtering that support RAG pipelines, though it does not advertise direct LLM integrations.
HTMLApproximate Nearest Neighbor SearchHybrid Search EnginesMetadata Filtering
Ver en GitHub9,031
manticoresoftware/manticoresearch
manticoresoftware/manticoresearch
11,819Ver en GitHub
Manticoresearch is a high-performance search engine and database designed for indexing and retrieving large datasets. It functions as a full-text search engine, a vector search database, and a SQL-based search database, providing a distributed search cluster architecture. The system provides an alternative to the Elasticsearch stack, offering a compatible API for indexing and searching structured and unstructured data. It distinguishes itself by supporting multiple retrieval methods, including vector matching for similarity search, geospatial queries, and traditional full-text ranking. The p
Manticore Search is a high-performance search engine and vector database with ANN search, embeddings storage, hybrid search, and distributed architecture, making it suitable for RAG pipelines—though it does not include explicit LLM or RAG-specific integrations.
C++Full Text SearchVector Similarity Search
Ver en GitHub11,819
typesense/typesense
typesense/typesense
25,254Ver en GitHub
Typesense is a distributed search engine designed to provide sub-millisecond query latency across massive datasets. It functions as both a high-performance indexing and retrieval engine and a comprehensive search experience platform, offering built-in typo tolerance and tools for managing relevance through synonym configuration, result curation, and complex filtering. The platform distinguishes itself by utilizing in-memory indexing to maintain high-throughput data retrieval and integrating vector database capabilities to support semantic similarity searches. It ensures data consistency and h
Typesense is a distributed search engine with vector database capabilities that supports semantic similarity search, hybrid full-text/vector queries, metadata filtering, and horizontal scaling — all of which are useful for RAG pipelines, though it lacks explicit LLM integration features.
C++Distributed Search EnginesSearch EnginesSearch Experience Platforms
Ver en GitHub25,254
pathwaycom/llm-app
pathwaycom/llm-app
59,341Ver en GitHub
This project is a data processing engine and AI application platform designed for building production-grade machine learning workflows. It provides a unified programming model that handles both historical batch data and live stream ingestion, enabling the development of real-time ETL pipelines and scalable data transformation workflows. The framework distinguishes itself through differential dataflow execution, which propagates only changes through a pipeline rather than recomputing entire datasets. It supports distributed state management across worker nodes and utilizes incremental stream p
Pathway's llm-app is an application platform that includes vector indexing and retrieval capabilities specifically built for RAG pipelines, directly covering embeddings storage, ANN search, and LLM integration – though its focus on real-time data processing makes it a different take from a standalone vector database, it squarely fits the tool category you are looking for.
Jupyter NotebookData Processing FrameworksDifferential Dataflow EnginesDistributed State Management
Ver en GitHub59,341
marqo-ai/marqo
marqo-ai/marqo
5,022Ver en GitHub
Marqo is an ecommerce product discovery platform, multimodal vector database, and AI search merchandising tool. It provides infrastructure for implementing semantic search and recommendations, allowing shoppers to find products using natural language and images. The platform distinguishes itself through a hybrid ranking pipeline that combines neural semantic scores with business-defined boosting and pinning rules. It features a conversational commerce engine that uses large language models to process user intent and provides a search performance analytics suite for measuring conversion uplift
Marqo is a vector database with hybrid search and LLM-powered conversational queries, so it can serve RAG pipelines, though its primary focus is ecommerce product discovery rather than general-purpose use.
PythonMultimodal IndexersProduct Discovery EnginesSemantic Search Engines
Ver en GitHub5,022
memvid/memvid
memvid/memvid
15,679Ver en GitHub
Memvid is an embedded memory framework designed to provide persistent, versioned context for intelligent agents. It functions as a local vector database library that stores all data within a single binary file, removing the need for external database infrastructure or network dependencies. The system distinguishes itself by integrating in-process vector indexing with append-only versioning, allowing for high-speed semantic similarity searches alongside the ability to track and roll back state changes over time. It includes built-in transparent data encryption and masking to secure sensitive i
Memvid is an embedded vector database library with in-process ANN search and versioned embeddings storage, purpose-built for agent memory and RAG pipelines; it covers core vector retrieval and LLM integration but is local-only and lacks hybrid search and distributed scalability.
RustAgent Memory EnginesAgent Memory StoresVector Databases
Ver en GitHub15,679

Base de datos vectorial

semi-technologies/weaviate

activeloopai/deeplake

chroma-core/chroma

pgvector/pgvector

qdrant/qdrant

weaviate/weaviate

alibaba/zvec

milvus-io/milvus

lance-format/lance

vespa-engine/vespa

meilisearch/meilisearch

arangodb/arangodb

surrealdb/surrealdb

eto-ai/lance

lancedb/lancedb

manticoresoftware/manticoresearch

typesense/typesense

pathwaycom/llm-app

marqo-ai/marqo

memvid/memvid