Chatgpt Retrieval Plugin

This project is a retrieval-augmented generation pipeline designed for building custom ChatGPT plugins that allow language models to query private or professional documents. It implements a full retrieval workflow, from processing and indexing document chunks to retrieving relevant context for natural language queries.

The system distinguishes itself through a hybrid retrieval approach that combines dense vector embeddings with sparse keyword matching, further refined by a two-stage semantic re-ranking process. It includes specialized data privacy tools for screening personally identifiable information and secures private data stores using OAuth-based user authentication.

The capability surface covers multi-format file indexing for PDF, DOCX, and PPTX files, alongside document ingestion from JSON and ZIP archives. It supports multiple vector storage backends, including PostgreSQL with pgvector, Redis, and cloud-native services. The architecture is designed for containerized deployment via Docker and includes tools for metadata extraction and real-time data synchronization through webhooks.

The project provides a local development server with pre-configured routing and security to verify plugin functionality before deployment.

Features

ChatGPT Plugin Development - Provides a complete pipeline for building retrieval-based ChatGPT plugins that query private or professional documents.

Semantic Search - Implements a search engine that uses vector embeddings to retrieve content based on conceptual meaning.

Document Indexing - Processes documents into smaller chunks and stores them in a database to support retrieval-augmented generation.

Document Chunking Strategies - Segments large files into smaller text blocks to optimize retrieval accuracy and fit LLM context windows.

RAG Pipelines - Implements a full retrieval-augmented generation workflow, from document chunking and indexing to context retrieval for queries.

Personal Data Retrieval - Searches through user files and emails to find answers and retrieve relevant data using natural language.

Retrieval Re-ranking - Applies a secondary scoring pass to initial search results to increase the quality of retrieved snippets.

Semantic Search - Retrieves relevant document snippets from a vector database using natural language queries to provide model context.

Vector Upsert Operations - Uploads text and files as embedded chunks into a vector database to maintain a current knowledge base.

Vector Database Integrations - Connects applications to external vector stores and document databases to enable similarity search and contextual data retrieval.

Vector Document Indexing - Implements automated workflows for updating and inserting document embeddings into vector databases to maintain an up-to-date knowledge base.

Hybrid Search - Combines dense vector embeddings with sparse keyword matching to increase the precision of document search results.

Hybrid Vector-Keyword Indexing - Combines dense vector embeddings with sparse keyword matching to increase the precision of search results.

Multi-Format Document Ingestion - Populates vector databases by ingesting and normalizing large collections of JSON, JSONL, or ZIP files.

Vector Similarity Search - Queries stored document embeddings to find relevant information based on the semantic proximity of a request.

Multi-Format Document Parsing - Extracts searchable text from PDF, DOCX, and PPTX files to make content accessible to language models.

Private Data Privacy Tools - Protects user data through PII screening of documents and secures private store access using OAuth authentication.

Data Indexing - Creates searchable in-memory structures including trees and keyword tables to facilitate efficient semantic retrieval for language models.

Archive-Based Indexing - Indexes documents stored within compressed ZIP archives into a vector database with PII screening.

Document and Unstructured Extraction - Parses key information like authors and dates from unstructured text using a model to return structured JSON.

Vector Databases - Deploys containerized search engines designed to index and retrieve high-dimensional embeddings for natural language queries.

PostgreSQL Vector Stores - Utilizes PostgreSQL with the pgvector extension to persist and manage document embeddings for retrieval.

Document Ingestion Pipelines - Processes JSON document dumps to store content and associated metadata in a vector database.

Metadata Filtering - Limits returned document chunks by refining search queries using metadata criteria such as source, date, or author.

OpenAPI-to-Tool Converters - Defines a machine-readable schema allowing external language models to discover and call retrieval endpoints.

Vector Search Indexes - Implements indexes on embedding columns to accelerate nearest neighbor lookups and improve query performance.

Search Result Filtering - Refines vector search results using structured attributes like date or source to limit the returned data set.

Vector Storage - Indexes high-dimensional vectors in cloud-native databases to support fast similarity searches across large datasets.

Cloud Deployment - Provides Docker-based packaging and environment variable configuration for hosting the application on cloud infrastructure.

Containerized Application Deployments - Packages the retrieval service and its dependencies into portable container images for consistent cloud deployment.

Containerized Deployments - Packages the retrieval service and vector database into Docker images for consistent cloud hosting.

Plugin Manifests - Hosts plugins on remote platforms and connects them to chat interfaces using OpenAPI schemas.

OAuth Authentication - Secures private document stores by verifying user identities and managing access scopes via OAuth 2.0.

PII Detection and Screening - Scans text and documents for personally identifiable information using a language model to prevent sensitive data storage.

Document Q&A - Plugin for semantic search and retrieval over personal documents.

Infrastructure and Utilities - Enables semantic search over personal or organizational data.

Natural Language Processing - Listed in the “Natural Language Processing” section of the FunNLP awesome list.

openaichatgpt-retrieval-plugin

Chatgpt Retrieval Plugin

Features

Star history