The visitor wants a middleware or proxy tool designed to cache Large Language Model API responses to reduce latency and operational costs.

wei-shaw/claude-relay-service is the closest match — This tool functions as an LLM gateway that provides essential features like multi-provider routing, API cost tracking, and rate limiting, though it lacks explicit mention of semantic caching capabilities.. Other strong matches: zilliztech/gptcache, alibaba/higress, berriai/litellm, blockrunai/clawrouter.

Why does wei-shaw/claude-relay-service match “a caching layer for LLM API calls”?

This tool functions as an LLM gateway that provides essential features like multi-provider routing, API cost tracking, and rate limiting, though it lacks explicit mention of semantic caching capabilities.

Why does zilliztech/gptcache match “a caching layer for LLM API calls”?

GPTCache is a dedicated semantic caching layer that supports self-hosting, multi-provider integration, and vector-based similarity matching to effectively reduce latency and costs for LLM applications.

Why does alibaba/higress match “a caching layer for LLM API calls”?

Higress is a cloud-native AI gateway that natively supports semantic caching, multi-provider proxying, cost tracking, and request logging, making it a comprehensive solution for managing and optimizing LLM API traffic.

Why does berriai/litellm match “a caching layer for LLM API calls”?

LiteLLM is a comprehensive, self-hostable LLM gateway that provides multi-provider support, API cost tracking, request logging, and built-in response caching, making it a complete solution for your requirements.

Why does blockrunai/clawrouter match “a caching layer for LLM API calls”?

ClawRouter is a self-hostable AI proxy that provides response caching, multi-provider support, and API cost management, making it a functional tool for optimizing LLM traffic despite its additional focus on blockchain-based micropayments.

LLM Response Caching Proxies

Open-source middleware and proxy tools that store language model outputs to reduce API latency and costs.

Find the best repos with AI.We'll search the best matching repositories with AI.

wei-shaw/claude-relay-service
Wei-Shaw/claude-relay-service
12,114View on GitHub
This project is a secure intermediary proxy gateway for large language model APIs. It functions as a relay service that forwards requests to AI providers while managing service accounts and routing traffic. The service provides a compatibility layer that supports multiple endpoint formats, allowing different third-party AI clients to communicate with a single provider. It distinguishes itself through a service account management system that assigns individual proxy settings to multiple accounts to prevent IP bans and distributes traffic via load balancing to avoid rate limits. The system inc
This tool functions as an LLM gateway that provides essential features like multi-provider routing, API cost tracking, and rate limiting, though it lacks explicit mention of semantic caching capabilities.
JavaScriptLLM GatewaysRate LimitingToken Usage Analytics
View on GitHub12,114
zilliztech/gptcache
zilliztech/GPTCache
8,068View on GitHub
GPTCache is a semantic caching layer and response optimizer for large language models. It functions as pluggable middleware for orchestration frameworks, utilizing vector database caching to store and retrieve model responses based on the semantic similarity of prompts rather than exact text matches. The system uses embeddings to determine cache hits by comparing the distance between new queries and stored vectors. It employs a hybrid storage model that persists original prompts in relational databases while maintaining high-dimensional embeddings in vector stores. The project covers a broad
GPTCache is a dedicated semantic caching layer that supports self-hosting, multi-provider integration, and vector-based similarity matching to effectively reduce latency and costs for LLM applications.
PythonSemantic Caching SystemsCache Hit Evaluation
View on GitHub8,068
alibaba/higress
alibaba/higress
7,558View on GitHub
Higress is an AI API gateway and cloud-native traffic manager that functions as a Kubernetes ingress controller. It provides a centralized system for routing, securing, and optimizing traffic directed toward large language models, AI agents, and microservice architectures. The project distinguishes itself through deep AI orchestration, including the ability to host and manage Model Context Protocol servers that transform REST APIs into tools for AI agents. It features specialized AI infrastructure for model request proxying, protocol translation across multiple providers, and semantic-based c
Higress is a cloud-native AI gateway that natively supports semantic caching, multi-provider proxying, cost tracking, and request logging, making it a comprehensive solution for managing and optimizing LLM API traffic.
GoRate LimitingSemantic CachingToken Usage Analytics
View on GitHub7,558
berriai/litellm
BerriAI/litellm
50,579View on GitHub
LiteLLM is a unified gateway and proxy server designed to centralize access to over one hundred language model providers. It provides a standardized API interface that abstracts vendor-specific schemas, allowing developers to interact with diverse models through a single, consistent format. By acting as a central traffic management layer, it enables organizations to route, secure, and govern model interactions across multiple deployments. The platform distinguishes itself through its policy-driven architecture, which uses configuration-based routing to manage traffic distribution, load balanc
LiteLLM is a comprehensive, self-hostable LLM gateway that provides multi-provider support, API cost tracking, request logging, and built-in response caching, making it a complete solution for your requirements.
PythonRequest Logs
View on GitHub50,579
blockrunai/clawrouter
BlockRunAI/ClawRouter
3,020View on GitHub
ClawRouter is an AI model router and API gateway designed to classify query complexity and assign prompts to the most efficient model tier. It operates as a multi-model AI proxy that orchestrates traffic between various large language models and AI media generators through a unified interface. The project distinguishes itself by integrating a non-custodial micropayment processor using the x402 protocol. This allows for per-request API access and USDC settlement on Base and Solana chains, replacing static API keys with wallet-based authentication and real-time budget enforcement. The system c
ClawRouter is a self-hostable AI proxy that provides response caching, multi-provider support, and API cost management, making it a functional tool for optimizing LLM traffic despite its additional focus on blockchain-based micropayments.
TypeScriptLLM GatewaysRate Limiters
View on GitHub3,020
quantumnous/new-api
QuantumNous/new-api
39,722View on GitHub
This project is an AI model API gateway and proxy server designed to provide a unified interface for interacting with diverse artificial intelligence service providers. It functions as a centralized middleware platform that routes, load balances, and translates API requests across multiple models, enabling developers to access text, image, audio, and video generation capabilities through a single, standardized integration. The gateway distinguishes itself through comprehensive administrative and financial controls, including event-driven usage accounting, real-time token consumption tracking,
This project functions as a comprehensive API gateway and proxy for LLM providers, offering robust cost tracking, multi-provider routing, and request management, though it lacks explicit support for semantic caching.
GoModel Proxy Gateways
View on GitHub39,722
katanemo/plano
katanemo/plano
5,120View on GitHub
Plano is an AI agent orchestrator and LLM gateway proxy that unifies access to multiple AI providers through a single interoperable interface. It functions as a model routing engine that decouples applications from specific vendors using semantic aliases, allowing traffic to be shifted between providers without modifying application code. The system distinguishes itself with intent-based agent routing, which directs prompts to specialized agents based on semantic analysis. It features an interceptor-based filter chain system that acts as guardrail middleware to enforce safety policies, rewrit
Plano is a self-hostable LLM gateway proxy that provides multi-provider support, request logging, and observability, though it focuses more on agent orchestration and model routing than on semantic caching.
RustLLM Gateways
View on GitHub5,120
apache/apisix
apache/apisix
16,767View on GitHub
This project is a high-performance, distributed API gateway designed to manage, secure, and observe traffic for microservices, serverless functions, and artificial intelligence model providers. It functions as a dynamic service proxy and cloud-native ingress controller, centralizing policy enforcement and traffic routing through a unified configuration interface that synchronizes state across multiple nodes in real time. The platform distinguishes itself through a highly extensible architecture that utilizes a high-performance scripting engine to execute modular logic directly within the requ
This is a high-performance API gateway that includes specialized plugins for AI model proxying, token-based budget enforcement, and request logging, making it a robust, self-hostable infrastructure choice for managing and caching LLM traffic.
LuaRequest Logs
View on GitHub16,767
truefoundry/cognita
truefoundry/cognita
4,317View on GitHub
Cognita is a retrieval augmented generation orchestration framework used to build pipelines that connect document stores and language models to provide grounded answers. It functions as a document ingestion pipeline and a vector database integrator, managing the process of loading, parsing, and indexing files into a searchable knowledge base. The system includes a language model gateway proxy that provides a unified API to interact with multiple different model providers. This routing layer decouples the application from specific vendors, allowing requests to be proxied through a provider-agn
This is a RAG orchestration framework that includes a model gateway for provider routing, but it lacks the specific semantic caching and cost-tracking features required for an LLM caching proxy.
PythonLLM GatewaysModel Proxy Gateways
View on GitHub4,317
ryoppippi/ccusage
ryoppippi/ccusage
10,826View on GitHub
This project is a command-line utility designed to monitor and analyze token consumption and financial expenditure for AI coding assistants. By parsing local session logs directly on the user's machine, it provides a privacy-focused way to track development activity without transmitting sensitive data to external servers. The tool distinguishes itself through its ability to aggregate disparate log formats from multiple coding assistants into a unified, schema-agnostic representation. It features a decoupled pricing engine that allows users to apply custom model-specific cost multipliers, over
This tool is a local log analyzer for tracking token usage and costs rather than a proxy or middleware capable of caching and intercepting live LLM API requests.
TypeScriptConversation Cost AggregatorsToken Usage AnalyticsUsage Monitoring
View on GitHub10,826

LLM Response Caching Proxies

Wei-Shaw/claude-relay-service

zilliztech/GPTCache

alibaba/higress

BerriAI/litellm

BlockRunAI/ClawRouter

QuantumNous/new-api

katanemo/plano

apache/apisix

truefoundry/cognita

ryoppippi/ccusage