30 open-source projects similar to traceloop/openllmetry, ranked by how many features they have in common. Compare stars, activity and what each one does to find the best Openllmetry alternative.
Arize Phoenix is an LLM observability platform and evaluation framework designed to capture execution traces and monitor large language model applications. It serves as a prompt management system for versioning and testing templates, and as a self-hosted AI operations infrastructure for managing telemetry and experiments. The platform differentiates itself through a specialized embedding visualization tool used to detect data drift and optimize vector search. It provides a comprehensive evaluation suite that utilizes judge-based evaluators and ground-truth datasets to score model outputs, and
Helicone is an AI gateway and observability platform designed to intercept, manage, and monitor interactions with large language models. By acting as a reverse-proxy, it provides a centralized layer for routing requests across multiple AI providers, allowing developers to maintain consistent application logic while gaining deep visibility into model performance, usage, and costs. The platform distinguishes itself through a robust suite of traffic management and prompt engineering tools. It enables policy-driven control, including automatic failover between providers, rate limiting, and edge-b
Agenta is a Prompt Ops lifecycle manager and prompt management platform that decouples prompt engineering from application code. It serves as a centralized system for developing, versioning, and deploying prompt templates and model configurations across different environments. The platform functions as an AI agent orchestrator with a visual interface for building agent workflows and connecting models to external tools. It further acts as an evaluation framework and observability tool, utilizing OpenTelemetry to capture execution traces, monitor latency, and track token costs. The system cove
Uptrace is an OpenTelemetry-based observability platform designed to collect, store, and analyze distributed traces, metrics, and logs. It functions as a centralized logging backend, a distributed tracing system, and a metrics engine to monitor application performance and system health. The platform is distinguished by AI-powered operational capabilities, allowing users to query telemetry data and manage monitoring dashboards using natural language. It specifically includes specialized monitoring for generative AI pipelines, tracking token usage and response quality for LLM interactions and r
Opik is an observability and evaluation platform designed for generative AI applications and agentic workflows. It provides a centralized environment for tracing execution flows, managing prompt templates, and monitoring production performance, allowing teams to gain visibility into complex model interactions and tool usage without requiring manual application code changes. The platform distinguishes itself through its integrated approach to the AI development lifecycle, combining distributed trace instrumentation with automated evaluation frameworks. It supports model-as-a-judge scoring, syn
Langfuse is an open-source observability and evaluation platform designed for language model applications. It provides a centralized system for tracking execution traces, monitoring performance metrics, and managing prompt templates. By capturing hierarchical units of work and telemetry data, the platform enables developers to debug complex application lifecycles and analyze token usage, latency, and model interactions in production environments. The platform distinguishes itself through an integrated evaluation framework that allows for systematic benchmarking and automated scoring of model
Lmnr is an LLM observability platform and evaluation framework designed for tracing, logging, and monitoring language model executions. It provides the tools necessary to debug agent behavior, analyze performance, and identify failure patterns in AI agents. The platform differentiates itself through a trace-to-dataset pipeline that converts production logs into labeled test sets for regression testing. It includes a prompt-variant replay engine to compare different prompts or models side-by-side and a state-cached debugging system to replay agent loops without restarting the process. The sys
Comet LLM is an observability platform and evaluation framework designed for large language model applications and agentic workflows. It functions as a system for tracing, monitoring, and debugging execution flows while providing tools for prompt optimization and the enforcement of AI safety guardrails. The platform distinguishes itself through a combination of model-based scoring and heuristic metrics to quantify output quality and detect hallucinations. It includes a dedicated prompt and agent optimizer with an interactive playground for refining templates and tool configurations. For retri
Mastra is an orchestration framework designed for building, deploying, and managing autonomous AI agents and multi-agent systems. It provides a comprehensive suite of primitives for creating resilient AI applications, including durable workflow orchestration, event-driven agent loops, and semantic memory management. By integrating these core components, the platform enables developers to build complex, multi-step processes that can reason about goals and execute tasks without manual intervention. The framework distinguishes itself through its focus on observability and secure, isolated execut
This project is an AI-powered IDE extension and LLM coding assistant that provides a conversational interface for generating, refactoring, and debugging code. It functions as an AI agent framework and a Model Context Protocol client, connecting AI models to external data sources and tools to automate complex development tasks. The system is distinguished by its use of autonomous AI agents capable of multi-step task execution, including the ability to read files, modify code, and run terminal commands iteratively. It supports recursive agent orchestration through subagent delegation and employ
PydanticAI is a Python framework designed for building production-grade autonomous agents. It provides a unified interface for interacting with diverse language models, enabling developers to construct agents that perform complex tasks through structured data validation, tool execution, and multi-turn conversation management. The library centers on type-safe schema enforcement, ensuring that model inputs and outputs remain consistent and reliable throughout the agent's lifecycle. The framework distinguishes itself through a robust architecture that emphasizes modularity and testability. It ut
Plano is an AI agent orchestrator and LLM gateway proxy that unifies access to multiple AI providers through a single interoperable interface. It functions as a model routing engine that decouples applications from specific vendors using semantic aliases, allowing traffic to be shifted between providers without modifying application code. The system distinguishes itself with intent-based agent routing, which directs prompts to specialized agents based on semantic analysis. It features an interceptor-based filter chain system that acts as guardrail middleware to enforce safety policies, rewrit
Giskard is an evaluation framework, testing library, and quality monitoring system for large language models and AI agents. It serves as a toolkit for quantifying model performance and reliability, providing specialized capabilities for validating retrieval-augmented generation pipelines. The project distinguishes itself through an automated red teaming tool and security scanner designed to identify vulnerabilities, prompt injections, and safety risks. It utilizes adversarial probing and synthetic edge case generation to quantify model robustness and detect information disclosure. The platfo
HyperDX is an OpenTelemetry observability platform that provides centralized log management, distributed tracing, and a self-hosted monitoring stack. It functions as a unified system for collecting, indexing, and visualizing logs, metrics, and traces from cloud and container environments. The platform distinguishes itself with specialized tooling for large language model monitoring and session replay, allowing user interactions in the browser to be linked to backend telemetry. It employs schema-less JSON parsing to index structured logs dynamically and uses source maps to resolve minified sta
Logfire is an OpenTelemetry observability platform and Python application monitoring tool. It provides a suite of tools for collecting, storing, and querying spans, logs, and metrics to monitor application performance and execution. The platform features a specialized monitor for Pydantic data validation, tracking data flow and validation outcomes in real time. It also includes a telemetry analysis tool that uses standard SQL to query observability data and connect to business intelligence tools. The system provides automatic instrumentation for Python libraries and frameworks, allowing for
Guardrails is a Python SDK that wraps calls to large language models with configurable validation pipelines, corrective actions, and structured output generation. It provides a unified API layer that connects to over 100 language models, applying consistent validation, streaming, and error-handling across providers. The framework validates and corrects model responses against safety and quality rules, detecting and mitigating risks in both inputs and outputs using pre-built and custom validators. The project distinguishes itself through a validator-pipeline architecture that sequentially appl
This project is a self-hosted AI monitoring stack that functions as an LLM observability platform, AI evaluation framework, and OpenTelemetry trace analyzer. It is designed to capture and analyze LLM traces, sessions, and telemetry to monitor AI agent performance. The platform distinguishes itself as a Model Context Protocol server, exposing workspace functions as tools for AI coding agents. It enables the conversion of failing production traces into test datasets for regression testing and utilizes semantic-based session clustering to discover emerging user behavior patterns. The system cov
lmms-eval is a benchmarking system and performance analysis suite designed to measure the capabilities of large multimodal models. It provides a framework for evaluating models across text, image, audio, and video datasets, serving as a multimodal dataset orchestrator and benchmarking tool to quantify accuracy and efficiency. The project distinguishes itself through a unified multimodal message protocol that structures diverse media inputs for consistent model consumption. It features specialized benchmarking for audio, video, visual, document, and spatial reasoning, alongside tools for model
Evidently is an AI observability platform and evaluation framework designed to quantify the performance of machine learning models and large language models. It functions as a monitoring tool for detecting data drift and quality degradation in tabular datasets, while providing a specialized analyzer for the faithfulness and correctness of retrieval augmented generation systems. The project distinguishes itself through an evaluation framework that utilizes judge models and custom rubrics to score language model outputs. It includes tools for iterative prompt optimization and the generation of
This project is an artificial intelligence gateway that functions as a centralized middleware layer for managing, securing, and observing interactions with language, vision, and audio models. It provides a unified interface that standardizes requests across multiple providers, enabling teams to integrate AI capabilities into their applications through a consistent set of tools and protocols. The gateway distinguishes itself through its comprehensive infrastructure governance and traffic management capabilities. It allows for policy-driven routing, automated failover, and load balancing across
Archestra is a platform for enterprise AI agent deployment and Model Context Protocol orchestration. It provides a centralized system for configuring specialized agents with specific system prompts and toolsets, and managing the deployment of Model Context Protocol servers that provide large language models with external tools and data sources. The system features an AI agent gateway that exposes configured agents as networked services for external clients and integrated development environments. It incorporates a security suite that provides deterministic guardrails to prevent prompt injecti
Claude Code is a command-line interface and multi-agent orchestration framework designed for autonomous software engineering. It enables AI agents to perform codebase modifications, debugging, and Git workflow management while coordinating multiple specialized agents to decompose and execute complex engineering tasks in parallel. The system distinguishes itself through a high degree of isolation and safety, utilizing Git worktrees to create independent working directories for concurrent agents and implementing a tiered permission system that combines user rules, project policies, and OS-level
Ragas is an evaluation framework designed to measure the performance of retrieval-augmented generation pipelines and autonomous agent workflows. It provides a comprehensive suite of tools for benchmarking system outputs, utilizing language models as automated judges to score performance against defined rubrics and reference data. By standardizing inputs, retrieved contexts, and generated responses into a unified schema, the project enables consistent analysis across complex AI applications. The framework distinguishes itself through its ability to generate synthetic test datasets from existin
The AWS Cloud Development Kit is an infrastructure-as-code framework that enables developers to define and provision cloud resources using familiar programming languages. By utilizing construct-based synthesis, it translates high-level, object-oriented code into declarative templates, allowing for the automated management of complex cloud environments through a centralized, code-driven control plane. The framework distinguishes itself through its ability to model infrastructure as a dependency-aware resource graph, ensuring that components are provisioned and updated in the correct order. It
TensorZero is an inference gateway and experimentation framework designed to manage the lifecycle of large language models in production environments. It functions as a central proxy that routes requests across multiple artificial intelligence providers while providing the infrastructure necessary to monitor performance, track costs, and ensure service reliability. The platform distinguishes itself by integrating a comprehensive evaluation engine and an observability pipeline directly into the request flow. It enables developers to conduct controlled experiments and A/B tests to compare diffe
Genkit is an open-source framework for building AI-powered applications. It provides a unified interface for connecting to hundreds of generative AI models from multiple providers, enabling text, image, audio, and video generation through a single API. The framework structures multi-step AI interactions—including chat, retrieval-augmented generation, tool use, and agentic workflows—as composable, traceable flows with built-in streaming and state management. The framework distinguishes itself through a comprehensive developer toolkit that includes a command-line interface and a local developer
mcp-context-forge is a Model Context Protocol federation gateway that unifies diverse AI tool servers and APIs into a single consistent interface for discovery and execution. It acts as a centralized proxy that aggregates multiple servers and APIs, allowing AI agents to access and invoke a unified set of tools, prompts, and resources. The project distinguishes itself through a multi-protocol translation bridge that converts communication between standard I/O, SSE, gRPC, and REST to enable interoperability between disparate tool servers. It includes a comprehensive LLM evaluation framework for
BAML is a prompt engineering framework and LLM client generator that defines AI prompts as type-safe functions. It serves as a structured data extraction tool and workflow orchestrator, transforming unstructured model responses into strongly typed objects using a custom schema language and alignment algorithms. The project distinguishes itself by using a compiler to generate language-specific boilerplate code for API communication and output parsing. It features a dedicated environment for designing complex prompt templates with conditional logic and reusable snippets, and employs genetic alg
OpenTelemetry Go is a framework for generating and collecting distributed traces, metrics, and logs from Go applications. It provides a standardized telemetry instrumentation API for adding observability markers to code and a corresponding SDK for processing and emitting these signals. The project utilizes a configurable observability pipeline to sample and export telemetry data to external backends using the OTLP wire protocol. It features a pluggable export system and a separation between the public API and the SDK implementation, allowing telemetry to be routed to third-party platforms wit