# Model evaluation and LLM observability

> Search results for `Model evaluation and LLM observability` on awesome-repositories.com. 117 total matches; showing the first 50.

Explore on the web: https://awesome-repositories.com/q/model-evaluation-and-llm-observability

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [this search on awesome-repositories.com](https://awesome-repositories.com/q/model-evaluation-and-llm-observability).**

## Results

- [lavague-ai/lavague](https://awesome-repositories.com/repository/lavague-ai-lavague.md) (6,374 ⭐) — LaVague is an LLM web agent framework and large action model designed to translate natural language instructions into executable browser automation scripts. It functions as a multi-modal orchestrator that reasons over web page states and HTML content to automate multi-step tasks via a Selenium-based automation engine.

The framework features a modular model provider layer, allowing users to swap between different language and vision models from providers such as Anthropic, Gemini, and Azure OpenAI. It employs a multi-modal world model to process screenshots and HTML structures, utilizing retri
- [lmnr-ai/lmnr](https://awesome-repositories.com/repository/lmnr-ai-lmnr.md) (2,608 ⭐) — Lmnr is an LLM observability platform and evaluation framework designed for tracing, logging, and monitoring language model executions. It provides the tools necessary to debug agent behavior, analyze performance, and identify failure patterns in AI agents.

The platform differentiates itself through a trace-to-dataset pipeline that converts production logs into labeled test sets for regression testing. It includes a prompt-variant replay engine to compare different prompts or models side-by-side and a state-cached debugging system to replay agent loops without restarting the process.

The sys
- [comet-ml/opik](https://awesome-repositories.com/repository/comet-ml-opik.md) (17,787 ⭐) — Opik is an observability and evaluation platform designed for generative AI applications and agentic workflows. It provides a centralized environment for tracing execution flows, managing prompt templates, and monitoring production performance, allowing teams to gain visibility into complex model interactions and tool usage without requiring manual application code changes.

The platform distinguishes itself through its integrated approach to the AI development lifecycle, combining distributed trace instrumentation with automated evaluation frameworks. It supports model-as-a-judge scoring, syn
- [milanm/devops-roadmap](https://awesome-repositories.com/repository/milanm-devops-roadmap.md) (18,752 ⭐) — DevOps-Roadmap is a comprehensive educational repository and knowledge base designed to guide technical professionals through the complexities of modern software engineering. It functions as a structured curriculum and reference library, covering the full spectrum of skills required to master system architecture, infrastructure management, and cloud operations.

The project distinguishes itself by bridging the gap between high-level architectural design and the practical realities of engineering leadership. It provides curated insights into distributed systems, data consistency, and scalable d
- [berriai/litellm](https://awesome-repositories.com/repository/berriai-litellm.md) (50,579 ⭐) — LiteLLM is a unified gateway and proxy server designed to centralize access to over one hundred language model providers. It provides a standardized API interface that abstracts vendor-specific schemas, allowing developers to interact with diverse models through a single, consistent format. By acting as a central traffic management layer, it enables organizations to route, secure, and govern model interactions across multiple deployments.

The platform distinguishes itself through its policy-driven architecture, which uses configuration-based routing to manage traffic distribution, load balanc
- [huggingface/evaluate](https://awesome-repositories.com/repository/huggingface-evaluate.md) (2,455 ⭐) — 🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
- [katanemo/plano](https://awesome-repositories.com/repository/katanemo-plano.md) (5,120 ⭐) — Plano is an AI agent orchestrator and LLM gateway proxy that unifies access to multiple AI providers through a single interoperable interface. It functions as a model routing engine that decouples applications from specific vendors using semantic aliases, allowing traffic to be shifted between providers without modifying application code.

The system distinguishes itself with intent-based agent routing, which directs prompts to specialized agents based on semantic analysis. It features an interceptor-based filter chain system that acts as guardrail middleware to enforce safety policies, rewrit
- [pineappleexpress808/auto-evaluator](https://awesome-repositories.com/repository/pineappleexpress808-auto-evaluator.md) (1,093 ⭐) — Evaluation tool for LLM QA chains
- [aishwaryanr/awesome-generative-ai-guide](https://awesome-repositories.com/repository/aishwaryanr-awesome-generative-ai-guide.md) (24,755 ⭐) — This project is a community-driven knowledge repository and technical learning resource focused on the field of generative artificial intelligence. It serves as a centralized hub for developers and practitioners to access curated research, tutorials, and foundational concepts necessary for building and deploying modern artificial intelligence applications.

The platform distinguishes itself through a collaborative, distributed contribution model that aggregates diverse learning materials into a structured, searchable knowledge base. It covers a wide range of specialized topics, including retri
- [langfuse/langfuse](https://awesome-repositories.com/repository/langfuse-langfuse.md) (29,190 ⭐) — Langfuse is an open-source observability and evaluation platform designed for language model applications. It provides a centralized system for tracking execution traces, monitoring performance metrics, and managing prompt templates. By capturing hierarchical units of work and telemetry data, the platform enables developers to debug complex application lifecycles and analyze token usage, latency, and model interactions in production environments.

The platform distinguishes itself through an integrated evaluation framework that allows for systematic benchmarking and automated scoring of model
- [huggingface/evaluation-guidebook](https://awesome-repositories.com/repository/huggingface-evaluation-guidebook.md) (2,125 ⭐) — Sharing both practical insights and theoretical knowledge about LLM evaluation that we gathered while managing the Open LLM Leaderboard and designing lighteval!
- [onyx-dot-app/onyx](https://awesome-repositories.com/repository/onyx-dot-app-onyx.md) (17,491 ⭐) — Onyx is an enterprise-grade AI platform designed for knowledge management, search, and autonomous agent orchestration. It functions as a centralized system that aggregates unstructured organizational data, enabling secure, context-aware retrieval and interaction across internal documents and communication history. By integrating retrieval-augmented generation with multi-model orchestration, the platform provides a unified interface for teams to query internal knowledge bases and execute complex, multi-step business processes.

The platform distinguishes itself through a focus on private infras
- [briland/llm-security-and-privacy](https://awesome-repositories.com/repository/briland-llm-security-and-privacy.md) (54 ⭐) — LLM security and privacy
- [livekit/livekit](https://awesome-repositories.com/repository/livekit-livekit.md) (19,358 ⭐) — LiveKit is a comprehensive framework for building and orchestrating real-time, multimodal AI agents that interact with users through voice, video, and text. It provides a centralized, event-driven architecture to manage the entire lifecycle of automated participants, from initialization and session state management to graceful shutdown. By utilizing a selective forwarding unit, the platform efficiently routes media streams between participants and agents, ensuring low-latency communication and secure, token-based authentication for all connections.

The platform distinguishes itself through it
- [tensorzero/tensorzero](https://awesome-repositories.com/repository/tensorzero-tensorzero.md) (10,985 ⭐) — TensorZero is an inference gateway and experimentation framework designed to manage the lifecycle of large language models in production environments. It functions as a central proxy that routes requests across multiple artificial intelligence providers while providing the infrastructure necessary to monitor performance, track costs, and ensure service reliability.

The platform distinguishes itself by integrating a comprehensive evaluation engine and an observability pipeline directly into the request flow. It enables developers to conduct controlled experiments and A/B tests to compare diffe
- [controllability/jailbreak-evaluation](https://awesome-repositories.com/repository/controllability-jailbreak-evaluation.md) (27 ⭐) — The jailbreak-evaluation is an easy-to-use Python package for language model jailbreak evaluation. The jailbreak-evaluation is designed for comprehensive and accurate evaluation of language model jailbreak attempts. Currently, jailbreak-evaluation support evaluating a language model jailbreak…
- [huggingface/smol-course](https://awesome-repositories.com/repository/huggingface-smol-course.md) (6,661 ⭐) — This project is an educational program focused on the alignment of small language models. It provides a technical curriculum and a series of courses designed to teach how to align models with human preferences and behaviors.

The material covers the implementation of preference optimization algorithms and the adaptation of vision-language models to process both text and image data simultaneously. It also includes instructional guides on synthetic data generation to improve model performance in specialized domains.

The curriculum encompasses supervised fine-tuning workflows, the use of chat te
- [evoagentx/evoagentx](https://awesome-repositories.com/repository/evoagentx-evoagentx.md) (2,555 ⭐) — EvoAgentX is an agent platform that combines human-in-the-loop checkpoints, MCP tool integration, multi-agent workflow orchestration, and self-improvement capabilities. It functions as a self-improving agent framework that connects to MCP-compatible servers and orchestrates multi-agent workflows using natural-language goals, while also serving as a platform that discovers, configures, and manages tools from MCP servers for use in automated agent workflows.

The platform distinguishes itself through a dual-memory agent architecture that maintains short-term and persistent memory stores, enablin
- [verazuo/jailbreak_llms](https://awesome-repositories.com/repository/verazuo-jailbreak-llms.md) (3,563 ⭐) — This project is a comprehensive ecosystem of frameworks, toolkits, and datasets designed to evaluate model vulnerabilities and analyze jailbreak patterns. It serves as an adversarial testing framework and research toolkit for measuring the effectiveness of safety guardrails in large language models.

The system includes a library of real-world prompt injection datasets harvested from social media to study bypass strategies. It provides specialized tools for semantic attack analysis and prompt visualization, allowing for the mapping of relationships between adversarial prompts to discover commo
- [huggingface/lighteval](https://awesome-repositories.com/repository/huggingface-lighteval.md) (2,453 ⭐) — Lighteval is an open-source framework for running standardized benchmarks and custom evaluation tasks against language models. It provides a system for defining new evaluation tasks with custom prompts, metrics, and scoring in YAML configuration files, and integrates with the Hugging Face Hub for storing and comparing results.

The framework supports evaluating models across multiple inference backends, including transformers, vllm, and custom APIs, through a unified generation and log-probability interface. It includes a pluggable metric registry for built-in and custom scoring, a prediction
- [umass-foundation-model/3d-llm](https://awesome-repositories.com/repository/umass-foundation-model-3d-llm.md) (1,205 ⭐) — 3D-LLM: Injecting the 3D World into Large Language Models (NeurIPS 2023 Spotlight) Yining Hong , Haoyu Zhen , Peihao Chen , Shuhong Zheng , Yilun Du , Zhenfang Chen , Chuang Gan
- [cfahlgren1/observers](https://awesome-repositories.com/repository/cfahlgren1-observers.md) (255 ⭐) — A Lightweight Library for AI Observability
- [langchain-ai/langchain](https://awesome-repositories.com/repository/langchain-ai-langchain.md) (139,458 ⭐) — LangChain is an orchestration framework designed for building, managing, and deploying applications powered by large language models. It provides a unified integration layer that normalizes disparate model provider APIs into a consistent set of primitives, enabling developers to build complex, multi-step AI workflows that manage state, memory, and tool execution.

The project distinguishes itself through a durable execution runtime that maintains persistent state across long-running processes by checkpointing progress to external storage. It models agent workflows as directed graphs, allowing
- [internlm/opencompass](https://awesome-repositories.com/repository/internlm-opencompass.md) (7,096 ⭐) — OpenCompass is a comprehensive evaluation platform, benchmarking suite, and distributed model evaluator designed to measure the performance and accuracy of large language models. It provides a framework for benchmarking both open-source and API-based models against diverse datasets using standardized metrics and reproducible pipelines.

The project features an automated judging framework that uses language models as judges to score and verify the quality of generated text. It includes a performance leaderboard system for comparing the relative capabilities of various models across industry-sta
- [dair-ai/prompt-engineering-guide](https://awesome-repositories.com/repository/dair-ai-prompt-engineering-guide.md) (75,678 ⭐) — This project is a comprehensive educational resource and technical guide focused on the development, optimization, and application of large language models. It provides a structured curriculum for mastering prompt engineering, ranging from foundational principles of instruction design to advanced techniques for improving model reasoning, accuracy, and reliability.

The guide distinguishes itself by offering deep technical insights into agentic workflows and autonomous system design. It covers the implementation of multi-step reasoning chains, tool integration through function calling, and stat
- [tpei/observable](https://awesome-repositories.com/repository/tpei-observable.md) (9 ⭐) — Implementation of the Observer pattern in crystal
- [roberthein/observable](https://awesome-repositories.com/repository/roberthein-observable.md) (378 ⭐) — The easiest way to observe values in Swift.
- [fincept-corporation/finceptterminal](https://awesome-repositories.com/repository/fincept-corporation-finceptterminal.md) (26,900 ⭐) — FinceptTerminal is a quantitative finance platform and financial engineering library designed for asset valuation, risk management, and fixed-income analytics. It provides a comprehensive suite for algorithmic trading and investment strategy automation, integrating specialized language model agents and node-based workflows to automate market research and alpha generation.

The project distinguishes itself with a dedicated game theory analysis engine for calculating Nash equilibria and simulating strategic interactions in competitive markets. It also features a specialized credit risk modeling
- [edublancas/sklearn-evaluation](https://awesome-repositories.com/repository/edublancas-sklearn-evaluation.md) (3 ⭐) — Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.
- [microsoft/ai-agents-for-beginners](https://awesome-repositories.com/repository/microsoft-ai-agents-for-beginners.md) (67,369 ⭐) — This project is a structured educational resource and technical guide for designing and implementing autonomous systems using large language models. It provides a comprehensive curriculum and code samples focused on agentic design patterns, autonomous development, and the creation of systems capable of planning and executing multi-step tasks.

The resource details the implementation of agentic retrieval-augmented generation, where models autonomously plan and refine data searches. It covers a wide array of orchestrators and design patterns, including metacognitive reflection for self-correctin
- [gsig/actor-observer](https://awesome-repositories.com/repository/gsig-actor-observer.md) (84 ⭐) — ActorObserverNet code in PyTorch from "Actor and Observer: Joint Modeling of First and Third-Person Videos", CVPR 2018
- [lm-sys/fastchat](https://awesome-repositories.com/repository/lm-sys-fastchat.md) (39,472 ⭐) — FastChat is a training and serving platform for large language models that provides an integrated toolkit for fine-tuning, hosting, and benchmarking chatbots. It functions as an inference server capable of hosting multiple models and exposing them via a standardized API for chat applications.

The platform distinguishes itself through a distributed model controller that manages worker nodes and routes requests across a hardware-agnostic inference layer supporting various accelerators. It includes a dedicated evaluation framework for assessing model quality using automated judges, multi-turn di
- [flowiseai/flowise](https://awesome-repositories.com/repository/flowiseai-flowise.md) (53,641 ⭐) — Flowise is a low-code platform designed for building and deploying complex language model workflows through a visual, node-based interface. It functions as an orchestrator for autonomous multi-agent systems, allowing users to construct conversational pipelines by connecting language models, memory stores, and external tools on a drag-and-drop canvas.

The platform distinguishes itself through its support for sophisticated agentic patterns, including supervisor-worker delegation and iterative reasoning strategies. Users can design directed acyclic graphs to manage conditional branching, state p
- [hannibal046/awesome-llm](https://awesome-repositories.com/repository/hannibal046-awesome-llm.md) (26,933 ⭐) — This project serves as a comprehensive, static directory of external resources dedicated to the study and application of large language models. It functions as a centralized discovery point for developers and researchers, aggregating foundational academic papers, technical documentation, and specialized tools within a structured, version-controlled knowledge base.

The repository distinguishes itself through a multi-level classification system that organizes diverse technical domains, ranging from model training frameworks and inference optimization to AI safety and hallucination detection. By
- [raga-ai-hub/ragaai-catalyst](https://awesome-repositories.com/repository/raga-ai-hub-ragaai-catalyst.md) (16,150 ⭐) — RagaAI-Catalyst is a suite of software implementation tools providing an SDK, dashboard, and platform for monitoring, debugging, red-teaming, and evaluating agentic AI workflows. It serves as an observability framework for tracing the execution paths of large language models and multi-agent systems.

The project distinguishes itself through a security suite for automated red-teaming and vulnerability scanning to detect biases, alongside a centralized prompt registry that decouples templates from application code. It further provides an evaluation platform that combines synthetic data generatio
- [redux-observable/redux-observable](https://awesome-repositories.com/repository/redux-observable-redux-observable.md) (7,815 ⭐) — RxJS-based middleware for Redux. Compose and cancel async actions to create side effects and more.
- [tc39/proposal-observable](https://awesome-repositories.com/repository/tc39-proposal-observable.md) (3,107 ⭐) — Observables for ECMAScript
- [llm-d/llm-d](https://awesome-repositories.com/repository/llm-d-llm-d.md) (2,514 ⭐) — llm-d is a distributed serving framework designed for large language model inference. It functions as an inference orchestrator and gateway, providing a control plane for deploying model replicas and managing hardware accelerators. The system includes a batch inference scheduler and a cache manager to coordinate request flow and memory utilization.

The project is distinguished by a disaggregated serving architecture that separates prefill and decode execution phases across specialized workers to maximize throughput. It employs a hardware-agnostic control plane and tiered cache offloading, mov
- [googlecloudplatform/generative-ai](https://awesome-repositories.com/repository/googlecloudplatform-generative-ai.md) (12,700 ⭐) — This project is a development platform for managing the lifecycle of generative artificial intelligence models. It provides a unified environment for accessing, fine-tuning, and deploying large language models, serving as an orchestrator that handles the integration of diverse models into custom applications.

The platform distinguishes itself by offering a managed infrastructure for hosting and scaling models, which removes the requirement for manual server maintenance or configuration. It includes integrated tools for supervised fine-tuning and vector embedding optimization, allowing for the
- [sindresorhus/awesome-observables](https://awesome-repositories.com/repository/sindresorhus-awesome-observables.md) (350 ⭐) — Awesome Observable related stuff - An Observable is a collection that arrives over time.
- [oumi-ai/oumi](https://awesome-repositories.com/repository/oumi-ai-oumi.md) (8,858 ⭐) — Oumi is a comprehensive large language model development platform designed for synthesizing data, fine-tuning models, and running performance evaluations. It serves as a unified environment for the entire model lifecycle, encompassing a training and fine-tuning suite, an evaluation framework, and tools for synthetic data generation and model distillation.

The platform is distinguished by its iterative, failure-driven synthesis approach, which analyzes model weaknesses during evaluation to generate targeted training data. It utilizes an LLM-based judge framework to programmatically score respo
- [mementum/backtrader](https://awesome-repositories.com/repository/mementum-backtrader.md) (20,462 ⭐) — Backtrader is a Python framework designed for the development, backtesting, and live execution of algorithmic trading strategies. It provides a comprehensive environment for quantitative finance, allowing users to simulate trading logic against historical market data or connect directly to brokerage platforms for automated real-time trading.

The project distinguishes itself through a unified event-driven architecture that treats backtesting and live trading with the same API. This consistency is supported by a flexible data-feed abstraction layer that normalizes diverse financial sources, ena
- [mastra-ai/mastra](https://awesome-repositories.com/repository/mastra-ai-mastra.md) (21,221 ⭐) — Mastra is an orchestration framework designed for building, deploying, and managing autonomous AI agents and multi-agent systems. It provides a comprehensive suite of primitives for creating resilient AI applications, including durable workflow orchestration, event-driven agent loops, and semantic memory management. By integrating these core components, the platform enables developers to build complex, multi-step processes that can reason about goals and execute tasks without manual intervention.

The framework distinguishes itself through its focus on observability and secure, isolated execut
- [valentyn1boreiko/llm-threat-model](https://awesome-repositories.com/repository/valentyn1boreiko-llm-threat-model.md) (11 ⭐) — Valentyn Boreiko \ , Alexander Panfilov \ , Vaclav Voracek, Matthias Hein, Jonas Geiping
- [lianjiatech/belle](https://awesome-repositories.com/repository/lianjiatech-belle.md) (8,273 ⭐) — BELLE is a specialized implementation of Chinese conversational large language models, encompassing a full instruction tuning framework. It provides a pipeline for training, evaluating, and deploying models optimized for natural language understanding and dialogue tasks in the Chinese language.

The project is distinguished by its integrated approach to model refinement, combining the curation of multi-million entry instruction datasets with a distributed training pipeline. This pipeline supports both full fine-tuning and low-rank adaptation to optimize conversational performance.

The system
- [browser-use/browser-use](https://awesome-repositories.com/repository/browser-use-browser-use.md) (100,229 ⭐) — Browser-use is a framework for building autonomous agents that navigate, interact with, and extract data from web interfaces using natural language instructions. By acting as an orchestration layer between large language models and browser automation protocols, it enables the execution of complex, multi-step workflows without relying on brittle selectors. The system functions as a headless browser controller, providing a programmatic interface to manage browser instances and execute granular interactions.

The project distinguishes itself through its ability to translate high-level intent into
- [blueswen/fastapi-observability](https://awesome-repositories.com/repository/blueswen-fastapi-observability.md) (1,106 ⭐) — Observe FastAPI app with three pillars of observability: Traces (Tempo), Metrics (Prometheus), Logs (Loki) on Grafana through OpenTelemetry.
- [promptfoo/promptfoo](https://awesome-repositories.com/repository/promptfoo-promptfoo.md) (10,529 ⭐) — Promptfoo is an evaluation framework designed for testing, benchmarking, and red-teaming language models and agentic workflows. It provides a unified environment to run prompts against multiple providers, allowing developers to systematically validate model outputs against objective assertions, semantic similarity metrics, and custom grading rubrics.

The platform distinguishes itself through a provider-agnostic execution layer and a stateful orchestrator capable of simulating multi-turn conversations and complex tool-use trajectories. It includes a dedicated adversarial mutation pipeline that
- [cymchad/baserecyclerviewadapterhelper](https://awesome-repositories.com/repository/cymchad-baserecyclerviewadapterhelper.md) (24,607 ⭐) — This project is an Android RecyclerView adapter wrapper designed to reduce boilerplate code when building complex lists. It serves as a framework for simplifying data binding and managing the interaction between data models and their corresponding view holders.

The library distinguishes itself through specialized support for multi-type layout rendering, where diverse data models are mapped to specific layouts within a single list. It provides a structural implementation for expandable list frameworks that allow users to collapse or expand hierarchical items to reveal nested content.

Addition
- [midudev/jscamp](https://awesome-repositories.com/repository/midudev-jscamp.md) (3,811 ⭐) — jscamp is a full-stack web development and education project focused on mastering JavaScript, TypeScript, and AI integration. It provides a structured curriculum and interactive exercises covering language fundamentals, frontend engineering, and backend API development.

The project distinguishes itself through the implementation of autonomous AI agents capable of complex task automation, such as modifying files, managing servers, and executing API calls. It includes advanced AI development tools for conversational querying, real-time code suggestions, and automated repository analysis to gene