Mini Swe Agent

mini-swe-agent is an autonomous software engineering system designed to develop features and fix bugs by combining large language models with a bash interface. It operates as an agentic framework that executes coding tasks and documentation updates through a continuous cycle of model reasoning and tool execution.

The project differentiates itself with a strong focus on safety and evaluation, utilizing container-based sandbox execution via Docker or Singularity to isolate command execution. It includes a batch-parallel evaluation harness to measure code-fixing accuracy against standardized software engineering datasets and a constraint-based control system to enforce limits on step counts, time, and API expenditure.

The system provides comprehensive LLM API orchestration, supporting a unified interface for multiple model providers, native tool calling, and detailed expenditure tracking. Additional capabilities cover interactive human-in-the-loop oversight via a REPL-style interface, trajectory serialization for post-run analysis, and a flexible configuration system using Jinja2 templates for prompt and observation formatting.

Features

Autonomous Software Engineering - Builds autonomous agents capable of navigating codebases to fix bugs and implement features via bash.

Sandboxed Execution Environments - Runs bash commands in isolated Linux environments to ensure security and reproducibility during agent execution.

Agentic Task Automation - Automates software engineering tasks by combining large language models with a bash interface to fix bugs and develop features.

Automated Engineering Platforms - Automates software engineering tasks like feature development and bug fixing by delegating work to an AI agent.

Agentic Reasoning Loops - Drives a continuous cycle of model reasoning and tool execution until a task is completed.

Human-in-the-Loop Oversight - Implements interactive REPL interfaces for monitoring and intervening in autonomous AI processes.

Agentic Resource Limits - Constrains autonomous loops using budget ceilings and resource limits to prevent excessive API and time expenditure.

Model Provider Integrations - Connects agents to various language models through unified provider interfaces to drive autonomous decision making.

Language Model Integrations - Integrates with various hosted or local language model providers using native tool calling or text-based action formats.

Code Execution Environments - Implements sandboxed environments specifically designed for agents to execute generated code securely.

LLM API Integrations - Sends prompts to external language models and processes completions into usable messages and action sets.

LLM Orchestrators - Manages connections, tool-calling configurations, and cost tracking across multiple LLM providers.

LLM Provider Interfaces - Provides a unified interface to send messages to multiple LLM providers and process their responses into actionable tool calls.

Agent Prediction Evaluations - Evaluates agent-generated code fixes against ground-truth datasets to measure correctness and accuracy.

Autonomous Execution Guards - Enforces hard limits on step counts, time duration, and API expenditure to prevent runaway autonomous processes.

Containerized Sandbox Runtimes - Sets up isolated runtimes using Docker or Singularity with custom startup commands to ensure safe code execution.

Shell Command Execution - Executes shell commands directly on the host machine for engineering tasks and file manipulations.

Code Execution Sandboxes - Runs LLM-generated commands in isolated Docker or Singularity containers for system security.

Benchmark Evaluation Runners - Executes evaluation harnesses across dataset splits in parallel to score generated code patches.

Agent Execution Environments - Provides isolated runtimes via local shells or Docker containers specifically tailored for autonomous agent tasks.

Container-Based Sandboxes - Isolates bash command execution within Docker or Singularity containers to ensure security and reproducibility.

Action Parsing - Implements regular expression-based parsing to extract executable commands from LLM responses in Markdown or XML formats.

Software Engineering Benchmarks - Evaluates agent code-fixing capabilities using standardized software engineering benchmark datasets.

Unified Model Interfaces - Abstracts multiple language model APIs into a single interface to support interchangeable model backends.

Model Orchestration - Routes and manages requests across multiple AI models to optimize task execution and compare performance.

Autonomy Controls - Limits AI agent autonomy using step counts, budget ceilings, and human-in-the-loop approval queues.

Tool Observation Formatters - Converts execution outputs into structured messages using templates and multimodal extraction to provide clear context for the model.

Human-in-the-Loop Workflows - Integrates manual approvals and human interventions directly into the autonomous engineering workflow.

API Operational Cost Limits - Enforces global call and expenditure limits via environment variables to prevent excessive API spending.

Parallel Evaluators - Provides a parallel evaluation system to measure code-fixing accuracy across multiple benchmark instances concurrently.

Native Model Tooling - Leverages provider-specific native tool calling APIs to execute commands instead of parsing markdown text.

Model Integration Configurations - Provides a schema-driven interface for mapping and managing connections to various LLM providers via flags and environment variables.

Model Provider Configurations - Manages credentials and default model selection for both local and remote AI model providers.

Observation Formatters - Transforms raw shell output and multimodal data into structured text using Jinja2 templates for model consumption.

Tool Output Formatters - Transforms raw bash command output into formatted messages that a language model can interpret as observations.

Execution Mode Toggles - Allows switching between manual confirmation, autonomous execution, and human override for all agent actions.

Agent Configurations - Provides local file-based settings to customize model selection, environment classes, and cost limits.

AI Agent Benchmarks - Evaluates the performance of coding agents using standardized software engineering datasets.

Container Command Executors - Executes arbitrary bash commands inside running Docker containers and captures their output for analysis.

Sandboxed Shell Executions - Executes shell commands within isolated cloud sandboxes to perform tasks without affecting the host.

Interactive Engineering Environments - Provides a REPL-style command line interface for running software engineering tasks within a local environment.

Container Lifecycle Management - Manages the lifecycle of containers, including startup and removal, to maintain clean workspaces.

Agent Command Line Interfaces - Provides a command-line interface to interact with the AI agent, manage sessions, and execute bash commands.

AI Token Spend Limits - Implements spending limits in currency and caps on the total number of API calls to prevent unmanaged LLM expenses.

Agent Configuration Files - Uses YAML configuration files to define agent behavior, including step limits and cost constraints.

Exception-Driven Control Flows - Uses a unified exception hierarchy as a control signal to manage agent state, completions, and interruptions.

AI Cost Monitoring - Includes utilities to track token usage and financial costs across different LLM providers.

Agent Trajectory Logs - Records the full sequence of agent thoughts, tool calls, and costs into structured trajectory logs.

Episode Trajectory Recorders - Records the full sequence of messages, tool calls, and cost metrics into structured files for post-run analysis.

API Expenditure Trackers - Calculates and aggregates the financial cost of API requests to monitor total spending across all model interactions.

Execution History Tracking - Tracks the step-by-step history of actions and decisions for post-run analysis.

SWE-agentmini-swe-agent

Features

Star history