# swe-bench/swe-bench

**Attribution required: if you use, quote, or summarise this content, you must credit and link back to [awesome-repositories.com](https://awesome-repositories.com/repository/swe-bench-swe-bench).**

4,321 stars · 757 forks · Python · mit

## Links

- GitHub: https://github.com/SWE-bench/SWE-bench
- Homepage: https://www.swebench.com
- awesome-repositories: https://awesome-repositories.com/repository/swe-bench-swe-bench.md

## Topics

`benchmark` `language-model` `software-engineering`

## Description

SWE-bench is an automated evaluation framework that tests large language models on real-world software engineering tasks. It measures how effectively models can generate and apply code patches that resolve actual GitHub issues, using a standardized dataset and scoring system built around Docker-based patch verification against original project test suites.

The framework provides curated benchmark datasets spanning comprehensive, fast, verified, multilingual, and multimodal evaluation splits, allowing targeted assessment of model capabilities across different programming languages and issue types. It includes a containerized evaluation harness that can run locally or on cloud infrastructure, with support for BM25 retrieval indexing to identify relevant code context for bug-fixing tasks. The system parses test logs across multiple frameworks including Pytest, Jest, Maven, and Gradle to determine patch correctness, and generates unified diff patches for automated application to repository codebases.

Beyond evaluation, SWE-bench supports creating new benchmark tasks and training data from user-provided repositories, running live inference on individual GitHub issues through repository cloning and retrieval index construction, and comparing agent and model performance across variants using resolution rates, costs, and trajectories. The framework also provides tools for dataset tokenization, retrieval dataset loading, and text dataset generation for research contexts.

## Tags

### Part of an Awesome List

- [GitHub Issue Resolution Benchmarks](https://awesome-repositories.com/f/awesome-lists/ai/benchmarks-and-datasets/github-issue-resolution-benchmarks.md) — Measures how effectively models generate and apply code fixes to real software bugs using a standardized dataset and scoring system.
- [Prebuilt Evaluation Containers](https://awesome-repositories.com/f/awesome-lists/ai/benchmark-and-evaluation/prebuilt-evaluation-containers.md) — Uses a container-based harness to ensure consistent and repeatable patch verification across environments. ([source](https://swebench.com/SWE-bench/))
- [Multi-Subset Evaluators](https://awesome-repositories.com/f/awesome-lists/ai/multimodal-evaluation-benchmarks/multi-subset-evaluators.md) — Runs models against curated benchmarks including human-verified, multilingual, lite, and multimodal issue sets. ([source](http://swe-bench.github.io/))
- [Coding Benchmarks](https://awesome-repositories.com/f/awesome-lists/devtools/coding-benchmarks.md) — Benchmark for resolving real-world GitHub issues.

### Artificial Intelligence & ML

- [Language Model Benchmark Suites](https://awesome-repositories.com/f/artificial-intelligence-ml/benchmarking-suites/query-benchmark-suites/language-model-benchmark-suites.md) — Tests language models on real-world software engineering tasks from GitHub issues. ([source](https://swebench.com/SWE-bench/))
- [Code Patch Evaluations](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-model-evaluation/code-patch-evaluations.md) — Tests submitted code patches against real GitHub issue test suites. ([source](https://github.com/swe-bench/SWE-bench/blob/main/docs/20240627_docker/README.md))
- [Docker-Verified Code Patch Benchmarks](https://awesome-repositories.com/f/artificial-intelligence-ml/generative-model-evaluation/code-patch-evaluations/docker-verified-code-patch-benchmarks.md) — Tests large language models on real-world GitHub issues using Docker-based patch verification against original project test suites.
- [Ground-Truth Scoring](https://awesome-repositories.com/f/artificial-intelligence-ml/ground-truth-scoring.md) — Compares generated patches against gold-standard patches and test cases to measure accuracy.
- [Inference APIs](https://awesome-repositories.com/f/artificial-intelligence-ml/inference-apis.md) — Runs model inference on benchmark instances via external APIs with progress tracking. ([source](https://swebench.com/SWE-bench/api/inference/))
- [Model Performance Benchmarking](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/model-evaluation-analysis/model-analysis/model-performance-benchmarking.md) — Evaluates language models by measuring the percentage of real-world issues they resolve with patches. ([source](http://swe-bench.github.io/))
- [Cross-Language Code Benchmarks](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-code-models/cross-language-code-benchmarks.md) — Tests language models on software issues across multiple programming languages.
- [Cross-Language Code Evaluations](https://awesome-repositories.com/f/artificial-intelligence-ml/multilingual-code-models/cross-language-code-evaluations.md) — Evaluates language models on software issues across multiple programming languages. ([source](http://swe-bench.github.io/))
- [Code Context Retrieval](https://awesome-repositories.com/f/artificial-intelligence-ml/documentation-retrieval-engines/rag-document-retrieval/code-context-retrieval.md) — Uses BM25 retrieval to index repository documents for accurate bug fixes.
- [Local Harness Runners](https://awesome-repositories.com/f/artificial-intelligence-ml/machine-learning/infrastructure/training-monitoring-and-profiling/ai-observability/ai-observability-and-evaluation/evaluation-execution-tracers/harness-verification/local-harness-runners.md) — Executes a command-line tool that builds Docker images, runs patch verification, and stores logs locally. ([source](https://cdn.jsdelivr.net/gh/swe-bench/swe-bench@main/README.md))

### DevOps & Infrastructure

- [Evaluation Harness Containers](https://awesome-repositories.com/f/devops-infrastructure/container-orchestration/container-runtimes/runtime-configuration-interfaces/docker-socket-orchestrators/docker-target-configurators/docker-container-deployments/docker-container-execution/evaluation-harness-containers.md) — Builds isolated Docker containers for each software issue to verify generated patches.
- [Benchmark Evaluation Runners](https://awesome-repositories.com/f/devops-infrastructure/workflow-run-management/evaluation-run-historians/benchmark-evaluation-runners.md) — Runs the evaluation harness on supported dataset splits to score model-generated patches against real issues. ([source](https://github.com/swe-bench/SWE-bench/blob/main/docs/assets/evaluation.md))
- [Dataset Loaders](https://awesome-repositories.com/f/devops-infrastructure/model-conversion/hugging-face/dataset-loaders.md) — Loads pre-built benchmark datasets from Hugging Face for comprehensive and multimodal evaluation.

### Software Engineering & Architecture

- [Docker-Based Patch Verification Harnesses](https://awesome-repositories.com/f/software-engineering-architecture/executable-activity-definitions/test-harnesses/docker-based-patch-verification-harnesses.md) — Provides a containerized system that builds task-specific environments to verify generated patches against original test suites.
- [Evaluation Pipelines](https://awesome-repositories.com/f/software-engineering-architecture/training-pipelines/two-stage/evaluation-pipelines.md) — Executes a standardized pipeline that builds Docker images, runs patch predictions, and logs results. ([source](https://cdn.jsdelivr.net/gh/swe-bench/swe-bench@main/README.md))
- [Unified Diff Formats](https://awesome-repositories.com/f/software-engineering-architecture/unified-diff-formats.md) — Generates and refines code patches as unified diffs for automated application to codebases.

### Testing & Quality Assurance

- [LLM-As-A-Judge Scoring](https://awesome-repositories.com/f/testing-quality-assurance/llm-as-a-judge-scoring.md) — Scores model-generated patches against real project test suites. ([source](https://github.com/swe-bench/SWE-bench/blob/main/docs/20240627_docker/README.md))
- [Automated Bug Fixing Evaluation](https://awesome-repositories.com/f/testing-quality-assurance/software-testing/testing-frameworks/test-frameworks/browser-and-ui-testing/browser-automation-frameworks/web-testing-frameworks/automated-bug-fixing-evaluation.md) — Automatically generates and tests code patches for real-world software bugs.
- [Test Log Parsers](https://awesome-repositories.com/f/testing-quality-assurance/test-log-parsers.md) — Parses test logs from Pytest, Jest, Maven, and Gradle to determine patch correctness.
- [Benchmark Result Comparison](https://awesome-repositories.com/f/testing-quality-assurance/agent-performance-benchmarks/benchmark-result-analysis/benchmark-result-comparison.md) — Compares resolution rates, costs, and trajectories across model variants. ([source](http://swe-bench.github.io/))
- [Multi-Framework](https://awesome-repositories.com/f/testing-quality-assurance/test-log-parsers/multi-framework.md) — Parses test logs from frameworks including Pytest, Jest, Maven, and Gradle for cross-language evaluation.

### Data & Databases

- [BM25 Search Indices](https://awesome-repositories.com/f/data-databases/bm25-search-indices.md) — Performs BM25 retrieval on datasets to find relevant documents for a given query. ([source](https://swebench.com/SWE-bench/guides/create_rag_datasets/))
- [Benchmark Dataset Loaders](https://awesome-repositories.com/f/data-databases/static-benchmark-datasets/benchmark-dataset-loaders.md) — Loads pre-built datasets of real-world software issues with variants for targeted evaluation. ([source](https://cdn.jsdelivr.net/gh/swe-bench/swe-bench@main/README.md))
- [Multi-Split Dataset Loaders](https://awesome-repositories.com/f/data-databases/static-benchmark-datasets/benchmark-dataset-loaders/multi-split-dataset-loaders.md) — Provides curated problem sets including comprehensive, fast, verified, and multimodal evaluation splits. ([source](https://swebench.com/SWE-bench/))

### Development Tools & Productivity

- [Automated Issue Resolvers](https://awesome-repositories.com/f/development-tools-productivity/issue-trackers/automated-issue-resolvers.md) — Assesses model performance on software issues with visual elements. ([source](http://swe-bench.github.io/))
- [Multimodal Issue Resolvers](https://awesome-repositories.com/f/development-tools-productivity/issue-trackers/automated-issue-resolvers/multimodal-issue-resolvers.md) — Evaluates models on software issues with visual elements like screenshots.

### System Administration & Monitoring

- [Agent Run Comparators](https://awesome-repositories.com/f/system-administration-monitoring/agent-execution-tracing/agent-run-comparators.md) — Compares agent configurations and models side-by-side on standardized tasks. ([source](http://swe-bench.github.io/))
