SWE-bench is an automated evaluation framework that tests large language models on real-world software engineering tasks. It measures how effectively models can generate and apply code patches that resolve actual GitHub issues, using a standardized dataset and scoring system built around Docker-based patch verification against original project test suites.
The framework provides curated benchmark datasets spanning comprehensive, fast, verified, multilingual, and multimodal evaluation splits, allowing targeted assessment of model capabilities across different programming languages and issue types. It includes a containerized evaluation harness that can run locally or on cloud infrastructure, with support for BM25 retrieval indexing to identify relevant code context for bug-fixing tasks. The system parses test logs across multiple frameworks including Pytest, Jest, Maven, and Gradle to determine patch correctness, and generates unified diff patches for automated application to repository codebases.
Beyond evaluation, SWE-bench supports creating new benchmark tasks and training data from user-provided repositories, running live inference on individual GitHub issues through repository cloning and retrieval index construction, and comparing agent and model performance across variants using resolution rates, costs, and trajectories. The framework also provides tools for dataset tokenization, retrieval dataset loading, and text dataset generation for research contexts.