SWE-gen

Convert merged GitHub PRs into Harbor tasks automatically.

Overview

Automates task creation from real bug fixes in open-source GitHub repos. Works with any programming language: Claude Code analyzes the repo to detect language, build system, and test framework.

Each task reverses a merged PR to recreate the buggy state, verifies tests fail on baseline, and pass after applying the fix. Fully containerized with all dependencies installed at build time.

News

[03/2026] ☁️ Run Harbor tasks on the cloud with oddish!
[02/2026] 🦀 SWE-gen-Rust, SWE-gen-Go, and SWE-gen-Cpp released!
[02/2026] ☕ SWE-gen-Java: 1,000 JVM tasks!
[01/2026] 🔥 SWE-gen-JS released: 1,000 JS/TS task dataset generated with SWE-gen
[01/2026] 📖 blog post is out!

Quick Start

# Install
uv pip install swegen

# Generate a task from a merged PR
swegen create --repo axios/axios --pr 7150

# Or farm all PRs from a repo
swegen farm fastapi/fastapi

Installation

uv pip install swegen

Ensure these environment variables are set:

export GITHUB_TOKEN=<gh-token>
export OPENAI_API_KEY=<api-key>
export ANTHROPIC_API_KEY=<api-key>  # or CLAUDE_CODE_OAUTH_TOKEN

Note: Cloud sandbox environments (Daytona, E2B, Modal, etc.) require additional API keys.

Usage

Commands:

swegen create — Generate a task from a merged PR
swegen farm — Continuously process PRs from a repository
swegen validate — Validate existing task (NOP + Oracle)
swegen analyze — Deep analysis with agent trials to verify task quality

Generate a Task

swegen create --repo <owner/repo> --pr <num>

Options

--output PATH — Output directory for generated tasks (default: tasks)
--state-dir PATH — State directory for cache/logs (default: .swegen)
--cc-timeout N — Claude Code session timeout in seconds (default: 3200)
--env, -e TYPE — Environment type: docker, daytona, e2b, modal, runloop, gke (default: docker)
--no-validate — Skip Harbor validations
--force — Bypass local dedupe and regenerate
--no-cache — Disable cached artifacts from previous tasks
--no-require-minimum-difficulty — Skip 3+ file and LLM substantiality checks
--min-source-files N — Minimum number of source files required (default: 3, tests excluded)
--max-source-files N — Maximum number of source files to avoid large refactors (default: 10, tests excluded)
--no-require-issue — Allow PRs without linked issues (uses PR body/title for instructions)
-v, --verbose / -q, --quiet

Continuous PR Farming

Stream through entire PR history, process each sequentially with state persistence.

swegen farm fastapi/fastapi

Options

--output PATH — Output directory for generated tasks (default: tasks)
--state-dir PATH — State directory for cache/logs (default: .swegen)
--timeout N — Timeout per PR in seconds (default: 300)
--cc-timeout N — Claude Code session timeout (default: 3200)
--task-delay N — Delay between tasks in seconds (default: 60)
--api-delay N — Delay between GitHub API calls in seconds (default: 0.5)
--env, -e TYPE — Environment type: docker, daytona, e2b, modal, runloop, gke (default: docker)
--resume-from DATE — Resume from date or timestamp
--reset — Reset state and start from beginning
--dry-run — Preview without generation
--force — Regenerate even if task already exists (default: true)
--no-validate — Skip Harbor validation step
--require-issue / --no-require-issue — Require PRs to have linked issues (default: True)
--no-require-minimum-difficulty — Skip 3+ file and LLM checks
--min-source-files N — Minimum number of source files required (default: 3, tests excluded)
--max-source-files N — Maximum number of source files to avoid large refactors (default: 10, tests excluded)
--no-cache — Disable cached artifacts
--docker-prune-batch N — Run docker cleanup after every N PRs (default: 5, 0 to disable)
--skip-list PATH — Path to file with task IDs to skip (one per line)
-v, --verbose

Validate Existing Tasks

Verify that a task passes NOP (baseline fails) and Oracle (solution succeeds) agents:

swegen validate <task_id>

Analyze Task Quality

Run agent trials to verify a task is well-specified and solvable:

swegen analyze <task_id>

What analyze does

Static quality check (harbor tasks check)
Baseline validation (nop fails, oracle passes)
Run N agent trials
Trial classification (identifies TASK vs AGENT problems)
Task verdict synthesis with actionable recommendations

Classification categories:

GOOD_SUCCESS — Agent solved it correctly
BAD_SUCCESS — Agent cheated or tests too permissive
GOOD_FAILURE — Agent failed due to its own limitations
BAD_FAILURE — Agent failed due to task issues (underspecified, brittle tests, etc.)
HARNESS_ERROR — Infrastructure problem

Task Requirements

Valid PR criteria

Languages: Any (Python, JavaScript, TypeScript, Go, Rust, Ruby, Java, etc.)

Valid PRs must:

Be merged to primary branch with accessible fork
Include test changes and corresponding fix
Have a linked issue for high-quality instructions (bypass with --no-require-issue)
Modify 3-10 source files (configurable with --min-source-files and --max-source-files, bypass with --no-require-minimum-difficulty)
Pass LLM substantiality evaluation (bypass with --no-require-minimum-difficulty)
Fail tests on reversed baseline, pass after applying fix
Exclude documentation-only, formatting-only, or version-bump-only changes

How It Works

Pipeline details

The pipeline uses a language-agnostic approach:

Fetch & Analyze — Get PR metadata via GitHub API, clone repo, identify test files
Evaluate — LLM evaluates PR substantiality and generates task instructions
Generate Skeleton — Create Dockerfile and test.sh with TODOs for Claude Code
Claude Code Completion — CC analyzes repo, detects language/runtime/build system, fills in skeleton
Validation — Run NOP (reward=0) and Oracle (reward=1) agents
Iteration — CC iterates until both agents pass

Key Details:

Dockerfile clones at HEAD, then applies bug.patch to revert to buggy BASE state
Test files stored in task/tests/ and copied at runtime (prevents agent tampering)
fix.patch (solution) excludes tests/CI, contains all other PR changes
Dependencies installed at build time; runtime doesn't require internet access
Successful tasks are cached as references to speed up future tasks from the same repo
PR evaluation uses LLM to check substantiality and generate instructions

Datasets

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 133 Commits
action		action
assets		assets
src/swegen		src/swegen
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE-gen

Overview

News

Quick Start

Installation

Usage

Generate a Task

Continuous PR Farming

Validate Existing Tasks

Analyze Task Quality

Task Requirements

How It Works

Datasets

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SWE-gen

Overview

News

Quick Start

Installation

Usage

Generate a Task

Continuous PR Farming

Validate Existing Tasks

Analyze Task Quality

Task Requirements

How It Works

Datasets

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages