Convert merged GitHub PRs into Harbor tasks automatically.
Automates task creation from real bug fixes in open-source GitHub repos. Works with any programming language: Claude Code analyzes the repo to detect language, build system, and test framework.
Each task reverses a merged PR to recreate the buggy state, verifies tests fail on baseline, and pass after applying the fix. Fully containerized with all dependencies installed at build time.
- [03/2026] ☁️ Run Harbor tasks on the cloud with oddish!
- [02/2026] 🦀 SWE-gen-Rust, SWE-gen-Go, and SWE-gen-Cpp released!
- [02/2026] ☕ SWE-gen-Java: 1,000 JVM tasks!
- [01/2026] 🔥 SWE-gen-JS released: 1,000 JS/TS task dataset generated with SWE-gen
- [01/2026] 📖 blog post is out!
# Install
uv pip install swegen
# Generate a task from a merged PR
swegen create --repo axios/axios --pr 7150
# Or farm all PRs from a repo
swegen farm fastapi/fastapiuv pip install swegenEnsure these environment variables are set:
export GITHUB_TOKEN=<gh-token>
export OPENAI_API_KEY=<api-key>
export ANTHROPIC_API_KEY=<api-key> # or CLAUDE_CODE_OAUTH_TOKENNote: Cloud sandbox environments (Daytona, E2B, Modal, etc.) require additional API keys.
Commands:
swegen create— Generate a task from a merged PRswegen farm— Continuously process PRs from a repositoryswegen validate— Validate existing task (NOP + Oracle)swegen analyze— Deep analysis with agent trials to verify task quality
swegen create --repo <owner/repo> --pr <num>Options
--output PATH— Output directory for generated tasks (default:tasks)--state-dir PATH— State directory for cache/logs (default:.swegen)--cc-timeout N— Claude Code session timeout in seconds (default: 3200)--env, -e TYPE— Environment type:docker,daytona,e2b,modal,runloop,gke(default:docker)--no-validate— Skip Harbor validations--force— Bypass local dedupe and regenerate--no-cache— Disable cached artifacts from previous tasks--no-require-minimum-difficulty— Skip 3+ file and LLM substantiality checks--min-source-files N— Minimum number of source files required (default: 3, tests excluded)--max-source-files N— Maximum number of source files to avoid large refactors (default: 10, tests excluded)--no-require-issue— Allow PRs without linked issues (uses PR body/title for instructions)-v, --verbose/-q, --quiet
Stream through entire PR history, process each sequentially with state persistence.
swegen farm fastapi/fastapiOptions
--output PATH— Output directory for generated tasks (default:tasks)--state-dir PATH— State directory for cache/logs (default:.swegen)--timeout N— Timeout per PR in seconds (default: 300)--cc-timeout N— Claude Code session timeout (default: 3200)--task-delay N— Delay between tasks in seconds (default: 60)--api-delay N— Delay between GitHub API calls in seconds (default: 0.5)--env, -e TYPE— Environment type:docker,daytona,e2b,modal,runloop,gke(default:docker)--resume-from DATE— Resume from date or timestamp--reset— Reset state and start from beginning--dry-run— Preview without generation--force— Regenerate even if task already exists (default: true)--no-validate— Skip Harbor validation step--require-issue/--no-require-issue— Require PRs to have linked issues (default: True)--no-require-minimum-difficulty— Skip 3+ file and LLM checks--min-source-files N— Minimum number of source files required (default: 3, tests excluded)--max-source-files N— Maximum number of source files to avoid large refactors (default: 10, tests excluded)--no-cache— Disable cached artifacts--docker-prune-batch N— Run docker cleanup after every N PRs (default: 5, 0 to disable)--skip-list PATH— Path to file with task IDs to skip (one per line)-v, --verbose
Verify that a task passes NOP (baseline fails) and Oracle (solution succeeds) agents:
swegen validate <task_id>Run agent trials to verify a task is well-specified and solvable:
swegen analyze <task_id>What analyze does
- Static quality check (
harbor tasks check) - Baseline validation (nop fails, oracle passes)
- Run N agent trials
- Trial classification (identifies TASK vs AGENT problems)
- Task verdict synthesis with actionable recommendations
Classification categories:
GOOD_SUCCESS— Agent solved it correctlyBAD_SUCCESS— Agent cheated or tests too permissiveGOOD_FAILURE— Agent failed due to its own limitationsBAD_FAILURE— Agent failed due to task issues (underspecified, brittle tests, etc.)HARNESS_ERROR— Infrastructure problem
Valid PR criteria
Languages: Any (Python, JavaScript, TypeScript, Go, Rust, Ruby, Java, etc.)
Valid PRs must:
- Be merged to primary branch with accessible fork
- Include test changes and corresponding fix
- Have a linked issue for high-quality instructions (bypass with
--no-require-issue) - Modify 3-10 source files (configurable with
--min-source-filesand--max-source-files, bypass with--no-require-minimum-difficulty) - Pass LLM substantiality evaluation (bypass with
--no-require-minimum-difficulty) - Fail tests on reversed baseline, pass after applying fix
- Exclude documentation-only, formatting-only, or version-bump-only changes
Pipeline details
The pipeline uses a language-agnostic approach:
- Fetch & Analyze — Get PR metadata via GitHub API, clone repo, identify test files
- Evaluate — LLM evaluates PR substantiality and generates task instructions
- Generate Skeleton — Create Dockerfile and test.sh with TODOs for Claude Code
- Claude Code Completion — CC analyzes repo, detects language/runtime/build system, fills in skeleton
- Validation — Run NOP (reward=0) and Oracle (reward=1) agents
- Iteration — CC iterates until both agents pass
Key Details:
- Dockerfile clones at HEAD, then applies
bug.patchto revert to buggy BASE state - Test files stored in
task/tests/and copied at runtime (prevents agent tampering) fix.patch(solution) excludes tests/CI, contains all other PR changes- Dependencies installed at build time; runtime doesn't require internet access
- Successful tasks are cached as references to speed up future tasks from the same repo
- PR evaluation uses LLM to check substantiality and generate instructions