Skip to content

fix: --resume now reuses existing run directory#192

Open
jecanore wants to merge 2 commits intoaiming-lab:mainfrom
jecanore:fix/resume-reuses-existing-run
Open

fix: --resume now reuses existing run directory#192
jecanore wants to merge 2 commits intoaiming-lab:mainfrom
jecanore:fix/resume-reuses-existing-run

Conversation

@jecanore
Copy link
Copy Markdown
Contributor

Summary

  • Fixed --resume creating a new run directory instead of reusing the existing one. Previously, cmd_run() always generated a fresh run_id/run_dir before checking checkpoints, so --resume without --output would always start from scratch.
  • Added _find_latest_run() helper that auto-detects the most recent run in artifacts/ with a checkpoint.json, reads the original run_id from it, and resumes from the correct stage.
  • Increased ACP timeout from 600s to 1200s to prevent CODE_GENERATION stage timeouts on complex experiment prompts.

Test plan

  • 5 new unit tests for _find_latest_run() and resume error handling
  • All 1264 existing tests pass (1 pre-existing failure in test_search_arxiv_mock unrelated to this change)
  • Manual: researchclaw run --resume finds latest run and resumes from checkpoint
  • Manual: researchclaw run --resume with no existing runs prints clear error

🤖 Generated with Claude Code

jecanore and others added 2 commits March 16, 2026 17:36
- Add resolve_config_path() to search for config.arc.yaml then config.yaml
- Change --config default to None (auto-detect) on run/validate/doctor
- Add _resolve_config_or_exit() helper with init hint on missing config
- Add `researchclaw init` subcommand with interactive provider selection
- String-based template replacement preserves YAML comments
… new one

Previously, `researchclaw run --resume` always generated a fresh run_id and
run_dir before checking for checkpoints, so it would look for a checkpoint
in the new empty directory, find nothing, and start from scratch.

Now --resume auto-detects the most recent run in artifacts/ that has a
checkpoint.json, reads the original run_id from it, and resumes from the
next stage. Also increases ACP timeout from 600s to 1200s to prevent
CODE_GENERATION timeouts on complex experiments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant