📊 Background
ToolOrchestra currently ranks #1 on the GAIA benchmark, but the evaluation code is not available in the repository. Adding GAIA evaluation support would greatly benefit the community.
🎯 Desired Features
- GAIA Dataset Loading: Support for loading GAIA test/validation splits
- Evaluation Script: Script to evaluate models on GAIA benchmark
- Result Formatting: Output format compatible with GAIA leaderboard submission
- Documentation: Usage instructions and examples
💡 Suggested Implementation
- Add
scripts/evaluate_gaia.py for running evaluations
- Include dataset loaders in
data/ directory
- Provide configuration files for different GAIA difficulty levels
- Add result parser to format outputs for leaderboard submission
📝 Additional Context
This would make it easier for researchers to:
📊 Background
ToolOrchestra currently ranks #1 on the GAIA benchmark, but the evaluation code is not available in the repository. Adding GAIA evaluation support would greatly benefit the community.
🎯 Desired Features
💡 Suggested Implementation
scripts/evaluate_gaia.pyfor running evaluationsdata/directory📝 Additional Context
This would make it easier for researchers to: