CREOLab

CREative tool use in Object-rich Laboratory

License: The dataset and code are provided under the same license for non-commercial research use only. See LICENSE.md for details.

Dataset

This repository provides a dataset of experimental procedure videos with ground truth annotations, designed for research on automatic procedure generation from laboratory videos.

Video dataset and ground truth procedures are available via DOI:

https://doi.org/10.5281/zenodo.17811722

Dataset Contents

The dataset includes:

65 experimental procedure videos (13 scenarios × 5 decoy variations each)
- Scenarios 01-10: test split (referred to as T1-T10 in the paper)
- Scenarios 11-13: dev split (referred to as D1-D3 in the paper)
Caption files (one per video) containing:
- Ground truth procedural annotations with step-by-step instructions
- Objects’ coordinates and labels
Dataset splits for reproducible evaluation

Example of Procedural Captioning

1. Write a number on the weighing paper using a ballpoint pen.
2. Fold the weighing paper into quarters and cut off the corner with scissors to create a hole.
3. Insert the NMR tube into the hole of the weighing paper.

Data Structure

data/
├── videos/              # scenario##_decoy#.mp4
├── captions/            # scenario##_decoy#.json
└── dataset_splits.json  # Train/dev/test split definitions

Example Code for Reproducing Paper Experiments

This repository also includes reference implementation code to reproduce the evaluation experiments from the paper. The code demonstrates automatic procedure generation using GPT-5 API with two approaches:

Manual Object Detection: Uses predefined objects’ coordinates and labels from caption files
Auto Object Detection: Uses GPT-5 for automatic object detection

Installation

git clone https://github.com/ToyotaCRDL/CREOLab.git
cd CREOLab
pip install -r requirements.txt

Setup

Create a .env file:
```
OPENAI_API_KEY=your_openai_api_key
```

Download and place the dataset:

Download the video dataset from DOI: https://doi.org/[TO_BE_ANNOUNCED]
Extract and place files in the data/ directory:

data/
├── videos/              # Place scenario##_decoy#.mp4 files here
├── captions/            # Place scenario##_decoy#.json files here
└── dataset_splits.json  # Dataset split definitions (already included)

The expected structure after data placement:

data/
├── videos/
│   ├── scenario01_decoy0.mp4
│   ├── scenario01_decoy1.mp4
│   └── ... (65 video files total)
├── captions/
│   ├── scenario01_decoy0.json
│   ├── scenario01_decoy1.json
│   └── ... (65 caption files total)
└── dataset_splits.json

Usage Examples

# Single file evaluation (reproduces single-video experiment)
python example/procedure_evaluation_pipeline.py --caption-file data/captions/scenario05_decoy0.json

# Batch processing (reproduces paper evaluation)
python example/procedure_evaluation_pipeline.py --batch debug          # 2 takes, quick test
python example/procedure_evaluation_pipeline.py --batch dev            # 15 takes, prompt development
python example/procedure_evaluation_pipeline.py --batch test           # 50 takes, full evaluation

Evaluation System

Deduction-based scoring: Starts at 100 points, deducts for each violation:

Note: In the prompt, this is positioned as a count-based quantitative rubric assessment and referred to as "rubric evaluation."

Critical Step Omissions (-15 each): Missing important procedural steps
Incorrect Step Sequence (-12 each): Steps in wrong order affecting outcome
Unnecessary Additional Steps (-8 each): Irrelevant steps adding confusion
Incomplete Step Descriptions (-5 each): Missing necessary details
Incorrect Terminology (-10 each): Wrong object names (functionally incompatible)
Ambiguous Terminology (-5 each): Terms preventing successful execution

Processing Time Estimates

Split	Takes	Time/Iteration	Default (5 iter)	Notes
debug	2	~20 min	~100 min	Quick testing
dev	15	~150 min	~750 min	Prompt development
test	50	~500 min	~2500 min	Full evaluation

Processing time: ~10 min/take/iteration. Time scales linearly with iteration count.

Pre-computed Execution Logs

Complete execution logs and generated results for the test split (50 takes, 10 iterations) are available at:

https://doi.org/10.5281/zenodo.17811722

The logs include:

Complete console output from --batch test --iterations 10 execution
Generated procedures for all 50 takes (manual and auto object detection)
Evaluation results with detailed scoring breakdowns
Visualization charts and cross-analysis statistics
Batch summary and aggregate results

Total processing: ~5000 minutes (~80 hours) with extensive GPT-5 API calls.

This allows researchers to review complete experimental results without re-running the entire pipeline, thereby saving computational resources and time; however, the log data must not be used for secondary purposes or for any use beyond result inspection.

Output Structure

output/
└── batch_YYYYMMDD_HHMMSS/
    ├── test/ (or dev/ or debug/)
    │   └── scenario05_decoy0/
    │       ├── iter_01/
    │       │   ├── integrations/
    │       │   │   ├── manual_object_detection_integrated_procedure.txt
    │       │   │   └── auto_object_detection_integrated_procedure.txt
    │       │   ├── evaluation/
    │       │   │   ├── rubric_evaluation_results.json
    │       │   │   ├── evaluation_bar_chart.json
    │       │   │   ├── evaluation_bar_chart.png
    │       │   │   └── evaluation_summary.txt
    │       │   ├── prompts/
    │       │   │   ├── auto_detection/
    │       │   │   ├── manual_detection/
    │       │   │   ├── integration_auto_object_detection_prompt.txt
    │       │   │   ├── integration_auto_object_detection_response.txt
    │       │   │   ├── integration_manual_object_detection_prompt.txt
    │       │   │   └── integration_manual_object_detection_response.txt
    │       │   ├── frames/
    │       │   ├── reference_images/
    │       │   └── segments/
    │       └── aggregate/
    │           ├── aggregate_results.json
    │           ├── aggregate_summary.txt
    │           ├── take_aggregate_rubric_deduction_breakdown.json
    │           └── take_aggregate_rubric_deduction_breakdown.png
    └── cross_analysis/
        ├── by_scenario_number/
        ├── by_decoy_number/
        └── overall_comparison/

Citation

If you use this dataset or code in your research, please cite:

@misc{CREOLab2025,
  title        = {CREOLab: CREative tool use in Object-rich Laboratory},
  author       = {Goto, Shigeaki and Hasebe, Tatsuki},
  year         = {2025},
  doi          = {10.5281/zenodo.17811722},
  url          = {https://doi.org/10.5281/zenodo.17811722}
}

Note on AI Assistance: A commercial coding assistant tool incorporating a code generation language model was used as an auxiliary aid during development (e.g., for code completion). The tool was used under a paid commercial license. The final implementation reflects human review and modifications.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
config		config
data		data
example		example
images		images
src		src
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CREOLab

Dataset

Dataset Contents

Example of Procedural Captioning

Data Structure

Example Code for Reproducing Paper Experiments

Installation

Setup

Usage Examples

Evaluation System

Processing Time Estimates

Pre-computed Execution Logs

Output Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ToyotaCRDL/CREOLab

Folders and files

Latest commit

History

Repository files navigation

CREOLab

Dataset

Dataset Contents

Example of Procedural Captioning

Data Structure

Example Code for Reproducing Paper Experiments

Installation

Setup

Usage Examples

Evaluation System

Processing Time Estimates

Pre-computed Execution Logs

Output Structure

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages