eval/main.py is the unified entry point. The current code supports:
- Models:
qwen2.5vl/qwen3vl - Datasets:
hcstvg,vidstg,doro-stvg
The default script is eval/run_eval.sh. You can edit it directly to change model paths, annotation paths, video paths, and output paths.
Typical outputs:
results.json: per-sample predictions, parsed outputs, GT, and metricsstatus.json: overall summary and averaged metrics
graph_generator/ is to generate structured data from raw videos. Based on the current code, the main pipeline includes:
- Scene splitting
- Object detection and tracking
- Attribute generation
- Action detection
- Relation generation
- Cross-shot reference edge generation (optional)
- STVG query generation from scene graphs
- Formatting query outputs into training-friendly JSONL
Relevant entry points:
graph_generator/main.py: main scene graph generation entrygraph_generator/modules/query_generator_cpsat.py: generate queries from scene graphsgraph_generator/utils/format_train.py: convert query outputs into training formatgraph_generator/scripts/run_generator.sh: current command collection used in practice
This repository does not currently use a single root-level setup script. The actual setup should follow the module-specific pyproject.toml files under envs/.
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrccd /home/wangxingjian/DORO-STVG/envs/eval
uv syncIf uv sync times out on files.pythonhosted.org in this environment, refresh the lock and sync against the configured mirror:
cd /home/wangxingjian/DORO-STVG/envs/eval
uv lock --refresh
uv sync --refreshcd /home/wangxingjian/DORO-STVG/envs/graph_generator/main
uv syncThis environment is used for:
graph_generator/main.py- the main pipeline modules for attributes, relations, reference edges, and query generation
cd /home/wangxingjian/DORO-STVG/envs/graph_generator/action_detector
uv syncThis separate environment is mainly used by the action detection module to avoid dependency conflicts with the main environment.
The evaluation script currently defaults to decord:
export FORCE_QWENVL_VIDEO_READER=decordYou can switch to torchvision or torchcodec if needed.
graph_generator depends on both model checkpoints and API-related environment variables. The repository already contains graph_generator/.env, and the scripts load it automatically.
The most important variables are:
API_KEYS=your_key_1,your_key_2
MM_API_BASE_URL=https://your-compatible-endpointYou also need to prepare:
- YOLO weights
- SAM2 / Grounded-SAM2 checkpoints
- VideoMAE action detection checkpoints
- DAM or other attribute-description models
For those details, refer to graph_generator/README.md.
cd /home/wangxingjian/DORO-STVG/eval
bash run_eval.shIf you prefer not to use the shell script, you can call the entry point directly:
cd /home/wangxingjian/DORO-STVG/eval
python main.py run \
--model_name=qwen3vl \
--model_path=/path/to/model \
--data_name=hcstvg2 \
--annotation_path=/path/to/test.json \
--video_dir=/path/to/videos \
--output_dir=./resThe current run_generator.sh contains the full pipeline command examples, and the bottom part of the script keeps the active query-generation example.
A typical workflow is:
- Generate
scene_graphs.jsonl - Generate
query.jsonl - Convert it into
query_train.jsonl
This is the training-friendly formatted output generated from query.jsonl by utils/format_train.py. The main fields include:
videopathqueryidqueryDifficultyWidth/Heightbox
box is a trajectory string in the following format:
target description: <frame_idx, time_sec, x1, y1, x2, y2; ... />
Here the coordinates are already normalized to [0, 1] using the video width and height, which makes this format easier to use for training and annotation consumption.