Graphia is a reinforcement learning-based social network graph generation framework.
We have published the processed versions of weibo-tech, weibo-daily, propagate-en(8days_dytag_small_text_en) datasets. The propagate-en data will be made public after the paper is accepted.
Dataset link: https://www.modelscope.cn/datasets/cather111/Graphia_data
Please download and place the dataset in the following directory:
Graphia/data
Graphia Models trained on Weibo Tech are available at the following links: https://www.modelscope.cn/models/cather111/Graphia-Q https://www.modelscope.cn/models/cather111/Graphia-E
To facilitate the reproduction of experimental results, we have uploaded all baseline model codes to: Graphia_baselines
Graphia/
├── scripts/
│ ├── prepare_dataset.sh # Graph dataset formatting script
│ ├── train_dp.sh # Activity predictor training script
│ └── prepare_prompt.sh # LLM training data formatting script
├── prompt_data/
│ └── weibo_daily/
│ └── train/
│ ├── cold_start/
│ │ └── combined_examples.jsonl # SFT training data
│ ├── seq/
│ │ ├── seq_edge.jsonl # Graphia-seq edge rl data
│ │ └── seq_dst.jsonl # Graphia-seq dst rl data
│ └── teacher_forcing/
│ ├── edge_text_examples.jsonl # Graphia edge rl data
│ └── query_examples.jsonl # Graphia dst rl data
└── README.md
Graphia relies on ROLL for LLM reinforcement learning training. We have made some modifications to the original code to meet specific requirements.
Please place the rlvr component in the following path:
ROLL/roll/pipeline/rlvr
- Python 3.7+
- PyTorch 1.10+
- requirements.txt
Complete data preprocessing through the following scripts:
# Format graph dataset
bash scripts/prepare_dataset.sh
# Train activity predictor
bash scripts/train_dp.sh
# Train reward model GNN
bash scripts/train_gnn_tgn.sh
# Format LLM training data
bash scripts/prepare_prompt.shScript function descriptions:
- prepare_dataset.sh: Prepare and format social network graph datasets
- train_dp.sh: Train activity predictor for graph node representation learning
- prepare_prompt.sh: Generate prompts for large language model training
The following uses the weibo-tech dataset as an example, assuming the above data preparation steps have been completed.
SFT training data location:
Graphia/prompt_data/weibo_daily/train/cold_start/combined_examples.jsonl
| Training Type | Configuration File Path |
|---|---|
| DST RL | ROLL/examples/rlvr_megatron_dst/rlvr_config_remote_all_dst_weibo_tech.yaml |
| Edge RL | ROLL/examples/rlvr_megatron_dst/rlvr_config_remote_all_easy_seq_weibo_tech.yaml |
Refer to the following for training commands:
ROLL/examples/rlvr_megatron_dst/local_run.sh- TDGG processing:
Graphia/scripts/postprocess_tdgg.sh - IDGG processing:
Graphia/scripts/postprocess_idgg.sh
After processing, first concatenate reports from different models, then perform evaluation:
# Concatenate reports
bash Graphia/scripts/concat_reports.sh
# Execute evaluation
TDGG evaluation: Graphia/eval_utils/eval_tdgg.py
IDGG evaluation: Graphia/eval_utils/eval_idgg.pyWelcome to submit Issues and Pull Requests to help improve the project.
Thanks to the following open-source projects and research teams for their support: