A tool for automating translation and analysis of game texts using modern machine learning models. Supports multilingual translation, named entity recognition (NER), as well as training and adaptation of models for game tasks and custom datasets.
This is an unofficial Star Citizen fansite, not affiliated with the Cloud Imperium group of companies. All content on this site not authored by its host or users are property of their respective owners.
Note
Tested on: WSL Ubuntu 22.04, NVIDIA CUDA 12.8, 12GB 4070 GPU
Documentation: Русский
Translation module
Automates the translation of game texts between languages using fine-tuned machine learning models. Used for localization of resources, supports preservation of structure and special constructs in source files. Enables fast, high-quality machine translation with game context awareness.
NER module (Named Entity Recognition)
Allows extracting named entities (e.g., names, organizations, game objects) from texts. This is important for automatic annotation, analysis, and also for improving translation quality — for example, to avoid translating proper names or to use them for additional model adaptation.
How modules are connected
NER can be used as an auxiliary step before translation: first, entities are extracted from the text, which can then be protected from translation or processed separately. This helps avoid errors when translating names, terms, and other important elements, and improves the final localization quality.
Warning
Requires Python 3.10 and NVIDIA GPU with CUDA 12.8 support. CUDA 12.8 is required for pytorch compatibility
To install CUDA on WSL Ubuntu 22.04 follow the instructions or run:
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt -y install cuda-toolkit-12-8-
Install UV:
curl -LsSf https://astral.sh/uv/install.sh | shor in Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" -
Clone the repository:
git clone https://github.com/sc-localization/VerseBridge.git cd VerseBridge -
Install the required dependencies:
uv sync
Important
It is highly recommended to use a fine-tuned model for translation, as it will significantly improve the quality of the translation. Additionally, fine-tuning will allow the model to better understand the context and nuances of the game, resulting in more accurate and natural-sounding translations. To fine-tune the model, you will need to prepare a dataset of annotated examples and train the model on this dataset. The more data you have, the better the model will perform. Fine-tuning will also allow you to adapt the model to the specific style and tone of the game, which is important for creating a high-quality localization.
To prepare data for training or translation, run the preprocessing pipeline from the project root:
uv run -m scripts.run_preprocessSteps:
- Configure Paths:
- Update
src/config/paths.pywith paths to original and translated.inifiles (e.g.,global_original.ini,global_pre_translated.ini). - Set
target_lang_codeinsrc/config/language.py
- Place Files:
- Copy original (
global_original) todata/raw/enand translated (global_pre_translated).initodata/raw/{target_lang_code}.
- Run Preprocessing:
- Convert
.inito JSON: Output:data/raw/training/{source_lang_code-target_lang_code}/data.json. - Clean Data: Output:
data/data/raw/training/{source_lang_code-target_lang_code}/cleaned_data.json. Removes duplicates, empty rows, and oversized tokenized text. - Split Data: Output:
data/data/raw/training/{source_lang_code-target_lang_code}/train.json(80%),data/data/raw/training/{source_lang_code-target_lang_code}/valid.json(20%).
Dataset Format:
{
"original": "This is a test sentence",
"translated": "Это тестовое предложение"
}To fine-tune the model, run:
uv run -m scripts.run_trainingNotes:
- Preprocessing runs automatically if training/test JSON files are missing.
- Metrics (BLEU, ChrF, METEOR, BERTScore) are computed during evaluation.
- Checkpoints and logs are saved in the configured directories.
- Early stopping is enabled with a patience of 5 and threshold of 0.001.
Monitor Training:
uv run tensorboard --logdir logs/Configuration:
- Update
src/config/training.pyfor training parameters (e.g., epochs, batch size). - Ensure
data/train.jsonanddata/valid.jsonare ready. - Model checkpoints and results are saved in
models/.
To translate .ini files, run:
uv run -m scripts.run_translationNotes:
- INI files are processed with protected patterns preserved (e.g., placeholders, newlines).
- Long texts are split to respect model token limits.
- Logs are saved in the configured logging directory.
- If no --input_path is provided, the script processes all INI files in the source directory.
Output:
- Translated files saved in
data/translated/<lang_code>/(e.g.,data/translated/ru/global_original.ini).
Configuration:
- Set
target_lang_codeandtranslate_dest_dirinsrc/config/translation.py.
Run preprocessing and entity extraction from the source texts:
uv run -m scripts.run_ner --stage preprocessThis will create files for annotation and training in the data/ner/ folder:
ner_unannotated.json— unannotated data for manual labelingdataset_bio.json— data in BIO format for training
For manual review and correction of annotations, use the web interface (Streamlit):
uv run -m scripts.run_ner --stage reviewAfter review, the file dataset_corrected.json will be created.
To train the model on annotated data, run:
uv run -m scripts.run_ner --stage trainThe model will be trained on train.json and test.json files (created automatically from the annotated dataset).
To apply the trained model to new data:
uv run -m scripts.run_ner --stage extractData format
Input and output files for NER are located in data/ner/:
dataset_bio.json— data in BIO format for trainingtest.json,train.json— split datasetsner_unannotated.json— unannotated data for annotation Example data structure:
{
"tokens": ["This", "is", "VerseBridge"],
"labels": ["O", "O", "B-ORG"]
}Configuration
- Main parameters and paths are set in
src/config/ner.pyandsrc/config/paths.py. - The web interface for annotation review uses Streamlit.
Notes
- Preprocessing is required for correct operation (
--stage preprocess). - All logs are saved in the
logs/folder.
Note: use the --help attribute to get help about available arguments
- train with LoRA:
uv run -m scripts.run_training --with-lora- resume training from a checkpoint (if checkpoint exist):
uv run -m scripts.run_training --with-lora --model-path models/lora/checkpoints/checkpoints-100- train without LoRA using a base model
uv run -m scripts.run_training- translate all INI files in source directory:
uv run -m scripts.run_translation --src-lang en --tgt-lang ru --translated-file-name translated.ini- translate INI file from custom directory:
uv run -m scripts.run_translation --input-file data/raw/global_original_test.ini- use a fine tuned model for translation:
uv run -m scripts.run_translation --model-path models/base_model/result- translation with default settings:
uv run -m scripts.run_translation- Preprocess and extract entities:
uv run -m scripts.run_ner --stage preprocess- Manual annotation and review (Streamlit):
uv run -m scripts.run_ner --stage review- Train NER model:
uv run -m scripts.run_ner --stage train- Extract entities from new texts:
uv run -m scripts.run_ner --stage extractIf you would like to contribute to the project, please read CONTRIBUTING.
- Code: MIT License (LICENSE_CODE)
Special permission is granted to Cloud Imperium Games for unrestricted use. See SPECIAL_PERMISSION.
