VerseBridge

A tool for automating translation and analysis of game texts using modern machine learning models. Supports multilingual translation, named entity recognition (NER), as well as training and adaptation of models for game tasks and custom datasets.

This is an unofficial Star Citizen fansite, not affiliated with the Cloud Imperium group of companies. All content on this site not authored by its host or users are property of their respective owners.

Note

Tested on: WSL Ubuntu 22.04, NVIDIA CUDA 12.8, 12GB 4070 GPU

Documentation: Русский

Translation module

Automates the translation of game texts between languages using fine-tuned machine learning models. Used for localization of resources, supports preservation of structure and special constructs in source files. Enables fast, high-quality machine translation with game context awareness.

NER module (Named Entity Recognition)

Allows extracting named entities (e.g., names, organizations, game objects) from texts. This is important for automatic annotation, analysis, and also for improving translation quality — for example, to avoid translating proper names or to use them for additional model adaptation.

How modules are connected

NER can be used as an auxiliary step before translation: first, entities are extracted from the text, which can then be protected from translation or processed separately. This helps avoid errors when translating names, terms, and other important elements, and improves the final localization quality.

Installation

Warning

Requires Python 3.10 and NVIDIA GPU with CUDA 12.8 support. CUDA 12.8 is required for pytorch compatibility

To install CUDA on WSL Ubuntu 22.04 follow the instructions or run:

  wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
  sudo dpkg -i cuda-keyring_1.1-1_all.deb
  sudo apt update
  sudo apt -y install cuda-toolkit-12-8

Install UV:

curl -LsSf https://astral.sh/uv/install.sh | sh

or in Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Clone the repository:

git clone https://github.com/sc-localization/VerseBridge.git
cd VerseBridge

Install the required dependencies:
```
uv sync
```

Usage

Important

It is highly recommended to use a fine-tuned model for translation, as it will significantly improve the quality of the translation. Additionally, fine-tuning will allow the model to better understand the context and nuances of the game, resulting in more accurate and natural-sounding translations. To fine-tune the model, you will need to prepare a dataset of annotated examples and train the model on this dataset. The more data you have, the better the model will perform. Fine-tuning will also allow you to adapt the model to the specific style and tone of the game, which is important for creating a high-quality localization.

Translation Pipeline

1. Preprocess Data

To prepare data for training or translation, run the preprocessing pipeline from the project root:

uv run -m scripts.run_preprocess

Steps:

Configure Paths:

Update src/config/paths.py with paths to original and translated .ini files (e.g., global_original.ini, global_pre_translated.ini).
Set target_lang_code in src/config/language.py

Place Files:

Copy original (global_original) to data/raw/en and translated (global_pre_translated) .ini to data/raw/{target_lang_code}.

Run Preprocessing:

Convert .ini to JSON: Output: data/raw/training/{source_lang_code-target_lang_code}/data.json.
Clean Data: Output: data/data/raw/training/{source_lang_code-target_lang_code}/cleaned_data.json. Removes duplicates, empty rows, and oversized tokenized text.
Split Data: Output: data/data/raw/training/{source_lang_code-target_lang_code}/train.json (80%), data/data/raw/training/{source_lang_code-target_lang_code}/valid.json (20%).

Dataset Format:

{
  "original": "This is a test sentence",
  "translated": "Это тестовое предложение"
}

2. Train Model

To fine-tune the model, run:

uv run -m scripts.run_training

Notes:

Preprocessing runs automatically if training/test JSON files are missing.
Metrics (BLEU, ChrF, METEOR, BERTScore) are computed during evaluation.
Checkpoints and logs are saved in the configured directories.
Early stopping is enabled with a patience of 5 and threshold of 0.001.

Monitor Training:

uv run tensorboard --logdir logs/

Configuration:

Update src/config/training.py for training parameters (e.g., epochs, batch size).
Ensure data/train.json and data/valid.json are ready.
Model checkpoints and results are saved in models/.

3. Translate Files

To translate .ini files, run:

uv run -m scripts.run_translation

Notes:

INI files are processed with protected patterns preserved (e.g., placeholders, newlines).
Long texts are split to respect model token limits.
Logs are saved in the configured logging directory.
If no --input_path is provided, the script processes all INI files in the source directory.

Output:

Translated files saved in data/translated/<lang_code>/ (e.g., data/translated/ru/global_original.ini).

Configuration:

Set target_lang_code and translate_dest_dir in src/config/translation.py.

Named Entity Recognition (NER) Pipeline

1. Data extraction and preparation

Run preprocessing and entity extraction from the source texts:

uv run -m scripts.run_ner --stage preprocess

This will create files for annotation and training in the data/ner/ folder:

ner_unannotated.json — unannotated data for manual labeling
dataset_bio.json — data in BIO format for training

2. Manual annotation and review

For manual review and correction of annotations, use the web interface (Streamlit):

uv run -m scripts.run_ner --stage review

After review, the file dataset_corrected.json will be created.

3. Training the NER model

To train the model on annotated data, run:

uv run -m scripts.run_ner --stage train

The model will be trained on train.json and test.json files (created automatically from the annotated dataset).

4. Entity extraction from new texts

To apply the trained model to new data:

uv run -m scripts.run_ner --stage extract

Data format Input and output files for NER are located in data/ner/:

dataset_bio.json — data in BIO format for training
test.json, train.json — split datasets
ner_unannotated.json — unannotated data for annotation Example data structure:

{
  "tokens": ["This", "is", "VerseBridge"],
  "labels": ["O", "O", "B-ORG"]
}

Configuration

Main parameters and paths are set in src/config/ner.py and src/config/paths.py.
The web interface for annotation review uses Streamlit.

Notes

Preprocessing is required for correct operation (--stage preprocess).
All logs are saved in the logs/ folder.

CLI Usage Examples

Note: use the --help attribute to get help about available arguments

Translation CLI

train with LoRA:

uv run -m scripts.run_training --with-lora

resume training from a checkpoint (if checkpoint exist):

uv run -m scripts.run_training --with-lora --model-path models/lora/checkpoints/checkpoints-100

train without LoRA using a base model

uv run -m scripts.run_training

translate all INI files in source directory:

uv run -m scripts.run_translation --src-lang en --tgt-lang ru --translated-file-name translated.ini

translate INI file from custom directory:

uv run -m scripts.run_translation --input-file data/raw/global_original_test.ini

use a fine tuned model for translation:

uv run -m scripts.run_translation --model-path models/base_model/result

translation with default settings:

uv run -m scripts.run_translation

NER CLI

Preprocess and extract entities:

uv run -m scripts.run_ner --stage preprocess

Manual annotation and review (Streamlit):

uv run -m scripts.run_ner --stage review

Train NER model:

uv run -m scripts.run_ner --stage train

Extract entities from new texts:

uv run -m scripts.run_ner --stage extract

Contribution

If you would like to contribute to the project, please read CONTRIBUTING.

License

Code: MIT License (LICENSE_CODE)

Special permission is granted to Cloud Imperium Games for unrestricted use. See SPECIAL_PERMISSION.

Name		Name	Last commit message	Last commit date
Latest commit History 172 Commits
.vscode		.vscode
doc		doc
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SPECIAL_PERMISSION.md		SPECIAL_PERMISSION.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VerseBridge

Installation

Usage

Translation Pipeline

1. Preprocess Data

2. Train Model

3. Translate Files

Named Entity Recognition (NER) Pipeline

1. Data extraction and preparation

2. Manual annotation and review

3. Training the NER model

4. Entity extraction from new texts

CLI Usage Examples

Translation CLI

NER CLI

Contribution

License

About

Uh oh!

Releases

Packages

Languages

License

sc-localization/VerseBridge

Folders and files

Latest commit

History

Repository files navigation

VerseBridge

Installation

Usage

Translation Pipeline

1. Preprocess Data

2. Train Model

3. Translate Files

Named Entity Recognition (NER) Pipeline

1. Data extraction and preparation

2. Manual annotation and review

3. Training the NER model

4. Entity extraction from new texts

CLI Usage Examples

Translation CLI

NER CLI

Contribution

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages