Hungarian chat data

This project is for downloading the open assistant chat dataset and translating it to hungarian

Basic info

The translated data is limited to the english parts of the dataset. The project uses gemini for translation, while instructing it to carefully watch for preserving the original meaning and context. The code expects GOOGLE_AI_API_KEY to be present in a .env file in the root of the folder.

The program allows for 5 concurrent requests to stay below the rate limit but be as fast as possible. The code is hardly anything special, chatgpt could probably have written it in a few shots, customize it for if you need anything.

The program turns off all safeguards that are available, so gemini wont censor the responses. Still, 6 messages were blocked for me (for "other" reasons.) I used google translate to manually translate those.

Download the dataset from huggingface

View the hf repo or download it in python:

import datasets

ds = datasets.load_dataset('jazzysnake01/oasst1-en-hun-gemini')

Usage

Translating the english part takes around 20 hours, which is 46% of the total dataset (~41k chat messages)

Installation:

pip install -r requirements.txt

Usage:

pyhton acquire_data.py && python translate_data.py --timeout 15

Once started the code may stop if 5 requests have failed in a row, in that case you can continue the translation from where you left off by:

pyhton translate_data.py --continue

If (when) the translation has finished with failed requests, the following command can be used to patch up those mistakes. For me, 1.2k failed requests remained, but that is because I set a 15sec timeout so long messages don't hold up 4 others with them (higher throughput). In this case whatever timeout you use will be doubled. (--timeout flag)

pyhton translate_data.py --patch-failed

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
acquire_data.py		acquire_data.py
requirements.txt		requirements.txt
translate_data.py		translate_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hungarian chat data

Basic info

Download the dataset from huggingface

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

spaced-ace/hun-chat-data

Folders and files

Latest commit

History

Repository files navigation

Hungarian chat data

Basic info

Download the dataset from huggingface

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages