fintuser

Introduction

I want to have an LLM-powered "CV reviewer" as a part of my CVAI project.

I have the following chat in Telegram, where engineers review the CVs of other engineers.

The idea is to use the reviews from this chat for fine-tuning an LLM.

A repository with the python scripts needed for creating a dataset from fine-tuning and submitting all the jobs.

create-message-chains.py takes care of transformation unstructured data into ProcessedMessages type. This will craete message chains and store them. They are later converted into english and then into the finetuning dataset format.
transform_chat_data.py takes the output of create-message-chains.py, and uses gpt-4o-mini to translate the text into english and add a yaml representation of PDF documents. So we can fine-tune both on raw documents and the transcriptions of the documents.
build-finetune-dataset creates a dataset for fine-tuning given the data in the database
submit-finetune-job uploads a file with a fine-tuning dataset and submits a job
save_out_dir_to_db gets output of batched jobs and saves it to the database

./data - json data + chat documents
./data/raw_chat_data/ - json from messages from telegram chat
./data/json_files/ - dataset in the format "doc_id" -> messages + filepath. The messages are in russian, and they need to be translated. Each file is a complete dataset. They differ by the number of entries and the pre-processing thingies. The format is the same though.
./data/files/ - PDFs. Ideally, this directory should be deleted and database dump should be enough.
./data/batches/ - batchs ready to be submitted for translation: russian -> english.
./data/fine_tune_data/ - translated text suitable for fine-tuning

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
prisma		prisma
src		src
.gitignore		.gitignore
.nvmrc		.nvmrc
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt