I want to have an LLM-powered "CV reviewer" as a part of my CVAI project.
I have the following chat in Telegram, where engineers review the CVs of other engineers.
The idea is to use the reviews from this chat for fine-tuning an LLM.
A repository with the python scripts needed for creating a dataset from fine-tuning and submitting all the jobs.
- create-message-chains.py takes care of transformation unstructured data into
ProcessedMessagestype. This will craete message chains and store them. They are later converted into english and then into the finetuning dataset format. - transform_chat_data.py takes the output of
create-message-chains.py, and uses gpt-4o-mini to translate the text into english and add a yaml representation of PDF documents. So we can fine-tune both on raw documents and the transcriptions of the documents. - build-finetune-dataset creates a dataset for fine-tuning given the data in the database
- submit-finetune-job uploads a file with a fine-tuning dataset and submits a job
- save_out_dir_to_db gets output of batched jobs and saves it to the database
- ./data - json data + chat documents
- ./data/raw_chat_data/ - json from messages from telegram chat
- ./data/json_files/ - dataset in the format "doc_id" -> messages + filepath. The messages are in russian, and they need to be translated. Each file is a complete dataset. They differ by the number of entries and the pre-processing thingies. The format is the same though.
- ./data/files/ - PDFs. Ideally, this directory should be deleted and database dump should be enough.
- ./data/batches/ - batchs ready to be submitted for translation: russian -> english.
- ./data/fine_tune_data/ - translated text suitable for fine-tuning