FineWebSFT

This is our project for NexaAI Hackathon.
By using GenQA technique in this paper, and LLM agent powered by groq and Langchain, we enable generation of up-to-date instruction data for finetuning, base on user input query
Base dataset of generation is fineweb-edu. Since this is only a demo, we create vector database using JinaEmbedding with only the top 10k rows from fineweb-edu 2024-10 snapshot.

create your keys.json file following this fomat

{
    "GROQ_API_KEY": "",
    "JINA_API_KEY": "",
    "GOOGLE_CSE_ID": "",
    "GOOGLE_API_KEY": "",
    "HUGGINGFACE_TOKEN": ""
}

create an environment with python version 3.11.3
pip install -r requirements.txt
If this is the first time of your run, excute python create_database.py. This should take like 5 minutes. Make yourself a cup of tea.
python demo_pipeline.py --query <your_query> --num_texts <number_of_texts_you_want_to_retrieve_from_database> --num_instructs <number_of_instructions_you_want_to_generate_for_each_num_texts> --output_dir <directory_to_store_final_result>
Note that due to API rate limit, sometimes generation may take longer than expected or fail.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
NexaAIHackathon-Frontend		NexaAIHackathon-Frontend
data		data
demo_result		demo_result
demo_result_data		demo_result_data
.gitignore		.gitignore
README.md		README.md
create_database.py		create_database.py
demo_pipeline.py		demo_pipeline.py
main.py		main.py
requirements.txt		requirements.txt
test_sdk.ipynb		test_sdk.ipynb
utils.py		utils.py

Provide feedback