Skip to content

malhajar17/GoogledWeb

Repository files navigation

FineWebSFT

What is it

  • This is our project for NexaAI Hackathon.
  • By using GenQA technique in this paper, and LLM agent powered by groq and Langchain, we enable generation of up-to-date instruction data for finetuning, base on user input query
  • Base dataset of generation is fineweb-edu. Since this is only a demo, we create vector database using JinaEmbedding with only the top 10k rows from fineweb-edu 2024-10 snapshot.

To use the command line demo

  • create your keys.json file following this fomat
    {
        "GROQ_API_KEY": "",
        "JINA_API_KEY": "",
        "GOOGLE_CSE_ID": "",
        "GOOGLE_API_KEY": "",
        "HUGGINGFACE_TOKEN": ""
    }
  • create an environment with python version 3.11.3
  • pip install -r requirements.txt
  • If this is the first time of your run, excute python create_database.py. This should take like 5 minutes. Make yourself a cup of tea.
  • python demo_pipeline.py --query <your_query> --num_texts <number_of_texts_you_want_to_retrieve_from_database> --num_instructs <number_of_instructions_you_want_to_generate_for_each_num_texts> --output_dir <directory_to_store_final_result>
  • Note that due to API rate limit, sometimes generation may take longer than expected or fail.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors