This repository contains scripts for generating and validating G-codes automatically-generated using various LLM pipelines.
git clone https://github.com/Chitransh31/GLLM.gitThis project uses Python3.11. If not installed, you may install it via:
sudo apt update
sudo apt install python3.11You can use pyenv to setup Python 3.11 in your repo folder
brew install pyenv
~/.zprofile
sudo ~/.zprofile
vi ~/.zprofile
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc\necho '[[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc\necho 'eval "$(pyenv init - zsh)"' >> ~/.zshrc
vi ~/.zshrc
pyenv versions
pyenv install 3.11
pyenv versions
Then, install poetry and guide it to use python 3.11
pipx install poetry
poetry env use /usr/bin/python3.11Then, install the requirements
poetry installTo use Huggingface models, it is required to save the API access token as an environment variable.
- Register or login at Hugging Face and create an API token in your profile settings
- Add a file called
secrets.tomlin a folder called.streamlitat the root of the repo, and provide your HuggingFace API token by typinghuggingface_token = "..." - For `OpenAI` models, add the access token
openai_token = "YourOpenAITokenHere"to `.streamlit/secrets.toml`.
or you can open your shell's configuration file in a text editor:
vim ~/.bashrcAdd the following line to the end of the file:
export HUGGINGFACEHUB_API_TOKEN="YourHFTokenHere"Save and close the file. To apply the changes, source the file or restart your terminal:
source ~/.bashrcMost of the import library errors can be resolved using these library installation commands
pip3 install streamlit
pip3 install openai
pip3 install hub
pip3 install -e /gllm
pip3 install deeplake
pip3 install hub
pip3 install langchain
pip3 install PyPDF2
pip3 install langchain_community
pip3 install langchain_chains
pip3 install peft
pip3 install pygcode
pip3 install matplotlib
pip3 install plotly
pip3 install langgraph
pip3 install langgraph-checkpoint-sqlite
To run the GLLM application:
poetry run streamlit run gllm/code_generator_streamlit_reasoning_langchain_langgraph.pyThis file contains code that takes in text and generates question-answer pairs which could be used for LLM evaluation or instruction tuning.
Code was taken from github. Check repo for details to setup and run code.
train_pipeline.py contains code to finetune open-source LLMs from Hugging Face.
Run python train_pipeline.py to start the finetuning process. As default, the dataset used for finetuning are
PDF files stored in the directory pdfs. To use "The Stack", specify this using: --dataset 'thestack'
The Stack contains code files collected from Github, including G-code. Around 400 MB of G-code is available with a total of 16020 examples.
To use this dataset, you need to log in to Hugging Face in your terminal by:
- Running
huggingface-cli login - Providing your Hugging Face access token.
To load this dataset, use ds = load_dataset("bigcode/the-stack", data_dir="data/g-code", split="train")
So far, training is limited to models with <3B parameters due to memory limitations. Training code works for these models:
- WizardLM/WizardCoder-3B-V1.0
- bigcode/starcoderbase-3b
I tested these methods when training larger models such as setting smaller batch size, gradient accumulation and checkpointing, mixed precision training, setting device_map='auto' when loading model, but nothing works so far
To push model to hub after finetuning, make sure you are logged in via cli, just like when using "The Stack" dataset (provide token that has write permission)
To use the Starcoder model, you need to be granted access to the model. To do this,
- Log in to Hugging Face in a terminal like described above
- Log in to the Hugging Face website, go to bigcode/starcoder
- Accept the conditions to access model files and content.
It is recommended to use the StarCoder tech assistant prompt, since the model is only trained on code completion.