The Julia Chatbot AI is a machine-learning-based chatbot trained on GitHub repositories of Julia to generate and understand Julia code.
It allows users to scrape, preprocess, train, and evaluate language models of various sizes (135m, 360m, 1.7b).
The system includes post-processing, benchmarking, and a chatbot UI for interactive conversations.

Create a virtual environment:
python3 -m venv .venvPlease use python 3.12 or higher
To be noted that we used python3.12.
Source the environment:
sourche .venv/bin/activaweSource the environment on the server:
eval "$(/home/SA24-G1/miniconda3/bin/conda shell.bash hook)"Install the requirements:
pip install -r requirements.txtexport PYTHONPATH="$PYTHONPATH:$PWD"To get scrape the github repositories run the next command:
python3 src/dataset/fetch_dataset.pyAfterwards it is necessary to scrape and parse the data to produce a 'json' file contained the raw data set.
python3 src/dataset/parser.pyThe raw Data is expected to be in the data directory. The name of the file is expected to be fd.json
The data is pre-processed using the following command, you need to run this command before training the model.
python src/data/preprocess.pymodel can be 135m, 360m, 1.7b
python src/LLM/LLM_train.py --model "135m" --sample is used to train the model on a subset of the data, specify the number of samples you want to train on.
python src/LLM/LLM_train.py --model "1.7b" --sample=1000the model will be saved in the models directory along with the tokenizer
train all models 135m, 360m, 1.7b
python src/LLM/LLM_train.py --model "all" --sampleis used to train the model on a small subset of the data (specify the number of samples you want to train on)--signatureis used to train the model on the signature data--baselineis used to train the model on the baseline data without any preprocessing
python src/LLM/LLM_train.py --model "all" --sample=1000Before evaluating the models they should be post processed. The completions created by the model often have syntax error caused a lack, or over inclusion, of 'end' keywords as in this example:
function fibonacci(n)
if n <= 2
return 1
end
return fibonacci(n - 1)
+ fibonacci(n - 2)
end
end
endThe post processing script ensure that the number of 'end' keywords are correct. It is sufficient to run the 'src/utils/post_processing.py' script passing as input dir the directory containing the results produced by the ./generate.sh script.
python3 src/utils/post_processing.py --input-dir ./resultsTo use the evaluation script, you need to have the model in the models directory trained.
python src/LLM/LLM_predict.py --prompt '"""A simple for loop"""' --model "135m"--max_lengthis used to specify the maximum number of tokens to the output.--signatureis used to evaluate the model on the signature data.--baselineis used to evaluate the model on the baseline data without any preprocessing.--original_outputdo not apply any post-processing to the output.
```bash
python src/LLM/LLM_predict.py --prompt '"""A simple for loop"""' --model "all"To run the chatbot UI
python src/chatbot/app.pyGo inside the benchmark directory
cd benchmarkReplace the name of the model with the one you want to run with benchmark, for example:
generate.sh ../models/135mevaluate.sh was customized to load our tokenizer.
The predictions generated by our Model are stored in results folder
To use the evaluation script, you need to have the model in the models directory trained.
Replace 135m with the desired model you want to evaluate and the checkpoint directory.
to evaluate you need to be on the top folder of the project --> thus if you are in the benchmark folder you need to go back to the top folder
evaluate.sh models/135mThe results are stored in a json file named $MODELNAME_results_jl.json.
Run the test.py in benchmark to get the efficiency of the models on MultiPl-E benchmark and some statistics.
python src/statistical_test/statistical.py