Skip to content

Materials Science Understanding Large Language Model

License

Notifications You must be signed in to change notification settings

M3RG-IITD/llamat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

176 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLaMat

This repo contains all the data and code related to our paper Foundational Large Language Models for Materials Research .

Table of contents



Overview

We performed domain adaptation of the models LLaMA-3 and LLaMA-2 for use in material science, via continued pretraining followed by instruction finetuning on material science and chemistry datasets.

overview LLaMat overview Results on MatNLP dataset
results Results on structured information extraction tasks

for detailed results please look at our paper Foundational Large Language Models for Materials Research . The models can be downloaded from https://huggingface.co/m3rg-iitd. The codebase makes use of the Megatron-LLM library for efficient training of LLMs. Go through their documentation to understand the basics. The environment for using our codebase is same as the one for Megatron-LLM.


File Structure

  • src Contains code to pretrain and fine-tune LLMs that have the LLaMA-2 or LLaMA-3 architecture.
  • preprocess Contains code that was used to extract text from research papers for the corpus, from elsevier and springer.
  • plots Code used for creating the plots used in the paper
  • evaluation_codes Contains code for running benchmark evaluations
  • agent Code for launching agentic applications

Additional Documentation


Pretraining

Pretraining was performed on a text corpus of total 30B tokens, interleaved in the following way:

  1. 10M research paper tokens taken from Elsevier and Springer publications followed by 0.1M Red Pajama tokens
  2. 30M Matsci community discourse tokens included in the last 3B (10%) of the dataset in 100:1 ratio. the list of journals and DOIs of the research papers used can be accessed from zenodo

For setting up environment to install Megatron, please follow the instructions provided here


Inference and Evaluation

for running the benchmark evaluations on our datasets. to run, first open the evaluation_codes directory and follow the given instructions. The environment for inference for matNLP tasks requires the VLLM library.

Environment setup downstream evaluation

We also provide a complete environment that can be used for inference on our downstream tasks, however note that it is different from the one used for training.

conda env create -f infer_env_downstream.yaml
conda activate infer_env_downstream

Instructions to run matNLP evaluations

make sure you are in the evaluation_codes directory. then run the following

    bash ft_eval_downstream.sh <Checkpoint_path> <GPU_number> <output_name1> <output_name2>

<output_name1>_<output_name2> will be the suffix for the output and error files. the output and error files will be stored in the same directory and their exact names can be found from evaluation_codes/ft_eval_donwstream.sh file.

Instructions to run structured information extraction evaluations:

the inference and evaluation code for our MatSIE tasks determine how to split the output based on the checkpoint name, hence when running for LLaMat-2 model variants (LLaMat-2 and LLaMat-2-chat), keep the name "llamat2" as a substring in the checkpoint name, and similarly keep "llamat3" as a substring for LLaMat-3 variants (LLaMat-3-chat or LLaMat-3).

Generating the output pickle file:

    python3 {doping, mof1, mof2, discomat}_run.py <CUDA_GPU_NUMBER> <MODEL_PATH> <SAVE_NAME_PREFIX>                               

Output will be stored as <SAVE_NAME_PREFIX>_{doping, mof1, mof2, discomat}_test.pkl in the same folder. here is an example command,

    python3 mof1_run.py 0 ../models/llamat3chat_hf llamat3chat

running the above code will run the model provided on the doping tasks and produce an output pickle file with the name llamat3chat_mof1_test.pkl, which can be passed to the evaluation function. For running inference on these tasks using other models, we have provided examples in src/SIE_external for GPT, Claude, and Gemini for reference.

running evaluation on the output file:

    python3 {doping, mof1, mof2, discomat}_eval.py <SAVE_NAME_PREFIX>                           

This will print the output to the screen along the metrics discussed in the paper. here is an example command to evaluate mof1 tasks,

    python3 mof1_eval.py llamat3chat

this will search for llamat3chat_mof1_test.pkl file in the same directory, and give the results for the model on the mof1 (General materials science) tasks.


Instruction finetuning

The weights of the input model must be stored in the Megatron format. To convert model weights from the HuggingFace format to Megatron format, wt_fromhf.sh is used. For the reverse conversion wt_tohf.sh is used. The model weights resulting from IFT are stored in the HF format to facilitate inference. After downloading the model from huggingface, this conversion is necessary for training.

Step 1. OpenOrca finetuning.

We follow 2 step finetuning. First the model is trained on OpenOrca which is a generic IFT dataset. To run this finetuning, simply make the required path changes in src/run_orca_ift.sh, providing paths for input base model and output location, and the place where OpenOrca data is loaded. Further precise instructions are provided in src/run_orca_ift.sh itself. to run this, simply call bash on it while in the src directory.

    bash run_orca_ift.sh

in the output path, the trained model will be present in huggingface format in the "release/hf" directory, as well as in meditron format in the "release/iter_0009000" directory.

Step 2. Materials science specific finetuning.

To run the next finetuning step, make the necessary path changes in the src/train_repo.sh file similarly, and make the base model as the output path given in the previous step. to start training, go in the src directory and run the following command

    bash train_repo.sh

this will also create the final model in both huggingface and meditron formats. the huggingface format is used for all the evaluations.

General Command:

the following is the general command we use for finetuning within src/run_orca_ift.sh and src/train_repo.sh.

sh ft_pipeline.sh <load_model_path> <save_model_path> <model_iteration_to_finetune> <train_path>\
<val_path> <epochs> <number of docs in train set> <log_file_name> <llama2/llama3> <port number>

The files that are responsible for IFT:

  • ft_pipeline.sh
  • finetune.sh
  • ft_sft.py
  • ft_sft.sh

Arguments flow from top to bottom in the above list.

The Instruction finetuning process was performed on 8 Nvidia-A100 80GB GPUs via IIT Delhi's High Performance Computing facility.


Acknowledgements

We used the codebase of Meditron-LLM for training our models on Nvidia A100 GPUs. We thank the High-Performance Computing (HPC) facility at IIT Delhi for computational and storage resources. This work was partially supported by the Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh. The EIDF provided access to Cerebras CS2 clusters which were used for performing pretraining on our models.

About

Materials Science Understanding Large Language Model

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5