LLaMat

This repo contains all the data and code related to our paper Foundational Large Language Models for Materials Research .

Overview

We performed domain adaptation of the models LLaMA-3 and LLaMA-2 for use in material science, via continued pretraining followed by instruction finetuning on material science and chemistry datasets.

LLaMat overview

Results on MatNLP dataset

Results on structured information extraction tasks

for detailed results please look at our paper Foundational Large Language Models for Materials Research . The models can be downloaded from https://huggingface.co/m3rg-iitd. The codebase makes use of the Megatron-LLM library for efficient training of LLMs. Go through their documentation to understand the basics. The environment for using our codebase is same as the one for Megatron-LLM.

File Structure

src Contains code to pretrain and fine-tune LLMs that have the LLaMA-2 or LLaMA-3 architecture.
preprocess Contains code that was used to extract text from research papers for the corpus, from elsevier and springer.
plots Code used for creating the plots used in the paper
evaluation_codes Contains code for running benchmark evaluations
agent Code for launching agentic applications

Additional Documentation

agent/README.md - Interactive chat and NER agents for materials science applications
evaluation_codes/README.md - Instructions for running MatNLP and MatSIE evaluations
visualizations/README.md - Streamlit dashboard for downstream evaluation analysis
src/cifs/crystal-text-llm/README.md - CIF generation dashboard and batch processing tools
src/cifs/crystal-text-llm/conditional_gen_eval/README.md - Conditional generation evaluation results and analysis

Pretraining

Pretraining was performed on a text corpus of total 30B tokens, interleaved in the following way:

10M research paper tokens taken from Elsevier and Springer publications followed by 0.1M Red Pajama tokens
30M Matsci community discourse tokens included in the last 3B (10%) of the dataset in 100:1 ratio. the list of journals and DOIs of the research papers used can be accessed from zenodo

For setting up environment to install Megatron, please follow the instructions provided here

Inference and Evaluation

for running the benchmark evaluations on our datasets. to run, first open the evaluation_codes directory and follow the given instructions. The environment for inference for matNLP tasks requires the VLLM library.

Environment setup downstream evaluation

We also provide a complete environment that can be used for inference on our downstream tasks, however note that it is different from the one used for training.

conda env create -f infer_env_downstream.yaml
conda activate infer_env_downstream

Instructions to run matNLP evaluations

make sure you are in the evaluation_codes directory. then run the following

    bash ft_eval_downstream.sh <Checkpoint_path> <GPU_number> <output_name1> <output_name2>

<output_name1>_<output_name2> will be the suffix for the output and error files. the output and error files will be stored in the same directory and their exact names can be found from evaluation_codes/ft_eval_donwstream.sh file.

Instructions to run structured information extraction evaluations:

the inference and evaluation code for our MatSIE tasks determine how to split the output based on the checkpoint name, hence when running for LLaMat-2 model variants (LLaMat-2 and LLaMat-2-chat), keep the name "llamat2" as a substring in the checkpoint name, and similarly keep "llamat3" as a substring for LLaMat-3 variants (LLaMat-3-chat or LLaMat-3).

Generating the output pickle file:

    python3 {doping, mof1, mof2, discomat}_run.py <CUDA_GPU_NUMBER> <MODEL_PATH> <SAVE_NAME_PREFIX>

Output will be stored as <SAVE_NAME_PREFIX>_{doping, mof1, mof2, discomat}_test.pkl in the same folder. here is an example command,

    python3 mof1_run.py 0 ../models/llamat3chat_hf llamat3chat

running the above code will run the model provided on the doping tasks and produce an output pickle file with the name llamat3chat_mof1_test.pkl, which can be passed to the evaluation function. For running inference on these tasks using other models, we have provided examples in src/SIE_external for GPT, Claude, and Gemini for reference.

running evaluation on the output file:

    python3 {doping, mof1, mof2, discomat}_eval.py <SAVE_NAME_PREFIX>

This will print the output to the screen along the metrics discussed in the paper. here is an example command to evaluate mof1 tasks,

    python3 mof1_eval.py llamat3chat

this will search for llamat3chat_mof1_test.pkl file in the same directory, and give the results for the model on the mof1 (General materials science) tasks.

Instruction finetuning

The weights of the input model must be stored in the Megatron format. To convert model weights from the HuggingFace format to Megatron format, wt_fromhf.sh is used. For the reverse conversion wt_tohf.sh is used. The model weights resulting from IFT are stored in the HF format to facilitate inference. After downloading the model from huggingface, this conversion is necessary for training.

Step 1. OpenOrca finetuning.

We follow 2 step finetuning. First the model is trained on OpenOrca which is a generic IFT dataset. To run this finetuning, simply make the required path changes in src/run_orca_ift.sh, providing paths for input base model and output location, and the place where OpenOrca data is loaded. Further precise instructions are provided in src/run_orca_ift.sh itself. to run this, simply call bash on it while in the src directory.

    bash run_orca_ift.sh

in the output path, the trained model will be present in huggingface format in the "release/hf" directory, as well as in meditron format in the "release/iter_0009000" directory.

Step 2. Materials science specific finetuning.

To run the next finetuning step, make the necessary path changes in the src/train_repo.sh file similarly, and make the base model as the output path given in the previous step. to start training, go in the src directory and run the following command

    bash train_repo.sh

this will also create the final model in both huggingface and meditron formats. the huggingface format is used for all the evaluations.

General Command:

the following is the general command we use for finetuning within src/run_orca_ift.sh and src/train_repo.sh.

sh ft_pipeline.sh <load_model_path> <save_model_path> <model_iteration_to_finetune> <train_path>\
<val_path> <epochs> <number of docs in train set> <log_file_name> <llama2/llama3> <port number>

The files that are responsible for IFT:

ft_pipeline.sh
finetune.sh
ft_sft.py
ft_sft.sh

Arguments flow from top to bottom in the above list.

The Instruction finetuning process was performed on 8 Nvidia-A100 80GB GPUs via IIT Delhi's High Performance Computing facility.

Acknowledgements

We used the codebase of Meditron-LLM for training our models on Nvidia A100 GPUs. We thank the High-Performance Computing (HPC) facility at IIT Delhi for computational and storage resources. This work was partially supported by the Edinburgh International Data Facility (EIDF) and the Data-Driven Innovation Programme at the University of Edinburgh. The EIDF provided access to Cerebras CS2 clusters which were used for performing pretraining on our models.

Name		Name	Last commit message	Last commit date
Latest commit History 176 Commits
Megatron-LLM		Megatron-LLM
agent		agent
datasets/final_paper_ift		datasets/final_paper_ift
energy_calc		energy_calc
evaluation_codes		evaluation_codes
plots		plots
preprocess		preprocess
src		src
visualizations		visualizations
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
fig-app-cif-loss.png		fig-app-cif-loss.png
infer_env.txt		infer_env.txt
infer_env_downstream.yaml		infer_env_downstream.yaml
installation_instructions.md		installation_instructions.md
loss.ipynb		loss.ipynb
loss_llamat.png		loss_llamat.png
new_plot.py		new_plot.py
training_env.txt		training_env.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaMat

Table of contents

Overview

File Structure

Additional Documentation

Pretraining

For setting up environment to install Megatron, please follow the instructions provided here

Inference and Evaluation

Environment setup downstream evaluation

Instructions to run matNLP evaluations

Instructions to run structured information extraction evaluations:

Generating the output pickle file:

running evaluation on the output file:

Instruction finetuning

Step 1. OpenOrca finetuning.

Step 2. Materials science specific finetuning.

General Command:

Acknowledgements

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

M3RG-IITD/llamat

Folders and files

Latest commit

History

Repository files navigation

LLaMat

Table of contents

Overview

File Structure

Additional Documentation

Pretraining

For setting up environment to install Megatron, please follow the instructions provided here

Inference and Evaluation

Environment setup downstream evaluation

Instructions to run matNLP evaluations

Instructions to run structured information extraction evaluations:

Generating the output pickle file:

running evaluation on the output file:

Instruction finetuning

Step 1. OpenOrca finetuning.

Step 2. Materials science specific finetuning.

General Command:

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages