Skip to content

An open-source multi-modal toolbox for extracting structured synthesis procedures and performance data from materials science literature at scale.

License

Notifications You must be signed in to change notification settings

LeMaterial/lematerial-llm-synthesis

Repository files navigation

LeMaterial-Synthesis

An open-source multi-modal toolbox for extracting structured synthesis procedures and performance data from materials science literature at scale. This repository contains the implementations of LeMat-Synth v1.0 (published on the arXiv and presented at NeurIPS AI4Mat 2025) plus the extendable codebase for usecases in materials science.

Paper Dataset


Quick Start

Installation Instructions

Prerequisites

This project uses uv as a package & project manager. See uv's README for installation instructions.

Setup

# 1. Clone & enter the repo
git clone https://github.com/LeMaterial/lematerial-llm-synthesis.git
cd lematerial-llm-synthesis

# 2. (First time only) create & seed venv
uv venv -p 3.11 --seed

# 3. Install dependencies & package
uv sync && uv pip install -e .

API Key Configuration

macOS/Linux ```bash cp .env.example .env # Edit `.env` to add: # MISTRAL_API_KEY=your_api_key # if using Mistral models and Mistral OCR # OPENAI_API_KEY=your_api_key # if using OpenAI models # GEMINI_API_KEY=your_api_key # if using Gemini models # ANTHROPIC_API_KEY=your_api_key # if using Anthropic models (Claude, image extraction) ```

Before running the scripts, you need to load your API keys. For this you need to source the .env file. Run:

source .env
Windows
  • Search bar → Edit the system environment variables → Advanced → click "Environment Variables..."
  • Under "User variables for " click "New" and add each:
    • Variable name: MISTRAL_API_KEY; Value: your_api_key
    • Variable name: OPENAI_API_KEY; Value: your_api_key
    • Variable name: GEMINI_API_KEY; Value: your_api_key
    • Variable name: GOOGLE_APPLICATION_CREDENTIALS; Value: C:\path\to\service-account.json

Note: For any platform you can always load .env-style keys in code via os.environ.get(...).

Verify Installation

uv run python -c "import llm_synthesis"

No errors? You're all set!


Dataset Access

Fetching HuggingFace Dataset LeMat-Synth

The data is hosted as a LeMaterial Dataset on HuggingFace: LeMat-Synth

Access Steps

  1. Apply for access (request will be instantly approved)
  2. Install HuggingFace CLI (guide)
    • Recommended: pip install -U "huggingface_hub[cli]"
    • Or (macOS): brew install huggingface-cli
  3. Login with access token: huggingface-cli login

Available Datasets

  • LeMat-Synth: Synthesis procedures and images in structured (per-synthesis) format
  • LeMat-Synth-Papers: Intermediate dataset storing papers in per-paper format

Usage

Extract from HuggingFace Dataset

uv run examples/scripts/extract_synthesis_procedure_from_text.py \
  data_loader=default \
  synthesis_extraction=default \
  material_extraction=default \
  judge=default \
  result_save=default

Extract Synthesis Locally

uv run examples/scripts/extract_synthesis_procedure_from_text.py \
  data_loader=local \
  data_loader.architecture.data_dir="/path/to/markdown" \
  synthesis_extraction=default \
  material_extraction=default \
  judge=default \
  result_save=default

Extract Images Locally

Work in Progress

Customize LeMat-Synth

Work in Progress

Thermocatalysis Case Study

Work in Progress

Filter down:

uv run examples/scripts/case_study_thermocatalysis/keyword_search.py
uv run examples/scripts/case_study_thermocatalysis/downsample_with_llm.py --prompt default
uv run examples/scripts/case_study_thermocatalysis/downsample_with_llm.py --prompt long

📝 Citation

Cite us:

@article{lederbauer2025lemat,
  title={LeMat-Synth: a multi-modal toolbox to curate broad synthesis procedure databases from scientific literature},
  author={Lederbauer, Magdalena and Betala, Siddharth and Li, Xiyao and Jain, Ayush and Sehaba, Amine and
          Channing, Georgia and Germain, Gr{\'e}goire and Leonescu, Anamaria and Flaifil, Faris and
          Amayuelas, Alfonso and Nozadze, Alexandre and Schmid, Stefan P. and Zaki, Mohd
          and Ethirajan, Sudheesh Kumar and Pan, Elton and Franckel, Mathilde
          and Duval, Alexandre and Krishnan, N. M. Anoop and Gleason, Samuel P.},
  journal={arXiv preprint arXiv:2510.26824},
  year={2025}
}

About

An open-source multi-modal toolbox for extracting structured synthesis procedures and performance data from materials science literature at scale.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 22

Languages