MatSKRAFT

This repo contains all the data and code related to our paper MatSKRAFT — A framework for large-scale materials knowledge extraction from scientific tables.

Overview

We introduce MatSKRAFT, a unified framework for large-scale extraction of properties and compositions from scientific tables, followed by the knowledge-base construction.

Key innovations

Hierarchical data preparation: integrates distant supervision, property-specific annotation algorithms, and strategic data-augmentation to construct balanced, high-quality training datasets.
Specialized GNN-based extraction models: Property and composition extraction using GNN-based models, with constrained-learning and post-processing.
Knowledge-base integration: links extracted compositions and properties through orientation-aware and cross-table entity linking, enabling knowledge base construction, which demonstrates impactful applications.

Results

MatSKRaFT achieves state-of-the-art performance across composition and property extraction:

For property extraction : Precision - 90.35, Recall - 87.07, F1 score - 88.68.
For composition extraction : Precision - 82.31, Recall - 62.97, F1 score - 71.35 .
At database scale, MatSKRaFT extracted 535,000+ entries from 68,933 tables across 47,242 papers with 78.08% precision.

Crucially, other than its high accuracy, MatSKRaFT is also computationally efficient, processing a table over 19-496× faster than LLM baselines. This high efficiency, along with reliable extraction accuracy, enabled large-scale extraction in a reasonable time on a single V100 GPU.

Ablation Study Insights

Our ablations confirm that MatSKRaFT’s high performance comes from the synergy of multiple architectural and data-preparation components, not from a single component:

Architectural components:
- Constrained learning improves material–property linking.
- Caption information provides critical semantic context for disambiguation.
- Post-processing is the most vital precision safeguard.

Data preparation strategies:
- Distant supervision (on INTERGLAD) provides broad initial coverage but is inherently limited to properties present in the reference database.
- Annotation algorithms systematically extend label coverage beyond distant supervision, generating high-precision training examples through multi-criteria verification.
- Data augmentation rebalances the long-tail by amplifying rare properties through neighborhood co-occurrence patterns (e.g., Abbe value with refractive index) combined with power-law scaling and Gaussian sampling.

Together, these components elevate overall performance, demonstrating that MatSKRaFT’s strength lies in its multi-component design.

Full detailed breakdowns are presented in the article and the Ablation Studies section.

File Structure

Downloading and Preprocessing
Scripts for acquiring full-text XMLs (via Elsevier API), and converting them into machine-readable tables with associated text.
Generating the Training Data
Automated hierarchical pipeline for generating high-quality training data via distant supervision, annotation codes, and data augmentation for extracting information from materials tables.
Property Extraction
Contains code for extracting property names, values, and units from tables; includes unit normalization, validation, and disambiguation routines.
Composition Extraction
Contains code for extracting constituting elements or compounds, values, and corresponding units from heterogeneous table formats in materials science.
Knowledge-Base Construction and Comprehensive Evaluation
Code to evaluate both the extraction tasks performed by our framework with respect to the expert-annotated test dataset. We then link the extracted compositions with properties using orientation-aware linking for intra-table and material-id linking for inter-table to form the structured knowledge-base, upon which we evaluate the final scores after linking.
Baseline Comparison
Contains code for baseline comparison on property and composition extraction.
Ablation Studies
Contains code and configs for running ablation experiments on data preparation strategies and architecture components.

Cite as

@article{hira2025matskraft,
  title={MatSKRAFT: A framework for large-scale materials knowledge extraction from scientific tables},
  author={Hira, Kausik and Zaki, Mohd and Krishnan, NM and others},
  journal={arXiv preprint arXiv:2509.10448},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
Linking_the_extracted_info_and_final_evaluation		Linking_the_extracted_info_and_final_evaluation
Matskraft_composition		Matskraft_composition
Matskraft_property		Matskraft_property
ablations		ablations
baselines		baselines
downloading_and_preprocessing		downloading_and_preprocessing
train_data_generation		train_data_generation
.gitignore		.gitignore
LICENSE		LICENSE
MatSKRAFT_framework.png		MatSKRAFT_framework.png
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MatSKRAFT

Overview

Key innovations

Results

Ablation Study Insights

File Structure

Cite as

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

License

M3RG-IITD/MatSKRAFT

Folders and files

Latest commit

History

Repository files navigation

MatSKRAFT

Overview

Key innovations

Results

Ablation Study Insights

File Structure

Cite as

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages