This repository contains the dataset and code for the ACL 2025 Findings paper "A MISMATCHED Benchmark for Scientific Natural Language Inference" The dataset can be downloaded from here.
If you face any difficulties while downloading the dataset, raise an issue in this repository or contact us at fshaik8@uic.edu.
Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from re- search articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are com- pletely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MIS- MATCHED benchmark covers three non-CS domains–PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre- trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best per- forming baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to in- troducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI.
We introduce MISMATCHED, a novel evaluation benchmark for scientific Natural Language Inference (NLI) designed to test out-of-domain generalization. It is derived from research articles across three non-Computer Science (CS) domains: PUBLIC HEALTH, PSYCHOLOGY, and ENGINEERING. MISMATCHED consists exclusively of development and test sets, containing 300 and 2,400 human-annotated sentence pairs respectively. Importantly, MISMATCHED does not include its own training set and serves as an out-of-domain evaluation test-bed.
The construction of the MISMATCHED development and test sets involved a two-phase process. In the first phase, sentence pair candidates for the ENTAILMENT, CONTRASTING, and REASONING classes were automatically extracted from the source articles. This extraction utilized a distant supervision method that relies on explicit linking phrases (e.g., "However," "Therefore") which are indicative of the semantic relation between adjacent sentences. These linking phrases were subsequently removed from the second sentence after the initial automatic labeling. For the NEUTRAL class, non-adjacent sentences from the same paper were paired using specific strategies.
In the second phase, all candidate pairs underwent a rigorous manual annotation process conducted by domain experts hired via a crowd-sourcing platform. This step was crucial to ensure high-quality data and create a realistic evaluation benchmark. Only those sentence pairs for which the human-assigned gold label matched the label automatically assigned during the distant supervision phase were included in the final MISMATCHED development and test sets.
We refer the reader to Section 3 of our paper, for an in-depth description of the dataset construction process, data sources, and detailed statistics.
The MISMATCHED dataset provides development and test data in Tab-Separated Value (.tsv) format:
=> test.tsv and dev.tsv contain the testing and development data, respectively. Each of these .tsv files includes the following columns essential for the Natural Language Inference task:
* sentence1: The premise sentence..
* sentence2: The hypothesis sentence
* label: The human-annotated label representing the semantic relation (e.g., ENTAILMENT, CONTRASTING, REASONING, NEUTRAL).
Additionally, the files contain metadata columns: Domain which allows for domain-specific analyses.
The MISMATCHED benchmark is specifically designed as an out-of-domain evaluation set and does not include a training set. The sizes for the provided, human-annotated sets are as follows:
-
Test: 2,400 sentence pairs - human annotated.
-
Dev: 300 sentence pairs - human annotated.
We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). The table below shows Macro F1 scores (%) with standard deviations across different domains and overall performance.
numpy==1.26.4
pandas==2.2.3
scikit-learn==1.6.1
torch==2.5.1
transformers==4.51.1
Note: The following 4 open-source LLMs were evaluated in our paper using zero-shot and few-shot settings on the MisMatched test set. Use these exact models to reproduce our results:
- microsoft/Phi-3-medium-128k-instruct
- meta-llama/Llama-2-13b-chat-hf
- meta-llama/Meta-Llama-3.1-8B-Instruct
- mistralai/Mistral-7B-Instruct-v0.3
python MisMatched_Zero_Shot.py --model_name <model_name> --test_data_path <path_to_test.tsv>
python MisMatched_Zero_Shot.py --model_name "microsoft/Phi-3-medium-128k-instruct" --test_data_path "MisMatched/test.tsv" --env_file "var.env" --batch_size 128 --max_new_tokens 40 --embedding_model "all-MiniLM-L6-v2" --embedding_batch_size 256 --similarity_threshold 0.25 --sample_size 100 --random_seed 42 --output_file "zero_shot_results.csv" --num_sample_prompts 2 --do_sample --temperature 0.7 --top_p 0.9 --show_sample_responses --save_predictions --log_level "INFO" --force_cpu
python MisMatched_Few_Shot.py --model_name <model_name> --train_data_path <path_to_train.csv> --test_data_path <path_to_test.tsv>
python MisMatched_Few_Shot.py --model_name "microsoft/Phi-3-medium-128k-instruct" --train_data_path "sampled_SciNLI_pair1.csv" --test_data_path "MisMatched/test.tsv" --env_file "var.env" --load_in_8bit --batch_size 128 --max_new_tokens 40 --use_sampled_examples --few_shot_samples_per_label 4 --few_shot_seed 0 --embedding_model "all-MiniLM-L6-v2" --embedding_batch_size 256 --similarity_threshold 0.25 --num_sample_prompts 2 --do_sample --temperature 0.7 --top_p 0.9 --show_sample_responses --save_predictions --output_file "few_shot_results.csv" --log_level "INFO"
python nli_train_evaluate.py --data_dir <location of a directory containing the train.tsv, test.tsv and dev.tsv files> --output_dir <directory to save model and results> --model_type <'BERT', 'Sci_BERT', 'Sci_BERT_uncased', 'RoBERTa', 'RoBERTa_large', 'xlnet'> --batch_size <batch size> --num_epochs <number of epochs to train the model for> --epoch_patience <patience for early stopping> --device <device to run your experiment on> --seed <some random seed>
python llama2chat_finetune.py --train_path <path_to_train.csv> --dev_path <path_to_dev.csv> --output_dir <output_directory> --save_model_path <final_model_path>
python llama2chat_finetune.py --train_path "Datasets/train.csv" --dev_path "Datasets/dev.csv" --model_name "meta-llama/Llama-2-13b-chat-hf" --max_memory_per_gpu "80GB" --load_in_4bit --bnb_4bit_use_double_quant --bnb_4bit_quant_type "nf4" --bnb_4bit_compute_dtype "bfloat16" --lora_r 16 --lora_alpha 32 --lora_dropout 0.05 --lora_bias "none" --output_dir "output/" --per_device_train_batch_size 32 --gradient_accumulation_steps 4 --learning_rate 2e-3 --num_train_epochs 3 --save_strategy "epoch" --eval_strategy "epoch" --load_best_model_at_end --fp16 --optim "adamw_bnb_8bit" --max_seq_length 1024 --save_model_path "llama2_chat_classification_trainsampled_3_epochs/" --config_file "config.json" --seed 42
python llama2chat_evaluation.py --model_path <path_to_finetuned_model> --test_path <path_to_test.csv>
python llama2chat_evaluation.py --model_path "llama2_chat_classification_trainsampled_3_epochs/" --device "auto" --test_path "Datasets/test.csv" --max_new_tokens 50 --num_return_sequences 1 --batch_size 128 --embedding_model "all-MiniLM-L6-v2" --use_sampling --sample_size 100 --sample_seed 42 --output_dir "evaluation_results/" --save_predictions --save_responses --show_examples 2 --verbose
If you use this dataset, please cite our paper:
@inproceedings{shaik-etal-2025-mismatched,
title = "A {MISMATCHED} Benchmark for Scientific Natural Language Inference",
author = "Shaik, Firoz and
Sadat, Mobashir and
Gautam, Nikita and
Caragea, Doina and
Caragea, Cornelia",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.1109/",
doi = "10.18653/v1/2025.findings-acl.1109",
pages = "21524--21538",
ISBN = "979-8-89176-256-5",
abstract = "Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MisMatched. The new MisMatched benchmark covers three non-CS domains{--}Psychology, Engineering, and Public Health, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MisMatched using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17{\%} illustrating the substantial headroom for future improvements. In addition to introducing the MisMatched benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub."
}
MisMatched is licensed with Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).
Please contact us at fshaik8@uic.edu with any questions.

