A MISMATCHED Benchmark for Scientific Natural Language Inference

This repository contains the dataset and code for the ACL 2025 Findings paper "A MISMATCHED Benchmark for Scientific Natural Language Inference" The dataset can be downloaded from here.

If you face any difficulties while downloading the dataset, raise an issue in this repository or contact us at fshaik8@uic.edu.

Abstract

Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from re- search articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are com- pletely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MISMATCHED. The new MIS- MATCHED benchmark covers three non-CS domains–PSYCHOLOGY, ENGINEERING, and PUBLIC HEALTH, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MISMATCHED using both Pre- trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best per- forming baseline shows a Macro F1 of only 78.17% illustrating the substantial headroom for future improvements. In addition to in- troducing the MISMATCHED benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI.

Dataset Description

We introduce MISMATCHED, a novel evaluation benchmark for scientific Natural Language Inference (NLI) designed to test out-of-domain generalization. It is derived from research articles across three non-Computer Science (CS) domains: PUBLIC HEALTH, PSYCHOLOGY, and ENGINEERING. MISMATCHED consists exclusively of development and test sets, containing 300 and 2,400 human-annotated sentence pairs respectively. Importantly, MISMATCHED does not include its own training set and serves as an out-of-domain evaluation test-bed.

The construction of the MISMATCHED development and test sets involved a two-phase process. In the first phase, sentence pair candidates for the ENTAILMENT, CONTRASTING, and REASONING classes were automatically extracted from the source articles. This extraction utilized a distant supervision method that relies on explicit linking phrases (e.g., "However," "Therefore") which are indicative of the semantic relation between adjacent sentences. These linking phrases were subsequently removed from the second sentence after the initial automatic labeling. For the NEUTRAL class, non-adjacent sentences from the same paper were paired using specific strategies.

In the second phase, all candidate pairs underwent a rigorous manual annotation process conducted by domain experts hired via a crowd-sourcing platform. This step was crucial to ensure high-quality data and create a realistic evaluation benchmark. Only those sentence pairs for which the human-assigned gold label matched the label automatically assigned during the distant supervision phase were included in the final MISMATCHED development and test sets.

We refer the reader to Section 3 of our paper, for an in-depth description of the dataset construction process, data sources, and detailed statistics.

Examples

Files

The MISMATCHED dataset provides development and test data in Tab-Separated Value (.tsv) format:

=> test.tsv and dev.tsv contain the testing and development data, respectively. Each of these .tsv files includes the following columns essential for the Natural Language Inference task:

* sentence1: The premise sentence..

* sentence2: The hypothesis sentence

* label: The human-annotated label representing the semantic relation (e.g., ENTAILMENT, CONTRASTING, REASONING, NEUTRAL).

Additionally, the files contain metadata columns: Domain which allows for domain-specific analyses.

Dataset Size (MISMATCHED Benchmark)

The MISMATCHED benchmark is specifically designed as an out-of-domain evaluation set and does not include a training set. The sizes for the provided, human-annotated sets are as follows:

Test: 2,400 sentence pairs - human annotated.
Dev: 300 sentence pairs - human annotated.

Baseline Results

We establish strong baselines on MISMATCHED using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). The table below shows Macro F1 scores (%) with standard deviations across different domains and overall performance.

Model Training & Testing

Requirements

numpy==1.26.4
pandas==2.2.3
scikit-learn==1.6.1
torch==2.5.1
transformers==4.51.1

Note: The following 4 open-source LLMs were evaluated in our paper using zero-shot and few-shot settings on the MisMatched test set. Use these exact models to reproduce our results:

microsoft/Phi-3-medium-128k-instruct
meta-llama/Llama-2-13b-chat-hf
meta-llama/Meta-Llama-3.1-8B-Instruct
mistralai/Mistral-7B-Instruct-v0.3

Zero-Shot Evaluation Script

Basic Usage

python MisMatched_Zero_Shot.py --model_name <model_name> --test_data_path <path_to_test.tsv>

Complete Command with All Parameters

python MisMatched_Zero_Shot.py --model_name "microsoft/Phi-3-medium-128k-instruct" --test_data_path "MisMatched/test.tsv" --env_file "var.env" --batch_size 128 --max_new_tokens 40 --embedding_model "all-MiniLM-L6-v2" --embedding_batch_size 256 --similarity_threshold 0.25 --sample_size 100 --random_seed 42 --output_file "zero_shot_results.csv" --num_sample_prompts 2 --do_sample --temperature 0.7 --top_p 0.9 --show_sample_responses --save_predictions --log_level "INFO" --force_cpu

Few-Shot Evaluation Script

Basic Usage

python MisMatched_Few_Shot.py --model_name <model_name> --train_data_path <path_to_train.csv> --test_data_path <path_to_test.tsv>

Complete Command with All Parameters

python MisMatched_Few_Shot.py --model_name "microsoft/Phi-3-medium-128k-instruct" --train_data_path "sampled_SciNLI_pair1.csv" --test_data_path "MisMatched/test.tsv" --env_file "var.env" --load_in_8bit --batch_size 128 --max_new_tokens 40 --use_sampled_examples --few_shot_samples_per_label 4 --few_shot_seed 0 --embedding_model "all-MiniLM-L6-v2" --embedding_batch_size 256 --similarity_threshold 0.25 --num_sample_prompts 2 --do_sample --temperature 0.7 --top_p 0.9 --show_sample_responses --save_predictions --output_file "few_shot_results.csv" --log_level "INFO"

Fine-tuning and Evaluation of Pre-trained Small Language Models

python nli_train_evaluate.py --data_dir <location of a directory containing the train.tsv, test.tsv and dev.tsv files> --output_dir <directory to save model and results> --model_type <'BERT', 'Sci_BERT', 'Sci_BERT_uncased', 'RoBERTa', 'RoBERTa_large', 'xlnet'> --batch_size <batch size> --num_epochs <number of epochs to train the model for> --epoch_patience <patience for early stopping> --device <device to run your experiment on> --seed <some random seed>

Llama-2 Fine-tuning and Evaluation

Fine-tuning Script

Basic Usage

python llama2chat_finetune.py --train_path <path_to_train.csv> --dev_path <path_to_dev.csv> --output_dir <output_directory> --save_model_path <final_model_path>

Complete Command with All Parameters

python llama2chat_finetune.py --train_path "Datasets/train.csv" --dev_path "Datasets/dev.csv" --model_name "meta-llama/Llama-2-13b-chat-hf" --max_memory_per_gpu "80GB" --load_in_4bit --bnb_4bit_use_double_quant --bnb_4bit_quant_type "nf4" --bnb_4bit_compute_dtype "bfloat16" --lora_r 16 --lora_alpha 32 --lora_dropout 0.05 --lora_bias "none" --output_dir "output/" --per_device_train_batch_size 32 --gradient_accumulation_steps 4 --learning_rate 2e-3 --num_train_epochs 3 --save_strategy "epoch" --eval_strategy "epoch" --load_best_model_at_end --fp16 --optim "adamw_bnb_8bit" --max_seq_length 1024 --save_model_path "llama2_chat_classification_trainsampled_3_epochs/" --config_file "config.json" --seed 42

Evaluation Script

Basic Usage

python llama2chat_evaluation.py --model_path <path_to_finetuned_model> --test_path <path_to_test.csv>

Complete Command with All Parameters

python llama2chat_evaluation.py --model_path "llama2_chat_classification_trainsampled_3_epochs/" --device "auto" --test_path "Datasets/test.csv" --max_new_tokens 50 --num_return_sequences 1 --batch_size 128 --embedding_model "all-MiniLM-L6-v2" --use_sampling --sample_size 100 --sample_seed 42 --output_dir "evaluation_results/" --save_predictions --save_responses --show_examples 2 --verbose

Citation

If you use this dataset, please cite our paper:

@inproceedings{shaik-etal-2025-mismatched,
    title = "A {MISMATCHED} Benchmark for Scientific Natural Language Inference",
    author = "Shaik, Firoz  and
      Sadat, Mobashir  and
      Gautam, Nikita  and
      Caragea, Doina  and
      Caragea, Cornelia",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.1109/",
    doi = "10.18653/v1/2025.findings-acl.1109",
    pages = "21524--21538",
    ISBN = "979-8-89176-256-5",
    abstract = "Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. Existing datasets for this task are derived from various computer science (CS) domains, whereas non-CS domains are completely ignored. In this paper, we introduce a novel evaluation benchmark for scientific NLI, called MisMatched. The new MisMatched benchmark covers three non-CS domains{--}Psychology, Engineering, and Public Health, and contains 2,700 human annotated sentence pairs. We establish strong baselines on MisMatched using both Pre-trained Small Language Models (SLMs) and Large Language Models (LLMs). Our best performing baseline shows a Macro F1 of only 78.17{\%} illustrating the substantial headroom for future improvements. In addition to introducing the MisMatched benchmark, we show that incorporating sentence pairs having an implicit scientific NLI relation between them in model training improves their performance on scientific NLI. We make our dataset and code publicly available on GitHub."
}

License

MisMatched is licensed with Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

Contact

Please contact us at fshaik8@uic.edu with any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Images		Images
MisMatched_Few_Shot.py		MisMatched_Few_Shot.py
MisMatched_Zero_Shot.py		MisMatched_Zero_Shot.py
README.md		README.md
Training_and_testing_utils_transformers.py		Training_and_testing_utils_transformers.py
Utils.py		Utils.py
bert_data_preprocessor.py		bert_data_preprocessor.py
llama2chat_evaluation.py		llama2chat_evaluation.py
llama2chat_finetune.py		llama2chat_finetune.py
neural_text_models.py		neural_text_models.py
nli_train_evaluate.py		nli_train_evaluate.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A MISMATCHED Benchmark for Scientific Natural Language Inference

Abstract

Dataset Description

Examples

Files

Dataset Size (MISMATCHED Benchmark)

Baseline Results

Model Training & Testing

Requirements

Zero-Shot Evaluation Script

Basic Usage

Complete Command with All Parameters

Few-Shot Evaluation Script

Basic Usage

Complete Command with All Parameters

Fine-tuning and Evaluation of Pre-trained Small Language Models

Llama-2 Fine-tuning and Evaluation

Fine-tuning Script

Basic Usage

Complete Command with All Parameters

Evaluation Script

Basic Usage

Complete Command with All Parameters

Citation

License

Contact

About

Uh oh!

Releases

Packages

Languages

fshaik8/MisMatched

Folders and files

Latest commit

History

Repository files navigation

Abstract

Dataset Description

Examples

Files

Dataset Size (MISMATCHED Benchmark)

Baseline Results

Model Training & Testing

Requirements

Basic Usage

Complete Command with All Parameters

Basic Usage

Complete Command with All Parameters

Llama-2 Fine-tuning and Evaluation

Basic Usage

Complete Command with All Parameters

Basic Usage

Complete Command with All Parameters

Citation

License

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages