-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Summary
When using the ZeroShotClassificationPipeline from gliclass, any examples in rac_examples that contain empty true_labels are silently discarded. This causes serious calibration issues and inflated confidence in incorrect predictions, especially in NLI-style classification tasks.
The situation worsens even when only a single positive true_label is added — the output becomes biased due to the lack of negative or neutral signals. The pipeline ends up treating this lone positive example as sufficient evidence, ignoring any counter-examples with no true labels.
🔬 Minimal Working Example
❌ With rac_examples — incorrect, high-confidence result
from transformers import AutoTokenizer
from gliclass import GLiClassModel, ZeroShotClassificationPipeline
import torch
model_str = "knowledgator/gliclass-base-v2.0-rac-init"
model = GLiClassModel.from_pretrained(model_str)
tokenizer = AutoTokenizer.from_pretrained(model_str)
device = 'mps' if torch.backends.mps.is_available() else 'cuda:0' if torch.cuda.is_available() else 'cpu'
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device=device)
example_1 = {
"text": "I submitted my application last week but haven’t heard back yet.",
"all_labels": ["this is about post-application"],
"true_labels": ["this is about post-application"]
}
# ❌ This negative example is silently discarded
example_2 = {
"text": "I was filling out the job application form when the site crashed.",
"all_labels": ["this is about post-application"],
"true_labels": []
}
premise = "The job portal crashed while I was still filling out the application."
hypotheses = ["this is about post-application"]
results = pipeline(premise, hypotheses, threshold=0.0, rac_examples=[example_1, example_2])[0]
print(results)Output:
[{'label': 'this is about post-application', 'score': 0.9948280453681946}]🔍 Even though the premise is about the pre-application stage, the model outputs a high-confidence score for post-application, due to the lack of counterbalancing from
example_2.
🟢 Without rac_examples — correct behavior
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device=device)
premise = "The job portal crashed while I was still filling out the application."
hypotheses = ["this is about post-application"]
results = pipeline(premise, hypotheses, threshold=0.0)[0]
print(results)Output:
[{'label': 'this is about post-application', 'score': 0.10260037332773209}]✅ Without the misleading calibration, the model gives a low score — as expected.
⚠️ SINGLE TRUE LABEL ADDED — Still bad behavior
Even using just a single positive RAC example (no counter-examples), we see the same high-confidence issue:
example_1 = {
"text": "I submitted my application last week but haven’t heard back yet.",
"all_labels": ["this is about post-application"],
"true_labels": ["this is about post-application"]
}
results = pipeline(premise, hypotheses, threshold=0.0, rac_examples=[example_1])[0]
print(results)Output:
[{'label': 'this is about post-application', 'score': 0.9948280453681946}]✅ Expected Behavior
- Examples with
true_labels=[]should:- ❗ Act as negative signals, indicating “this text is not about the listed labels”; OR
⚠️ Trigger a clear warning that the example will be ignored, so users can avoid false calibration.
💡 Why This Matters
- In zero-shot or few-shot setups, users expect every example to contribute to the output decision.
- Silently discarding negative or neutral examples skews predictions, especially when examples are few.
- This reduces trust, interpretability, and can yield confidently wrong classifications — a critical issue in real-world deployments.
🧪 Environment
- Dependency snapshot: uv.lock.txt