Skip to content

RAC examples with empty true_labels are silently ignored, leading to incorrect classification #21

@alexwilson1

Description

@alexwilson1

Summary

When using the ZeroShotClassificationPipeline from gliclass, any examples in rac_examples that contain empty true_labels are silently discarded. This causes serious calibration issues and inflated confidence in incorrect predictions, especially in NLI-style classification tasks.

The situation worsens even when only a single positive true_label is added — the output becomes biased due to the lack of negative or neutral signals. The pipeline ends up treating this lone positive example as sufficient evidence, ignoring any counter-examples with no true labels.


🔬 Minimal Working Example

❌ With rac_examples — incorrect, high-confidence result

from transformers import AutoTokenizer
from gliclass import GLiClassModel, ZeroShotClassificationPipeline 
import torch

model_str = "knowledgator/gliclass-base-v2.0-rac-init"

model = GLiClassModel.from_pretrained(model_str)
tokenizer = AutoTokenizer.from_pretrained(model_str)

device = 'mps' if torch.backends.mps.is_available() else 'cuda:0' if torch.cuda.is_available() else 'cpu'

pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device=device)

example_1 = {
    "text": "I submitted my application last week but haven’t heard back yet.",
    "all_labels": ["this is about post-application"],
    "true_labels": ["this is about post-application"]
}

# ❌ This negative example is silently discarded
example_2 = {
    "text": "I was filling out the job application form when the site crashed.",
    "all_labels": ["this is about post-application"],
    "true_labels": []
}

premise = "The job portal crashed while I was still filling out the application."
hypotheses = ["this is about post-application"]

results = pipeline(premise, hypotheses, threshold=0.0, rac_examples=[example_1, example_2])[0]
print(results)

Output:

[{'label': 'this is about post-application', 'score': 0.9948280453681946}]

🔍 Even though the premise is about the pre-application stage, the model outputs a high-confidence score for post-application, due to the lack of counterbalancing from example_2.


🟢 Without rac_examples — correct behavior

pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device=device)

premise = "The job portal crashed while I was still filling out the application."
hypotheses = ["this is about post-application"]

results = pipeline(premise, hypotheses, threshold=0.0)[0]
print(results)

Output:

[{'label': 'this is about post-application', 'score': 0.10260037332773209}]

✅ Without the misleading calibration, the model gives a low score — as expected.


⚠️ SINGLE TRUE LABEL ADDED — Still bad behavior

Even using just a single positive RAC example (no counter-examples), we see the same high-confidence issue:

example_1 = {
    "text": "I submitted my application last week but haven’t heard back yet.",
    "all_labels": ["this is about post-application"],
    "true_labels": ["this is about post-application"]
}

results = pipeline(premise, hypotheses, threshold=0.0, rac_examples=[example_1])[0]
print(results)

Output:

[{'label': 'this is about post-application', 'score': 0.9948280453681946}]

✅ Expected Behavior

  • Examples with true_labels=[] should:
    • ❗ Act as negative signals, indicating “this text is not about the listed labels”; OR
    • ⚠️ Trigger a clear warning that the example will be ignored, so users can avoid false calibration.

💡 Why This Matters

  • In zero-shot or few-shot setups, users expect every example to contribute to the output decision.
  • Silently discarding negative or neutral examples skews predictions, especially when examples are few.
  • This reduces trust, interpretability, and can yield confidently wrong classifications — a critical issue in real-world deployments.

🧪 Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions