[data] feat: add data collators for embedding classification. #376

yiwzhao · 2026-01-07T05:27:34Z

feat: Add data collators to support embedding classification.
split the data collator part from #322

gemini-code-assist

Code Review

This pull request introduces two new data collators, ClassificationDataCollatorWithPositionIDs and ClassificationTextSequenceShardCollator, specifically for sequence classification tasks. It also adds a comprehensive set of unit tests to validate their functionality. The new collators correctly handle labels for classification by omitting the label masking and shifting present in the base collators. The unit tests are well-written, covering various scenarios including sequence parallelism.

The primary concern with this PR is the significant code duplication in the implementation of the new collator classes. Both classes are nearly identical copies of existing ones, which poses a maintainability risk. I have provided detailed feedback and suggestions to refactor the code using inheritance to mitigate this issue, and also recommended a more robust long-term solution of making the base collators more flexible.

veomni/data/data_collator.py

Coach257 · 2026-01-07T08:39:09Z

veomni/data/data_collator.py

+
+
+@dataclass
+class ClassificationDataCollatorWithPositionIDs(DataCollator):


maybe reuse DataCollatorWithPositionIDs if no diff here

There is a difference: DataCollatorWithPositionIDs masks the labels corresponding to the "boundary tokens" of each packed subsequence, while ClassificationDataCollatorWithPositionIDs does not perform this masking step. However, I think we can make "whether to mask boundary labels" a configurable option, so we don't need to create a new class.

Coach257 · 2026-01-07T08:45:44Z

veomni/data/data_collator.py

+        input_ids = batch.pop("input_ids")
+
+        # CHANGED: do NOT shift labels for seq-cls token-level labels
+        labels = batch.pop("labels").contiguous()


i think we cound add a parameter named ‘shift_labels=True` in TextSequenceShardCollator to control whether to keep the labels or shift and mask the labels

I agree. I added two switches to the original class, which by default maintain the old behavior while also supporting sequence classification.

…xtSequenceShardCollator

piyifan123 · 2026-01-08T01:46:42Z

veomni/data/data_collator.py

+
+        if self.mask_boundary_labels:
+            if self.rmpad_with_pos_ids:  # mask the last token of each sequence
+                cu_seqlens = pos2culen(batch["position_ids"])


don't we need cu_seqlens for not mask_boundary_labels case as well?

piyifan123 · 2026-01-08T01:49:10Z

tests/data/test_seqcls_collators.py

+    batch = collator(features_two_samples)
+
+    assert batch["input_ids"].shape == (1, 5)
+    assert "position_ids" in batch


can we assert actual values and in other tests?

piyifan123 · 2026-01-08T01:49:45Z

tests/data/test_seqcls_collators.py

+import torch
+
+
+IGNORE_INDEX = -100


can we import veomni constant

add data collator for sequence classification and unit tests

63ab57c

github-actions bot added the ci label Jan 7, 2026

yiwzhao changed the title ~~add data collator for sequence classification and unit tests~~ feat: Add data collator to support embedding classification. Jan 7, 2026

yiwzhao changed the title ~~feat: Add data collator to support embedding classification.~~ feat: Add data collators to support embedding classification. Jan 7, 2026

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

veomni/data/data_collator.py Outdated Show resolved Hide resolved

veomni/data/data_collator.py Outdated Show resolved Hide resolved

Inherit from TextSequenceShardCollator

1e31c55

yiwzhao changed the title ~~feat: Add data collators to support embedding classification.~~ [data, collator] feat: Add data collators to support embedding classification. Jan 7, 2026

yiwzhao changed the title ~~[data, collator] feat: Add data collators to support embedding classification.~~ [data] feat: add data collators for embedding classification. Jan 7, 2026

yiwzhao self-assigned this Jan 7, 2026

Coach257 reviewed Jan 7, 2026

View reviewed changes

yiwzhao mentioned this pull request Jan 7, 2026

[model, ops] feat: add Qwen3 sequence classification model and loss for embedding classification. #322

Open

Coach257 reviewed Jan 7, 2026

View reviewed changes

yiwzhao added 3 commits January 7, 2026 20:43

add some configurable option for sequence classification

faf4dca

remove ClassificationDataCollatorWithPositionIDs and ClassificationTe…

718aa40

…xtSequenceShardCollator

adjust data collator test code

5c57aaa

piyifan123 reviewed Jan 8, 2026

View reviewed changes

tests/data/test_seqcls_collators.py

import torch

IGNORE_INDEX = -100

Copy link

Collaborator

piyifan123 Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we import veomni constant

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] feat: add data collators for embedding classification. #376

[data] feat: add data collators for embedding classification. #376

yiwzhao commented Jan 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Coach257 Jan 7, 2026

Uh oh!

yiwzhao Jan 7, 2026

Uh oh!

yiwzhao Jan 7, 2026

Uh oh!

Coach257 Jan 7, 2026

Uh oh!

yiwzhao Jan 7, 2026

Uh oh!

piyifan123 Jan 8, 2026

Uh oh!

piyifan123 Jan 8, 2026

Uh oh!

piyifan123 Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@dataclass
		class ClassificationDataCollatorWithPositionIDs(DataCollator):

[data] feat: add data collators for embedding classification. #376

Are you sure you want to change the base?

[data] feat: add data collators for embedding classification. #376

Conversation

yiwzhao commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiwzhao commented Jan 7, 2026 •

edited

Loading