-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Hi - thank you for making this great library! I am trying to use it to implicate training data for differences in minimal pair sentences of the form:
the keys to the cabinet are on the table
the keys to the cabinet is on the table
where I just want to look at what factors affect "is" vs. "are". This would clearly require changes to the wikitext example where the eval/dev set was simply being grouped into chunks of fixed length sequences as opposed to individual sentences per row. I was wondering if I could simply pad all my queries with some fixed max length and then proceed as normal or is there something else I can do?
I tried using the pad sequence idea but was getting some weird matmul dimension errors (I was just trying with 4 examples in my dev set):
the toys on the table are
the toys on the table is
i think the toy on the table is
i think the toy on the table are
and then:
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", max_length=128)
def add_labels(examples):
examples["labels"] = examples["input_ids"].copy()
return examples
tokenized_test_dataset = test_dataset.map(
tokenize_function,
batched=True,
num_proc=None,
remove_columns=test_dataset["test"].column_names,
load_from_cache_file=True,
desc="Running tokenizer on dataset",
batch_size=4
)
tokenized_test_dataset = tokenized_test_dataset.map(
add_labels,
batched=True,
num_proc=None,
load_from_cache_file=True,
batch_size=4
)when I then run the pairwise score computation, this is the error I get:
RuntimeError: The size of tensor a (4) must match the size of tensor b (512) at non-singleton dimension 1
Any assistance would be much appreciated - please let me know if I should share more details!