LM training data attribution question for individual sentences

Hi - thank you for making this great library! I am trying to use it to implicate training data for differences in minimal pair sentences of the form: 

```
the keys to the cabinet are on the table
the keys to the cabinet is on the table
```

where I just want to look at what factors affect "is" vs. "are". This would clearly require changes to the wikitext example where the eval/dev set was simply being grouped into chunks of fixed length sequences as opposed to individual sentences per row. I was wondering if I could simply pad all my queries with some fixed max length and then proceed as normal or is there something else I can do?

I tried using the pad sequence idea but was getting some weird matmul dimension errors (I was just trying with 4 examples in my dev set): 

```
the toys on the table are
the toys on the table is
i think the toy on the table is
i think the toy on the table are
```

and then:

```py
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", max_length=128)

def add_labels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples

tokenized_test_dataset = test_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=None,
    remove_columns=test_dataset["test"].column_names,
    load_from_cache_file=True,
    desc="Running tokenizer on dataset",
    batch_size=4
)

tokenized_test_dataset = tokenized_test_dataset.map(
    add_labels,
    batched=True,
    num_proc=None,
    load_from_cache_file=True,
    batch_size=4
)
```

when I then run the pairwise score computation, this is the error I get:

```
RuntimeError: The size of tensor a (4) must match the size of tensor b (512) at non-singleton dimension 1
```

Any assistance would be much appreciated - please let me know if I should share more details! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LM training data attribution question for individual sentences #39

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

LM training data attribution question for individual sentences #39

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions