Skip to content

LM training data attribution question for individual sentences #39

@kanishkamisra

Description

@kanishkamisra

Hi - thank you for making this great library! I am trying to use it to implicate training data for differences in minimal pair sentences of the form:

the keys to the cabinet are on the table
the keys to the cabinet is on the table

where I just want to look at what factors affect "is" vs. "are". This would clearly require changes to the wikitext example where the eval/dev set was simply being grouped into chunks of fixed length sequences as opposed to individual sentences per row. I was wondering if I could simply pad all my queries with some fixed max length and then proceed as normal or is there something else I can do?

I tried using the pad sequence idea but was getting some weird matmul dimension errors (I was just trying with 4 examples in my dev set):

the toys on the table are
the toys on the table is
i think the toy on the table is
i think the toy on the table are

and then:

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", max_length=128)

def add_labels(examples):
    examples["labels"] = examples["input_ids"].copy()
    return examples

tokenized_test_dataset = test_dataset.map(
    tokenize_function,
    batched=True,
    num_proc=None,
    remove_columns=test_dataset["test"].column_names,
    load_from_cache_file=True,
    desc="Running tokenizer on dataset",
    batch_size=4
)

tokenized_test_dataset = tokenized_test_dataset.map(
    add_labels,
    batched=True,
    num_proc=None,
    load_from_cache_file=True,
    batch_size=4
)

when I then run the pairwise score computation, this is the error I get:

RuntimeError: The size of tensor a (4) must match the size of tensor b (512) at non-singleton dimension 1

Any assistance would be much appreciated - please let me know if I should share more details!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions