Skip to content

Issue with dataset concatenation #43

@jplu

Description

@jplu

Hi,

First of all there is a bug in https://github.com/asahi417/tner/blob/master/tner/tner_cl/train.py#L118 The GridSearcher call should be:

trainer = GridSearcher(
        checkpoint_dir=opt.checkpoint_dir,
        dataset=opt.dataset,
        local_dataset=opt.local_dataset,
        dataset_name=opt.dataset_name,
        n_max_config=opt.n_max_config,
        epoch_partial=opt.epoch_partial,
        max_length_eval=opt.max_length_eval,
        dataset_split_train=opt.dataset_split_train,
        dataset_split_valid=opt.dataset_split_valid,
        model=opt.model,
        crf=opt.crf,
        max_length=opt.max_length,
        epoch=opt.epoch,
        batch_size=opt.batch_size,
        lr=opt.lr,
        random_seed=opt.random_seed,
        gradient_accumulation_steps=opt.gradient_accumulation_steps,
        weight_decay=[i if i != 0 else None for i in opt.weight_decay],
        lr_warmup_step_ratio=[i if i != 0 else None for i in opt.lr_warmup_step_ratio],
        max_grad_norm=[i if i != 0 else None for i in opt.max_grad_norm],
        use_auth_token=opt.use_auth_token
    )

The dataset_name argument was missing.

Then when I want to train a model over two different datasets they are not properly concatenated. Here a simple example to reproduce:

tner-train-search -m "xlm-roberta-base" -c "output/" -d "tner/wikiann" "tner/tweetner7" --dataset-name "ace" "tweetner7" -e 15 --epoch-partial 5 --n-max-config 3 -b 32 -g 2 4 --lr 1e-6 1e-5 --crf 0 1 --max-grad-norm 0 10 --weight-decay 0 1e-7

According to the logs we get:

encode all the data: 7111

7111 is the size of the tner/tweetner7 dataset for the split train_all. The real size should be 100 + 7111 the former being the size of the train split of the ace subdataset of tner/wikiann .

I don't know if this is an easy fix or not. I will be happy to help if needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions