The model supports standard tab seperated CoNLL format - one word per line, with tokens and labels being separated with a tab. There is a new line between sentences.
For example (CoNLL2010 uncertainty detection):
We C
believe U
that C
this C
server C
is C
useful C
To run seq class training, please run: python seq_class_script.py {config_path} {gpu_id}
Similarly, to generate token level predictions run: python model_to_token_preds.py {config_path} {gpu_id} with two optional
arguments: {layer_id} and {head_id} if running for RoBERTa's
internal attention.
To evaluate on the token level, run: python token_preds_evaluate.py {config_path} {layer_id (optional)} {head_id (optional)}
Sample configs are provided in the configs/ folder for all 3 stages.
Possible entries of the configs are:
experiment_name - name of the experiment being run now
dataset - name of the dataset being used
model_name - name of the HuggingFace BERT model to use
max_seq_length - maximum seq length used by the HuggingFace library
per_device_train_batch_size - train batch size
per_device_eval_batch_size - eval batch size
num_train_epochs - number of epochs
warmup_ratio - warmup ratio of the optimiser
learning_rate - training learning rate
weight_decay - weight decay inside BERT during fine tuning
seed - random seed
adam_epsilon - epsilon value for adam optimiser
test_label_dummy - dummy label to use for test dataset
make_all_labels_equal_max - make all tokens labels equal to max
is_seq_class - perform seq class or token class
lowercase - make all tokens lowercase
gradient_accumulation_steps - over how many steps to accumulate gradient
save_steps - how often to save checkpoints
logging_steps - how often to log
output_dir - where to save the model
do_mask_words - mask out some words with a token
mask_prob - probability of a word being masked
hid_to_attn_dropout - dropout between soft attention layer and BERT
attention_evidence_size - size of the attention weights layer
final_hidden_layer_size - size of the final output hidden layer
initializer_name - how to initialise weights and bias
attention_activation - function to use for soft attention activation
soft_attention - apply soft attention architecture on top
soft_attention_gamma - how much imporatance to give to optimise max token label to be equal to max sentence label
soft_attention_alpha- importance of optimising min token label to 0
square_attention - square attentions during normalisation
freeze_bert_layers_up_to - freeze all BERT layers up to x
zero_n - make at least n token labels be close to 0.0
zero_delta - importance of the zero_n loss
dataset_split - which dataset to use for predictions/eval
model_path - path to the pretrained model
datetime - datetime to use
results_input_dir - input dir of the the token-level preds
results_input_filename - file in whcih token-level preds are stored in the tsv format
importance_threshold - threshold to use for eval
top_count - only mark x top scores as positive
preds_output_filename - save predictions to file
eval_results_filanamd - save eval results to file
Please note that some of this code is based on the token prediction tutorial for BERT using HuggingFace: https://github.com/huggingface/transformers/tree/master/examples/token-classification
and Marek Rei's mltagger: https://github.com/marekrei/mltagger