Skip to content

assert [tok.text for tok in tokens] == [ AssertionError: Spacy and Stanza word mismatch #7

@AnnaWegmann

Description

@AnnaWegmann

Hi! Maybe you can help me with the following:

After creating a conda environment with

conda create --name aw_value python=3.10.13
conda activate aw_value
pip install value-nlp
pip install datasets==2.20.0

I am calling

export TASK_NAME=sst2
export PYTHONHASHSEED=1234
python run_glue.py  --model_name_or_path roberta-base --task_name $TASK_NAME --do_eval --max_seq_length 128 --per_device_train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3 --output_dir output/$TASK_NAME/roberta_base --dialect "aave" --morphosyntax --do_train 

and get the error

  File "/home/uu_cs_nlpsoc/awegmann/StyleTokenizer/src/run_value.py", line 743, in <module>
    main()
  File "/home/uu_cs_nlpsoc/awegmann/StyleTokenizer/src/run_value.py", line 553, in main
    raw_datasets = raw_datasets.map(
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/dataset_dict.py", line 869, in map
    {
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/dataset_dict.py", line 870, in <dictcomp>
    k: dataset.map(
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3161, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3552, in _map_single
    batch = apply_function_on_filtered_inputs(
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3421, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "/home/uu_cs_nlpsoc/awegmann/StyleTokenizer/src/run_value.py", line 517, in preprocess_function
    conversions1 = [
  File "/home/uu_cs_nlpsoc/awegmann/StyleTokenizer/src/run_value.py", line 518, in <listcomp>
    dialect.convert_sae_to_dialect(example)
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/multivalue/BaseDialect.py", line 193, in convert_sae_to_dialect
    self.update(string)
  File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/multivalue/BaseDialect.py", line 218, in update
    self.coref_clusters = self.create_coref_cluster(string)
File "/hpc/local/Rocky8/uu_cs_nlpsoc/miniconda3/envs/aw_value/lib/python3.10/site-packages/multivalue/BaseDialect.py", line 237, in create_coref_cluster
    assert [tok.text for tok in tokens] == [
AssertionError: Spacy and Stanza word mismatch

Any experience with this error? Does the run_glue.py still work for you in your env? I also had to delete the mapping in AfricanAmericanVernacular(mapping, ...).

FYI: I renamed run_glue to run_value.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions