Weird behavior for multiword tokens

I'm getting some weird output for segmenting multiword tokens in some languages.

For example, in the Arabic-PADT dev set, the first sentence is tokenized as `#sent_tok: ميراث  ب  300  الف  دولار  يقلب  حياة  متشرد  اميركي  لونغ  بيتش  (  الولايات  المتحدة  )  15  -  7  (  اف  ب  )  -  كل  شيء  تغير  في  حياة  المتشرد  ستيفن  كنت  عندما  عثرت  علي  ه  \\\كككككككك%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  بعد  عناء  طويل  ل  تبلغ  ه  ب  أن  ه  ورث  300  الف  دولار  و  ب  أن  ه  بات  قادرا  على  وضع  حد  ل  عشرين  سنة  من  حياة  التشرد  في  شوارع  مدينة  لونغ  بيتش  في  ولاية  كاليفورنيا  .`

It includes a multiword with a single word (?):

    36-36   شقيقته  _       _       _       _       _       _       _       _
    36      \\\كككككككك%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%     _       _       _       _       _       _       _       _

Similarly, in the Hebrew dev set, I have `#sent_tok: נמיר  הודיעה  כי  תפנה  ל  שרי  ה  פנים  ו  ה  עבודה  ו  ה  רווחה  ו  ל  מזכיר  תנועת  ה  מושבים  ,  ב  תביעה  לבטל  את  ))))))ווווווווווווווווווווווו  של  500  עובדים  זרים  מתאילנד  כ  מתנדבים  כ  ביכול  .` 

This also has a multiword with a single word:

    26-26   הזמנתם  _       _       _       _       _       _       _       _
    26      ))))))ווווווווווווווווווווווו   _       _       _       _       _       _       _       _

I trained the model with the default options as given in the README. I got tokenization F1 similar to the reported in the shared task, so I suppose the system is mostly working correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weird behavior for multiword tokens #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Weird behavior for multiword tokens #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions