I am trying to recover the text but it is not possible since the token.original_spelling for a token : ( does not contain the original number of spaces.
Here is a motivating example:
import somajo
tokenizer = somajo.SoMaJo("de_CMC", split_camel_case=True, split_sentences=True)
paragraph = ["Angebotener Hersteller/Typ: (vom Bieter einzutragen) Im \
Einheitspreis sind alle erforderlichen \
Schutzmaßnahmen bei Errichtung des Brandschutzes einzukalkulieren."]
for sent in tokenizer.tokenize_text(paragraph):
for token in sent:
print(token, " --> ", token.original_spelling)
This prints
Angebotener --> None
Hersteller --> None
/ --> None
Typ --> None
:( --> : (
vom --> None
Bieter --> None
einzutragen --> None
) --> None
Im --> None
Einheitspreis --> None
sind --> None
alle --> None
erforderlichen --> None
Schutzmaßnahmen --> None
bei --> None
Errichtung --> None
des --> None
Brandschutzes --> None
einzukalkulieren --> None
. --> None
It would be great if this could somehow be resolved. Thanks!
I am trying to recover the text but it is not possible since the
token.original_spellingfor a token: (does not contain the original number of spaces.Here is a motivating example:
This prints
It would be great if this could somehow be resolved. Thanks!