Skip to content

inaccurate action word recognition #6

@vizzerdrix55

Description

@vizzerdrix55

SoMeWeta uses the Tagset STTS_IBK for tagging. One of the differences between STTS and STTS_IBK is the Tag Action words (AKW), e.g. for German lach (Beißwenger, Bartz, Storrer und Westpfahl, 2015).
I tested the accuracy of AKW-tagging with a small sample of tokens. As you can see from the attached results, the accuracy is about 33 %.

You can reproduce the wrong tagging with the following minimal working example containing 10 sample sentences:

#!/usr/bin/env python
# coding: utf-8
from somajo import Tokenizer, SentenceSplitter
from someweta import ASPTagger

# ## Settings for SoMeWeTa (PoS-Tagger)
#ToDo: update path to language model
model = "german_web_social_media_2018-12-21.model"
asptagger = ASPTagger()
asptagger.load(model)

# ## Settings for SoMaJo (Tokenizer)
tokenizer = Tokenizer(split_camel_case=False,
                      token_classes=False, extra_info=False)
sentence_splitter = SentenceSplitter(is_tuple=False)
eos_tags = set(["post"])

# generate PoS-Tags
def getPos_tag(content):
    tokens = tokenizer.tokenize_paragraph(content)
    sentences = sentence_splitter.split_xml(tokens, eos_tags)
    tagged_sentences = []
    for sentence in sentences:
        tagged_sentences.append(asptagger.tag_xml_sentence(sentence))
    return tagged_sentences

#test sentences from authentic German CMC-data
sentences = ["Also das schlägt ja wohl dem Fass den Boden aus! :haeh:",
             "das mehr oder weniger gute Dlc gabs noch gratis dazu.",
            "Aus der Liste: definitiv Brink, obwohls für kurze Zeit Spaß gemacht "
            "hat, aber im Nachhinein hab ichs doch sehr bereut.",
            "*schluchz, heul*",
            "endlich, und dann noch als standalone-addon *freu*",
            "Und immer schön mit den Holländer zocken, da gabs die besten Preise.",
            "Ich freu mich riesig und weiß was ich im Wintersemester "
            "jeden Tag machen werde!!",
            "alles oben in der liste gabs unter bf2 auch schon in einer form.",
            "Mit dem Account werden weitere Features im Online-Modus des FM11 "
            "freigeschaltet, bswp mehr Statistiken, mehr Aktionskarten, mögliche "
            "Fantasy-Liga, yadda, yadda."]

akws = list()
for sentence in sentences:
    tagged_sentences = getPos_tag(sentence)
    tagged_sentence = tagged_sentences[0]
    for word in tagged_sentence:
        #append to list akws if tagged with PoS-Tag 'AKW'
        try:
            akw = (word.index('AKW'))
            akws.append(word)
        except:
            continue
print("tagged as AKW:", akws)

The output list akws contains two right action words ('heul' and 'freu'). 'Haeh' is an emoticon, 'gabs' and 'obwohls' are in fact contractions. 'bswp' is used as abbreviation for German 'beispielsweise'.

Is this serious enough to be considered as an issue or have i implemented something wrong? As far as I see, this error is not part of the error table 4 in Proisl (2018, p. 668).

Cited sources:

  • Beißwenger, Michael / Bartz, Thomas / Storrer, Angelika und Westpfahl, Swantje (2015). Tagset and guidelines for the PoS tagging of language data from genres of computer- mediated communication / social media., 19.
  • Proisl, Thomas (2018). SoMeWeTa: A Part-of-Speech Tagger for German Social Media and Web Texts. In: European Language Resources Association (ELRA) (Hrsg.), Proceedings of the 11th Language Resources and Evaluation Conference (S. 665–670). Miyazaki, Japan: European Language Resource Association. Abgerufen von https://www.aclweb.org/anthology/L18-1106

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions