Skip to content

syntactic processing #2

@SennR-1952135

Description

@SennR-1952135

In table_annotator.py on line 632, we process the original column name to match the name against the ontologies of DBpedia and Schema. The original column names are processed using the code below:

cleaned_table_columns = [
                re.sub(r"[_-]", " ", " ".join(
                    re.findall("[0-9,a-z,.,\"#!$%\^&\*;:{}=\-_`~()\n\t\d]+|[A-Z](?:[A-Z]*(?![a-z])|[a-z]*)", col)
                )).lower() for col in table_columns.copy()
            ]

I wonder if the first " " inside the re.sub() call, currently a space, should be converted to "", an empty string. Because we already match the _- in the regex inside findall, which in turn means the _ or _ is replaced by a space using " ".join(). This join keeps the matched _ or - in the string, which in turn means the _ or - is replaced by another " " using the re.sub(r"[_-]", " ", ...).

For example:
"Team-Name" would be converted into "team name", 2 spaces between 'team' and 'name'. Is this desired behaviour, am I missing something? Or is this a bug?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions