Skip to content

[Suprising behavior] lazy quantifiers are treated the same way as greedy quantifiers #15

@Gompyn

Description

@Gompyn

I want to use interegular for tokenization in a C-like programming language. In particular, I use the regex \/\*.*?\*\/ to match the block comments. By using ?, I expect the lexer can immediately return a block comment when it see some string constitutes one.

from interegular import parse_pattern

fsm = parse_pattern(r'(?s:\/\*.*?\*\/)').to_fsm()
assert '/**/*/' not in fsm

However, this pattern is treated the same way as if we removed ?. This causes the string /**/*/ to wrongly match the regex, and by the 'most vexing' principle, this pattern eats every character even after a block comment finishes.

This can be seen from the source code that ? are simply ignored.

if self.static_b("*"):
if self.static_b("?"):
pass
return _Repeated(base, 0, None)
elif self.static_b("+"):
if self.static_b("?"):
pass
return _Repeated(base, 1, None)
elif self.static_b("?"):
if self.static_b("?"):
pass
return _Repeated(base, 0, 1)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions