Skip to content

Special characters * and $ not matched in URI #57

@sebastian-nagel

Description

@sebastian-nagel

Section 2.2.3 Special Characters contains two examples about path matching for paths containing the special characters * and $. The two characters are percent-encoded in the allow/disallow rule but not encoded in the URL/URI to be matched. Looks like the robots.txt parser and matcher does not follow the examples in the RFC here and fails to match the percent-encoded characters in the rule with the unencoded ones in the URI. See the unit test below.

* and $ are among the reserved characters in URIs (RFC 3986, section 2.2) and therefor cannot be percent-encoded without potentially changing the semantics of the URI.

diff --git a/robots_test.cc b/robots_test.cc
index 35853de..3a37813 100644
--- a/robots_test.cc
+++ b/robots_test.cc
@@ -492,6 +492,19 @@ TEST(RobotsUnittest, ID_SpecialCharacters) {
     EXPECT_FALSE(
         IsUserAgentAllowed(robotstxt, "FooBot", "http://foo.bar/foo/quz"));
   }
+  {
+    const absl::string_view robotstxt =
+        "User-agent: FooBot\n"
+        "Disallow: /path/file-with-a-%2A.html\n"
+        "Disallow: /path/foo-%24\n"
+        "Allow: /\n";
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/file-with-a-*.html"));
+    EXPECT_FALSE(
+        IsUserAgentAllowed(robotstxt, "FooBot",
+                           "https://www.example.com/path/foo-$"));
+  }
 }
 
 // Google-specific: "index.html" (and only that) at the end of a pattern is

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions