Skip to content

Add initial implementation for Wikipedia infobox synchronization#449

Open
loka1 wants to merge 2 commits intomainfrom
InfoboxSync
Open

Add initial implementation for Wikipedia infobox synchronization#449
loka1 wants to merge 2 commits intomainfrom
InfoboxSync

Conversation

@loka1
Copy link
Copy Markdown
Member

@loka1 loka1 commented Aug 28, 2025

Introduce the foundational structure for a Wikipedia infobox synchronization tool, including stages for mapping, saving, parsing, and publishing data. Implement data models and fetchers to handle synchronization between Arabic and English Wikipedia pages. Include infobox parsers and a factory for creating appropriate parsers based on template types. Add logging for better traceability during data operations.

"""
logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
# Extract page title from URL (simple implementation)
if 'wikipedia.org' in url:

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
wikipedia.org
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 7 months ago

The best fix is to parse the incoming URL using Python's urllib.parse.urlparse, and ensure the hostname is exactly "wikipedia.org" or an allowed subdomain (e.g., "en.wikipedia.org"). This avoids matching URLs that merely contain the string in the wrong position. For this code, replace the substring check with a host check after parsing the URL. If the hostname matches "wikipedia.org" or a defined allowlist of trusted Wikipedia hostnames, proceed; otherwise, raise the ValueError. Update only the block in the function fetch_data, leaving other code untouched.

Implementation steps:

  • Add from urllib.parse import urlparse import if not already present.
  • Replace line(s) where 'wikipedia.org' in url is checked.
  • Instead, use urlparse(url).hostname and check if it matches the set of allowed hosts.
  • Define a list or set of allowed Wikipedia hostnames (e.g., en.wikipedia.org, ar.wikipedia.org).
  • Only proceed if the parsed hostname is in this allowlist.

Suggested changeset 1
tasks/InfoboxSync/fetch/__init__.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tasks/InfoboxSync/fetch/__init__.py b/tasks/InfoboxSync/fetch/__init__.py
--- a/tasks/InfoboxSync/fetch/__init__.py
+++ b/tasks/InfoboxSync/fetch/__init__.py
@@ -2,7 +2,7 @@
 
 import logging
 from typing import Dict, Any
-
+from urllib.parse import urlparse
 from .sync_fetcher import WikipediaSyncFetcher
 from .models import PageInfo, SyncResult
 
@@ -44,9 +44,15 @@
     Now expects a page title instead of URL.
     """
     logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
-    # Extract page title from URL (simple implementation)
-    if 'wikipedia.org' in url:
-        page_title = url.split('/')[-1].replace('_', ' ')
+    # Extract page title from URL (safe implementation)
+    allowed_wikipedia_hosts = {
+        "ar.wikipedia.org",
+        "en.wikipedia.org",
+        "wikipedia.org"
+    }
+    parsed = urlparse(url)
+    if parsed.hostname in allowed_wikipedia_hosts:
+        page_title = parsed.path.split('/')[-1].replace('_', ' ')
         return fetch_wikipedia_data(page_title)
     else:
         raise ValueError("URL must be a Wikipedia page URL")
EOF
@@ -2,7 +2,7 @@

import logging
from typing import Dict, Any

from urllib.parse import urlparse
from .sync_fetcher import WikipediaSyncFetcher
from .models import PageInfo, SyncResult

@@ -44,9 +44,15 @@
Now expects a page title instead of URL.
"""
logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
# Extract page title from URL (simple implementation)
if 'wikipedia.org' in url:
page_title = url.split('/')[-1].replace('_', ' ')
# Extract page title from URL (safe implementation)
allowed_wikipedia_hosts = {
"ar.wikipedia.org",
"en.wikipedia.org",
"wikipedia.org"
}
parsed = urlparse(url)
if parsed.hostname in allowed_wikipedia_hosts:
page_title = parsed.path.split('/')[-1].replace('_', ' ')
return fetch_wikipedia_data(page_title)
else:
raise ValueError("URL must be a Wikipedia page URL")
Copilot is powered by AI and may make mistakes. Always verify output.
"""
logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
# Extract page title from URL (simple implementation)
if 'wikipedia.org' in url:

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
wikipedia.org
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 7 months ago

To properly ensure the incoming URL is a Wikipedia page, we should parse the URL using urllib.parse and check that the hostname (not just any substring) matches the expected Wikipedia domains (such as wikipedia.org or with language prefix like en.wikipedia.org, ar.wikipedia.org, etc.). Edit the legacy function in tasks/InfoboxSync/fetch/fetch.py, specifically line 237 and associated logic, to (a) import urlparse from urllib.parse, (b) parse the URL, and (c) check that its netloc ends with .wikipedia.org. This fix should be implemented in the fetch_data function. Add the required import at the top if not present.

Suggested changeset 1
tasks/InfoboxSync/fetch/fetch.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tasks/InfoboxSync/fetch/fetch.py b/tasks/InfoboxSync/fetch/fetch.py
--- a/tasks/InfoboxSync/fetch/fetch.py
+++ b/tasks/InfoboxSync/fetch/fetch.py
@@ -1,5 +1,6 @@
 import logging
 from abc import ABC, abstractmethod
+from urllib.parse import urlparse
 from typing import Dict, Optional, Any
 from dataclasses import dataclass
 
@@ -233,9 +234,11 @@
     Now expects a page title instead of URL.
     """
     logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
-    # Extract page title from URL (simple implementation)
-    if 'wikipedia.org' in url:
-        page_title = url.split('/')[-1].replace('_', ' ')
+    # Extract page title from URL by parsing and checking domain
+    parsed_url = urlparse(url)
+    hostname = parsed_url.hostname or ""
+    if hostname.endswith(".wikipedia.org"):
+        page_title = parsed_url.path.split('/')[-1].replace('_', ' ')
         return fetch_wikipedia_data(page_title)
     else:
         raise ValueError("URL must be a Wikipedia page URL")
\ No newline at end of file
EOF
@@ -1,5 +1,6 @@
import logging
from abc import ABC, abstractmethod
from urllib.parse import urlparse
from typing import Dict, Optional, Any
from dataclasses import dataclass

@@ -233,9 +234,11 @@
Now expects a page title instead of URL.
"""
logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
# Extract page title from URL (simple implementation)
if 'wikipedia.org' in url:
page_title = url.split('/')[-1].replace('_', ' ')
# Extract page title from URL by parsing and checking domain
parsed_url = urlparse(url)
hostname = parsed_url.hostname or ""
if hostname.endswith(".wikipedia.org"):
page_title = parsed_url.path.split('/')[-1].replace('_', ' ')
return fetch_wikipedia_data(page_title)
else:
raise ValueError("URL must be a Wikipedia page URL")
Copilot is powered by AI and may make mistakes. Always verify output.
"run_wikipedia_pipeline(page_title) instead.")
logger.warning(msg)

if 'wikipedia.org' in url and '/wiki/' in url:

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High test

The string
wikipedia.org
may be at an arbitrary position in the sanitized URL.

Copilot Autofix

AI 7 months ago

The best way to fix the problem is to properly parse the url string using standard library URL parsing (for example, Python's urllib.parse.urlparse) and then validate that its netloc (domain) is a valid Wikipedia host, and its path starts with /wiki/. This approach mitigates inappropriate matches and ensures only correct Wikipedia URLs are accepted. The code in the run_pipeline function, specifically lines 157-159, should be changed to use urlparse, check that the domain matches the Wikipedia pattern (ideally, ends with wikipedia.org or matches a regex/allowlist), and that the path starts with /wiki/ before extracting the title. urllib.parse should be imported if not already present.

Suggested changeset 1
tasks/InfoboxSync/test.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tasks/InfoboxSync/test.py b/tasks/InfoboxSync/test.py
--- a/tasks/InfoboxSync/test.py
+++ b/tasks/InfoboxSync/test.py
@@ -1,5 +1,6 @@
 import logging
 from fetch import fetch_wikipedia_data
+from urllib.parse import urlparse
 from parse.parse import parse_data
 from map.map import map_data
 from translate.translate import translate_data
@@ -154,8 +155,12 @@
            "run_wikipedia_pipeline(page_title) instead.")
     logger.warning(msg)
 
-    if 'wikipedia.org' in url and '/wiki/' in url:
-        page_title = url.split('/wiki/')[-1].replace('_', ' ')
+    parsed_url = urlparse(url)
+    netloc = parsed_url.netloc.lower()
+    path = parsed_url.path
+    # Accept subdomains like en.wikipedia.org, ar.wikipedia.org, etc.
+    if netloc.endswith("wikipedia.org") and path.startswith("/wiki/"):
+        page_title = path[len("/wiki/"):].replace('_', ' ')
         return run_wikipedia_pipeline(page_title, target_lang, output_dir)
     else:
         msg = ("URL must be a Wikipedia page URL "
EOF
@@ -1,5 +1,6 @@
import logging
from fetch import fetch_wikipedia_data
from urllib.parse import urlparse
from parse.parse import parse_data
from map.map import map_data
from translate.translate import translate_data
@@ -154,8 +155,12 @@
"run_wikipedia_pipeline(page_title) instead.")
logger.warning(msg)

if 'wikipedia.org' in url and '/wiki/' in url:
page_title = url.split('/wiki/')[-1].replace('_', ' ')
parsed_url = urlparse(url)
netloc = parsed_url.netloc.lower()
path = parsed_url.path
# Accept subdomains like en.wikipedia.org, ar.wikipedia.org, etc.
if netloc.endswith("wikipedia.org") and path.startswith("/wiki/"):
page_title = path[len("/wiki/"):].replace('_', ' ')
return run_wikipedia_pipeline(page_title, target_lang, output_dir)
else:
msg = ("URL must be a Wikipedia page URL "
Copilot is powered by AI and may make mistakes. Always verify output.
api_key = os.getenv(env_var)
if api_key:
self.config[service]['api_key'] = api_key
logger.info(f"Loaded API key for {service} from {env_var}")

Check failure

Code scanning / CodeQL

Clear-text logging of sensitive information High

This expression logs
sensitive data (password)
as clear text.

Copilot Autofix

AI 7 months ago

To fix the problem, remove or modify the logger statement so that it does not log any potentially sensitive information, including the name of the environment variable or even which service API key was loaded. Instead, if it is necessary to log progress, a generic "Loaded translation service API key" message can be used that does not include details. The best approach is to simply omit this log message altogether, since it's not providing essential runtime information and could encourage risky patterns if copied. The only required change is to remove or replace line 53 in the _load_from_env method of the TranslationConfig class in tasks/InfoboxSync/translate/config.py.

No new methods, imports, or definitions are required.

Suggested changeset 1
tasks/InfoboxSync/translate/config.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/tasks/InfoboxSync/translate/config.py b/tasks/InfoboxSync/translate/config.py
--- a/tasks/InfoboxSync/translate/config.py
+++ b/tasks/InfoboxSync/translate/config.py
@@ -50,7 +50,7 @@
                     api_key = os.getenv(env_var)
                     if api_key:
                         self.config[service]['api_key'] = api_key
-                        logger.info(f"Loaded API key for {service} from {env_var}")
+                        # logger.info("Loaded API key for translation service.")  # (Commented out to avoid leaking info)
                         break
 
         # Other environment variables
EOF
@@ -50,7 +50,7 @@
api_key = os.getenv(env_var)
if api_key:
self.config[service]['api_key'] = api_key
logger.info(f"Loaded API key for {service} from {env_var}")
# logger.info("Loaded API key for translation service.") # (Commented out to avoid leaking info)
break

# Other environment variables
Copilot is powered by AI and may make mistakes. Always verify output.
@loka1 loka1 self-assigned this Aug 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant