Add initial implementation for Wikipedia infobox synchronization by loka1 · Pull Request #449 · LokasWiki/LokasBot

loka1 · 2025-08-28T12:34:08Z

Introduce the foundational structure for a Wikipedia infobox synchronization tool, including stages for mapping, saving, parsing, and publishing data. Implement data models and fetchers to handle synchronization between Arabic and English Wikipedia pages. Include infobox parsers and a factory for creating appropriate parsers based on template types. Add logging for better traceability during data operations.

tasks/InfoboxSync/fetch/__init__.py

+    """
+    logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
+    # Extract page title from URL (simple implementation)
+    if 'wikipedia.org' in url:


The best fix is to parse the incoming URL using Python's urllib.parse.urlparse, and ensure the hostname is exactly "wikipedia.org" or an allowed subdomain (e.g., "en.wikipedia.org"). This avoids matching URLs that merely contain the string in the wrong position. For this code, replace the substring check with a host check after parsing the URL. If the hostname matches "wikipedia.org" or a defined allowlist of trusted Wikipedia hostnames, proceed; otherwise, raise the ValueError. Update only the block in the function fetch_data, leaving other code untouched.

Implementation steps:

Add from urllib.parse import urlparse import if not already present.

Replace line(s) where 'wikipedia.org' in url is checked.

Instead, use urlparse(url).hostname and check if it matches the set of allowed hosts.

Define a list or set of allowed Wikipedia hostnames (e.g., en.wikipedia.org, ar.wikipedia.org).

Only proceed if the parsed hostname is in this allowlist.

tasks/InfoboxSync/fetch/fetch.py

+    """
+    logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
+    # Extract page title from URL (simple implementation)
+    if 'wikipedia.org' in url:


To properly ensure the incoming URL is a Wikipedia page, we should parse the URL using urllib.parse and check that the hostname (not just any substring) matches the expected Wikipedia domains (such as wikipedia.org or with language prefix like en.wikipedia.org, ar.wikipedia.org, etc.). Edit the legacy function in tasks/InfoboxSync/fetch/fetch.py, specifically line 237 and associated logic, to (a) import urlparse from urllib.parse, (b) parse the URL, and (c) check that its netloc ends with .wikipedia.org. This fix should be implemented in the fetch_data function. Add the required import at the top if not present.

tasks/InfoboxSync/test.py

+           "run_wikipedia_pipeline(page_title) instead.")
+    logger.warning(msg)
+
+    if 'wikipedia.org' in url and '/wiki/' in url:


The best way to fix the problem is to properly parse the url string using standard library URL parsing (for example, Python's urllib.parse.urlparse) and then validate that its netloc (domain) is a valid Wikipedia host, and its path starts with /wiki/. This approach mitigates inappropriate matches and ensures only correct Wikipedia URLs are accepted. The code in the run_pipeline function, specifically lines 157-159, should be changed to use urlparse, check that the domain matches the Wikipedia pattern (ideally, ends with wikipedia.org or matches a regex/allowlist), and that the path starts with /wiki/ before extracting the title. urllib.parse should be imported if not already present.

tasks/InfoboxSync/translate/config.py

+                    api_key = os.getenv(env_var)
+                    if api_key:
+                        self.config[service]['api_key'] = api_key
+                        logger.info(f"Loaded API key for {service} from {env_var}")


To fix the problem, remove or modify the logger statement so that it does not log any potentially sensitive information, including the name of the environment variable or even which service API key was loaded. Instead, if it is necessary to log progress, a generic "Loaded translation service API key" message can be used that does not include details. The best approach is to simply omit this log message altogether, since it's not providing essential runtime information and could encourage risky patterns if copied. The only required change is to remove or replace line 53 in the _load_from_env method of the TranslationConfig class in tasks/InfoboxSync/translate/config.py.

No new methods, imports, or definitions are required.

init

6422b4d

github-advanced-security bot found potential problems Aug 28, 2025

View reviewed changes

loka1 self-assigned this Aug 28, 2025

remove docs

89fd778

@@ -2,7 +2,7 @@
             import logging
             from typing import Dict, Any
+            from urllib.parse import urlparse
             from .sync_fetcher import WikipediaSyncFetcher
             from .models import PageInfo, SyncResult
@@ -44,9 +44,15 @@
                 Now expects a page title instead of URL.
                 """
                 logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
-                # Extract page title from URL (simple implementation)
-                if 'wikipedia.org' in url:
-                    page_title = url.split('/')[-1].replace('_', ' ')
+                # Extract page title from URL (safe implementation)
+                allowed_wikipedia_hosts = {
+                    "ar.wikipedia.org",
+                    "en.wikipedia.org",
+                    "wikipedia.org"
+                }
+                parsed = urlparse(url)
+                if parsed.hostname in allowed_wikipedia_hosts:
+                    page_title = parsed.path.split('/')[-1].replace('_', ' ')
                     return fetch_wikipedia_data(page_title)
                 else:
                     raise ValueError("URL must be a Wikipedia page URL")

@@ -1,5 +1,6 @@
             import logging
             from abc import ABC, abstractmethod
+            from urllib.parse import urlparse
             from typing import Dict, Optional, Any
             from dataclasses import dataclass
@@ -233,9 +234,11 @@
                 Now expects a page title instead of URL.
                 """
                 logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.")
-                # Extract page title from URL (simple implementation)
-                if 'wikipedia.org' in url:
-                    page_title = url.split('/')[-1].replace('_', ' ')
+                # Extract page title from URL by parsing and checking domain
+                parsed_url = urlparse(url)
+                hostname = parsed_url.hostname or ""
+                if hostname.endswith(".wikipedia.org"):
+                    page_title = parsed_url.path.split('/')[-1].replace('_', ' ')
                     return fetch_wikipedia_data(page_title)
                 else:
                     raise ValueError("URL must be a Wikipedia page URL")

@@ -1,5 +1,6 @@
             import logging
             from fetch import fetch_wikipedia_data
+            from urllib.parse import urlparse
             from parse.parse import parse_data
             from map.map import map_data
             from translate.translate import translate_data
@@ -154,8 +155,12 @@
                        "run_wikipedia_pipeline(page_title) instead.")
                 logger.warning(msg)
-                if 'wikipedia.org' in url and '/wiki/' in url:
-                    page_title = url.split('/wiki/')[-1].replace('_', ' ')
+                parsed_url = urlparse(url)
+                netloc = parsed_url.netloc.lower()
+                path = parsed_url.path
+                # Accept subdomains like en.wikipedia.org, ar.wikipedia.org, etc.
+                if netloc.endswith("wikipedia.org") and path.startswith("/wiki/"):
+                    page_title = path[len("/wiki/"):].replace('_', ' ')
                     return run_wikipedia_pipeline(page_title, target_lang, output_dir)
                 else:
                     msg = ("URL must be a Wikipedia page URL "

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial implementation for Wikipedia infobox synchronization#449

Add initial implementation for Wikipedia infobox synchronization#449
loka1 wants to merge 2 commits intomainfrom
InfoboxSync

loka1 commented Aug 28, 2025

Uh oh!

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Check failure

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

loka1 commented Aug 28, 2025

Uh oh!

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Check failure

Uh oh!

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant