Add initial implementation for Wikipedia infobox synchronization#449
Add initial implementation for Wikipedia infobox synchronization#449
Conversation
| """ | ||
| logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.") | ||
| # Extract page title from URL (simple implementation) | ||
| if 'wikipedia.org' in url: |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 7 months ago
The best fix is to parse the incoming URL using Python's urllib.parse.urlparse, and ensure the hostname is exactly "wikipedia.org" or an allowed subdomain (e.g., "en.wikipedia.org"). This avoids matching URLs that merely contain the string in the wrong position. For this code, replace the substring check with a host check after parsing the URL. If the hostname matches "wikipedia.org" or a defined allowlist of trusted Wikipedia hostnames, proceed; otherwise, raise the ValueError. Update only the block in the function fetch_data, leaving other code untouched.
Implementation steps:
- Add
from urllib.parse import urlparseimport if not already present. - Replace line(s) where
'wikipedia.org' in urlis checked. - Instead, use
urlparse(url).hostnameand check if it matches the set of allowed hosts. - Define a list or set of allowed Wikipedia hostnames (e.g.,
en.wikipedia.org,ar.wikipedia.org). - Only proceed if the parsed hostname is in this allowlist.
| @@ -2,7 +2,7 @@ | ||
|
|
||
| import logging | ||
| from typing import Dict, Any | ||
|
|
||
| from urllib.parse import urlparse | ||
| from .sync_fetcher import WikipediaSyncFetcher | ||
| from .models import PageInfo, SyncResult | ||
|
|
||
| @@ -44,9 +44,15 @@ | ||
| Now expects a page title instead of URL. | ||
| """ | ||
| logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.") | ||
| # Extract page title from URL (simple implementation) | ||
| if 'wikipedia.org' in url: | ||
| page_title = url.split('/')[-1].replace('_', ' ') | ||
| # Extract page title from URL (safe implementation) | ||
| allowed_wikipedia_hosts = { | ||
| "ar.wikipedia.org", | ||
| "en.wikipedia.org", | ||
| "wikipedia.org" | ||
| } | ||
| parsed = urlparse(url) | ||
| if parsed.hostname in allowed_wikipedia_hosts: | ||
| page_title = parsed.path.split('/')[-1].replace('_', ' ') | ||
| return fetch_wikipedia_data(page_title) | ||
| else: | ||
| raise ValueError("URL must be a Wikipedia page URL") |
| """ | ||
| logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.") | ||
| # Extract page title from URL (simple implementation) | ||
| if 'wikipedia.org' in url: |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 7 months ago
To properly ensure the incoming URL is a Wikipedia page, we should parse the URL using urllib.parse and check that the hostname (not just any substring) matches the expected Wikipedia domains (such as wikipedia.org or with language prefix like en.wikipedia.org, ar.wikipedia.org, etc.). Edit the legacy function in tasks/InfoboxSync/fetch/fetch.py, specifically line 237 and associated logic, to (a) import urlparse from urllib.parse, (b) parse the URL, and (c) check that its netloc ends with .wikipedia.org. This fix should be implemented in the fetch_data function. Add the required import at the top if not present.
| @@ -1,5 +1,6 @@ | ||
| import logging | ||
| from abc import ABC, abstractmethod | ||
| from urllib.parse import urlparse | ||
| from typing import Dict, Optional, Any | ||
| from dataclasses import dataclass | ||
|
|
||
| @@ -233,9 +234,11 @@ | ||
| Now expects a page title instead of URL. | ||
| """ | ||
| logger.warning("fetch_data(url) is deprecated. Use fetch_wikipedia_data(page_title) instead.") | ||
| # Extract page title from URL (simple implementation) | ||
| if 'wikipedia.org' in url: | ||
| page_title = url.split('/')[-1].replace('_', ' ') | ||
| # Extract page title from URL by parsing and checking domain | ||
| parsed_url = urlparse(url) | ||
| hostname = parsed_url.hostname or "" | ||
| if hostname.endswith(".wikipedia.org"): | ||
| page_title = parsed_url.path.split('/')[-1].replace('_', ' ') | ||
| return fetch_wikipedia_data(page_title) | ||
| else: | ||
| raise ValueError("URL must be a Wikipedia page URL") |
| "run_wikipedia_pipeline(page_title) instead.") | ||
| logger.warning(msg) | ||
|
|
||
| if 'wikipedia.org' in url and '/wiki/' in url: |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 7 months ago
The best way to fix the problem is to properly parse the url string using standard library URL parsing (for example, Python's urllib.parse.urlparse) and then validate that its netloc (domain) is a valid Wikipedia host, and its path starts with /wiki/. This approach mitigates inappropriate matches and ensures only correct Wikipedia URLs are accepted. The code in the run_pipeline function, specifically lines 157-159, should be changed to use urlparse, check that the domain matches the Wikipedia pattern (ideally, ends with wikipedia.org or matches a regex/allowlist), and that the path starts with /wiki/ before extracting the title. urllib.parse should be imported if not already present.
| @@ -1,5 +1,6 @@ | ||
| import logging | ||
| from fetch import fetch_wikipedia_data | ||
| from urllib.parse import urlparse | ||
| from parse.parse import parse_data | ||
| from map.map import map_data | ||
| from translate.translate import translate_data | ||
| @@ -154,8 +155,12 @@ | ||
| "run_wikipedia_pipeline(page_title) instead.") | ||
| logger.warning(msg) | ||
|
|
||
| if 'wikipedia.org' in url and '/wiki/' in url: | ||
| page_title = url.split('/wiki/')[-1].replace('_', ' ') | ||
| parsed_url = urlparse(url) | ||
| netloc = parsed_url.netloc.lower() | ||
| path = parsed_url.path | ||
| # Accept subdomains like en.wikipedia.org, ar.wikipedia.org, etc. | ||
| if netloc.endswith("wikipedia.org") and path.startswith("/wiki/"): | ||
| page_title = path[len("/wiki/"):].replace('_', ' ') | ||
| return run_wikipedia_pipeline(page_title, target_lang, output_dir) | ||
| else: | ||
| msg = ("URL must be a Wikipedia page URL " |
| api_key = os.getenv(env_var) | ||
| if api_key: | ||
| self.config[service]['api_key'] = api_key | ||
| logger.info(f"Loaded API key for {service} from {env_var}") |
Check failure
Code scanning / CodeQL
Clear-text logging of sensitive information High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 7 months ago
To fix the problem, remove or modify the logger statement so that it does not log any potentially sensitive information, including the name of the environment variable or even which service API key was loaded. Instead, if it is necessary to log progress, a generic "Loaded translation service API key" message can be used that does not include details. The best approach is to simply omit this log message altogether, since it's not providing essential runtime information and could encourage risky patterns if copied. The only required change is to remove or replace line 53 in the _load_from_env method of the TranslationConfig class in tasks/InfoboxSync/translate/config.py.
No new methods, imports, or definitions are required.
| @@ -50,7 +50,7 @@ | ||
| api_key = os.getenv(env_var) | ||
| if api_key: | ||
| self.config[service]['api_key'] = api_key | ||
| logger.info(f"Loaded API key for {service} from {env_var}") | ||
| # logger.info("Loaded API key for translation service.") # (Commented out to avoid leaking info) | ||
| break | ||
|
|
||
| # Other environment variables |
Introduce the foundational structure for a Wikipedia infobox synchronization tool, including stages for mapping, saving, parsing, and publishing data. Implement data models and fetchers to handle synchronization between Arabic and English Wikipedia pages. Include infobox parsers and a factory for creating appropriate parsers based on template types. Add logging for better traceability during data operations.