-
Notifications
You must be signed in to change notification settings - Fork 10
feat(sdk): add input_type parameter for file type detection #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nitial SDK configuration
…ackage configuration
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: PowerRAG SDKThe PR introduces a new 'input_type' parameter for file type detection in document parsing methods, allowing explicit control over file type detection and bypassing auto-detection. The current documentation does not mention this parameter or its usage in the document parsing section or examples. To remain accurate and complete, the documentation should be updated to describe the 'input_type' parameter, its options, and provide usage examples reflecting the new capability. View Suggested Changes@@ -9,6 +9,7 @@
- Create, update, list, and delete datasets.
- Upload documents to datasets, supporting batch operations.
- Parse documents to Markdown format, including direct binary parsing for PDF, Office documents, images, and HTML ([source](https://github.com/oceanbase/powerrag/pull/40)).
+- Explicit file type detection control via `input_type` parameter for parsing methods (see below).
- Asynchronous and synchronous document parsing, with status polling and cancellation.
- Manage document metadata, download content, and handle document chunks.
@@ -75,6 +76,40 @@
- `enable_formula`: Recognize formulas (bool, default False)
- `enable_table`: Recognize tables (bool, default True)
- `from_page`, `to_page`: Page range for PDFs (int, default 0 and 100000)
+- `input_type`: File type detection mode (default: 'auto'). Controls how the SDK determines the file type for parsing. Options:
+ - `'auto'`: Try filename extension first, then auto-detect from binary if no extension or unsupported extension (default)
+ - `'pdf'`, `'office'`, `'html'`, `'image'`: Explicitly specify the file type to bypass detection
+
+#### Example Usage of `input_type`
+
+```python
+# Default: auto-detect using extension or binary content
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document.pdf", "rb").read(),
+ filename="document.pdf"
+)
+
+# Auto-detect from binary when file has no extension
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document_no_ext", "rb").read(),
+ filename="document_no_ext"
+)
+
+# Explicitly specify file type (bypass detection)
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document", "rb").read(),
+ filename="document",
+ input_type="pdf"
+)
+
+# Using config and input_type together
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document.html", "rb").read(),
+ filename="document.html",
+ config={"layout_recognize": "mineru"},
+ input_type="html"
+)
+```
### Chat Creation
@@ -115,13 +150,21 @@
### Parse Documents to Markdown (Binary)
```python
+# Default: auto-detect file type
result = client.document_manager.parse_to_md_binary(
file_binary=open("document.pdf", "rb").read(),
- filename="document.pdf",
- config={"layout_recognize": "mineru", "enable_ocr": True}
+ filename="document.pdf"
)
print(result['markdown'])
print(f"Parsed {result['total_images']} images")
+
+# Explicitly specify file type
+result = client.document_manager.parse_to_md_binary(
+ file_binary=open("document", "rb").read(),
+ filename="document",
+ input_type="pdf"
+)
+print(result['markdown'])(source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request adds support for automatic file type detection and URL-based file parsing to the PowerRAG SDK. The changes enable the system to handle files without extensions and download files directly from URLs for parsing.
Changes:
- Added
input_typeparameter ('auto', 'pdf', 'office', 'html', 'image', 'markdown') to control file type detection mode across parsing methods - Implemented
detect_file_typeutility function to identify file formats from binary content using magic numbers - Enhanced
parse_to_md_uploadendpoint to acceptfile_urlparameter for downloading and parsing remote files - Updated SDK methods to propagate and handle
input_typeconsistently - Added comprehensive test coverage for auto-detection and file_url functionality
- Updated test configuration to require API key via environment variables with
.envfile support
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 20 comments.
Show a summary per file
| File | Description |
|---|---|
| powerrag/utils/file_utils.py | Adds detect_file_type function to identify file formats from binary data using magic numbers |
| powerrag/server/services/parse_service.py | Adds input_type parameter to parse_file_binary and implements auto-detection logic |
| powerrag/server/routes/powerrag_routes.py | Enhances /parse_to_md/upload endpoint to support file_url and input_type parameters |
| powerrag/sdk/modules/document_manager.py | Updates SDK methods to accept and propagate input_type parameter through API calls |
| powerrag/sdk/tests/conftest.py | Requires POWERRAG_API_KEY from environment, changes default port to 9390 |
| powerrag/sdk/tests/.env.example | Provides template for test environment configuration |
| powerrag/sdk/tests/pytest.ini | Adds test timeout and configuration settings |
| powerrag/sdk/tests/test_document.py | Adds comprehensive tests for input_type auto-detection and file_url functionality |
| powerrag/sdk/README.md | Documents new input_type and file_url features with usage examples |
Comments suppressed due to low confidence (1)
powerrag/server/routes/powerrag_routes.py:1160
- The route is calling the private method
_parse_to_markdowndirectly instead of using the publicparse_file_binarymethod. This bypasses the logic and validation inparse_file_binaryand creates duplicate code for input_type handling. It would be better to callservice.parse_file_binary(binary, filename, config, input_type)which already handles all the format detection logic, and then extract markdown and images from its return value. This would reduce code duplication and ensure consistent behavior across the codebase.
# Create service and parse
gotenberg_url = config.get("gotenberg_url", GOTENBERG_URL)
service = PowerRAGParseService(gotenberg_url=gotenberg_url)
# Parse to markdown
md_content, images = service._parse_to_markdown(
filename=filename,
binary=binary,
format_type=format_type,
config=config
)
| except requests.exceptions.RequestException as e: | ||
| logger.error(f"Failed to download file from URL: {file_url}. Error: {e}") | ||
| return jsonify({ | ||
| "code": 400, | ||
| "message": f"Failed to download file from URL: {str(e)}" |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling catches all RequestException types equally, but different error types might benefit from more specific handling. For example, timeout errors could suggest increasing the timeout, connection errors could indicate network issues, and HTTP errors could provide more context about the server response. Consider adding more granular error handling for different RequestException subtypes.
| except requests.exceptions.RequestException as e: | |
| logger.error(f"Failed to download file from URL: {file_url}. Error: {e}") | |
| return jsonify({ | |
| "code": 400, | |
| "message": f"Failed to download file from URL: {str(e)}" | |
| except requests.exceptions.Timeout as e: | |
| logger.error(f"Timeout while downloading file from URL: {file_url}. Error: {e}") | |
| return jsonify({ | |
| "code": 400, | |
| "message": "Download request to the file URL timed out. Please check the network or try again with a longer timeout." | |
| }), 400 | |
| except requests.exceptions.ConnectionError as e: | |
| logger.error(f"Connection error while downloading file from URL: {file_url}. Error: {e}") | |
| return jsonify({ | |
| "code": 400, | |
| "message": "Failed to connect to the file URL. Please verify the URL and network connectivity." | |
| }), 400 | |
| except requests.exceptions.HTTPError as e: | |
| status_code = e.response.status_code if getattr(e, "response", None) is not None else "unknown" | |
| logger.error(f"HTTP error while downloading file from URL: {file_url}. Status: {status_code}, Error: {e}") | |
| return jsonify({ | |
| "code": 400, | |
| "message": f"Server returned an HTTP error while downloading file (status: {status_code})." | |
| }), 400 | |
| except requests.exceptions.RequestException as e: | |
| logger.error(f"Unexpected request error while downloading file from URL: {file_url}. Error: {e}") | |
| return jsonify({ | |
| "code": 400, | |
| "message": f"Failed to download file from URL due to an unexpected request error: {str(e)}" |
| if not API_KEY: | ||
| raise ValueError( | ||
| "POWERRAG_API_KEY environment variable is not set. " | ||
| "Please set it in your .env file or system environment. " | ||
| "Copy .env.example to .env and update with your API key." | ||
| ) |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The conftest.py now raises an exception if POWERRAG_API_KEY is not set, which will cause all tests to fail if the environment variable is not configured. While this is good for catching configuration issues, it also means tests cannot run in CI/CD environments without proper secrets configuration. Consider adding a pytest skip marker or fixture that allows tests to be skipped gracefully with a clear message when the API key is not available, rather than raising an exception that prevents pytest from running at all.
| # 从环境变量获取配置 | ||
| HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222") | ||
| API_KEY = os.getenv("POWERRAG_API_KEY", "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E") | ||
| HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390") |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default HOST_ADDRESS has been changed from "http://127.0.0.1:9222" to "http://127.0.0.1:9390". This is a breaking change for anyone who has tests configured with the old port. While this change might be intentional to reflect a new default, it should be documented in the PR description or migration notes to alert users to update their configurations.
| HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390") | |
| HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222") |
| HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222") | ||
| API_KEY = os.getenv("POWERRAG_API_KEY", "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E") | ||
| HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390") | ||
| API_KEY = os.getenv("POWERRAG_API_KEY") |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old default API_KEY value "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E" has been removed, which is good for security. However, this is a breaking change that requires all test environments to be updated. This should be clearly noted in the PR description as a breaking change requiring action from developers.
| return 'office' # pptx | ||
| if any(n.startswith("xl/") for n in names): | ||
| return 'office' # xlsx | ||
| except Exception: |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function returns 'unknown' for ZIP files that don't match any Office format (lines 344-355), but then this return is never reached because line 355 passes silently and continues to check other formats. If a ZIP file doesn't match Office formats, the function will fall through to check OLE, images, and HTML, which is correct. However, the logic would be clearer if there was an explicit comment noting that non-Office ZIP files will fall through to be classified as 'unknown' at the end if they don't match other formats.
| except Exception: | |
| except Exception: | |
| # If the ZIP is not a recognized Office document (or cannot be read), | |
| # fall through to other format checks; it may be classified as 'unknown' | |
| # at the end if no other type matches. |
| def test_explicit_input_type_pdf(self, client: PowerRAGClient, tmp_path): | ||
| """测试显式指定 input_type='pdf'""" | ||
| # 创建一个简单的 PDF 内容 | ||
| pdf_header = b"%PDF-1.4\n%\xE2\xE3\xCF\xD3\n" | ||
| pdf_content = pdf_header + b"1 0 obj\n<<\n/Type /Catalog\n>>\nendobj\n" | ||
|
|
||
| # 显式指定为 PDF 类型,即使文件名没有扩展名 | ||
| result = client.document.parse_to_md_binary( | ||
| file_binary=pdf_content, | ||
| filename="document", | ||
| input_type="pdf" # 显式指定类型 | ||
| ) | ||
|
|
||
| assert "filename" in result |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test at line 653-666 similarly uses a minimal, potentially invalid PDF and only asserts that "filename" is in the result. This is a very weak assertion that doesn't verify whether the parsing actually succeeded or if the PDF content was correctly processed. Consider strengthening the assertion to check for "markdown" in result and verify markdown_length > 0, or use a valid PDF fixture.
| elif input_type in ['pdf', 'office', 'html', 'image']: | ||
| # Use explicitly specified input_type | ||
| format_type = input_type | ||
| logger.info(f"Using explicit input_type: {format_type} for file: {filename}") | ||
| else: | ||
| return jsonify({ | ||
| "code": 400, | ||
| "message": f"Unsupported file format: {file_ext}. Supported formats: pdf, doc, docx, ppt, pptx, jpg, png, html" | ||
| "message": f"Invalid input_type: {input_type}. Must be 'auto' (default), 'pdf', 'office', 'html', or 'image'." | ||
| }), 400 |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The valid input_type values shown in the error message at line 1147 don't include 'markdown', but the parse_file_binary method at line 190 in parse_service.py accepts 'markdown' as a valid input_type. This inconsistency means users will get an error if they try to use 'markdown' as input_type in this endpoint, even though the underlying service method supports it. Either add 'markdown' to the validation here, or remove it from parse_file_binary's validation.
| # Download from URL | ||
| import requests | ||
| from urllib.parse import urlparse | ||
| from pathlib import Path | ||
|
|
||
| logger.info(f"Downloading file from URL: {file_url}") | ||
|
|
||
| try: | ||
| response = requests.get(file_url, timeout=60) | ||
| response.raise_for_status() | ||
| binary = response.content |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file download functionality doesn't validate the file_url scheme or implement any URL allowlist. This could potentially allow Server-Side Request Forgery (SSRF) attacks where an attacker could make the server request internal network resources. Consider adding URL validation to only allow specific schemes (http/https) and optionally implement an allowlist or blocklist for domains/IPs, especially to prevent requests to internal/private network addresses.
| return 'office' # pptx | ||
| if any(n.startswith("xl/") for n in names): | ||
| return 'office' # xlsx | ||
| except Exception: |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| except Exception: | |
| except Exception: | |
| # Any error while reading as a ZIP (corrupt/non-Office archive, etc.) | |
| # means we cannot classify it as a ZIP-based Office file; fall through. |
| if '<html' in text_sample or '<!doctype html' in text_sample or '<head' in text_sample: | ||
| return 'html' | ||
| except Exception: | ||
| pass |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'except' clause does nothing but pass and there is no explanatory comment.
| pass | |
| # If any unexpected error occurs during HTML detection, fall back to unknown. | |
| return 'unknown' |
Summary
Solution Description