Skip to content

Conversation

@Zhangg7723
Copy link
Collaborator

@Zhangg7723 Zhangg7723 commented Jan 23, 2026

Summary

  • Add input_type argument to parsing methods to control file type detection mode
  • Support modes: 'auto' (default), 'pdf', 'office', 'html', 'image' to bypass detection
  • Enhance parse_to_md_with_binary to accept and use input_type
  • Update server routes and parse service to handle input_type consistently
  • Implement auto-detection fallback from filename extension to binary content
  • Add detect_file_type util function to identify file format from binary data
  • Extend and improve tests to cover input_type usage and auto-detection behaviors
  • Add .env.example and environment variable loading in tests for API key configuration
  • Modify form_data construction to include input_type in config JSON when not 'auto'
  • Improve logging for input_type and format detection decisions in server and service layers
  • Enhance parse_to_md_upload endpoint to accept 'file_url' for downloading and parsing files
  • Allow either file upload or file_url parameter, reject providing both simultaneously
  • Implement automatic filename extraction from URL or accept filename override in config
  • Update SDK README with examples for file_url usage and binary parsing without extensions
  • Add comprehensive tests covering valid, invalid, empty, and conflict scenarios for file_url parsing

Solution Description

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 23, 2026
@dosubot
Copy link

dosubot bot commented Jan 23, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

PowerRAG SDK

The PR introduces a new 'input_type' parameter for file type detection in document parsing methods, allowing explicit control over file type detection and bypassing auto-detection. The current documentation does not mention this parameter or its usage in the document parsing section or examples. To remain accurate and complete, the documentation should be updated to describe the 'input_type' parameter, its options, and provide usage examples reflecting the new capability.

View Suggested Changes
@@ -9,6 +9,7 @@
 - Create, update, list, and delete datasets.
 - Upload documents to datasets, supporting batch operations.
 - Parse documents to Markdown format, including direct binary parsing for PDF, Office documents, images, and HTML ([source](https://github.com/oceanbase/powerrag/pull/40)).
+- Explicit file type detection control via `input_type` parameter for parsing methods (see below).
 - Asynchronous and synchronous document parsing, with status polling and cancellation.
 - Manage document metadata, download content, and handle document chunks.
 
@@ -75,6 +76,40 @@
 - `enable_formula`: Recognize formulas (bool, default False)
 - `enable_table`: Recognize tables (bool, default True)
 - `from_page`, `to_page`: Page range for PDFs (int, default 0 and 100000)
+- `input_type`: File type detection mode (default: 'auto'). Controls how the SDK determines the file type for parsing. Options:
+    - `'auto'`: Try filename extension first, then auto-detect from binary if no extension or unsupported extension (default)
+    - `'pdf'`, `'office'`, `'html'`, `'image'`: Explicitly specify the file type to bypass detection
+
+#### Example Usage of `input_type`
+
+```python
+# Default: auto-detect using extension or binary content
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document.pdf", "rb").read(),
+    filename="document.pdf"
+)
+
+# Auto-detect from binary when file has no extension
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document_no_ext", "rb").read(),
+    filename="document_no_ext"
+)
+
+# Explicitly specify file type (bypass detection)
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document", "rb").read(),
+    filename="document",
+    input_type="pdf"
+)
+
+# Using config and input_type together
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document.html", "rb").read(),
+    filename="document.html",
+    config={"layout_recognize": "mineru"},
+    input_type="html"
+)
+```
 
 ### Chat Creation
 
@@ -115,13 +150,21 @@
 ### Parse Documents to Markdown (Binary)
 
 ```python
+# Default: auto-detect file type
 result = client.document_manager.parse_to_md_binary(
     file_binary=open("document.pdf", "rb").read(),
-    filename="document.pdf",
-    config={"layout_recognize": "mineru", "enable_ocr": True}
+    filename="document.pdf"
 )
 print(result['markdown'])
 print(f"Parsed {result['total_images']} images")
+
+# Explicitly specify file type
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document", "rb").read(),
+    filename="document",
+    input_type="pdf"
+)
+print(result['markdown'])

(source)

</details>



<!-- Dosu Comment Footer -->
<sup>How did I do? Any feedback?</sup>&nbsp;&nbsp;[![Join Discord](https://img.shields.io/badge/join-5865F2?logo=discord&logoColor=white&label=)](https://go.dosu.dev/discord-bot)

@dosubot dosubot bot added the enhancement New feature or request label Jan 23, 2026
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jan 23, 2026
@whhe whhe requested a review from Copilot January 23, 2026 07:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds support for automatic file type detection and URL-based file parsing to the PowerRAG SDK. The changes enable the system to handle files without extensions and download files directly from URLs for parsing.

Changes:

  • Added input_type parameter ('auto', 'pdf', 'office', 'html', 'image', 'markdown') to control file type detection mode across parsing methods
  • Implemented detect_file_type utility function to identify file formats from binary content using magic numbers
  • Enhanced parse_to_md_upload endpoint to accept file_url parameter for downloading and parsing remote files
  • Updated SDK methods to propagate and handle input_type consistently
  • Added comprehensive test coverage for auto-detection and file_url functionality
  • Updated test configuration to require API key via environment variables with .env file support

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 20 comments.

Show a summary per file
File Description
powerrag/utils/file_utils.py Adds detect_file_type function to identify file formats from binary data using magic numbers
powerrag/server/services/parse_service.py Adds input_type parameter to parse_file_binary and implements auto-detection logic
powerrag/server/routes/powerrag_routes.py Enhances /parse_to_md/upload endpoint to support file_url and input_type parameters
powerrag/sdk/modules/document_manager.py Updates SDK methods to accept and propagate input_type parameter through API calls
powerrag/sdk/tests/conftest.py Requires POWERRAG_API_KEY from environment, changes default port to 9390
powerrag/sdk/tests/.env.example Provides template for test environment configuration
powerrag/sdk/tests/pytest.ini Adds test timeout and configuration settings
powerrag/sdk/tests/test_document.py Adds comprehensive tests for input_type auto-detection and file_url functionality
powerrag/sdk/README.md Documents new input_type and file_url features with usage examples
Comments suppressed due to low confidence (1)

powerrag/server/routes/powerrag_routes.py:1160

  • The route is calling the private method _parse_to_markdown directly instead of using the public parse_file_binary method. This bypasses the logic and validation in parse_file_binary and creates duplicate code for input_type handling. It would be better to call service.parse_file_binary(binary, filename, config, input_type) which already handles all the format detection logic, and then extract markdown and images from its return value. This would reduce code duplication and ensure consistent behavior across the codebase.
        # Create service and parse
        gotenberg_url = config.get("gotenberg_url", GOTENBERG_URL)
        service = PowerRAGParseService(gotenberg_url=gotenberg_url)
        
        # Parse to markdown
        md_content, images = service._parse_to_markdown(
            filename=filename,
            binary=binary,
            format_type=format_type,
            config=config
        )

Comment on lines +1067 to +1071
except requests.exceptions.RequestException as e:
logger.error(f"Failed to download file from URL: {file_url}. Error: {e}")
return jsonify({
"code": 400,
"message": f"Failed to download file from URL: {str(e)}"
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling catches all RequestException types equally, but different error types might benefit from more specific handling. For example, timeout errors could suggest increasing the timeout, connection errors could indicate network issues, and HTTP errors could provide more context about the server response. Consider adding more granular error handling for different RequestException subtypes.

Suggested change
except requests.exceptions.RequestException as e:
logger.error(f"Failed to download file from URL: {file_url}. Error: {e}")
return jsonify({
"code": 400,
"message": f"Failed to download file from URL: {str(e)}"
except requests.exceptions.Timeout as e:
logger.error(f"Timeout while downloading file from URL: {file_url}. Error: {e}")
return jsonify({
"code": 400,
"message": "Download request to the file URL timed out. Please check the network or try again with a longer timeout."
}), 400
except requests.exceptions.ConnectionError as e:
logger.error(f"Connection error while downloading file from URL: {file_url}. Error: {e}")
return jsonify({
"code": 400,
"message": "Failed to connect to the file URL. Please verify the URL and network connectivity."
}), 400
except requests.exceptions.HTTPError as e:
status_code = e.response.status_code if getattr(e, "response", None) is not None else "unknown"
logger.error(f"HTTP error while downloading file from URL: {file_url}. Status: {status_code}, Error: {e}")
return jsonify({
"code": 400,
"message": f"Server returned an HTTP error while downloading file (status: {status_code})."
}), 400
except requests.exceptions.RequestException as e:
logger.error(f"Unexpected request error while downloading file from URL: {file_url}. Error: {e}")
return jsonify({
"code": 400,
"message": f"Failed to download file from URL due to an unexpected request error: {str(e)}"

Copilot uses AI. Check for mistakes.
Comment on lines +29 to +34
if not API_KEY:
raise ValueError(
"POWERRAG_API_KEY environment variable is not set. "
"Please set it in your .env file or system environment. "
"Copy .env.example to .env and update with your API key."
)
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conftest.py now raises an exception if POWERRAG_API_KEY is not set, which will cause all tests to fail if the environment variable is not configured. While this is good for catching configuration issues, it also means tests cannot run in CI/CD environments without proper secrets configuration. Consider adding a pytest skip marker or fixture that allows tests to be skipped gracefully with a clear message when the API key is not available, rather than raising an exception that prevents pytest from running at all.

Copilot uses AI. Check for mistakes.
# 从环境变量获取配置
HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222")
API_KEY = os.getenv("POWERRAG_API_KEY", "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E")
HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390")
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default HOST_ADDRESS has been changed from "http://127.0.0.1:9222" to "http://127.0.0.1:9390". This is a breaking change for anyone who has tests configured with the old port. While this change might be intentional to reflect a new default, it should be documented in the PR description or migration notes to alert users to update their configurations.

Suggested change
HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390")
HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222")

Copilot uses AI. Check for mistakes.
HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222")
API_KEY = os.getenv("POWERRAG_API_KEY", "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E")
HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390")
API_KEY = os.getenv("POWERRAG_API_KEY")
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old default API_KEY value "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E" has been removed, which is good for security. However, this is a breaking change that requires all test environments to be updated. This should be clearly noted in the PR description as a breaking change requiring action from developers.

Copilot uses AI. Check for mistakes.
return 'office' # pptx
if any(n.startswith("xl/") for n in names):
return 'office' # xlsx
except Exception:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns 'unknown' for ZIP files that don't match any Office format (lines 344-355), but then this return is never reached because line 355 passes silently and continues to check other formats. If a ZIP file doesn't match Office formats, the function will fall through to check OLE, images, and HTML, which is correct. However, the logic would be clearer if there was an explicit comment noting that non-Office ZIP files will fall through to be classified as 'unknown' at the end if they don't match other formats.

Suggested change
except Exception:
except Exception:
# If the ZIP is not a recognized Office document (or cannot be read),
# fall through to other format checks; it may be classified as 'unknown'
# at the end if no other type matches.

Copilot uses AI. Check for mistakes.
Comment on lines +653 to +666
def test_explicit_input_type_pdf(self, client: PowerRAGClient, tmp_path):
"""测试显式指定 input_type='pdf'"""
# 创建一个简单的 PDF 内容
pdf_header = b"%PDF-1.4\n%\xE2\xE3\xCF\xD3\n"
pdf_content = pdf_header + b"1 0 obj\n<<\n/Type /Catalog\n>>\nendobj\n"

# 显式指定为 PDF 类型,即使文件名没有扩展名
result = client.document.parse_to_md_binary(
file_binary=pdf_content,
filename="document",
input_type="pdf" # 显式指定类型
)

assert "filename" in result
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test at line 653-666 similarly uses a minimal, potentially invalid PDF and only asserts that "filename" is in the result. This is a very weak assertion that doesn't verify whether the parsing actually succeeded or if the PDF content was correctly processed. Consider strengthening the assertion to check for "markdown" in result and verify markdown_length > 0, or use a valid PDF fixture.

Copilot uses AI. Check for mistakes.
Comment on lines +1140 to 1148
elif input_type in ['pdf', 'office', 'html', 'image']:
# Use explicitly specified input_type
format_type = input_type
logger.info(f"Using explicit input_type: {format_type} for file: {filename}")
else:
return jsonify({
"code": 400,
"message": f"Unsupported file format: {file_ext}. Supported formats: pdf, doc, docx, ppt, pptx, jpg, png, html"
"message": f"Invalid input_type: {input_type}. Must be 'auto' (default), 'pdf', 'office', 'html', or 'image'."
}), 400
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The valid input_type values shown in the error message at line 1147 don't include 'markdown', but the parse_file_binary method at line 190 in parse_service.py accepts 'markdown' as a valid input_type. This inconsistency means users will get an error if they try to use 'markdown' as input_type in this endpoint, even though the underlying service method supports it. Either add 'markdown' to the validation here, or remove it from parse_file_binary's validation.

Copilot uses AI. Check for mistakes.
Comment on lines +1056 to +1066
# Download from URL
import requests
from urllib.parse import urlparse
from pathlib import Path

logger.info(f"Downloading file from URL: {file_url}")

try:
response = requests.get(file_url, timeout=60)
response.raise_for_status()
binary = response.content
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file download functionality doesn't validate the file_url scheme or implement any URL allowlist. This could potentially allow Server-Side Request Forgery (SSRF) attacks where an attacker could make the server request internal network resources. Consider adding URL validation to only allow specific schemes (http/https) and optionally implement an allowlist or blocklist for domains/IPs, especially to prevent requests to internal/private network addresses.

Copilot uses AI. Check for mistakes.
return 'office' # pptx
if any(n.startswith("xl/") for n in names):
return 'office' # xlsx
except Exception:
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
except Exception:
except Exception:
# Any error while reading as a ZIP (corrupt/non-Office archive, etc.)
# means we cannot classify it as a ZIP-based Office file; fall through.

Copilot uses AI. Check for mistakes.
if '<html' in text_sample or '<!doctype html' in text_sample or '<head' in text_sample:
return 'html'
except Exception:
pass
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'except' clause does nothing but pass and there is no explanatory comment.

Suggested change
pass
# If any unexpected error occurs during HTML detection, fall back to unknown.
return 'unknown'

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants