feat(sdk): add input_type parameter for file type detection #50

Zhangg7723 · 2026-01-23T06:26:44Z

Summary

Add input_type argument to parsing methods to control file type detection mode
Support modes: 'auto' (default), 'pdf', 'office', 'html', 'image' to bypass detection
Enhance parse_to_md_with_binary to accept and use input_type
Update server routes and parse service to handle input_type consistently
Implement auto-detection fallback from filename extension to binary content
Add detect_file_type util function to identify file format from binary data
Extend and improve tests to cover input_type usage and auto-detection behaviors
Add .env.example and environment variable loading in tests for API key configuration
Modify form_data construction to include input_type in config JSON when not 'auto'
Improve logging for input_type and format detection decisions in server and service layers
Enhance parse_to_md_upload endpoint to accept 'file_url' for downloading and parsing files
Allow either file upload or file_url parameter, reject providing both simultaneously
Implement automatic filename extraction from URL or accept filename override in config
Update SDK README with examples for file_url usage and binary parsing without extensions
Add comprehensive tests covering valid, invalid, empty, and conflict scenarios for file_url parsing

Solution Description

…nitial SDK configuration

…ackage configuration

… 3.10

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

dosubot · 2026-01-23T06:27:45Z

Related Documentation

1 document(s) may need updating based on files changed in this PR:

PowerRAG SDK

The PR introduces a new 'input_type' parameter for file type detection in document parsing methods, allowing explicit control over file type detection and bypassing auto-detection. The current documentation does not mention this parameter or its usage in the document parsing section or examples. To remain accurate and complete, the documentation should be updated to describe the 'input_type' parameter, its options, and provide usage examples reflecting the new capability.

View Suggested Changes

@@ -9,6 +9,7 @@
 - Create, update, list, and delete datasets.
 - Upload documents to datasets, supporting batch operations.
 - Parse documents to Markdown format, including direct binary parsing for PDF, Office documents, images, and HTML ([source](https://github.com/oceanbase/powerrag/pull/40)).
+- Explicit file type detection control via `input_type` parameter for parsing methods (see below).
 - Asynchronous and synchronous document parsing, with status polling and cancellation.
 - Manage document metadata, download content, and handle document chunks.
 
@@ -75,6 +76,40 @@
 - `enable_formula`: Recognize formulas (bool, default False)
 - `enable_table`: Recognize tables (bool, default True)
 - `from_page`, `to_page`: Page range for PDFs (int, default 0 and 100000)
+- `input_type`: File type detection mode (default: 'auto'). Controls how the SDK determines the file type for parsing. Options:
+    - `'auto'`: Try filename extension first, then auto-detect from binary if no extension or unsupported extension (default)
+    - `'pdf'`, `'office'`, `'html'`, `'image'`: Explicitly specify the file type to bypass detection
+
+#### Example Usage of `input_type`
+
+```python
+# Default: auto-detect using extension or binary content
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document.pdf", "rb").read(),
+    filename="document.pdf"
+)
+
+# Auto-detect from binary when file has no extension
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document_no_ext", "rb").read(),
+    filename="document_no_ext"
+)
+
+# Explicitly specify file type (bypass detection)
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document", "rb").read(),
+    filename="document",
+    input_type="pdf"
+)
+
+# Using config and input_type together
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document.html", "rb").read(),
+    filename="document.html",
+    config={"layout_recognize": "mineru"},
+    input_type="html"
+)
+```
 
 ### Chat Creation
 
@@ -115,13 +150,21 @@
 ### Parse Documents to Markdown (Binary)
 
 ```python
+# Default: auto-detect file type
 result = client.document_manager.parse_to_md_binary(
     file_binary=open("document.pdf", "rb").read(),
-    filename="document.pdf",
-    config={"layout_recognize": "mineru", "enable_ocr": True}
+    filename="document.pdf"
 )
 print(result['markdown'])
 print(f"Parsed {result['total_images']} images")
+
+# Explicitly specify file type
+result = client.document_manager.parse_to_md_binary(
+    file_binary=open("document", "rb").read(),
+    filename="document",
+    input_type="pdf"
+)
+print(result['markdown'])

(source)

</details>



<!-- Dosu Comment Footer -->
<sup>How did I do? Any feedback?</sup>&nbsp;&nbsp;[![Join Discord](https://img.shields.io/badge/join-5865F2?logo=discord&logoColor=white&label=)](https://go.dosu.dev/discord-bot)

Copilot

Pull request overview

This pull request adds support for automatic file type detection and URL-based file parsing to the PowerRAG SDK. The changes enable the system to handle files without extensions and download files directly from URLs for parsing.

Changes:

Added input_type parameter ('auto', 'pdf', 'office', 'html', 'image', 'markdown') to control file type detection mode across parsing methods
Implemented detect_file_type utility function to identify file formats from binary content using magic numbers
Enhanced parse_to_md_upload endpoint to accept file_url parameter for downloading and parsing remote files
Updated SDK methods to propagate and handle input_type consistently
Added comprehensive test coverage for auto-detection and file_url functionality
Updated test configuration to require API key via environment variables with .env file support

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 20 comments.

Show a summary per file

File	Description
powerrag/utils/file_utils.py	Adds `detect_file_type` function to identify file formats from binary data using magic numbers
powerrag/server/services/parse_service.py	Adds `input_type` parameter to `parse_file_binary` and implements auto-detection logic
powerrag/server/routes/powerrag_routes.py	Enhances `/parse_to_md/upload` endpoint to support file_url and input_type parameters
powerrag/sdk/modules/document_manager.py	Updates SDK methods to accept and propagate `input_type` parameter through API calls
powerrag/sdk/tests/conftest.py	Requires POWERRAG_API_KEY from environment, changes default port to 9390
powerrag/sdk/tests/.env.example	Provides template for test environment configuration
powerrag/sdk/tests/pytest.ini	Adds test timeout and configuration settings
powerrag/sdk/tests/test_document.py	Adds comprehensive tests for input_type auto-detection and file_url functionality
powerrag/sdk/README.md	Documents new input_type and file_url features with usage examples

Comments suppressed due to low confidence (1)

powerrag/server/routes/powerrag_routes.py:1160

The route is calling the private method _parse_to_markdown directly instead of using the public parse_file_binary method. This bypasses the logic and validation in parse_file_binary and creates duplicate code for input_type handling. It would be better to call service.parse_file_binary(binary, filename, config, input_type) which already handles all the format detection logic, and then extract markdown and images from its return value. This would reduce code duplication and ensure consistent behavior across the codebase.

        # Create service and parse
        gotenberg_url = config.get("gotenberg_url", GOTENBERG_URL)
        service = PowerRAGParseService(gotenberg_url=gotenberg_url)
        
        # Parse to markdown
        md_content, images = service._parse_to_markdown(
            filename=filename,
            binary=binary,
            format_type=format_type,
            config=config
        )

Copilot · 2026-01-23T07:47:58Z

powerrag/server/routes/powerrag_routes.py

+            except requests.exceptions.RequestException as e:
+                logger.error(f"Failed to download file from URL: {file_url}. Error: {e}")
+                return jsonify({
+                    "code": 400,
+                    "message": f"Failed to download file from URL: {str(e)}"


The error handling catches all RequestException types equally, but different error types might benefit from more specific handling. For example, timeout errors could suggest increasing the timeout, connection errors could indicate network issues, and HTTP errors could provide more context about the server response. Consider adding more granular error handling for different RequestException subtypes.

Suggested change

except requests.exceptions.RequestException as e:

logger.error(f"Failed to download file from URL: {file_url}. Error: {e}")

return jsonify({

"code": 400,

"message": f"Failed to download file from URL: {str(e)}"

except requests.exceptions.Timeout as e:

logger.error(f"Timeout while downloading file from URL: {file_url}. Error: {e}")

return jsonify({

"code": 400,

"message": "Download request to the file URL timed out. Please check the network or try again with a longer timeout."

}), 400

except requests.exceptions.ConnectionError as e:

logger.error(f"Connection error while downloading file from URL: {file_url}. Error: {e}")

return jsonify({

"code": 400,

"message": "Failed to connect to the file URL. Please verify the URL and network connectivity."

}), 400

except requests.exceptions.HTTPError as e:

status_code = e.response.status_code if getattr(e, "response", None) is not None else "unknown"

logger.error(f"HTTP error while downloading file from URL: {file_url}. Status: {status_code}, Error: {e}")

return jsonify({

"code": 400,

"message": f"Server returned an HTTP error while downloading file (status: {status_code})."

}), 400

except requests.exceptions.RequestException as e:

logger.error(f"Unexpected request error while downloading file from URL: {file_url}. Error: {e}")

return jsonify({

"code": 400,

"message": f"Failed to download file from URL due to an unexpected request error: {str(e)}"

Copilot · 2026-01-23T07:47:59Z

powerrag/sdk/tests/conftest.py

+if not API_KEY:
+    raise ValueError(
+        "POWERRAG_API_KEY environment variable is not set. "
+        "Please set it in your .env file or system environment. "
+        "Copy .env.example to .env and update with your API key."
+    )


The conftest.py now raises an exception if POWERRAG_API_KEY is not set, which will cause all tests to fail if the environment variable is not configured. While this is good for catching configuration issues, it also means tests cannot run in CI/CD environments without proper secrets configuration. Consider adding a pytest skip marker or fixture that allows tests to be skipped gracefully with a clear message when the API key is not available, rather than raising an exception that prevents pytest from running at all.

Copilot · 2026-01-23T07:47:59Z

powerrag/sdk/tests/conftest.py

 # 从环境变量获取配置
-HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222")
-API_KEY = os.getenv("POWERRAG_API_KEY", "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E")
+HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390")


The default HOST_ADDRESS has been changed from "http://127.0.0.1:9222" to "http://127.0.0.1:9390". This is a breaking change for anyone who has tests configured with the old port. While this change might be intentional to reflect a new default, it should be documented in the PR description or migration notes to alert users to update their configurations.

Suggested change

HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390")

HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222")

Copilot · 2026-01-23T07:47:59Z

powerrag/sdk/tests/conftest.py

-HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222")
-API_KEY = os.getenv("POWERRAG_API_KEY", "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E")
+HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390")
+API_KEY = os.getenv("POWERRAG_API_KEY")


The old default API_KEY value "ragflow-MAln1FNDn9PhIcqv1axaaUT3mM-efUZ83O5LVcroe9E" has been removed, which is good for security. However, this is a breaking change that requires all test environments to be updated. This should be clearly noted in the PR description as a breaking change requiring action from developers.

Copilot · 2026-01-23T07:48:00Z

powerrag/utils/file_utils.py

+                    return 'office'  # pptx
+                if any(n.startswith("xl/") for n in names):
+                    return 'office'  # xlsx
+        except Exception:


The function returns 'unknown' for ZIP files that don't match any Office format (lines 344-355), but then this return is never reached because line 355 passes silently and continues to check other formats. If a ZIP file doesn't match Office formats, the function will fall through to check OLE, images, and HTML, which is correct. However, the logic would be clearer if there was an explicit comment noting that non-Office ZIP files will fall through to be classified as 'unknown' at the end if they don't match other formats.

Suggested change

except Exception:

except Exception:

# If the ZIP is not a recognized Office document (or cannot be read),

# fall through to other format checks; it may be classified as 'unknown'

# at the end if no other type matches.

Copilot · 2026-01-23T07:48:03Z

powerrag/sdk/tests/test_document.py

+    def test_explicit_input_type_pdf(self, client: PowerRAGClient, tmp_path):
+        """测试显式指定 input_type='pdf'"""
+        # 创建一个简单的 PDF 内容
+        pdf_header = b"%PDF-1.4\n%\xE2\xE3\xCF\xD3\n"
+        pdf_content = pdf_header + b"1 0 obj\n<<\n/Type /Catalog\n>>\nendobj\n"
+
+        # 显式指定为 PDF 类型，即使文件名没有扩展名
+        result = client.document.parse_to_md_binary(
+            file_binary=pdf_content,
+            filename="document",
+            input_type="pdf"  # 显式指定类型
+        )
+
+        assert "filename" in result


The test at line 653-666 similarly uses a minimal, potentially invalid PDF and only asserts that "filename" is in the result. This is a very weak assertion that doesn't verify whether the parsing actually succeeded or if the PDF content was correctly processed. Consider strengthening the assertion to check for "markdown" in result and verify markdown_length > 0, or use a valid PDF fixture.

Copilot · 2026-01-23T07:48:03Z

powerrag/server/routes/powerrag_routes.py

+        elif input_type in ['pdf', 'office', 'html', 'image']:
+            # Use explicitly specified input_type
+            format_type = input_type
+            logger.info(f"Using explicit input_type: {format_type} for file: {filename}")
+        else:
            return jsonify({
                "code": 400,
-                "message": f"Unsupported file format: {file_ext}. Supported formats: pdf, doc, docx, ppt, pptx, jpg, png, html"
+                "message": f"Invalid input_type: {input_type}. Must be 'auto' (default), 'pdf', 'office', 'html', or 'image'."
            }), 400


The valid input_type values shown in the error message at line 1147 don't include 'markdown', but the parse_file_binary method at line 190 in parse_service.py accepts 'markdown' as a valid input_type. This inconsistency means users will get an error if they try to use 'markdown' as input_type in this endpoint, even though the underlying service method supports it. Either add 'markdown' to the validation here, or remove it from parse_file_binary's validation.

Copilot · 2026-01-23T07:48:04Z

powerrag/server/routes/powerrag_routes.py

+            # Download from URL
+            import requests
+            from urllib.parse import urlparse
+            from pathlib import Path
+
+            logger.info(f"Downloading file from URL: {file_url}")
+
+            try:
+                response = requests.get(file_url, timeout=60)
+                response.raise_for_status()
+                binary = response.content


The file download functionality doesn't validate the file_url scheme or implement any URL allowlist. This could potentially allow Server-Side Request Forgery (SSRF) attacks where an attacker could make the server request internal network resources. Consider adding URL validation to only allow specific schemes (http/https) and optionally implement an allowlist or blocklist for domains/IPs, especially to prevent requests to internal/private network addresses.

Copilot · 2026-01-23T07:48:05Z

powerrag/utils/file_utils.py

+                    return 'office'  # pptx
+                if any(n.startswith("xl/") for n in names):
+                    return 'office'  # xlsx
+        except Exception:


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

except Exception:

except Exception:

# Any error while reading as a ZIP (corrupt/non-Office archive, etc.)

# means we cannot classify it as a ZIP-based Office file; fall through.

Copilot · 2026-01-23T07:48:06Z

powerrag/utils/file_utils.py

+        if '<html' in text_sample or '<!doctype html' in text_sample or '<head' in text_sample:
+            return 'html'
+    except Exception:
+        pass


'except' clause does nothing but pass and there is no explanatory comment.

Suggested change

pass

# If any unexpected error occurs during HTML detection, fall back to unknown.

return 'unknown'

Zhangg7723 and others added 18 commits December 31, 2025 16:55

feat:add PowerRAG SDK and API Proxy

352fa8f

Merge branch 'oceanbase:main' into powerrag_sdk_api

1f60049

Merge branch 'oceanbase:main' into powerrag_sdk_api

8b735e4

feat: add GitHub Actions workflow for Python package publishing and i…

f8d2bd5

…nitial SDK configuration

chore: update GitHub Actions workflow for SDK publishing and refine p…

35ba7e2

…ackage configuration

chore: update Python version requirement in pyproject.toml to support…

6325328

… 3.10

chore: add environment configuration for PyPI in GitHub Actions workflow

79d1294

Merge branch 'oceanbase:main' into powerrag_sdk_api

3dbcd5d

docs: update SDK README.md

16a0245

Merge branch 'oceanbase:main' into powerrag_sdk_api

e160ce0

feat(document): add binary file parsing to Markdown method

f15d0ef

refactor(document_manager): centralize parse to markdown upload logic

7732022

refactor(init): remove module docstring and __all__ exports

33a0520

Merge branch 'oceanbase:main' into powerrag_sdk_api

23dde0b

chore(docker): add GOTENBERG server environment variables

399d17a

Update docker/.env.example

f28c272

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'oceanbase:main' into powerrag_sdk_api

d92c91d

feat(document): add input_type parameter for file type detection

e9780ed

dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 23, 2026

dosubot bot added the enhancement New feature or request label Jan 23, 2026

feat(document): add support for file_url to parse documents from URL

045ad0a

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jan 23, 2026

whhe requested a review from Copilot January 23, 2026 07:37

Copilot started reviewing on behalf of whhe January 23, 2026 07:38 View session

Copilot AI reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sdk): add input_type parameter for file type detection #50

feat(sdk): add input_type parameter for file type detection #50

Uh oh!

Zhangg7723 commented Jan 23, 2026 •

edited

Loading

Uh oh!

dosubot bot commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Copilot AI Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9390")
	HOST_ADDRESS = os.getenv("HOST_ADDRESS", "http://127.0.0.1:9222")

-        except Exception:
+        except Exception:
+            # If the ZIP is not a recognized Office document (or cannot be read),
+            # fall through to other format checks; it may be classified as 'unknown'
+            # at the end if no other type matches.

	pass
	# If any unexpected error occurs during HTML detection, fall back to unknown.
	return 'unknown'

feat(sdk): add input_type parameter for file type detection #50

Are you sure you want to change the base?

feat(sdk): add input_type parameter for file type detection #50

Uh oh!

Conversation

Zhangg7723 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Solution Description

Uh oh!

dosubot bot commented Jan 23, 2026

PowerRAG SDK

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Zhangg7723 commented Jan 23, 2026 •

edited

Loading