feat(parse): add Feishu/Lark cloud document parser#831
feat(parse): add Feishu/Lark cloud document parser#831ryzn0518 wants to merge 1 commit intovolcengine:mainfrom
Conversation
|
Failed to generate code suggestions for PR |
d1e51e0 to
f48db10
Compare
qin-ctx
left a comment
There was a problem hiding this comment.
Good feature addition overall. The convert-then-parse pattern and data-driven dispatch design are solid choices.
One blocking issue: synchronous lark-oapi SDK calls inside async parse() will block the event loop — needs asyncio.to_thread() wrapping, consistent with how other parsers in this project handle blocking I/O (e.g., code/code.py).
See inline comments for details.
| doc_title = title | ||
|
|
||
| # Delegate to MarkdownParser | ||
| from openviking.parse.parsers.markdown import MarkdownParser |
There was a problem hiding this comment.
[Bug] (blocking) Synchronous API calls block the async event loop.
_resolve_wiki_node(), _parse_docx(), _parse_sheets(), and _parse_bitable() all make synchronous HTTP calls via the lark-oapi SDK, but they are called directly from async def parse() without asyncio.to_thread() wrapping.
For large documents with pagination, these calls can block the event loop for several seconds. This project already uses asyncio.to_thread() for blocking I/O in other parsers (e.g., code/code.py:379). The same pattern should be applied here:
markdown, doc_title = await asyncio.to_thread(getattr(self, handler_name), token)Similarly for _resolve_wiki_node on line 172.
| raw_req = ( | ||
| lark.BaseRequest.builder() | ||
| .http_method(lark.HttpMethod.GET) | ||
| .uri(f"/open-apis/docx/v1/documents/{doc_id}/blocks/{block_id}") |
There was a problem hiding this comment.
[Design] (non-blocking) doc_id = block.parent_id assumes the embedded sheet is a direct child of the page block.
If the embedded sheet is nested inside a container block (e.g., quote_container or table), parent_id would point to that container rather than the document. The API call to /open-apis/docx/v1/documents/{doc_id}/blocks/{block_id} would then fail or return unexpected data.
Consider passing document_id explicitly from _parse_docx() (where it's known) instead of inferring it from parent_id:
def _embedded_sheet_to_markdown(self, block, block_map=None, *, document_id=None, **_):
doc_id = document_id or block.parent_id| val = getattr(block, attr, None) | ||
| if val is not None: | ||
| return attr | ||
| return None |
There was a problem hiding this comment.
[Design] (non-blocking) Using dir(block) for block type detection is fragile.
dir() returns all attributes including methods, properties, and class-level attributes. The skip set only covers currently known non-content names. If a future lark-oapi SDK version adds a non-underscore helper method or property that returns a truthy value (e.g., to_json, validate, serialize), it would be falsely detected as a content attribute.
Consider using block_type (the integer already available on every block) as the primary dispatch mechanism, with attribute inspection as a fallback for unknown types:
# Primary: known block_type dispatch
# Fallback: attribute inspection for unknown typesThis preserves the auto-compat benefit for unknown types while being robust for known ones.
openviking/utils/media_processor.py
Outdated
| if self._is_feishu_url(url): | ||
| from openviking.parse.parsers.feishu import FeishuParser | ||
|
|
||
| parser = FeishuParser() |
There was a problem hiding this comment.
[Design] (non-blocking) Creates a new FeishuParser instance (and thus a new lark-oapi client with separate auth token lifecycle) on every URL call.
The ParserRegistry already registers a FeishuParser instance. Consider either:
- Retrieving the parser from the registry, or
- Caching the
lark-oapiclient at class level (as_get_client()already does per-instance)
The current approach works but creates unnecessary overhead for repeated imports.
| domain: str = "https://open.feishu.cn" | ||
| max_rows_per_sheet: int = 1000 | ||
| max_records_per_table: int = 1000 | ||
| download_images: bool = True |
There was a problem hiding this comment.
[Suggestion] (non-blocking) download_images and request_timeout (next line) are declared but never read by FeishuParser.
Images currently generate feishu://image/{token} placeholder links without actual downloading, and request_timeout is not passed to the lark-oapi client builder.
If these are planned for future work, consider adding a comment noting that. Otherwise, removing them avoids misleading users into thinking they have an effect.
openviking/utils/media_processor.py
Outdated
| host = parsed.hostname or "" | ||
| path = parsed.path | ||
| is_feishu_domain = host.endswith(".feishu.cn") or host.endswith(".larksuite.com") | ||
| has_doc_path = any(path.startswith(f"/{t}") for t in ("docx", "wiki", "sheets", "base")) |
There was a problem hiding this comment.
[Suggestion] (non-blocking) Prefix matching is slightly too broad.
path.startswith(f"/{t}") would match unintended paths like /docx-editor or /base-pricing (hypothetical non-document Feishu pages). Using segment-level matching would be more precise:
has_doc_path = any(
path == f"/{t}" or path.startswith(f"/{t}/")
for t in ("docx", "wiki", "sheets", "base")
)|
|
||
| # Code block (needs language from style) | ||
| if attr == "code": | ||
| lang = "" |
There was a problem hiding this comment.
[Suggestion] (non-blocking) Ordered list counter doesn't reset between separate ordered lists under the same parent.
The counter is keyed by parent_id and monotonically incremented. If a document has two independent ordered lists under the same parent block, the second list continues numbering from where the first ended (e.g., 1-3 then 4-6 instead of 1-3 then 1-3).
A possible fix is to reset the counter when a non-ordered block is encountered for a given parent, or to use (parent_id, list_index) as the key.
|
|
||
| def test_all_empty(self): | ||
| rows = [["", ""], ["", ""]] | ||
| assert FeishuParser._trim_empty_columns(rows) == [] |
There was a problem hiding this comment.
[Suggestion] (non-blocking) Good unit test coverage for utility functions, but no tests for the core parse methods (_parse_docx, _parse_sheets, _parse_bitable).
These methods contain pagination logic, error handling, and API response processing that would benefit from tests with mocked lark-oapi responses. For example:
@patch('lark_oapi.Client')
def test_parse_docx_with_pagination(mock_client):
# Mock multi-page block list response
...This would catch regressions in the API interaction layer without requiring live credentials.
5450430 to
35f45ba
Compare
Add FeishuParser that supports importing Feishu cloud documents (docx, wiki, sheets, bitable) into the knowledge base via URL. Supports: - Docx documents via Blocks API with attribute-driven block detection - Wiki pages (auto-resolve to underlying document type) - Spreadsheets via Sheets API - Bitable (multi-dimensional tables) via Bitable API - Embedded sheet views inside docx documents - Generic text extraction fallback for unknown block types Design: - Uses lark-oapi SDK for all API calls (auth, pagination, etc.) - Attribute-driven block detection: inspects which SDK attribute is populated rather than hard-coding 50+ block type integer constants - Follows convert-then-parse pattern: Feishu -> Markdown -> MarkdownParser - Lazy imports to avoid breaking when lark-oapi is not installed - FeishuConfig for credentials (env vars or ov.conf) Integration: - URL routing in media_processor.py for feishu.cn/larksuite.com - Parser registration with ImportError guard - parse-feishu optional dependency group in pyproject.toml Tested end-to-end: Feishu URL -> parse -> VikingFS -> L0/L1 generation -> vectorization -> semantic search, all working.
35f45ba to
594b424
Compare
Summary
FeishuParserthat imports Feishu cloud documents (docx, wiki, sheets, bitable) into the knowledge base via URLlark-oapiSDK for all API calls with attribute-driven block detectionMotivation
Feishu (飞书) is widely used for documentation in Chinese tech companies. This parser enables teams to directly import Feishu cloud documents into OpenViking's knowledge base by simply providing a URL, supporting automatic L0/L1 generation and semantic search.
What's Supported
*.feishu.cn/docx/{id}*.feishu.cn/wiki/{token}*.feishu.cn/sheets/{token}*.feishu.cn/base/{token}Design Highlights
_DOC_TYPE_HANDLERS), special block handling (_SPECIAL_BLOCK_HANDLERS), and text formatting (_TEXT_FORMAT) are all table-driven.lark-oapiis imported inside methods to avoid breaking when the optional dependency is not installed.bot-feishudependency group — no new dependency groups needed.Usage
Test Plan