Skip to content

feat(parse): add Feishu/Lark cloud document parser#831

Open
ryzn0518 wants to merge 1 commit intovolcengine:mainfrom
ryzn0518:feat/feishu-parser
Open

feat(parse): add Feishu/Lark cloud document parser#831
ryzn0518 wants to merge 1 commit intovolcengine:mainfrom
ryzn0518:feat/feishu-parser

Conversation

@ryzn0518
Copy link

Summary

  • Add FeishuParser that imports Feishu cloud documents (docx, wiki, sheets, bitable) into the knowledge base via URL
  • Uses lark-oapi SDK for all API calls with attribute-driven block detection
  • Follows the existing convert-then-parse pattern: Feishu → Markdown → MarkdownParser → VikingFS

Motivation

Feishu (飞书) is widely used for documentation in Chinese tech companies. This parser enables teams to directly import Feishu cloud documents into OpenViking's knowledge base by simply providing a URL, supporting automatic L0/L1 generation and semantic search.

What's Supported

Document Type URL Pattern API Used
Documents *.feishu.cn/docx/{id} Blocks API (attribute-driven)
Wiki pages *.feishu.cn/wiki/{token} Wiki API → auto-resolve → delegate
Spreadsheets *.feishu.cn/sheets/{token} Sheets v2/v3 API
Bitable *.feishu.cn/base/{token} Bitable API
Embedded sheets in docx (auto-detected) Sheets v2 API

Design Highlights

  • Attribute-driven block detection: Instead of hard-coding 50+ block type integer constants, inspects which SDK attribute is populated per block. Unknown/future block types with text elements are auto-extracted.
  • Data-driven dispatch: Document type routing (_DOC_TYPE_HANDLERS), special block handling (_SPECIAL_BLOCK_HANDLERS), and text formatting (_TEXT_FORMAT) are all table-driven.
  • Lazy imports: lark-oapi is imported inside methods to avoid breaking when the optional dependency is not installed.
  • Reuses existing bot-feishu dependency group — no new dependency groups needed.

Usage

# Configure credentials (env vars or ov.conf)
export FEISHU_APP_ID="cli_xxx"
export FEISHU_APP_SECRET="xxx"

# Import documents
ov add-resource "https://example.feishu.cn/docx/doxcnABC123"
ov add-resource "https://example.feishu.cn/wiki/wikiXYZ"
ov add-resource "https://example.feishu.cn/sheets/shtcn456"
ov add-resource "https://example.feishu.cn/base/bascn789"

# Incremental update
ov add-resource "https://example.feishu.cn/docx/xxx" -to viking://resources/my_doc

Test Plan

  • URL parsing for all document types (docx, wiki, sheets, base, larksuite.com)
  • Attribute-driven block → markdown conversion (headings, lists, code, tables, images, etc.)
  • Text style formatting (bold, italic, inline code, links, strikethrough)
  • Embedded sheet extraction and empty column trimming
  • Bitable field formatting (list, dict, None, string)
  • End-to-end: Feishu URL → parse → VikingFS → L0/L1 → vectorization → semantic search

@CLAassistant
Copy link

CLAassistant commented Mar 21, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link

Failed to generate code suggestions for PR

@ryzn0518 ryzn0518 force-pushed the feat/feishu-parser branch from d1e51e0 to f48db10 Compare March 21, 2026 01:26
@qin-ctx qin-ctx self-assigned this Mar 21, 2026
Copy link
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feature addition overall. The convert-then-parse pattern and data-driven dispatch design are solid choices.

One blocking issue: synchronous lark-oapi SDK calls inside async parse() will block the event loop — needs asyncio.to_thread() wrapping, consistent with how other parsers in this project handle blocking I/O (e.g., code/code.py).

See inline comments for details.

doc_title = title

# Delegate to MarkdownParser
from openviking.parse.parsers.markdown import MarkdownParser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) Synchronous API calls block the async event loop.

_resolve_wiki_node(), _parse_docx(), _parse_sheets(), and _parse_bitable() all make synchronous HTTP calls via the lark-oapi SDK, but they are called directly from async def parse() without asyncio.to_thread() wrapping.

For large documents with pagination, these calls can block the event loop for several seconds. This project already uses asyncio.to_thread() for blocking I/O in other parsers (e.g., code/code.py:379). The same pattern should be applied here:

markdown, doc_title = await asyncio.to_thread(getattr(self, handler_name), token)

Similarly for _resolve_wiki_node on line 172.

raw_req = (
lark.BaseRequest.builder()
.http_method(lark.HttpMethod.GET)
.uri(f"/open-apis/docx/v1/documents/{doc_id}/blocks/{block_id}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (non-blocking) doc_id = block.parent_id assumes the embedded sheet is a direct child of the page block.

If the embedded sheet is nested inside a container block (e.g., quote_container or table), parent_id would point to that container rather than the document. The API call to /open-apis/docx/v1/documents/{doc_id}/blocks/{block_id} would then fail or return unexpected data.

Consider passing document_id explicitly from _parse_docx() (where it's known) instead of inferring it from parent_id:

def _embedded_sheet_to_markdown(self, block, block_map=None, *, document_id=None, **_):
    doc_id = document_id or block.parent_id

val = getattr(block, attr, None)
if val is not None:
return attr
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (non-blocking) Using dir(block) for block type detection is fragile.

dir() returns all attributes including methods, properties, and class-level attributes. The skip set only covers currently known non-content names. If a future lark-oapi SDK version adds a non-underscore helper method or property that returns a truthy value (e.g., to_json, validate, serialize), it would be falsely detected as a content attribute.

Consider using block_type (the integer already available on every block) as the primary dispatch mechanism, with attribute inspection as a fallback for unknown types:

# Primary: known block_type dispatch
# Fallback: attribute inspection for unknown types

This preserves the auto-compat benefit for unknown types while being robust for known ones.

if self._is_feishu_url(url):
from openviking.parse.parsers.feishu import FeishuParser

parser = FeishuParser()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (non-blocking) Creates a new FeishuParser instance (and thus a new lark-oapi client with separate auth token lifecycle) on every URL call.

The ParserRegistry already registers a FeishuParser instance. Consider either:

  • Retrieving the parser from the registry, or
  • Caching the lark-oapi client at class level (as _get_client() already does per-instance)

The current approach works but creates unnecessary overhead for repeated imports.

domain: str = "https://open.feishu.cn"
max_rows_per_sheet: int = 1000
max_records_per_table: int = 1000
download_images: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) download_images and request_timeout (next line) are declared but never read by FeishuParser.

Images currently generate feishu://image/{token} placeholder links without actual downloading, and request_timeout is not passed to the lark-oapi client builder.

If these are planned for future work, consider adding a comment noting that. Otherwise, removing them avoids misleading users into thinking they have an effect.

host = parsed.hostname or ""
path = parsed.path
is_feishu_domain = host.endswith(".feishu.cn") or host.endswith(".larksuite.com")
has_doc_path = any(path.startswith(f"/{t}") for t in ("docx", "wiki", "sheets", "base"))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) Prefix matching is slightly too broad.

path.startswith(f"/{t}") would match unintended paths like /docx-editor or /base-pricing (hypothetical non-document Feishu pages). Using segment-level matching would be more precise:

has_doc_path = any(
    path == f"/{t}" or path.startswith(f"/{t}/")
    for t in ("docx", "wiki", "sheets", "base")
)


# Code block (needs language from style)
if attr == "code":
lang = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) Ordered list counter doesn't reset between separate ordered lists under the same parent.

The counter is keyed by parent_id and monotonically incremented. If a document has two independent ordered lists under the same parent block, the second list continues numbering from where the first ended (e.g., 1-3 then 4-6 instead of 1-3 then 1-3).

A possible fix is to reset the counter when a non-ordered block is encountered for a given parent, or to use (parent_id, list_index) as the key.


def test_all_empty(self):
rows = [["", ""], ["", ""]]
assert FeishuParser._trim_empty_columns(rows) == []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) Good unit test coverage for utility functions, but no tests for the core parse methods (_parse_docx, _parse_sheets, _parse_bitable).

These methods contain pagination logic, error handling, and API response processing that would benefit from tests with mocked lark-oapi responses. For example:

@patch('lark_oapi.Client')
def test_parse_docx_with_pagination(mock_client):
    # Mock multi-page block list response
    ...

This would catch regressions in the API interaction layer without requiring live credentials.

@ryzn0518 ryzn0518 force-pushed the feat/feishu-parser branch 2 times, most recently from 5450430 to 35f45ba Compare March 21, 2026 15:39
Add FeishuParser that supports importing Feishu cloud documents
(docx, wiki, sheets, bitable) into the knowledge base via URL.

Supports:
- Docx documents via Blocks API with attribute-driven block detection
- Wiki pages (auto-resolve to underlying document type)
- Spreadsheets via Sheets API
- Bitable (multi-dimensional tables) via Bitable API
- Embedded sheet views inside docx documents
- Generic text extraction fallback for unknown block types

Design:
- Uses lark-oapi SDK for all API calls (auth, pagination, etc.)
- Attribute-driven block detection: inspects which SDK attribute is
  populated rather than hard-coding 50+ block type integer constants
- Follows convert-then-parse pattern: Feishu -> Markdown -> MarkdownParser
- Lazy imports to avoid breaking when lark-oapi is not installed
- FeishuConfig for credentials (env vars or ov.conf)

Integration:
- URL routing in media_processor.py for feishu.cn/larksuite.com
- Parser registration with ImportError guard
- parse-feishu optional dependency group in pyproject.toml

Tested end-to-end: Feishu URL -> parse -> VikingFS -> L0/L1 generation
-> vectorization -> semantic search, all working.
@ryzn0518 ryzn0518 force-pushed the feat/feishu-parser branch from 35f45ba to 594b424 Compare March 21, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants