feat(parse): add Feishu/Lark cloud document parser by ryzn0518 · Pull Request #831 · volcengine/OpenViking

ryzn0518 · 2026-03-21T01:21:55Z

Summary

Add FeishuParser that imports Feishu cloud documents (docx, wiki, sheets, bitable) into the knowledge base via URL
Uses lark-oapi SDK for all API calls with attribute-driven block detection
Follows the existing convert-then-parse pattern: Feishu → Markdown → MarkdownParser → VikingFS

Motivation

Feishu (飞书) is widely used for documentation in Chinese tech companies. This parser enables teams to directly import Feishu cloud documents into OpenViking's knowledge base by simply providing a URL, supporting automatic L0/L1 generation and semantic search.

What's Supported

Document Type	URL Pattern	API Used
Documents	`*.feishu.cn/docx/{id}`	Blocks API (attribute-driven)
Wiki pages	`*.feishu.cn/wiki/{token}`	Wiki API → auto-resolve → delegate
Spreadsheets	`*.feishu.cn/sheets/{token}`	Sheets v2/v3 API
Bitable	`*.feishu.cn/base/{token}`	Bitable API
Embedded sheets in docx	(auto-detected)	Sheets v2 API

Design Highlights

Attribute-driven block detection: Instead of hard-coding 50+ block type integer constants, inspects which SDK attribute is populated per block. Unknown/future block types with text elements are auto-extracted.
Data-driven dispatch: Document type routing (_DOC_TYPE_HANDLERS), special block handling (_SPECIAL_BLOCK_HANDLERS), and text formatting (_TEXT_FORMAT) are all table-driven.
Lazy imports: lark-oapi is imported inside methods to avoid breaking when the optional dependency is not installed.
Reuses existing bot-feishu dependency group — no new dependency groups needed.

Usage

# Configure credentials (env vars or ov.conf)
export FEISHU_APP_ID="cli_xxx"
export FEISHU_APP_SECRET="xxx"

# Import documents
ov add-resource "https://example.feishu.cn/docx/doxcnABC123"
ov add-resource "https://example.feishu.cn/wiki/wikiXYZ"
ov add-resource "https://example.feishu.cn/sheets/shtcn456"
ov add-resource "https://example.feishu.cn/base/bascn789"

# Incremental update
ov add-resource "https://example.feishu.cn/docx/xxx" -to viking://resources/my_doc

Test Plan

URL parsing for all document types (docx, wiki, sheets, base, larksuite.com)
Attribute-driven block → markdown conversion (headings, lists, code, tables, images, etc.)
Text style formatting (bold, italic, inline code, links, strikethrough)
Embedded sheet extraction and empty column trimming
Bitable field formatting (list, dict, None, string)
End-to-end: Feishu URL → parse → VikingFS → L0/L1 → vectorization → semantic search

CLAassistant · 2026-03-21T01:22:02Z

All committers have signed the CLA.

github-actions · 2026-03-21T01:22:49Z

Failed to generate code suggestions for PR

qin-ctx

Good feature addition overall. The convert-then-parse pattern and data-driven dispatch design are solid choices.

One blocking issue: synchronous lark-oapi SDK calls inside async parse() will block the event loop — needs asyncio.to_thread() wrapping, consistent with how other parsers in this project handle blocking I/O (e.g., code/code.py).

See inline comments for details.

qin-ctx · 2026-03-21T04:30:15Z

openviking/parse/parsers/feishu.py

+                doc_title = title
+
+            # Delegate to MarkdownParser
+            from openviking.parse.parsers.markdown import MarkdownParser


[Bug] (blocking) Synchronous API calls block the async event loop.

_resolve_wiki_node(), _parse_docx(), _parse_sheets(), and _parse_bitable() all make synchronous HTTP calls via the lark-oapi SDK, but they are called directly from async def parse() without asyncio.to_thread() wrapping.

For large documents with pagination, these calls can block the event loop for several seconds. This project already uses asyncio.to_thread() for blocking I/O in other parsers (e.g., code/code.py:379). The same pattern should be applied here:

markdown, doc_title = await asyncio.to_thread(getattr(self, handler_name), token)

Similarly for _resolve_wiki_node on line 172.

qin-ctx · 2026-03-21T04:30:15Z

openviking/parse/parsers/feishu.py

+        raw_req = (
+            lark.BaseRequest.builder()
+            .http_method(lark.HttpMethod.GET)
+            .uri(f"/open-apis/docx/v1/documents/{doc_id}/blocks/{block_id}")


[Design] (non-blocking) doc_id = block.parent_id assumes the embedded sheet is a direct child of the page block.

If the embedded sheet is nested inside a container block (e.g., quote_container or table), parent_id would point to that container rather than the document. The API call to /open-apis/docx/v1/documents/{doc_id}/blocks/{block_id} would then fail or return unexpected data.

Consider passing document_id explicitly from _parse_docx() (where it's known) instead of inferring it from parent_id:

def _embedded_sheet_to_markdown(self, block, block_map=None, *, document_id=None, **_): doc_id = document_id or block.parent_id

qin-ctx · 2026-03-21T04:30:15Z

openviking/parse/parsers/feishu.py

+            val = getattr(block, attr, None)
+            if val is not None:
+                return attr
+        return None


[Design] (non-blocking) Using dir(block) for block type detection is fragile.

dir() returns all attributes including methods, properties, and class-level attributes. The skip set only covers currently known non-content names. If a future lark-oapi SDK version adds a non-underscore helper method or property that returns a truthy value (e.g., to_json, validate, serialize), it would be falsely detected as a content attribute.

Consider using block_type (the integer already available on every block) as the primary dispatch mechanism, with attribute inspection as a fallback for unknown types:

# Primary: known block_type dispatch # Fallback: attribute inspection for unknown types

This preserves the auto-compat benefit for unknown types while being robust for known ones.

qin-ctx · 2026-03-21T04:30:15Z

openviking/utils/media_processor.py

+        if self._is_feishu_url(url):
+            from openviking.parse.parsers.feishu import FeishuParser
+
+            parser = FeishuParser()


[Design] (non-blocking) Creates a new FeishuParser instance (and thus a new lark-oapi client with separate auth token lifecycle) on every URL call.

The ParserRegistry already registers a FeishuParser instance. Consider either:

Retrieving the parser from the registry, or

Caching the lark-oapi client at class level (as _get_client() already does per-instance)

The current approach works but creates unnecessary overhead for repeated imports.

qin-ctx · 2026-03-21T04:30:15Z

openviking_cli/utils/config/parser_config.py

+    domain: str = "https://open.feishu.cn"
+    max_rows_per_sheet: int = 1000
+    max_records_per_table: int = 1000
+    download_images: bool = True


[Suggestion] (non-blocking) download_images and request_timeout (next line) are declared but never read by FeishuParser.

Images currently generate feishu://image/{token} placeholder links without actual downloading, and request_timeout is not passed to the lark-oapi client builder.

If these are planned for future work, consider adding a comment noting that. Otherwise, removing them avoids misleading users into thinking they have an effect.

qin-ctx · 2026-03-21T04:30:15Z

openviking/utils/media_processor.py

+        host = parsed.hostname or ""
+        path = parsed.path
+        is_feishu_domain = host.endswith(".feishu.cn") or host.endswith(".larksuite.com")
+        has_doc_path = any(path.startswith(f"/{t}") for t in ("docx", "wiki", "sheets", "base"))


[Suggestion] (non-blocking) Prefix matching is slightly too broad.

path.startswith(f"/{t}") would match unintended paths like /docx-editor or /base-pricing (hypothetical non-document Feishu pages). Using segment-level matching would be more precise:

has_doc_path = any( path == f"/{t}" or path.startswith(f"/{t}/") for t in ("docx", "wiki", "sheets", "base") )

qin-ctx · 2026-03-21T04:30:15Z

openviking/parse/parsers/feishu.py

+
+        # Code block (needs language from style)
+        if attr == "code":
+            lang = ""


[Suggestion] (non-blocking) Ordered list counter doesn't reset between separate ordered lists under the same parent.

The counter is keyed by parent_id and monotonically incremented. If a document has two independent ordered lists under the same parent block, the second list continues numbering from where the first ended (e.g., 1-3 then 4-6 instead of 1-3 then 1-3).

A possible fix is to reset the counter when a non-ordered block is encountered for a given parent, or to use (parent_id, list_index) as the key.

qin-ctx · 2026-03-21T04:30:15Z

tests/parse/test_feishu_parser.py

+
+    def test_all_empty(self):
+        rows = [["", ""], ["", ""]]
+        assert FeishuParser._trim_empty_columns(rows) == []


[Suggestion] (non-blocking) Good unit test coverage for utility functions, but no tests for the core parse methods (_parse_docx, _parse_sheets, _parse_bitable).

These methods contain pagination logic, error handling, and API response processing that would benefit from tests with mocked lark-oapi responses. For example:

@patch('lark_oapi.Client') def test_parse_docx_with_pagination(mock_client): # Mock multi-page block list response ...

This would catch regressions in the API interaction layer without requiring live credentials.

Add FeishuParser that supports importing Feishu cloud documents (docx, wiki, sheets, bitable) into the knowledge base via URL. Supports: - Docx documents via Blocks API with attribute-driven block detection - Wiki pages (auto-resolve to underlying document type) - Spreadsheets via Sheets API - Bitable (multi-dimensional tables) via Bitable API - Embedded sheet views inside docx documents - Generic text extraction fallback for unknown block types Design: - Uses lark-oapi SDK for all API calls (auth, pagination, etc.) - Attribute-driven block detection: inspects which SDK attribute is populated rather than hard-coding 50+ block type integer constants - Follows convert-then-parse pattern: Feishu -> Markdown -> MarkdownParser - Lazy imports to avoid breaking when lark-oapi is not installed - FeishuConfig for credentials (env vars or ov.conf) Integration: - URL routing in media_processor.py for feishu.cn/larksuite.com - Parser registration with ImportError guard - parse-feishu optional dependency group in pyproject.toml Tested end-to-end: Feishu URL -> parse -> VikingFS -> L0/L1 generation -> vectorization -> semantic search, all working.

github-project-automation bot added this to OpenViking project Mar 21, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 21, 2026

ryzn0518 force-pushed the feat/feishu-parser branch from d1e51e0 to f48db10 Compare March 21, 2026 01:26

qin-ctx self-assigned this Mar 21, 2026

qin-ctx requested changes Mar 21, 2026

View reviewed changes

ryzn0518 force-pushed the feat/feishu-parser branch 2 times, most recently from 5450430 to 35f45ba Compare March 21, 2026 15:39

ryzn0518 force-pushed the feat/feishu-parser branch from 35f45ba to 594b424 Compare March 21, 2026 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parse): add Feishu/Lark cloud document parser#831

feat(parse): add Feishu/Lark cloud document parser#831
ryzn0518 wants to merge 1 commit intovolcengine:mainfrom
ryzn0518:feat/feishu-parser

ryzn0518 commented Mar 21, 2026

Uh oh!

CLAassistant commented Mar 21, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

qin-ctx left a comment

Uh oh!

qin-ctx Mar 21, 2026

Uh oh!

qin-ctx Mar 21, 2026

Uh oh!

qin-ctx Mar 21, 2026

Uh oh!

qin-ctx Mar 21, 2026

Uh oh!

qin-ctx Mar 21, 2026

Uh oh!

qin-ctx Mar 21, 2026

Uh oh!

qin-ctx Mar 21, 2026

Uh oh!

qin-ctx Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ryzn0518 commented Mar 21, 2026

Summary

Motivation

What's Supported

Design Highlights

Usage

Test Plan

Uh oh!

CLAassistant commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 21, 2026

Uh oh!

qin-ctx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Mar 21, 2026 •

edited

Loading