Skip to content

feat: Use LLM components in Docling#9770

Merged
erichare merged 8 commits intolangflow-ai:mainfrom
dolfim-ibm:feat-docling-bundle-using-lcllm
Sep 11, 2025
Merged

feat: Use LLM components in Docling#9770
erichare merged 8 commits intolangflow-ai:mainfrom
dolfim-ibm:feat-docling-bundle-using-lcllm

Conversation

@dolfim-ibm
Copy link
Copy Markdown
Contributor

@dolfim-ibm dolfim-ibm commented Sep 9, 2025

Add options for the Docling component:

  • Picture classification
  • Picture description

When vision-language models are used, configure the models connecting LLM components.

The component has been tested with:

  • LM Studio: langchain_openai.chat_models.base.ChatOpenAI
  • Ollama: langchain_ollama.chat_models.ChatOllama
  • watsonx.ai: langchain_ibm.chat_models.ChatWatsonx
image

Summary by CodeRabbit

  • New Features
    • Added optional picture classification and picture description stages.
    • New inputs: enable picture classification, select an LLM handle for picture description, and set a custom prompt.
  • Bug Fixes
    • More robust behavior when Docling isn’t installed or the worker shuts down, with clearer error handling and safe fallbacks.
  • Chores
    • Expanded optional “docling” install to include langchain-docling (>=1.1.0).

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Sep 9, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds optional picture classification and picture description stages to the Docling workflow. Introduces Pydantic-based secret-safe (de)serialization for LLM configs passed to the worker. Updates the UI component to accept new inputs, serialize the LLM, and call the worker with keyword-only parameters. Adds langchain-docling to optional dependencies.

Changes

Cohort / File(s) Summary
Dependencies
pyproject.toml
Adds langchain-docling>=1.1.0 to the docling optional dependency group.
Worker & serialization utilities
src/lfx/src/lfx/base/data/docling_utils.py
Adds secret-unwrapping and Pydantic model (de)serialization helpers. Extends docling_worker to keyword-only params: do_picture_classification, pic_desc_config, pic_desc_prompt. Enables optional picture-description stage with dynamic LLM deserialization and PictureDescriptionLangChainOptions. Imports adjusted for dynamic loading and typing.
UI component integration
src/lfx/src/lfx/components/docling/docling_inline.py
Adds inputs: do_picture_classification (Bool), pic_desc_llm (Handle), pic_desc_prompt (Str). Serializes pic_desc_llm via _serialize_pydantic_model and passes as pic_desc_config. Updates docling_worker invocation to keyword args. Simplifies in-process pipeline configuration and improves optional Docling handling.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant UI as DoclingInlineComponent
  participant W as docling_worker
  participant D as Docling Pipeline
  participant L as LLM (deserialized)

  User->>UI: Provide files + options<br/>(pipeline, ocr_engine,<br/>do_picture_classification,<br/>pic_desc_llm?, pic_desc_prompt)
  UI->>UI: Serialize pic_desc_llm -> pic_desc_config
  UI->>W: process(files, pipeline, ocr_engine,<br/>do_picture_classification, pic_desc_config, pic_desc_prompt)

  rect rgba(220,240,255,0.4)
    note over W: Worker setup
    W->>W: If pic_desc_config: deserialize LLM
    W->>D: Configure pipeline options<br/>(picture classification flag)
    alt pic_desc_config provided
      W->>D: Enable picture description<br/>allow external plugins
      W->>D: Set PictureDescriptionLangChainOptions<br/>(llm=L, prompt)
    end
  end

  W->>D: Run extraction pipeline
  D-->>W: Results
  W-->>UI: Results via queue
  UI-->>User: Converted Data objects
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

enhancement, size:XXL

Suggested reviewers

  • italojohnny
  • ogabrielluiz
  • jordanrfrazier
  • erichare

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly highlights the primary enhancement of integrating LLM components into the Docling workflow, which aligns directly with the core changes around dynamic LLM injection, picture classification, and description stages without extraneous detail.
✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 9, 2025
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 9, 2025
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
@dolfim-ibm dolfim-ibm marked this pull request as ready for review September 10, 2025 08:46
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 10, 2025
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pyproject.toml (1)

207-211: Constrain langchain-docling major version
Add an upper bound so future releases can’t introduce breaking changes to the import paths/contracts:

 docling = [
     "docling>=2.36.1",
-    "langchain-docling>=1.1.0",
+    "langchain-docling>=1.1.0,<2.0.0",
     "tesserocr>=2.8.0",
     "rapidocr-onnxruntime>=1.4.4",
     "ocrmac>=1.0.0; sys_platform == 'darwin'",
 ]
🧹 Nitpick comments (4)
src/lfx/src/lfx/base/data/docling_utils.py (1)

153-154: Surface a clearer error when langchain-docling is missing.

Right now all ModuleNotFoundErrors become “Docling is an optional dependency…”. Distinguish langchain_docling for better UX.

# Apply outside this hunk: tweak the except at Lines ~159-166
except ModuleNotFoundError as e:
    missing = e.name or "a Docling dependency"
    if missing in {"docling", "langchain_docling"}:
        msg = (
            f"{missing} is not installed. "
            "Install with `uv pip install 'langflow[docling]'` or see docs."
        )
    else:
        msg = f"Failed to import a Docling dependency: {missing}"
    queue.put({"error": msg})
    return
src/lfx/src/lfx/components/docling/docling_inline.py (3)

168-171: Nil guard looks good; consider early import check for better UX.

If pic_desc_llm is set, you could early-check import langchain_docling here to fail fast before spawning a process.

# Optional preflight (outside the hunk):
try:
    if self.pic_desc_llm is not None:
        import langchain_docling  # noqa: F401
except ModuleNotFoundError:
    raise ImportError("langchain-docling is required for picture description. Install 'langflow[docling]'.")

214-221: Normalize worker error mapping for missing optional deps and shutdown.

Match the worker’s messages so users get ImportError for missing extras.

-        if isinstance(result, dict) and "error" in result:
+        if isinstance(result, dict) and "error" in result:
             msg = result["error"]
-            if msg.startswith("Docling is not installed"):
+            if (
+                msg.startswith("Docling is not installed")
+                or "optional dependency of Langflow" in msg
+                or "No module named 'langchain_docling'" in msg
+            ):
                 raise ImportError(msg)
             # Handle interrupt gracefully - return empty result instead of raising error
-            if "Worker interrupted by SIGINT" in msg or "shutdown" in result:
+            if "Worker interrupted by SIGINT" in msg or result.get("shutdown"):
                 self.log("Docling process cancelled by user")
                 result = []
             else:
                 raise RuntimeError(msg)

77-82: Validate chat-capable model at runtime
LangFlow doesn’t support input_types=["ChatModel"], so keep input_types=["LanguageModel"] and add a runtime check in PictureDescriptionLangChainOptions to reject non-chat LLMs, e.g.

if not llm.is_chat_model:
    raise ValueError("Picture description requires a chat-capable model")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4144d76 and 1df71aa.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (3)
  • pyproject.toml (1 hunks)
  • src/lfx/src/lfx/base/data/docling_utils.py (4 hunks)
  • src/lfx/src/lfx/components/docling/docling_inline.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/lfx/src/lfx/components/docling/docling_inline.py (1)
src/lfx/src/lfx/base/data/docling_utils.py (2)
  • _serialize_pydantic_model (80-84)
  • docling_worker (95-304)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Validate PR
  • GitHub Check: Update Starter Projects
  • GitHub Check: Ruff Style Check (3.13)
  • GitHub Check: Run Ruff Check and Format
🔇 Additional comments (3)
src/lfx/src/lfx/base/data/docling_utils.py (2)

95-104: No direct positional calls found—please verify dynamic/indirect invocations
Ripgrep found only the definition of docling_worker and no call sites; ensure any scheduler, CLI or reflection-based usages are updated to keyword-only.


192-193: Verify PdfPipelineOptions supports do_picture_classification
At src/lfx/src/lfx/base/data/docling_utils.py:192-193, ensure your installed Docling (≥2.36.1) defines do_picture_classification on PdfPipelineOptions. Wrap its assignment in an hasattr guard or provide a sensible default if the attribute is missing.

src/lfx/src/lfx/components/docling/docling_inline.py (1)

176-184: Nice switch to kwargs.

Clearer and future-proof; aligns with worker’s keyword-only signature.

Comment on lines +87 to +93
def _deserialize_pydantic_model(data: dict):
module_name, class_name = data["__class_path__"].rsplit(".", 1)
module = importlib.import_module(module_name)
cls = getattr(module, class_name)
adapter = TypeAdapter(cls)
return adapter.validate_python(data["config"])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Harden deserialization: validate class path + robust errors.

Dynamic imports from unvalidated class paths are risky and brittle. Gate the module prefix, validate keys, and avoid KeyError/ValueError crashes.

-def _deserialize_pydantic_model(data: dict):
-    module_name, class_name = data["__class_path__"].rsplit(".", 1)
-    module = importlib.import_module(module_name)
-    cls = getattr(module, class_name)
-    adapter = TypeAdapter(cls)
-    return adapter.validate_python(data["config"])
+def _deserialize_pydantic_model(data: dict):
+    try:
+        class_path = data["__class_path__"]
+        module_name, class_name = class_path.rsplit(".", 1)
+    except (KeyError, ValueError) as e:
+        raise ValueError("Invalid serialized model: missing or malformed '__class_path__'.") from e
+    # Allow only known-safe prefixes to avoid arbitrary imports
+    ALLOWED_CLASS_PREFIXES = ("langchain_", "langchain", "langchain_core")
+    if not module_name.startswith(ALLOWED_CLASS_PREFIXES):
+        raise ValueError(f"Deserialization blocked for class path {class_path!r}")
+    module = importlib.import_module(module_name)
+    try:
+        cls = getattr(module, class_name)
+    except AttributeError as e:
+        raise ValueError(f"Class {class_name!r} not found in module {module_name!r}.") from e
+    adapter = TypeAdapter(cls)
+    return adapter.validate_python(data.get("config", {}))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _deserialize_pydantic_model(data: dict):
module_name, class_name = data["__class_path__"].rsplit(".", 1)
module = importlib.import_module(module_name)
cls = getattr(module, class_name)
adapter = TypeAdapter(cls)
return adapter.validate_python(data["config"])
def _deserialize_pydantic_model(data: dict):
- module_name, class_name = data["__class_path__"].rsplit(".", 1)
- module = importlib.import_module(module_name)
- cls = getattr(module, class_name)
- adapter = TypeAdapter(cls)
try:
class_path = data["__class_path__"]
module_name, class_name = class_path.rsplit(".", 1)
except (KeyError, ValueError) as e:
raise ValueError("Invalid serialized model: missing or malformed '__class_path__'.") from e
# Allow only known-safe prefixes to avoid arbitrary imports
ALLOWED_CLASS_PREFIXES = ("langchain_", "langchain", "langchain_core")
if not module_name.startswith(ALLOWED_CLASS_PREFIXES):
raise ValueError(f"Deserialization blocked for class path {class_path!r}")
module = importlib.import_module(module_name)
try:
cls = getattr(module, class_name)
except AttributeError as e:
raise ValueError(f"Class {class_name!r} not found in module {module_name!r}.") from e
adapter = TypeAdapter(cls)
return adapter.validate_python(data.get("config", {}))
🤖 Prompt for AI Agents
In src/lfx/src/lfx/base/data/docling_utils.py around lines 87 to 93, the
deserializer trusts data["__class_path__"] and other keys and does an
unrestricted dynamic import; validate presence and types of required keys (e.g.
ensure "__class_path__" and "config" exist and are dict-like), restrict allowed
module prefixes (maintain an allowlist/prefix whitelist and reject imports
outside it), wrap importlib.import_module/getattr/TypeAdapter.validate_python in
try/except and raise clear ValueError/TypeError with contextual messages on
failure (invalid path format, disallowed module, import error, missing class,
validation error) instead of letting KeyError/ValueError propagate; ensure
returned value is the validated model or raise a descriptive exception.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 10, 2025
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 10, 2025
@erichare erichare self-requested a review September 10, 2025 15:34
@github-actions github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 10, 2025
@sonarqubecloud
Copy link
Copy Markdown

name="do_picture_classification",
display_name="Picture classification",
info="If enabled, the Docling pipeline will classify the pictures type.",
value=False,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: I believe this is the default value for a BoolInput, so could be omitted if desired.

@erichare erichare added lgtm This PR has been approved by a maintainer fast-track Skip tests and sends PR into the merge queue labels Sep 10, 2025
@erichare erichare added this pull request to the merge queue Sep 10, 2025
Merged via the queue into langflow-ai:main with commit 061449b Sep 11, 2025
50 of 65 checks passed
lucaseduoli pushed a commit that referenced this pull request Sep 16, 2025
* configure Docling bundle with LC LLM objects

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* cleanup old not used code

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* pin released docling plugin

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* use generic pydantic serialization

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

* Apply suggestions from code review

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* [autofix.ci] apply automated fixes

---------

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Co-authored-by: Eric Hare <ericrhare@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request fast-track Skip tests and sends PR into the merge queue lgtm This PR has been approved by a maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants