feat: Use LLM components in Docling by dolfim-ibm · Pull Request #9770 · langflow-ai/langflow

dolfim-ibm · 2025-09-09T13:26:42Z

Add options for the Docling component:

Picture classification
Picture description

When vision-language models are used, configure the models connecting LLM components.

The component has been tested with:

LM Studio: langchain_openai.chat_models.base.ChatOpenAI
Ollama: langchain_ollama.chat_models.ChatOllama
watsonx.ai: langchain_ibm.chat_models.ChatWatsonx

Summary by CodeRabbit

New Features
- Added optional picture classification and picture description stages.
- New inputs: enable picture classification, select an LLM handle for picture description, and set a custom prompt.
Bug Fixes
- More robust behavior when Docling isn’t installed or the worker shuts down, with clearer error handling and safe fallbacks.
Chores
- Expanded optional “docling” install to include langchain-docling (>=1.1.0).

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

coderabbitai · 2025-09-09T13:26:50Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds optional picture classification and picture description stages to the Docling workflow. Introduces Pydantic-based secret-safe (de)serialization for LLM configs passed to the worker. Updates the UI component to accept new inputs, serialize the LLM, and call the worker with keyword-only parameters. Adds langchain-docling to optional dependencies.

Changes

Cohort / File(s)	Summary
Dependencies `pyproject.toml`	Adds `langchain-docling>=1.1.0` to the `docling` optional dependency group.
Worker & serialization utilities `src/lfx/src/lfx/base/data/docling_utils.py`	Adds secret-unwrapping and Pydantic model (de)serialization helpers. Extends `docling_worker` to keyword-only params: `do_picture_classification`, `pic_desc_config`, `pic_desc_prompt`. Enables optional picture-description stage with dynamic LLM deserialization and `PictureDescriptionLangChainOptions`. Imports adjusted for dynamic loading and typing.
UI component integration `src/lfx/src/lfx/components/docling/docling_inline.py`	Adds inputs: `do_picture_classification` (Bool), `pic_desc_llm` (Handle), `pic_desc_prompt` (Str). Serializes `pic_desc_llm` via `_serialize_pydantic_model` and passes as `pic_desc_config`. Updates `docling_worker` invocation to keyword args. Simplifies in-process pipeline configuration and improves optional Docling handling.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant UI as DoclingInlineComponent
  participant W as docling_worker
  participant D as Docling Pipeline
  participant L as LLM (deserialized)

  User->>UI: Provide files + options<br/>(pipeline, ocr_engine,<br/>do_picture_classification,<br/>pic_desc_llm?, pic_desc_prompt)
  UI->>UI: Serialize pic_desc_llm -> pic_desc_config
  UI->>W: process(files, pipeline, ocr_engine,<br/>do_picture_classification, pic_desc_config, pic_desc_prompt)

  rect rgba(220,240,255,0.4)
    note over W: Worker setup
    W->>W: If pic_desc_config: deserialize LLM
    W->>D: Configure pipeline options<br/>(picture classification flag)
    alt pic_desc_config provided
      W->>D: Enable picture description<br/>allow external plugins
      W->>D: Set PictureDescriptionLangChainOptions<br/>(llm=L, prompt)
    end
  end

  W->>D: Run extraction pipeline
  D-->>W: Results
  W-->>UI: Results via queue
  UI-->>User: Converted Data objects

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: Docling components #8394 — Introduces the original DoclingInlineComponent and integration points that this PR extends.
refactor(docling): extract processing logic to separate worker process #9393 — Earlier extension of docling_worker; this PR further modifies its signature and behavior.
fix: move Docling worker to base module and update imports #9471 — Prior refactor around docling_utils/docling_worker; directly related to current worker changes.

Suggested labels

enhancement, size:XXL

Suggested reviewers

italojohnny
ogabrielluiz
jordanrfrazier
erichare

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly highlights the primary enhancement of integrating LLM components into the Docling workflow, which aligns directly with the core changes around dynamic LLM injection, picture classification, and description stages without extraneous detail.

✨ Finishing Touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

…-using-lcllm

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pyproject.toml (1)
207-211: Constrain langchain-docling major version
Add an upper bound so future releases can’t introduce breaking changes to the import paths/contracts:
 docling = [
     "docling>=2.36.1",
-    "langchain-docling>=1.1.0",
+    "langchain-docling>=1.1.0,<2.0.0",
     "tesserocr>=2.8.0",
     "rapidocr-onnxruntime>=1.4.4",
     "ocrmac>=1.0.0; sys_platform == 'darwin'",
 ]

🧹 Nitpick comments (4)

src/lfx/src/lfx/base/data/docling_utils.py (1)

153-154: Surface a clearer error when langchain-docling is missing.

Right now all ModuleNotFoundErrors become “Docling is an optional dependency…”. Distinguish langchain_docling for better UX.

# Apply outside this hunk: tweak the except at Lines ~159-166
except ModuleNotFoundError as e:
    missing = e.name or "a Docling dependency"
    if missing in {"docling", "langchain_docling"}:
        msg = (
            f"{missing} is not installed. "
            "Install with `uv pip install 'langflow[docling]'` or see docs."
        )
    else:
        msg = f"Failed to import a Docling dependency: {missing}"
    queue.put({"error": msg})
    return

src/lfx/src/lfx/components/docling/docling_inline.py (3)

168-171: Nil guard looks good; consider early import check for better UX.

If pic_desc_llm is set, you could early-check import langchain_docling here to fail fast before spawning a process.

# Optional preflight (outside the hunk):
try:
    if self.pic_desc_llm is not None:
        import langchain_docling  # noqa: F401
except ModuleNotFoundError:
    raise ImportError("langchain-docling is required for picture description. Install 'langflow[docling]'.")

214-221: Normalize worker error mapping for missing optional deps and shutdown.

Match the worker’s messages so users get ImportError for missing extras.

-        if isinstance(result, dict) and "error" in result:
+        if isinstance(result, dict) and "error" in result:
             msg = result["error"]
-            if msg.startswith("Docling is not installed"):
+            if (
+                msg.startswith("Docling is not installed")
+                or "optional dependency of Langflow" in msg
+                or "No module named 'langchain_docling'" in msg
+            ):
                 raise ImportError(msg)
             # Handle interrupt gracefully - return empty result instead of raising error
-            if "Worker interrupted by SIGINT" in msg or "shutdown" in result:
+            if "Worker interrupted by SIGINT" in msg or result.get("shutdown"):
                 self.log("Docling process cancelled by user")
                 result = []
             else:
                 raise RuntimeError(msg)

77-82: Validate chat-capable model at runtime
LangFlow doesn’t support input_types=["ChatModel"], so keep input_types=["LanguageModel"] and add a runtime check in PictureDescriptionLangChainOptions to reject non-chat LLMs, e.g.

if not llm.is_chat_model:
    raise ValueError("Picture description requires a chat-capable model")

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4144d76 and 1df71aa.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

pyproject.toml (1 hunks)
src/lfx/src/lfx/base/data/docling_utils.py (4 hunks)
src/lfx/src/lfx/components/docling/docling_inline.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/lfx/src/lfx/components/docling/docling_inline.py (1)

src/lfx/src/lfx/base/data/docling_utils.py (2)

_serialize_pydantic_model (80-84)

docling_worker (95-304)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Validate PR
GitHub Check: Update Starter Projects
GitHub Check: Ruff Style Check (3.13)
GitHub Check: Run Ruff Check and Format

🔇 Additional comments (3)

src/lfx/src/lfx/base/data/docling_utils.py (2)

95-104: No direct positional calls found—please verify dynamic/indirect invocations
Ripgrep found only the definition of docling_worker and no call sites; ensure any scheduler, CLI or reflection-based usages are updated to keyword-only.

192-193: Verify PdfPipelineOptions supports do_picture_classification
At src/lfx/src/lfx/base/data/docling_utils.py:192-193, ensure your installed Docling (≥2.36.1) defines do_picture_classification on PdfPipelineOptions. Wrap its assignment in an hasattr guard or provide a sensible default if the attribute is missing.

src/lfx/src/lfx/components/docling/docling_inline.py (1)

176-184: Nice switch to kwargs.

Clearer and future-proof; aligns with worker’s keyword-only signature.

src/lfx/src/lfx/base/data/docling_utils.py

coderabbitai · 2025-09-10T09:00:35Z

src/lfx/src/lfx/base/data/docling_utils.py

+def _deserialize_pydantic_model(data: dict):
+    module_name, class_name = data["__class_path__"].rsplit(".", 1)
+    module = importlib.import_module(module_name)
+    cls = getattr(module, class_name)
+    adapter = TypeAdapter(cls)
+    return adapter.validate_python(data["config"])
+


🛠️ Refactor suggestion

Harden deserialization: validate class path + robust errors.

Dynamic imports from unvalidated class paths are risky and brittle. Gate the module prefix, validate keys, and avoid KeyError/ValueError crashes.

-def _deserialize_pydantic_model(data: dict): - module_name, class_name = data["__class_path__"].rsplit(".", 1) - module = importlib.import_module(module_name) - cls = getattr(module, class_name) - adapter = TypeAdapter(cls) - return adapter.validate_python(data["config"]) +def _deserialize_pydantic_model(data: dict): + try: + class_path = data["__class_path__"] + module_name, class_name = class_path.rsplit(".", 1) + except (KeyError, ValueError) as e: + raise ValueError("Invalid serialized model: missing or malformed '__class_path__'.") from e + # Allow only known-safe prefixes to avoid arbitrary imports + ALLOWED_CLASS_PREFIXES = ("langchain_", "langchain", "langchain_core") + if not module_name.startswith(ALLOWED_CLASS_PREFIXES): + raise ValueError(f"Deserialization blocked for class path {class_path!r}") + module = importlib.import_module(module_name) + try: + cls = getattr(module, class_name) + except AttributeError as e: + raise ValueError(f"Class {class_name!r} not found in module {module_name!r}.") from e + adapter = TypeAdapter(cls) + return adapter.validate_python(data.get("config", {}))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _deserialize_pydantic_model(data: dict):

module_name, class_name = data["__class_path__"].rsplit(".", 1)

module = importlib.import_module(module_name)

cls = getattr(module, class_name)

adapter = TypeAdapter(cls)

return adapter.validate_python(data["config"])

def _deserialize_pydantic_model(data: dict):

- module_name, class_name = data["__class_path__"].rsplit(".", 1)

- module = importlib.import_module(module_name)

- cls = getattr(module, class_name)

- adapter = TypeAdapter(cls)

try:

class_path = data["__class_path__"]

module_name, class_name = class_path.rsplit(".", 1)

except (KeyError, ValueError) as e:

raise ValueError("Invalid serialized model: missing or malformed '__class_path__'.") from e

# Allow only known-safe prefixes to avoid arbitrary imports

ALLOWED_CLASS_PREFIXES = ("langchain_", "langchain", "langchain_core")

if not module_name.startswith(ALLOWED_CLASS_PREFIXES):

raise ValueError(f"Deserialization blocked for class path {class_path!r}")

module = importlib.import_module(module_name)

try:

cls = getattr(module, class_name)

except AttributeError as e:

raise ValueError(f"Class {class_name!r} not found in module {module_name!r}.") from e

adapter = TypeAdapter(cls)

return adapter.validate_python(data.get("config", {}))

🤖 Prompt for AI Agents

In src/lfx/src/lfx/base/data/docling_utils.py around lines 87 to 93, the deserializer trusts data["__class_path__"] and other keys and does an unrestricted dynamic import; validate presence and types of required keys (e.g. ensure "__class_path__" and "config" exist and are dict-like), restrict allowed module prefixes (maintain an allowlist/prefix whitelist and reject imports outside it), wrap importlib.import_module/getattr/TypeAdapter.validate_python in try/except and raise clear ValueError/TypeError with contextual messages on failure (invalid path format, disallowed module, import error, missing class, validation error) instead of letting KeyError/ValueError propagate; ensure returned value is the validated model or raise a descriptive exception.

src/lfx/src/lfx/base/data/docling_utils.py

src/lfx/src/lfx/components/docling/docling_inline.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

sonarqubecloud · 2025-09-10T15:34:46Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

erichare · 2025-09-10T15:35:13Z

src/lfx/src/lfx/components/docling/docling_inline.py

+            name="do_picture_classification",
+            display_name="Picture classification",
+            info="If enabled, the Docling pipeline will classify the pictures type.",
+            value=False,


NIT: I believe this is the default value for a BoolInput, so could be omitted if desired.

* configure Docling bundle with LC LLM objects Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * cleanup old not used code Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin released docling plugin Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use generic pydantic serialization Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Apply suggestions from code review Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [autofix.ci] apply automated fixes --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Hare <ericrhare@gmail.com>

configure Docling bundle with LC LLM objects

d165f28

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 9, 2025

cleanup old not used code

5193da6

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 9, 2025

dolfim-ibm added 3 commits September 10, 2025 10:14

Merge remote-tracking branch 'upstream/main' into feat-docling-bundle…

0d1c14f

…-using-lcllm

pin released docling plugin

b2ac95e

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

use generic pydantic serialization

1df71aa

Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>

dolfim-ibm marked this pull request as ready for review September 10, 2025 08:46

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 10, 2025

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

Apply suggestions from code review

1df40c0

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 10, 2025

[autofix.ci] apply automated fixes

7a2ccbd

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 10, 2025

Merge branch 'main' into feat-docling-bundle-using-lcllm

3e02126

erichare self-requested a review September 10, 2025 15:34

github-actions bot added enhancement New feature or request and removed enhancement New feature or request labels Sep 10, 2025

erichare reviewed Sep 10, 2025

View reviewed changes

erichare approved these changes Sep 10, 2025

View reviewed changes

erichare added lgtm This PR has been approved by a maintainer fast-track Skip tests and sends PR into the merge queue labels Sep 10, 2025

erichare added this pull request to the merge queue Sep 10, 2025

Merged via the queue into langflow-ai:main with commit 061449b Sep 11, 2025
50 of 65 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Use LLM components in Docling#9770

feat: Use LLM components in Docling#9770
erichare merged 8 commits intolangflow-ai:mainfrom
dolfim-ibm:feat-docling-bundle-using-lcllm

dolfim-ibm commented Sep 9, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 9, 2025 •

edited

Loading

Review skipped

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Sep 10, 2025

Uh oh!

erichare Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dolfim-ibm commented Sep 9, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Pre-merge checks (2 passed, 1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Sep 10, 2025

Quality Gate passed

Uh oh!

erichare Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dolfim-ibm commented Sep 9, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 9, 2025 •

edited

Loading