feat: Use LLM components in Docling#9770
Conversation
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughAdds optional picture classification and picture description stages to the Docling workflow. Introduces Pydantic-based secret-safe (de)serialization for LLM configs passed to the worker. Updates the UI component to accept new inputs, serialize the LLM, and call the worker with keyword-only parameters. Adds langchain-docling to optional dependencies. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant UI as DoclingInlineComponent
participant W as docling_worker
participant D as Docling Pipeline
participant L as LLM (deserialized)
User->>UI: Provide files + options<br/>(pipeline, ocr_engine,<br/>do_picture_classification,<br/>pic_desc_llm?, pic_desc_prompt)
UI->>UI: Serialize pic_desc_llm -> pic_desc_config
UI->>W: process(files, pipeline, ocr_engine,<br/>do_picture_classification, pic_desc_config, pic_desc_prompt)
rect rgba(220,240,255,0.4)
note over W: Worker setup
W->>W: If pic_desc_config: deserialize LLM
W->>D: Configure pipeline options<br/>(picture classification flag)
alt pic_desc_config provided
W->>D: Enable picture description<br/>allow external plugins
W->>D: Set PictureDescriptionLangChainOptions<br/>(llm=L, prompt)
end
end
W->>D: Run extraction pipeline
D-->>W: Results
W-->>UI: Results via queue
UI-->>User: Converted Data objects
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
Signed-off-by: Michele Dolfi <dol@zurich.ibm.com>
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pyproject.toml (1)
207-211: Constrain langchain-docling major version
Add an upper bound so future releases can’t introduce breaking changes to the import paths/contracts:docling = [ "docling>=2.36.1", - "langchain-docling>=1.1.0", + "langchain-docling>=1.1.0,<2.0.0", "tesserocr>=2.8.0", "rapidocr-onnxruntime>=1.4.4", "ocrmac>=1.0.0; sys_platform == 'darwin'", ]
🧹 Nitpick comments (4)
src/lfx/src/lfx/base/data/docling_utils.py (1)
153-154: Surface a clearer error when langchain-docling is missing.Right now all
ModuleNotFoundErrors become “Docling is an optional dependency…”. Distinguishlangchain_doclingfor better UX.# Apply outside this hunk: tweak the except at Lines ~159-166 except ModuleNotFoundError as e: missing = e.name or "a Docling dependency" if missing in {"docling", "langchain_docling"}: msg = ( f"{missing} is not installed. " "Install with `uv pip install 'langflow[docling]'` or see docs." ) else: msg = f"Failed to import a Docling dependency: {missing}" queue.put({"error": msg}) returnsrc/lfx/src/lfx/components/docling/docling_inline.py (3)
168-171: Nil guard looks good; consider early import check for better UX.If
pic_desc_llmis set, you could early-checkimport langchain_doclinghere to fail fast before spawning a process.# Optional preflight (outside the hunk): try: if self.pic_desc_llm is not None: import langchain_docling # noqa: F401 except ModuleNotFoundError: raise ImportError("langchain-docling is required for picture description. Install 'langflow[docling]'.")
214-221: Normalize worker error mapping for missing optional deps and shutdown.Match the worker’s messages so users get ImportError for missing extras.
- if isinstance(result, dict) and "error" in result: + if isinstance(result, dict) and "error" in result: msg = result["error"] - if msg.startswith("Docling is not installed"): + if ( + msg.startswith("Docling is not installed") + or "optional dependency of Langflow" in msg + or "No module named 'langchain_docling'" in msg + ): raise ImportError(msg) # Handle interrupt gracefully - return empty result instead of raising error - if "Worker interrupted by SIGINT" in msg or "shutdown" in result: + if "Worker interrupted by SIGINT" in msg or result.get("shutdown"): self.log("Docling process cancelled by user") result = [] else: raise RuntimeError(msg)
77-82: Validate chat-capable model at runtime
LangFlow doesn’t supportinput_types=["ChatModel"], so keepinput_types=["LanguageModel"]and add a runtime check inPictureDescriptionLangChainOptionsto reject non-chat LLMs, e.g.if not llm.is_chat_model: raise ValueError("Picture description requires a chat-capable model")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
pyproject.toml(1 hunks)src/lfx/src/lfx/base/data/docling_utils.py(4 hunks)src/lfx/src/lfx/components/docling/docling_inline.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/lfx/src/lfx/components/docling/docling_inline.py (1)
src/lfx/src/lfx/base/data/docling_utils.py (2)
_serialize_pydantic_model(80-84)docling_worker(95-304)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Validate PR
- GitHub Check: Update Starter Projects
- GitHub Check: Ruff Style Check (3.13)
- GitHub Check: Run Ruff Check and Format
🔇 Additional comments (3)
src/lfx/src/lfx/base/data/docling_utils.py (2)
95-104: No direct positional calls found—please verify dynamic/indirect invocations
Ripgrep found only the definition ofdocling_workerand no call sites; ensure any scheduler, CLI or reflection-based usages are updated to keyword-only.
192-193: Verify PdfPipelineOptions supports do_picture_classification
At src/lfx/src/lfx/base/data/docling_utils.py:192-193, ensure your installed Docling (≥2.36.1) definesdo_picture_classificationonPdfPipelineOptions. Wrap its assignment in anhasattrguard or provide a sensible default if the attribute is missing.src/lfx/src/lfx/components/docling/docling_inline.py (1)
176-184: Nice switch to kwargs.Clearer and future-proof; aligns with worker’s keyword-only signature.
| def _deserialize_pydantic_model(data: dict): | ||
| module_name, class_name = data["__class_path__"].rsplit(".", 1) | ||
| module = importlib.import_module(module_name) | ||
| cls = getattr(module, class_name) | ||
| adapter = TypeAdapter(cls) | ||
| return adapter.validate_python(data["config"]) | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
Harden deserialization: validate class path + robust errors.
Dynamic imports from unvalidated class paths are risky and brittle. Gate the module prefix, validate keys, and avoid KeyError/ValueError crashes.
-def _deserialize_pydantic_model(data: dict):
- module_name, class_name = data["__class_path__"].rsplit(".", 1)
- module = importlib.import_module(module_name)
- cls = getattr(module, class_name)
- adapter = TypeAdapter(cls)
- return adapter.validate_python(data["config"])
+def _deserialize_pydantic_model(data: dict):
+ try:
+ class_path = data["__class_path__"]
+ module_name, class_name = class_path.rsplit(".", 1)
+ except (KeyError, ValueError) as e:
+ raise ValueError("Invalid serialized model: missing or malformed '__class_path__'.") from e
+ # Allow only known-safe prefixes to avoid arbitrary imports
+ ALLOWED_CLASS_PREFIXES = ("langchain_", "langchain", "langchain_core")
+ if not module_name.startswith(ALLOWED_CLASS_PREFIXES):
+ raise ValueError(f"Deserialization blocked for class path {class_path!r}")
+ module = importlib.import_module(module_name)
+ try:
+ cls = getattr(module, class_name)
+ except AttributeError as e:
+ raise ValueError(f"Class {class_name!r} not found in module {module_name!r}.") from e
+ adapter = TypeAdapter(cls)
+ return adapter.validate_python(data.get("config", {}))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _deserialize_pydantic_model(data: dict): | |
| module_name, class_name = data["__class_path__"].rsplit(".", 1) | |
| module = importlib.import_module(module_name) | |
| cls = getattr(module, class_name) | |
| adapter = TypeAdapter(cls) | |
| return adapter.validate_python(data["config"]) | |
| def _deserialize_pydantic_model(data: dict): | |
| - module_name, class_name = data["__class_path__"].rsplit(".", 1) | |
| - module = importlib.import_module(module_name) | |
| - cls = getattr(module, class_name) | |
| - adapter = TypeAdapter(cls) | |
| try: | |
| class_path = data["__class_path__"] | |
| module_name, class_name = class_path.rsplit(".", 1) | |
| except (KeyError, ValueError) as e: | |
| raise ValueError("Invalid serialized model: missing or malformed '__class_path__'.") from e | |
| # Allow only known-safe prefixes to avoid arbitrary imports | |
| ALLOWED_CLASS_PREFIXES = ("langchain_", "langchain", "langchain_core") | |
| if not module_name.startswith(ALLOWED_CLASS_PREFIXES): | |
| raise ValueError(f"Deserialization blocked for class path {class_path!r}") | |
| module = importlib.import_module(module_name) | |
| try: | |
| cls = getattr(module, class_name) | |
| except AttributeError as e: | |
| raise ValueError(f"Class {class_name!r} not found in module {module_name!r}.") from e | |
| adapter = TypeAdapter(cls) | |
| return adapter.validate_python(data.get("config", {})) |
🤖 Prompt for AI Agents
In src/lfx/src/lfx/base/data/docling_utils.py around lines 87 to 93, the
deserializer trusts data["__class_path__"] and other keys and does an
unrestricted dynamic import; validate presence and types of required keys (e.g.
ensure "__class_path__" and "config" exist and are dict-like), restrict allowed
module prefixes (maintain an allowlist/prefix whitelist and reject imports
outside it), wrap importlib.import_module/getattr/TypeAdapter.validate_python in
try/except and raise clear ValueError/TypeError with contextual messages on
failure (invalid path format, disallowed module, import error, missing class,
validation error) instead of letting KeyError/ValueError propagate; ensure
returned value is the validated model or raise a descriptive exception.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
| name="do_picture_classification", | ||
| display_name="Picture classification", | ||
| info="If enabled, the Docling pipeline will classify the pictures type.", | ||
| value=False, |
There was a problem hiding this comment.
NIT: I believe this is the default value for a BoolInput, so could be omitted if desired.
* configure Docling bundle with LC LLM objects Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * cleanup old not used code Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * pin released docling plugin Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * use generic pydantic serialization Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> * Apply suggestions from code review Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * [autofix.ci] apply automated fixes --------- Signed-off-by: Michele Dolfi <dol@zurich.ibm.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Hare <ericrhare@gmail.com>



Add options for the Docling component:
When vision-language models are used, configure the models connecting LLM components.
The component has been tested with:
langchain_openai.chat_models.base.ChatOpenAIlangchain_ollama.chat_models.ChatOllamalangchain_ibm.chat_models.ChatWatsonxSummary by CodeRabbit