Skip to content

Changes in ingestion flow to support docling chunking component.#900

Open
ricofurtado wants to merge 4 commits intomainfrom
docling-based-chunkers-support
Open

Changes in ingestion flow to support docling chunking component.#900
ricofurtado wants to merge 4 commits intomainfrom
docling-based-chunkers-support

Conversation

@ricofurtado
Copy link
Collaborator

Changes in default ingestion flow so it can support docling-based chunking.

Copy link
Collaborator

@mpawlow mpawlow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ricofurtado

Code Review 1

  • See PR comments: (a) to (f)
  • Note: No functional review was performed as part of this code review

"show": true,
"title_case": false,
"type": "code",
"value": "import json\n\nimport tiktoken\nfrom docling_core.transforms.chunker import BaseChunker, DocMeta\nfrom docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker\n\nfrom lfx.base.data.docling_utils import extract_docling_documents\nfrom lfx.custom import Component\nfrom lfx.io import DropdownInput, HandleInput, IntInput, MessageTextInput, Output, StrInput\nfrom lfx.schema import Data, DataFrame\n\n\nclass ChunkDoclingDocumentComponent(Component):\n display_name: str = \"Chunk DoclingDocument\"\n description: str = \"Use the DocumentDocument chunkers to split the document into chunks.\"\n documentation = \"https://docling-project.github.io/docling/concepts/chunking/\"\n icon = \"Docling\"\n name = \"ChunkDoclingDocument\"\n\n inputs = [\n HandleInput(\n name=\"data_inputs\",\n display_name=\"Data or DataFrame\",\n info=\"The data with documents to split in chunks.\",\n input_types=[\"Data\", \"DataFrame\"],\n required=True,\n ),\n DropdownInput(\n name=\"chunker\",\n display_name=\"Chunker\",\n options=[\"HybridChunker\", \"HierarchicalChunker\"],\n info=(\"Which chunker to use.\"),\n value=\"HybridChunker\",\n real_time_refresh=True,\n ),\n DropdownInput(\n name=\"provider\",\n display_name=\"Provider\",\n options=[\"Hugging Face\", \"OpenAI\"],\n info=(\"Which tokenizer provider.\"),\n value=\"Hugging Face\",\n show=True,\n real_time_refresh=True,\n advanced=True,\n dynamic=True,\n ),\n StrInput(\n name=\"hf_model_name\",\n display_name=\"HF model name\",\n info=(\n \"Model name of the tokenizer to use with the HybridChunker when Hugging Face is chosen as a tokenizer.\"\n ),\n value=\"sentence-transformers/all-MiniLM-L6-v2\",\n show=True,\n advanced=True,\n dynamic=True,\n ),\n StrInput(\n name=\"openai_model_name\",\n display_name=\"OpenAI model name\",\n info=(\"Model name of the tokenizer to use with the HybridChunker when OpenAI is chosen as a tokenizer.\"),\n value=\"gpt-4o\",\n show=False,\n advanced=True,\n dynamic=True,\n ),\n IntInput(\n name=\"max_tokens\",\n display_name=\"Maximum tokens\",\n info=(\"Maximum number of tokens for the HybridChunker.\"),\n show=True,\n required=False,\n advanced=True,\n dynamic=True,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"DataFrame\", name=\"dataframe\", method=\"chunk_documents\"),\n ]\n\n def update_build_config(self, build_config: dict, field_value: str, field_name: str | None = None) -> dict:\n if field_name == \"chunker\":\n provider_type = build_config[\"provider\"][\"value\"]\n is_hf = provider_type == \"Hugging Face\"\n is_openai = provider_type == \"OpenAI\"\n if field_value == \"HybridChunker\":\n build_config[\"provider\"][\"show\"] = True\n build_config[\"hf_model_name\"][\"show\"] = is_hf\n build_config[\"openai_model_name\"][\"show\"] = is_openai\n build_config[\"max_tokens\"][\"show\"] = True\n else:\n build_config[\"provider\"][\"show\"] = False\n build_config[\"hf_model_name\"][\"show\"] = False\n build_config[\"openai_model_name\"][\"show\"] = False\n build_config[\"max_tokens\"][\"show\"] = False\n elif field_name == \"provider\" and build_config[\"chunker\"][\"value\"] == \"HybridChunker\":\n if field_value == \"Hugging Face\":\n build_config[\"hf_model_name\"][\"show\"] = True\n build_config[\"openai_model_name\"][\"show\"] = False\n elif field_value == \"OpenAI\":\n build_config[\"hf_model_name\"][\"show\"] = False\n build_config[\"openai_model_name\"][\"show\"] = True\n\n return build_config\n\n def _docs_to_data(self, docs) -> list[Data]:\n return [Data(text=doc.page_content, data=doc.metadata) for doc in docs]\n\n def chunk_documents(self) -> DataFrame:\n documents = extract_docling_documents(self.data_inputs, self.doc_key)\n\n chunker: BaseChunker\n if self.chunker == \"HybridChunker\":\n try:\n from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\n except ImportError as e:\n msg = (\n \"HybridChunker is not installed. Please install it with `uv pip install docling-core[chunking] \"\n \"or `uv pip install transformers`\"\n )\n raise ImportError(msg) from e\n max_tokens: int | None = self.max_tokens if self.max_tokens else None\n if self.provider == \"Hugging Face\":\n try:\n from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n except ImportError as e:\n msg = (\n \"HuggingFaceTokenizer is not installed.\"\n \" Please install it with `uv pip install docling-core[chunking]`\"\n )\n raise ImportError(msg) from e\n tokenizer = HuggingFaceTokenizer.from_pretrained(\n model_name=self.hf_model_name,\n max_tokens=max_tokens,\n )\n elif self.provider == \"OpenAI\":\n try:\n from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer\n except ImportError as e:\n msg = (\n \"OpenAITokenizer is not installed.\"\n \" Please install it with `uv pip install docling-core[chunking]`\"\n \" or `uv pip install transformers`\"\n )\n raise ImportError(msg) from e\n if max_tokens is None:\n max_tokens = 128 * 1024 # context window length required for OpenAI tokenizers\n tokenizer = OpenAITokenizer(\n tokenizer=tiktoken.encoding_for_model(self.openai_model_name), max_tokens=max_tokens\n )\n chunker = HybridChunker(\n tokenizer=tokenizer,\n )\n elif self.chunker == \"HierarchicalChunker\":\n chunker = HierarchicalChunker()\n\n results: list[Data] = []\n try:\n for doc in documents:\n for chunk in chunker.chunk(dl_doc=doc):\n enriched_text = chunker.contextualize(chunk=chunk)\n meta = DocMeta.model_validate(chunk.meta)\n\n results.append(\n Data(\n data={\n \"text\": enriched_text,\n \"document_id\": f\"{doc.origin.binary_hash}\",\n \"doc_items\": json.dumps([item.self_ref for item in meta.doc_items]),\n }\n )\n )\n\n except Exception as e:\n msg = f\"Error splitting text: {e}\"\n raise TypeError(msg) from e\n\n return DataFrame(results)\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(b) [Normal] tokenizer variable unbound in HybridChunker path

  • Similar to (a)
  • Within the HybridChunker branch, UnboundLocalError if provider is neither value
  if self.provider == "Hugging Face":
      tokenizer = HuggingFaceTokenizer.from_pretrained(...)
  elif self.provider == "OpenAI":
      tokenizer = OpenAITokenizer(...)

  chunker = HybridChunker(tokenizer=tokenizer)  # <<<<<<<<<<<
  • If self.provider has an unexpected value, tokenizer is unbound.
  • Potential Solution: Add an else branch with an error message

"show": true,
"title_case": false,
"type": "code",
"value": "import json\n\nimport tiktoken\nfrom docling_core.transforms.chunker import BaseChunker, DocMeta\nfrom docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker\n\nfrom lfx.base.data.docling_utils import extract_docling_documents\nfrom lfx.custom import Component\nfrom lfx.io import DropdownInput, HandleInput, IntInput, MessageTextInput, Output, StrInput\nfrom lfx.schema import Data, DataFrame\n\n\nclass ChunkDoclingDocumentComponent(Component):\n display_name: str = \"Chunk DoclingDocument\"\n description: str = \"Use the DocumentDocument chunkers to split the document into chunks.\"\n documentation = \"https://docling-project.github.io/docling/concepts/chunking/\"\n icon = \"Docling\"\n name = \"ChunkDoclingDocument\"\n\n inputs = [\n HandleInput(\n name=\"data_inputs\",\n display_name=\"Data or DataFrame\",\n info=\"The data with documents to split in chunks.\",\n input_types=[\"Data\", \"DataFrame\"],\n required=True,\n ),\n DropdownInput(\n name=\"chunker\",\n display_name=\"Chunker\",\n options=[\"HybridChunker\", \"HierarchicalChunker\"],\n info=(\"Which chunker to use.\"),\n value=\"HybridChunker\",\n real_time_refresh=True,\n ),\n DropdownInput(\n name=\"provider\",\n display_name=\"Provider\",\n options=[\"Hugging Face\", \"OpenAI\"],\n info=(\"Which tokenizer provider.\"),\n value=\"Hugging Face\",\n show=True,\n real_time_refresh=True,\n advanced=True,\n dynamic=True,\n ),\n StrInput(\n name=\"hf_model_name\",\n display_name=\"HF model name\",\n info=(\n \"Model name of the tokenizer to use with the HybridChunker when Hugging Face is chosen as a tokenizer.\"\n ),\n value=\"sentence-transformers/all-MiniLM-L6-v2\",\n show=True,\n advanced=True,\n dynamic=True,\n ),\n StrInput(\n name=\"openai_model_name\",\n display_name=\"OpenAI model name\",\n info=(\"Model name of the tokenizer to use with the HybridChunker when OpenAI is chosen as a tokenizer.\"),\n value=\"gpt-4o\",\n show=False,\n advanced=True,\n dynamic=True,\n ),\n IntInput(\n name=\"max_tokens\",\n display_name=\"Maximum tokens\",\n info=(\"Maximum number of tokens for the HybridChunker.\"),\n show=True,\n required=False,\n advanced=True,\n dynamic=True,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"DataFrame\", name=\"dataframe\", method=\"chunk_documents\"),\n ]\n\n def update_build_config(self, build_config: dict, field_value: str, field_name: str | None = None) -> dict:\n if field_name == \"chunker\":\n provider_type = build_config[\"provider\"][\"value\"]\n is_hf = provider_type == \"Hugging Face\"\n is_openai = provider_type == \"OpenAI\"\n if field_value == \"HybridChunker\":\n build_config[\"provider\"][\"show\"] = True\n build_config[\"hf_model_name\"][\"show\"] = is_hf\n build_config[\"openai_model_name\"][\"show\"] = is_openai\n build_config[\"max_tokens\"][\"show\"] = True\n else:\n build_config[\"provider\"][\"show\"] = False\n build_config[\"hf_model_name\"][\"show\"] = False\n build_config[\"openai_model_name\"][\"show\"] = False\n build_config[\"max_tokens\"][\"show\"] = False\n elif field_name == \"provider\" and build_config[\"chunker\"][\"value\"] == \"HybridChunker\":\n if field_value == \"Hugging Face\":\n build_config[\"hf_model_name\"][\"show\"] = True\n build_config[\"openai_model_name\"][\"show\"] = False\n elif field_value == \"OpenAI\":\n build_config[\"hf_model_name\"][\"show\"] = False\n build_config[\"openai_model_name\"][\"show\"] = True\n\n return build_config\n\n def _docs_to_data(self, docs) -> list[Data]:\n return [Data(text=doc.page_content, data=doc.metadata) for doc in docs]\n\n def chunk_documents(self) -> DataFrame:\n documents = extract_docling_documents(self.data_inputs, self.doc_key)\n\n chunker: BaseChunker\n if self.chunker == \"HybridChunker\":\n try:\n from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\n except ImportError as e:\n msg = (\n \"HybridChunker is not installed. Please install it with `uv pip install docling-core[chunking] \"\n \"or `uv pip install transformers`\"\n )\n raise ImportError(msg) from e\n max_tokens: int | None = self.max_tokens if self.max_tokens else None\n if self.provider == \"Hugging Face\":\n try:\n from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n except ImportError as e:\n msg = (\n \"HuggingFaceTokenizer is not installed.\"\n \" Please install it with `uv pip install docling-core[chunking]`\"\n )\n raise ImportError(msg) from e\n tokenizer = HuggingFaceTokenizer.from_pretrained(\n model_name=self.hf_model_name,\n max_tokens=max_tokens,\n )\n elif self.provider == \"OpenAI\":\n try:\n from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer\n except ImportError as e:\n msg = (\n \"OpenAITokenizer is not installed.\"\n \" Please install it with `uv pip install docling-core[chunking]`\"\n \" or `uv pip install transformers`\"\n )\n raise ImportError(msg) from e\n if max_tokens is None:\n max_tokens = 128 * 1024 # context window length required for OpenAI tokenizers\n tokenizer = OpenAITokenizer(\n tokenizer=tiktoken.encoding_for_model(self.openai_model_name), max_tokens=max_tokens\n )\n chunker = HybridChunker(\n tokenizer=tokenizer,\n )\n elif self.chunker == \"HierarchicalChunker\":\n chunker = HierarchicalChunker()\n\n results: list[Data] = []\n try:\n for doc in documents:\n for chunk in chunker.chunk(dl_doc=doc):\n enriched_text = chunker.contextualize(chunk=chunk)\n meta = DocMeta.model_validate(chunk.meta)\n\n results.append(\n Data(\n data={\n \"text\": enriched_text,\n \"document_id\": f\"{doc.origin.binary_hash}\",\n \"doc_items\": json.dumps([item.self_ref for item in meta.doc_items]),\n }\n )\n )\n\n except Exception as e:\n msg = f\"Error splitting text: {e}\"\n raise TypeError(msg) from e\n\n return DataFrame(results)\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(c) [Normal] Dead code: _docs_to_data method in ChunkDoclingDocument

  • The component defines a helper method that is never called
  def _docs_to_data(self, docs) -> list[Data]:
      return [Data(text=doc.page_content, data=doc.metadata) for doc in docs]

"show": true,
"title_case": false,
"type": "code",
"value": "import json\n\nimport tiktoken\nfrom docling_core.transforms.chunker import BaseChunker, DocMeta\nfrom docling_core.transforms.chunker.hierarchical_chunker import HierarchicalChunker\n\nfrom lfx.base.data.docling_utils import extract_docling_documents\nfrom lfx.custom import Component\nfrom lfx.io import DropdownInput, HandleInput, IntInput, MessageTextInput, Output, StrInput\nfrom lfx.schema import Data, DataFrame\n\n\nclass ChunkDoclingDocumentComponent(Component):\n display_name: str = \"Chunk DoclingDocument\"\n description: str = \"Use the DocumentDocument chunkers to split the document into chunks.\"\n documentation = \"https://docling-project.github.io/docling/concepts/chunking/\"\n icon = \"Docling\"\n name = \"ChunkDoclingDocument\"\n\n inputs = [\n HandleInput(\n name=\"data_inputs\",\n display_name=\"Data or DataFrame\",\n info=\"The data with documents to split in chunks.\",\n input_types=[\"Data\", \"DataFrame\"],\n required=True,\n ),\n DropdownInput(\n name=\"chunker\",\n display_name=\"Chunker\",\n options=[\"HybridChunker\", \"HierarchicalChunker\"],\n info=(\"Which chunker to use.\"),\n value=\"HybridChunker\",\n real_time_refresh=True,\n ),\n DropdownInput(\n name=\"provider\",\n display_name=\"Provider\",\n options=[\"Hugging Face\", \"OpenAI\"],\n info=(\"Which tokenizer provider.\"),\n value=\"Hugging Face\",\n show=True,\n real_time_refresh=True,\n advanced=True,\n dynamic=True,\n ),\n StrInput(\n name=\"hf_model_name\",\n display_name=\"HF model name\",\n info=(\n \"Model name of the tokenizer to use with the HybridChunker when Hugging Face is chosen as a tokenizer.\"\n ),\n value=\"sentence-transformers/all-MiniLM-L6-v2\",\n show=True,\n advanced=True,\n dynamic=True,\n ),\n StrInput(\n name=\"openai_model_name\",\n display_name=\"OpenAI model name\",\n info=(\"Model name of the tokenizer to use with the HybridChunker when OpenAI is chosen as a tokenizer.\"),\n value=\"gpt-4o\",\n show=False,\n advanced=True,\n dynamic=True,\n ),\n IntInput(\n name=\"max_tokens\",\n display_name=\"Maximum tokens\",\n info=(\"Maximum number of tokens for the HybridChunker.\"),\n show=True,\n required=False,\n advanced=True,\n dynamic=True,\n ),\n MessageTextInput(\n name=\"doc_key\",\n display_name=\"Doc Key\",\n info=\"The key to use for the DoclingDocument column.\",\n value=\"doc\",\n advanced=True,\n ),\n ]\n\n outputs = [\n Output(display_name=\"DataFrame\", name=\"dataframe\", method=\"chunk_documents\"),\n ]\n\n def update_build_config(self, build_config: dict, field_value: str, field_name: str | None = None) -> dict:\n if field_name == \"chunker\":\n provider_type = build_config[\"provider\"][\"value\"]\n is_hf = provider_type == \"Hugging Face\"\n is_openai = provider_type == \"OpenAI\"\n if field_value == \"HybridChunker\":\n build_config[\"provider\"][\"show\"] = True\n build_config[\"hf_model_name\"][\"show\"] = is_hf\n build_config[\"openai_model_name\"][\"show\"] = is_openai\n build_config[\"max_tokens\"][\"show\"] = True\n else:\n build_config[\"provider\"][\"show\"] = False\n build_config[\"hf_model_name\"][\"show\"] = False\n build_config[\"openai_model_name\"][\"show\"] = False\n build_config[\"max_tokens\"][\"show\"] = False\n elif field_name == \"provider\" and build_config[\"chunker\"][\"value\"] == \"HybridChunker\":\n if field_value == \"Hugging Face\":\n build_config[\"hf_model_name\"][\"show\"] = True\n build_config[\"openai_model_name\"][\"show\"] = False\n elif field_value == \"OpenAI\":\n build_config[\"hf_model_name\"][\"show\"] = False\n build_config[\"openai_model_name\"][\"show\"] = True\n\n return build_config\n\n def _docs_to_data(self, docs) -> list[Data]:\n return [Data(text=doc.page_content, data=doc.metadata) for doc in docs]\n\n def chunk_documents(self) -> DataFrame:\n documents = extract_docling_documents(self.data_inputs, self.doc_key)\n\n chunker: BaseChunker\n if self.chunker == \"HybridChunker\":\n try:\n from docling_core.transforms.chunker.hybrid_chunker import HybridChunker\n except ImportError as e:\n msg = (\n \"HybridChunker is not installed. Please install it with `uv pip install docling-core[chunking] \"\n \"or `uv pip install transformers`\"\n )\n raise ImportError(msg) from e\n max_tokens: int | None = self.max_tokens if self.max_tokens else None\n if self.provider == \"Hugging Face\":\n try:\n from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer\n except ImportError as e:\n msg = (\n \"HuggingFaceTokenizer is not installed.\"\n \" Please install it with `uv pip install docling-core[chunking]`\"\n )\n raise ImportError(msg) from e\n tokenizer = HuggingFaceTokenizer.from_pretrained(\n model_name=self.hf_model_name,\n max_tokens=max_tokens,\n )\n elif self.provider == \"OpenAI\":\n try:\n from docling_core.transforms.chunker.tokenizer.openai import OpenAITokenizer\n except ImportError as e:\n msg = (\n \"OpenAITokenizer is not installed.\"\n \" Please install it with `uv pip install docling-core[chunking]`\"\n \" or `uv pip install transformers`\"\n )\n raise ImportError(msg) from e\n if max_tokens is None:\n max_tokens = 128 * 1024 # context window length required for OpenAI tokenizers\n tokenizer = OpenAITokenizer(\n tokenizer=tiktoken.encoding_for_model(self.openai_model_name), max_tokens=max_tokens\n )\n chunker = HybridChunker(\n tokenizer=tokenizer,\n )\n elif self.chunker == \"HierarchicalChunker\":\n chunker = HierarchicalChunker()\n\n results: list[Data] = []\n try:\n for doc in documents:\n for chunk in chunker.chunk(dl_doc=doc):\n enriched_text = chunker.contextualize(chunk=chunk)\n meta = DocMeta.model_validate(chunk.meta)\n\n results.append(\n Data(\n data={\n \"text\": enriched_text,\n \"document_id\": f\"{doc.origin.binary_hash}\",\n \"doc_items\": json.dumps([item.self_ref for item in meta.doc_items]),\n }\n )\n )\n\n except Exception as e:\n msg = f\"Error splitting text: {e}\"\n raise TypeError(msg) from e\n\n return DataFrame(results)\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(d) [Normal] Large value 128 * 1024 set as default max_tokens for OpenAI tokenizer

  • i.e.
  if max_tokens is None:
      max_tokens = 128 * 1024  # context window length required for OpenAI tokenizers
  • A 128k-token default for chunk max size is enormous and would result in very few or no splits for most documents.
  • Question: Is this value indeed required? (according to the comment)
  • Note: The HuggingFaceTokenizer default from sentence-transformers/all-MiniLM-L6-v2 is 256 tokens (which might be more appropriate for RAG)
  • Recommend verifying the value against what Docling's OpenAITokenizer actually requires vs. what makes sense for retrieval.

"_input_type": "SecretStrInput",
"advanced": false,
"display_name": "IBM watsonx.ai API Key",
"display_name": "OpenAI API Key",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(e) [Minor] DoclingRemote field label change are a breaking rename: "IBM watsonx.ai API Key" -> "OpenAI API Key"

  • Note: This isn't directly authored in this PR (it's an upstream lfx component update), but the effect for any existing WatsonX users of OpenRAG is that their configured field names would mismatch.
  • Recommend checking if FlowsService needs to handle this migration.

"advanced": true,
"display_name": "API Base URL",
"advanced": false,
"display_name": "OpenAI API Base URL",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(f) [Minor] DoclingRemote field label change are a breaking rename: "API Base URL" -> "OpenAI API Base URL"

  • Similar to (e)
  • Note: This isn't directly authored in this PR (it's an upstream lfx component update), but the effect for any existing WatsonX users of OpenRAG is that their configured field names would mismatch.
  • Recommend checking if FlowsService needs to handle this migration.

@ricofurtado
Copy link
Collaborator Author

This pull request was substituted by #1113

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants