Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions docs/docs/integrations/document_loaders/pebblo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,19 @@
"source": [
"# Pebblo Safe DocumentLoader\n",
"\n",
"> [Pebblo](https://github.com/daxa-ai/pebblo) enables developers to safely load data and promote their Gen AI app to deployment without worrying about the organization’s compliance and security requirements. The project identifies semantic topics and entities found in the loaded data and summarizes them on the UI or a PDF report.\n",
"> [Pebblo](https://daxa-ai.github.io/pebblo/) enables developers to safely load data and promote their Gen AI app to deployment without worrying about the organization’s compliance and security requirements. The project identifies semantic topics and entities found in the loaded data and summarizes them on the UI or a PDF report.\n",
"\n",
"Pebblo has two components.\n",
"\n",
"1. Pebblo Safe DocumentLoader for Langchain\n",
"1. Pebblo Daemon\n",
"1. Pebblo Server\n",
"\n",
"This document describes how to augment your existing Langchain DocumentLoader with Pebblo Safe DocumentLoader to get deep data visibility on the types of Topics and Entities ingested into the Gen-AI Langchain application. For details on `Pebblo Daemon` see this [pebblo daemon](https://daxa-ai.github.io/pebblo-docs/daemon.html) document.\n",
"This document describes how to augment your existing Langchain DocumentLoader with Pebblo Safe DocumentLoader to get deep data visibility on the types of Topics and Entities ingested into the Gen-AI Langchain application. For details on `Pebblo Server` see this [pebblo server](https://daxa-ai.github.io/pebblo/daemon) document.\n",
"\n",
"Pebblo Safeloader enables safe data ingestion for Langchain `DocumentLoader`. This is done by wrapping the document loader call with `Pebblo Safe DocumentLoader`.\n",
"\n",
"Note: To configure pebblo server on some url other that pebblo's default (localhost:8000) url, put the correct URL in `PEBBLO_CLASSIFIER_URL` env variable. This is configurable using the `classifier_url` keyword argument as well. Ref: [server-configurations](https://daxa-ai.github.io/pebblo/config)\n",
"\n",
"#### How to Pebblo enable Document Loading?\n",
"\n",
"Assume a Langchain RAG application snippet using `CSVLoader` to read a CSV document for inference.\n",
Expand Down
6 changes: 4 additions & 2 deletions libs/community/langchain_community/document_loaders/pebblo.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ def __init__(
description: str = "",
api_key: Optional[str] = None,
load_semantic: bool = False,
classifier_url: Optional[str] = None,
):
if not name or not isinstance(name, str):
raise NameError("Must specify a valid name.")
Expand All @@ -63,6 +64,7 @@ def __init__(
self.source_type = get_loader_type(loader_name)
self.source_path_size = self.get_source_size(self.source_path)
self.source_aggregate_size = 0
self.classifier_url = classifier_url or CLASSIFIER_URL
self.loader_details = {
"loader": loader_name,
"source_path": self.source_path,
Expand Down Expand Up @@ -210,7 +212,7 @@ def _classify_doc(self, loaded_docs: list, loading_end: bool = False) -> list:
self.source_aggregate_size
)
payload = Doc(**payload).dict(exclude_unset=True)
load_doc_url = f"{CLASSIFIER_URL}{LOADER_DOC_URL}"
load_doc_url = f"{self.classifier_url}{LOADER_DOC_URL}"
classified_docs = []
try:
pebblo_resp = requests.post(
Expand Down Expand Up @@ -296,7 +298,7 @@ def _send_discover(self) -> None:
"Content-Type": "application/json",
}
payload = self.app.dict(exclude_unset=True)
app_discover_url = f"{CLASSIFIER_URL}{APP_DISCOVER_URL}"
app_discover_url = f"{self.classifier_url}{APP_DISCOVER_URL}"
try:
pebblo_resp = requests.post(
app_discover_url, headers=headers, json=payload, timeout=20
Expand Down