Conversation
WalkthroughThe recent changes enhance the codebase's functionality and clarity by introducing a structured approach to managing various source types and streamlining the document ingestion process. New functions for policy extraction and sitemap parsing have been added, while existing methods have been updated for improved flexibility. The integration of enums and modifications to parameter types contribute to better data integrity and extensibility, setting the stage for future enhancements. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Ingest
participant Extract
participant Crawl
User->>Ingest: Initiate policy ingestion
Ingest->>Crawl: Retrieve policies via get_source_policy_list
Crawl->>Ingest: Return list of PolicyDetails
Ingest->>Extract: Extract text from policies
Extract->>Ingest: Return extracted text
Ingest->>User: Provide extracted policy data
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 4
Outside diff range, codebase verification and nitpick comments (10)
background/crawl.py (1)
22-25: Add TODO comments for unimplemented source types.The function currently returns empty lists for
SITEMAPandRECURSIVEtypes. Adding TODO comments will help track future implementation.if source.type == SourceType.SITEMAP: return [] elif source.type == SourceType.RECURSIVE: return [] # TODO: implement recursive crawlingbackground/extract.py (3)
108-112: Ensure Metadata ConsistencyEnsure that the metadata (title and keywords) are consistently updated in the
policyobject even if they are empty.- if title: - policy.title = title - if keywords: - policy.keywords = keywords + policy.title = title if title else policy.title + policy.keywords = keywords if keywords else policy.keywords
Line range hint
118-140: Ensure Proper Handling of Empty Text ExtractionThe function should handle cases where text extraction from the PDF fails, even after attempting OCR.
- return text + if not text: + logger.error(f"Failed to extract text from {policy.url} even after OCR") + return text
Line range hint
22-22: Remove Unused ImportThe import
extract_text_from_pdfis not used in this file and should be removed.- from background.extract import extract_text_from_pdf, extract_text_from_policy_file + from background.extract import extract_text_from_policy_filebackground/update.py (3)
49-53: Remove Outdated TODO CommentThe TODO comment about removing policies from the database is outdated and should be removed or updated to reflect the current state of the implementation.
- ## TODO: eventually, we could add a check to see if the policy has been removed from the source, and if so, remove it from the db
Line range hint
56-87: Improve Error Handling for Unrecognized SourcesThe function should handle cases where
get_source_policy_listreturnsNonemore gracefully by providing a more informative error message and possibly retrying the operation.- if policy_details is None: - logger.error(f"Source {source.name} not recognized") + if policy_details is None: + logger.error(f"Source {source.name} not recognized or no policies found")
83-84: Clarify Comment on JSON StorageThe comment about moving the JSON file to remote storage should be clarified to indicate the future plan more explicitly.
- ## TODO: move to remote storage of JSON file + ## TODO: Plan to move JSON file to remote storage for better scalability and accessbackground/ingest.py (3)
Line range hint
92-123: Handle Unsupported Content TypesThe function should handle unsupported content types more gracefully by providing a more informative error message and possibly retrying the operation.
- if "Content-Type" in response.headers: - content_type = response.headers["Content-Type"] - if "application/pdf" in content_type: - file_type = "pdf" + content_type = response.headers.get("Content-Type", "") + if "application/pdf" in content_type: + file_type = "pdf" + elif "text/plain" not in content_type: + logger.error(f"Unsupported content type: {content_type}") + return None
Line range hint
138-185: Ensure Proper Handling of Missing Text ExtractionThe function should handle cases where text extraction fails more gracefully by providing a more informative error message and possibly retrying the operation.
- logger.warning(f"No text extracted from {local_policy_path}") + logger.warning(f"No text extracted from {local_policy_path}. Skipping policy.")Tools
Ruff
157-157: f-string without any placeholders
Remove extraneous
fprefix(F541)
Line range hint
138-185: Fix Variable Scope IssueThe variables
num_docs_indexedandnum_new_docsshould be updated within the loop to reflect the correct counts.- logger.info(f"Indexed {num_docs_indexed} documents from source {source.name}") + logger.info(f"Indexed {num_docs_indexed} documents from source {source.name}") + num_docs_indexed += 1 + if not document: + num_new_docs += 1Tools
Ruff
157-157: f-string without any placeholders
Remove extraneous
fprefix(F541)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (6)
- background/crawl.py (2 hunks)
- background/db.py (1 hunks)
- background/extract.py (3 hunks)
- background/ingest.py (5 hunks)
- background/sources/sitemap.py (1 hunks)
- background/update.py (6 hunks)
Additional context used
Ruff
background/ingest.py
22-22:
background.extract.extract_text_from_pdfimported but unusedRemove unused import:
background.extract.extract_text_from_pdf(F401)
Additional comments not posted (6)
background/db.py (2)
42-46: LGTM! Well-definedSourceTypeenum.The enum clearly categorizes different source types, improving the structure and clarity of the code.
52-52: LGTM! Addition oftypefield inSourceclass.The
typefield enforces the use of theSourceTypeenum, enhancing data integrity and clarity.background/sources/sitemap.py (1)
73-76: LGTM!_ensure_absolute_urlfunction is well-implemented.The function correctly ensures that a URL is absolute, handling relative URLs appropriately.
background/crawl.py (1)
Line range hint
33-47:
LGTM!get_custom_policiesfunction is well-implemented.The function correctly handles specific sources that require unique processing and delegates to the appropriate functions.
background/update.py (2)
216-216: Ensure Consistency in Source Object InitializationEnsure that the
typeattribute is consistently initialized across all instances of theSourceobject.
197-197: Ensure Consistency in Source Object InitializationEnsure that the
typeattribute is consistently initialized across all instances of theSourceobject.
| response.raise_for_status() | ||
|
|
||
| if response is None: | ||
| logger.error(f"Failed to fetch sitemap {sitemap_url}") | ||
| return [] |
There was a problem hiding this comment.
Handle potential None response before calling raise_for_status.
The response.raise_for_status() call should be placed after checking if the response is None to avoid potential errors.
- response.raise_for_status()
-
if response is None:
logger.error(f"Failed to fetch sitemap {sitemap_url}")
return []
+ response.raise_for_status()Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| response.raise_for_status() | |
| if response is None: | |
| logger.error(f"Failed to fetch sitemap {sitemap_url}") | |
| return [] | |
| if response is None: | |
| logger.error(f"Failed to fetch sitemap {sitemap_url}") | |
| return [] | |
| response.raise_for_status() |
| def extract_text_from_text_file(input_path: str, policy: PolicyDetails) -> str: | ||
| """ | ||
| Extract text from a text file. Sounds simple. | ||
| But we do want to check if it looks like HTML, and if so we should attempt to extract metadata too (like title) | ||
| """ | ||
|
|
||
| try: | ||
| with open(input_path, "r") as file: | ||
| file_contents = file.read() | ||
| except Exception as e: | ||
| logger.error(f"Error extracting text from {policy.url}: {e}") | ||
| return "" | ||
|
|
||
| # Create a BeautifulSoup object | ||
| soup = BeautifulSoup(file_contents, "lxml") | ||
|
|
||
| # If the document does not have an 'html' tag, it is not a webpage | ||
| if not soup.find("html"): | ||
| return file_contents | ||
|
|
||
| # Extract title (og:title or title tag) | ||
| title = None | ||
| og_title = soup.find("meta", property="og:title") | ||
| if og_title and og_title.get("content"): | ||
| title = og_title["content"] | ||
| else: | ||
| title_tag = soup.find("title") | ||
| if title_tag: | ||
| title = title_tag.string | ||
|
|
||
| # Extract keywords | ||
| keywords = [] | ||
| meta_keywords = soup.find("meta", attrs={"name": "keywords"}) | ||
| if meta_keywords and meta_keywords.get("content"): | ||
| keywords_content = meta_keywords["content"] | ||
| keywords = [keyword.strip() for keyword in keywords_content.split(",")] | ||
|
|
||
| # Extract content from <main> or <body> | ||
| content = "" | ||
| main_content = soup.find("main") | ||
| if main_content: | ||
| content = main_content.get_text(separator="\n").strip() | ||
| else: | ||
| body_content = soup.find("body") | ||
| if body_content: | ||
| content = body_content.get_text(separator="\n").strip() | ||
|
|
||
| # modify the policy if we have a title or keywords | ||
| if title: | ||
| policy.title = title | ||
| if keywords: | ||
| policy.keywords = keywords | ||
|
|
||
| # return the content | ||
| return content |
There was a problem hiding this comment.
Improve Error Handling for File Reading
The function should handle specific exceptions related to file reading, such as FileNotFoundError and IOError, to provide more precise error messages.
- except Exception as e:
+ except (FileNotFoundError, IOError) as e:Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def extract_text_from_text_file(input_path: str, policy: PolicyDetails) -> str: | |
| """ | |
| Extract text from a text file. Sounds simple. | |
| But we do want to check if it looks like HTML, and if so we should attempt to extract metadata too (like title) | |
| """ | |
| try: | |
| with open(input_path, "r") as file: | |
| file_contents = file.read() | |
| except Exception as e: | |
| logger.error(f"Error extracting text from {policy.url}: {e}") | |
| return "" | |
| # Create a BeautifulSoup object | |
| soup = BeautifulSoup(file_contents, "lxml") | |
| # If the document does not have an 'html' tag, it is not a webpage | |
| if not soup.find("html"): | |
| return file_contents | |
| # Extract title (og:title or title tag) | |
| title = None | |
| og_title = soup.find("meta", property="og:title") | |
| if og_title and og_title.get("content"): | |
| title = og_title["content"] | |
| else: | |
| title_tag = soup.find("title") | |
| if title_tag: | |
| title = title_tag.string | |
| # Extract keywords | |
| keywords = [] | |
| meta_keywords = soup.find("meta", attrs={"name": "keywords"}) | |
| if meta_keywords and meta_keywords.get("content"): | |
| keywords_content = meta_keywords["content"] | |
| keywords = [keyword.strip() for keyword in keywords_content.split(",")] | |
| # Extract content from <main> or <body> | |
| content = "" | |
| main_content = soup.find("main") | |
| if main_content: | |
| content = main_content.get_text(separator="\n").strip() | |
| else: | |
| body_content = soup.find("body") | |
| if body_content: | |
| content = body_content.get_text(separator="\n").strip() | |
| # modify the policy if we have a title or keywords | |
| if title: | |
| policy.title = title | |
| if keywords: | |
| policy.keywords = keywords | |
| # return the content | |
| return content | |
| try: | |
| with open(input_path, "r") as file: | |
| file_contents = file.read() | |
| except (FileNotFoundError, IOError) as e: | |
| logger.error(f"Error extracting text from {policy.url}: {e}") | |
| return "" |
| # Create a BeautifulSoup object | ||
| soup = BeautifulSoup(file_contents, "lxml") | ||
|
|
||
| # If the document does not have an 'html' tag, it is not a webpage | ||
| if not soup.find("html"): | ||
| return file_contents |
There was a problem hiding this comment.
Optimize HTML Check
The current check for HTML documents can be optimized by directly checking the presence of the html tag in the file contents before creating a BeautifulSoup object.
- # Create a BeautifulSoup object
- soup = BeautifulSoup(file_contents, "lxml")
- # If the document does not have an 'html' tag, it is not a webpage
- if not soup.find("html"):
+ # If the document does not contain an 'html' tag, it is not a webpage
+ if "<html" not in file_contents.lower():
+ return file_contents
+
+ soup = BeautifulSoup(file_contents, "lxml")Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Create a BeautifulSoup object | |
| soup = BeautifulSoup(file_contents, "lxml") | |
| # If the document does not have an 'html' tag, it is not a webpage | |
| if not soup.find("html"): | |
| return file_contents | |
| # If the document does not contain an 'html' tag, it is not a webpage | |
| if "<html" not in file_contents.lower(): | |
| return file_contents | |
| soup = BeautifulSoup(file_contents, "lxml") |
| def extract_text_from_policy_file(input_path: str, policy: PolicyDetails) -> str: | ||
| """ | ||
| Determine which extractor to use based on the file extension. | ||
| """ | ||
| ext = os.path.splitext(input_path)[1].lower() | ||
|
|
||
| if ext == ".pdf": | ||
| return extract_text_from_pdf(input_path, policy) | ||
| else: | ||
| return extract_text_from_text_file(input_path, policy) | ||
|
|
There was a problem hiding this comment.
Ensure Proper Handling of Unsupported File Types
The function currently handles only PDF and text files. Consider adding a default case to handle unsupported file types gracefully.
- else:
+ elif ext == ".txt":
return extract_text_from_text_file(input_path, policy)
+ else:
+ logger.error(f"Unsupported file type: {ext}")
+ return ""Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Actionable comments posted: 2
Outside diff range, codebase verification and nitpick comments (2)
background/ingest.py (2)
Line range hint
92-123:
Enhance docstring and improve error handling.The docstring should be more descriptive, and error handling for the file download should be improved.
def download_policy(url: str, dir: str) -> str: """ Download a policy from the given URL and save it to the specified directory. Determines if the file is a PDF or text file based on the Content-Type header and adds the appropriate extension. Args: url (str): The URL of the policy to download. dir (str): The directory to save the downloaded policy. Returns: str: The file path of the downloaded policy. """ headers = {"User-Agent": user_agent} response = request_with_retry( url, headers=headers, allow_redirects=True, timeout=60 ) if not response: logger.error(f"Failed to download {url}") return None try: response.raise_for_status() except requests.exceptions.HTTPError as e: logger.error(f"HTTP error occurred: {e}") return None file_type = "txt" # default to text # check if the response is a PDF if "Content-Type" in response.headers: content_type = response.headers["Content-Type"] if "application/pdf" in content_type: file_type = "pdf" unique_filename = f"{uuid.uuid4()}.{file_type}" file_path = os.path.join(dir, unique_filename) with open(file_path, "wb") as file: file.write(response.content) return file_path
Line range hint
138-197:
Enhance logging and error handling.Improve logging for better traceability and add error handling for potential issues during the ingestion process.
def ingest_policies(source: Source, policies: List[PolicyDetails]) -> IngestResult: start_time = datetime.now(timezone.utc) num_docs_indexed = 0 num_new_docs = 0 with tempfile.TemporaryDirectory() as temp_dir: for policy in policies: if not policy: logger.warning(f"Policy is None, skipping") continue logger.info(f"Processing policy {policy.url}") log_memory_usage(logger) try: local_policy_path = download_policy(policy.url, temp_dir) if not local_policy_path: logger.error(f"Failed to download policy at {policy.url}. Skipping.") continue policy_file_hash = calculate_file_hash(local_policy_path) document = get_document_by_url(policy.url) if document and document.metadata.get("hash") == policy_file_hash: logger.info(f"Document {policy.url} has not changed, skipping") wait_before_next_request() continue extracted_text = extract_text_from_policy_file(local_policy_path, policy) if not extracted_text: logger.warning(f"No text extracted from {local_policy_path}") continue vectorized_document = policy.to_vectorized_document(extracted_text) vectorized_document.metadata.hash = policy_file_hash vectorized_document.metadata.content_length = len(extracted_text) vectorized_document.metadata.scope = source.name num_new_docs += 1 if not document else 0 num_docs_indexed += 1 result = vectorize_text(vectorized_document) update_document(source, policy, document, vectorized_document, result) except Exception as e: logger.error(f"Error processing policy {policy.url}: {e}") logger.info(f"Indexed {num_docs_indexed} documents from source {source.name}") end_time = datetime.now(timezone.utc) ## TODO: somewhere remove old documents that are no longer in the source return IngestResult( num_docs_indexed=num_docs_indexed, num_new_docs=num_new_docs, source_id=source._id, start_time=start_time, end_time=end_time, duration=(end_time - start_time).total_seconds(), )Tools
Ruff
157-157: f-string without any placeholders
Remove extraneous
fprefix(F541)
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (3)
- background/extract.py (3 hunks)
- background/ingest.py (7 hunks)
- background/update.py (7 hunks)
Files skipped from review as they are similar to previous changes (2)
- background/extract.py
- background/update.py
Additional context used
Ruff
background/ingest.py
22-22:
background.extract.extract_text_from_pdfimported but unusedRemove unused import:
background.extract.extract_text_from_pdf(F401)
Additional comments not posted (1)
background/ingest.py (1)
Line range hint
198-231:
LGTM!The function
update_documenthandles the update process correctly.
background/ingest.py
Outdated
|
|
||
| import requests | ||
| from background.extract import extract_text_from_pdf | ||
| from background.extract import extract_text_from_pdf, extract_text_from_policy_file |
There was a problem hiding this comment.
Remove unused import.
The import statement extract_text_from_pdf is not used in the code and should be removed to clean up the code.
- from background.extract import extract_text_from_pdf, extract_text_from_policy_file
+ from background.extract import extract_text_from_policy_fileCommittable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| from background.extract import extract_text_from_pdf, extract_text_from_policy_file | |
| from background.extract import extract_text_from_policy_file |
Tools
Ruff
22-22:
background.extract.extract_text_from_pdfimported but unusedRemove unused import:
background.extract.extract_text_from_pdf(F401)
| # if we haven't seen this document before, increment the count | ||
| num_new_docs += 1 if not document else 0 | ||
| num_docs_indexed += 1 # record the indexing either way | ||
|
|
||
| result = vectorize_text(vectorized_document) | ||
|
|
||
| update_document( | ||
| source, | ||
| num_docs_indexed, | ||
| num_new_docs, | ||
| policy, | ||
| document, | ||
| vectorized_document, |
There was a problem hiding this comment.
Enhance logging and error handling.
Improve logging for better traceability and add error handling for potential issues during the ingestion process.
def ingest_kb_documents(
source: Source, policy_details_with_text: List[Tuple[PolicyDetails, str]]
) -> IngestResult:
# KB is a special case, we already have the content
# eventually it'd be nice to either scrape the site or get API access instead
start_time = datetime.now(timezone.utc)
num_docs_indexed = 0
num_new_docs = 0
for policy, text in policy_details_with_text:
if not policy:
logger.warning(f"Policy is None, skipping")
continue
logger.info(f"Processing document {policy.url}")
log_memory_usage(logger)
try:
hash = hashlib.sha256(text.encode()).hexdigest()
document = get_document_by_url(policy.url)
if document and document.metadata.get("hash") == hash:
logger.info(f"Document {policy.url} has not changed, skipping")
continue
if not text:
logger.warning(f"No text extracted from {policy.url}")
continue
vectorized_document = policy.to_vectorized_document(text)
vectorized_document.metadata.hash = hash
vectorized_document.metadata.content_length = len(text)
vectorized_document.metadata.scope = source.name
num_new_docs += 1 if not document else 0
num_docs_indexed += 1
result = vectorize_text(vectorized_document)
update_document(source, policy, document, vectorized_document, result)
except Exception as e:
logger.error(f"Error processing document {policy.url}: {e}")
logger.info(f"Indexed {num_docs_indexed} documents from source {source.name}")
end_time = datetime.now(timezone.utc)
## TODO: somewhere remove old documents that are no longer in the source
return IngestResult(
num_docs_indexed=num_docs_indexed,
num_new_docs=num_new_docs,
source_id=source._id,
start_time=start_time,
end_time=end_time,
duration=(end_time - start_time).total_seconds(),
)Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # if we haven't seen this document before, increment the count | |
| num_new_docs += 1 if not document else 0 | |
| num_docs_indexed += 1 # record the indexing either way | |
| result = vectorize_text(vectorized_document) | |
| update_document( | |
| source, | |
| num_docs_indexed, | |
| num_new_docs, | |
| policy, | |
| document, | |
| vectorized_document, | |
| # if we haven't seen this document before, increment the count | |
| num_new_docs += 1 if not document else 0 | |
| num_docs_indexed += 1 # record the indexing either way | |
| result = vectorize_text(vectorized_document) | |
| update_document( | |
| source, | |
| policy, | |
| document, | |
| vectorized_document, | |
| result | |
| ) | |
| except Exception as e: | |
| logger.error(f"Error processing document {policy.url}: {e}") | |
| logger.info(f"Indexed {num_docs_indexed} documents from source {source.name}") | |
| end_time = datetime.now(timezone.utc) | |
| ## TODO: somewhere remove old documents that are no longer in the source | |
| return IngestResult( | |
| num_docs_indexed=num_docs_indexed, | |
| num_new_docs=num_new_docs, | |
| source_id=source._id, | |
| start_time=start_time, | |
| end_time=end_time, | |
| duration=(end_time - start_time).total_seconds(), | |
| ) |
There was a problem hiding this comment.
Actionable comments posted: 0
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (5)
- background/crawl.py (2 hunks)
- background/extract.py (3 hunks)
- background/ingest.py (7 hunks)
- background/sources/sitemap.py (1 hunks)
- requirements.txt (1 hunks)
Files skipped from review as they are similar to previous changes (4)
- background/crawl.py
- background/extract.py
- background/ingest.py
- background/sources/sitemap.py
Additional comments not posted (1)
requirements.txt (1)
10-10: Acknowledged: Addition oflxml==5.2.2.The addition of
lxmlis appropriate for XML processing tasks, which aligns with the PR's focus on sitemap ingestion.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Refactor