Skip to content

Conversation

@neliebi
Copy link
Collaborator

@neliebi neliebi commented Nov 14, 2025

No description provided.

Copilot AI review requested due to automatic review settings November 14, 2025 12:47
@neliebi neliebi merged commit 913548f into main Nov 14, 2025
7 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds batch processing functionality to split EUPMC link generation into multiple XML files, each containing a configurable number of links (5000 by default). This enables more manageable file sizes and incremental uploads to the FTP server.

Key Changes:

  • Introduced BATCH_SIZE constant and split_into_batches() function to divide data into chunks
  • Modified main workflow to generate and upload multiple XML files (one per batch) instead of a single file
  • Removed duplicate import os statement and simplified some docstrings

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return


def split_into_batches(data: List[Dict[str, str]], batch_size: int):
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is missing a return type hint. For consistency with other functions in this file (which use type hints), consider adding -> Generator[List[Dict[str, str]], None, None] as the return type. You'll also need to import Generator from the typing module.

Copilot uses AI. Check for mistakes.

input_tsv = "/nfs/ftp/public/databases/em_ebi/emdb_related/emicss/resources/emdb_pubmed.tsv"
output_xml = "/hps/nobackup/gerard/emdb/annotations/output/EMDB_linkFile_providerID_2057.xml"
output_dir = "/hps/nobackup/gerard/emdb/annotations/output/EPMC"
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output directory is not validated or created before use. If the directory doesn't exist, the script will fail when attempting to write the first batch XML file. Consider adding directory creation logic (e.g., os.makedirs(output_dir, exist_ok=True)) before the batch processing loop.

Suggested change
output_dir = "/hps/nobackup/gerard/emdb/annotations/output/EPMC"
output_dir = "/hps/nobackup/gerard/emdb/annotations/output/EPMC"
os.makedirs(output_dir, exist_ok=True)

Copilot uses AI. Check for mistakes.
@@ -49,12 +50,10 @@ def build_xml(data: List[Dict[str, str]]) -> ET.Element:


def prettify_xml(elem: ET.Element) -> str:
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring was removed from this function. While the function is simple, it's a best practice to maintain docstrings for all functions, especially in a codebase that previously had them. Consider adding a brief docstring like """Return a pretty-printed XML string for the Element."""

Suggested change
def prettify_xml(elem: ET.Element) -> str:
def prettify_xml(elem: ET.Element) -> str:
"""Return a pretty-printed XML string for the Element."""

Copilot uses AI. Check for mistakes.
return minidom.parseString(ET.tostring(elem, 'utf-8')).toprettyxml(indent=" ")


def write_xml(xml_root: ET.Element, output_file: str) -> None:
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring was removed from this function. Consider restoring it or adding a brief one like """Write the XML tree to a file with pretty formatting.""" to maintain consistency with the codebase's documentation standards.

Suggested change
def write_xml(xml_root: ET.Element, output_file: str) -> None:
def write_xml(xml_root: ET.Element, output_file: str) -> None:
"""Write the XML tree to a file with pretty formatting."""

Copilot uses AI. Check for mistakes.
@@ -69,17 +68,6 @@ def upload_file_via_ftp(
remote_dir: str = ".",
remote_filename: str = None
) -> None:
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comprehensive docstring was removed from this function. Given that this function has multiple parameters with specific purposes and interacts with external FTP servers, it would be beneficial to restore the docstring to help future maintainers understand the function's purpose, parameters, and behavior.

Suggested change
) -> None:
) -> None:
"""
Upload a local file to a remote FTP server.
Connects to the specified FTP server using the provided credentials,
changes to the specified remote directory, and uploads the given local file.
Parameters:
server (str): The FTP server address.
username (str): Username for FTP authentication.
password (str): Password for FTP authentication.
local_file_path (str): Path to the local file to upload.
remote_dir (str, optional): Remote directory to upload the file to. Defaults to ".".
remote_filename (str, optional): Name to use for the file on the server. Defaults to the local file's basename.
Returns:
None
Logs errors if the local file does not exist or if the upload fails.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +122 to 142

xml_root = build_xml(batch)

output_file = os.path.join(
output_dir,
f"EMDB_linkFile_providerID_2057_part{part}.xml"
)

write_xml(xml_root, output_file)

upload_file_via_ftp(
server=ftp_server,
username=ftp_user,
password=ftp_pass,
local_file_path=output_file,
remote_dir=ftp_dir
)

logging.info(f"Batch {part} completed.")
part += 1

Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The batch processing loop lacks error handling. If an exception occurs during XML generation, file writing, or FTP upload for any batch, the entire script will fail and subsequent batches won't be processed. Consider adding try-except blocks around the batch processing operations to log failures and continue with remaining batches, or at minimum, provide clear error context about which batch failed.

Suggested change
xml_root = build_xml(batch)
output_file = os.path.join(
output_dir,
f"EMDB_linkFile_providerID_2057_part{part}.xml"
)
write_xml(xml_root, output_file)
upload_file_via_ftp(
server=ftp_server,
username=ftp_user,
password=ftp_pass,
local_file_path=output_file,
remote_dir=ftp_dir
)
logging.info(f"Batch {part} completed.")
part += 1
try:
xml_root = build_xml(batch)
output_file = os.path.join(
output_dir,
f"EMDB_linkFile_providerID_2057_part{part}.xml"
)
write_xml(xml_root, output_file)
upload_file_via_ftp(
server=ftp_server,
username=ftp_user,
password=ftp_pass,
local_file_path=output_file,
remote_dir=ftp_dir
)
logging.info(f"Batch {part} completed.")
except Exception as e:
logging.error(f"Error processing batch {part}: {e}", exc_info=True)
part += 1

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +138
upload_file_via_ftp(
server=ftp_server,
username=ftp_user,
password=ftp_pass,
local_file_path=output_file,
remote_dir=ftp_dir
)
Copy link

Copilot AI Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FTP connection is established and torn down for each batch. For better performance, consider establishing a single FTP connection before the loop and reusing it for all uploads, then closing it after all batches are processed. This would significantly reduce connection overhead, especially when processing many batches.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant