Skip to content

feat: Implement auto-renaming for file uploads to prevent S3 filename conflicts#11

Open
abuzarmahmood wants to merge 1 commit intomasterfrom
8-auto-rename-if-filename-already-present
Open

feat: Implement auto-renaming for file uploads to prevent S3 filename conflicts#11
abuzarmahmood wants to merge 1 commit intomasterfrom
8-auto-rename-if-filename-already-present

Conversation

@abuzarmahmood
Copy link
Owner

No description provided.

@abuzarmahmood abuzarmahmood linked an issue Feb 8, 2025 that may be closed by this pull request
@abuzarmahmood
Copy link
Owner Author

Generated edit command:

  • Summary of User's Issues and Requests:
    The user has requested an enhancement to the file upload functionality in the ocr_database repository. Specifically, they want to implement an auto-renaming feature for files if the filename already exists in the S3 bucket before uploading. This involves modifying the upload logic to check for existing filenames and rename the file if necessary. Additionally, tests should be added to ensure this functionality works as expected.

  • Overview of Plan to Address the Issues:

    1. Modify the Upload.py file to include logic that checks for existing filenames in the S3 bucket and renames the file by appending a unique identifier if a conflict is detected.
    2. Update the s3_test.py file to include test cases that verify the auto-renaming functionality works correctly when a filename conflict occurs.
  • Specific Details of Changes to be Made:

    1. Modifications in Upload.py:

      • File Path: /home/exouser/Desktop/blech_github_bot/src/repos/abuzarmahmood/ocr_database/src/pages/Upload.py
      • Changes:
        • Add a function get_unique_filename to generate a unique filename if a conflict is detected.
        • Modify the file upload logic to use the unique filename.
      import time
      
      def get_unique_filename(s3, path, filename):
          base, ext = os.path.splitext(filename)
          counter = 1
          new_filename = filename
          while s3.exists(os.path.join(path, new_filename)):
              new_filename = f"{base}_{int(time.time())}_{counter}{ext}"
              counter += 1
          return new_filename
      
      # Inside the form submission logic
      if submit_button:
          if uploaded_file is not None:
              # Get a unique filename
              unique_filename = get_unique_filename(s3, save_path, uploaded_file.name)
              
              # Use unique_filename instead of uploaded_file.name
              save_page_path = os.path.join(
                  save_path,
                  unique_filename.split('.')[0] + '_' + str(i) + '.pdf'
              )
              
              # Continue with the existing upload logic
    2. Modifications in s3_test.py:

      • File Path: /home/exouser/Desktop/blech_github_bot/src/repos/abuzarmahmood/ocr_database/src/_tests/s3_test.py
      • Changes:
        • Add a test case test_auto_rename_on_conflict to verify that files are renamed correctly when a filename conflict occurs.
      def test_auto_rename_on_conflict():
          # Setup: Upload a file with a known name
          original_filename = 'test_file.pdf'
          s3.put(original_filename, f's3://ocr-database-s3/{original_filename}')
          
          # Attempt to upload the same file again
          new_filename = get_unique_filename(s3, 's3://ocr-database-s3', original_filename)
          s3.put(original_filename, f's3://ocr-database-s3/{new_filename}')
          
          # Assert: Check that the new file exists with a different name
          assert s3.exists(f's3://ocr-database-s3/{new_filename}')
          assert new_filename != original_filename

These changes should effectively implement the auto-renaming feature and ensure it is tested thoroughly.

TERMINATE

Aider output:

Found ANTHROPIC_API_KEY so using sonnet since no --model was specified.
Aider v0.74.1
Main model: claude-3-5-sonnet-20241022 with diff edit format, infinite output
Weak model: claude-3-5-haiku-20241022
Git repo: .git with 9 files
Repo-map: using 4096 tokens, auto refresh


src/_tests/s3_test.py

src/pages/Upload.py
I'll help implement the auto-renaming feature using proper SEARCH/REPLACE blocks. Let's make the changes step by step:                                                                                                                           

 1 First, let's add the new imports and helper function to Upload.py:                                                                                                                                                                            

src/pages/Upload.py                                                                                                                                                                                                                              

                                                                                                                                                                                                                                                 
 <<<<<<< SEARCH                                                                                                                                                                                                                                  
 import streamlit as st                                                                                                                                                                                                                          
 import pandas as pd                                                                                                                                                                                                                             
 from io import StringIO                                                                                                                                                                                                                         
 import os                                                                                                                                                                                                                                       
 import pandas as pd                                                                                                                                                                                                                             
 from datetime import datetime                                                                                                                                                                                                                   
 from PyPDF2 import PdfWriter, PdfReader                                                                                                                                                                                                         
 import s3fs                                                                                                                                                                                                                                     
 =======                                                                                                                                                                                                                                         
 import streamlit as st                                                                                                                                                                                                                          
 import pandas as pd                                                                                                                                                                                                                             
 from io import StringIO                                                                                                                                                                                                                         
 import os                                                                                                                                                                                                                                       
 import pandas as pd                                                                                                                                                                                                                             
 from datetime import datetime                                                                                                                                                                                                                   
 from PyPDF2 import PdfWriter, PdfReader                                                                                                                                                                                                         
 import s3fs                                                                                                                                                                                                                                     
 import time                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                 
 def get_unique_filename(s3, path, filename):                                                                                                                                                                                                    
     """Generate a unique filename if a conflict is detected"""                                                                                                                                                                                  
     base, ext = os.path.splitext(filename)                                                                                                                                                                                                      
     counter = 1                                                                                                                                                                                                                                 
     new_filename = filename                                                                                                                                                                                                                     
     while s3.exists(os.path.join(path, new_filename)):                                                                                                                                                                                          
         new_filename = f"{base}_{int(time.time())}_{counter}{ext}"                                                                                                                                                                              
         counter += 1                                                                                                                                                                                                                            
     return new_filename                                                                                                                                                                                                                         
 >>>>>>> REPLACE                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                 

 2 Now modify the file upload logic to use the unique filename:                                                                                                                                                                                  

src/pages/Upload.py                                                                                                                                                                                                                              

                                                                                                                                                                                                                                                 
 <<<<<<< SEARCH                                                                                                                                                                                                                                  
             save_page_path = os.path.join(                                                                                                                                                                                                      
                     save_path,                                                                                                                                                                                                                  
                     uploaded_file.name.split('.')[0] + '_' + str(i) + '.pdf'                                                                                                                                                                    
                     )                                                                                                                                                                                                                           
 =======                                                                                                                                                                                                                                         
             base_filename = get_unique_filename(s3, save_path, uploaded_file.name)                                                                                                                                                              
             save_page_path = os.path.join(                                                                                                                                                                                                      
                     save_path,                                                                                                                                                                                                                  
                     base_filename.split('.')[0] + '_' + str(i) + '.pdf'                                                                                                                                                                         
                     )                                                                                                                                                                                                                           
 >>>>>>> REPLACE                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                 

 3 Let's update the test file with new test cases:                                                                                                                                                                                               

src/_tests/s3_test.py                                                                                                                                                                                                                            

                                                                                                                                                                                                                                                 
 <<<<<<< SEARCH                                                                                                                                                                                                                                  
 import s3fs                                                                                                                                                                                                                                     
 import os                                                                                                                                                                                                                                       
 import streamlit as st                                                                                                                                                                                                                          
 import pandas as pd                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                 
 s3 = s3fs.S3FileSystem(                                                                                                                                                                                                                         
         anon=False,                                                                                                                                                                                                                             
         key = st.secrets["S3_KEY"],                                                                                                                                                                                                             
         secret = st.secrets["S3_SECRET"]                                                                                                                                                                                                        
         )                                                                                                                                                                                                                                       
 s3.ls('s3://ocr-database-s3')                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                 
 wanted_files = [x for x in os.listdir('.') if x.endswith('.csv')][0]                                                                                                                                                                            
                                                                                                                                                                                                                                                 
 # Upload file                                                                                                                                                                                                                                   
 s3.put(wanted_files, 's3://ocr-database-s3/' + wanted_files)                                                                                                                                                                                    
                                                                                                                                                                                                                                                 
 # Load cvs to dataframe                                                                                                                                                                                                                         
 df = pd.read_csv(s3.open('s3://ocr-database-s3/' + wanted_files))                                                                                                                                                                               
 =======                                                                                                                                                                                                                                         
 import s3fs                                                                                                                                                                                                                                     
 import os                                                                                                                                                                                                                                       
 import streamlit as st                                                                                                                                                                                                                          
 import pandas as pd                                                                                                                                                                                                                             
 import pytest                                                                                                                                                                                                                                   
 import time                                                                                                                                                                                                                                     
 from pages.Upload import get_unique_filename                                                                                                                                                                                                    
                                                                                                                                                                                                                                                 
 s3 = s3fs.S3FileSystem(                                                                                                                                                                                                                         
         anon=False,                                                                                                                                                                                                                             
         key = st.secrets["S3_KEY"],                                                                                                                                                                                                             
         secret = st.secrets["S3_SECRET"]                                                                                                                                                                                                        
         )                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                 
 def test_get_unique_filename():                                                                                                                                                                                                                 
     # Setup test environment                                                                                                                                                                                                                    
     test_path = f's3://{st.secrets["S3_BUCKET_NAME"]}/test'                                                                                                                                                                                     
     test_filename = 'test_file.pdf'                                                                                                                                                                                                             
                                                                                                                                                                                                                                                 
     # Create a test file                                                                                                                                                                                                                        
     with s3.open(os.path.join(test_path, test_filename), 'wb') as f:                                                                                                                                                                            
         f.write(b'test content')                                                                                                                                                                                                                
                                                                                                                                                                                                                                                 
     # Test unique filename generation                                                                                                                                                                                                           
     unique_name = get_unique_filename(s3, test_path, test_filename)                                                                                                                                                                             
     assert unique_name != test_filename                                                                                                                                                                                                         
     assert unique_name.startswith('test_file_')                                                                                                                                                                                                 
     assert unique_name.endswith('.pdf')                                                                                                                                                                                                         
                                                                                                                                                                                                                                                 
     # Cleanup                                                                                                                                                                                                                                   
     s3.rm(os.path.join(test_path, test_filename))                                                                                                                                                                                               
                                                                                                                                                                                                                                                 
 def test_file_upload():                                                                                                                                                                                                                         
     wanted_files = [x for x in os.listdir('.') if x.endswith('.csv')][0]                                                                                                                                                                        
                                                                                                                                                                                                                                                 
     # Upload file                                                                                                                                                                                                                               
     s3.put(wanted_files, 's3://ocr-database-s3/' + wanted_files)                                                                                                                                                                                
                                                                                                                                                                                                                                                 
     # Load csv to dataframe                                                                                                                                                                                                                     
     df = pd.read_csv(s3.open('s3://ocr-database-s3/' + wanted_files))                                                                                                                                                                           
     assert not df.empty                                                                                                                                                                                                                         
 >>>>>>> REPLACE                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                 

These changes will:                                                                                                                                                                                                                              

 1 Add a function to generate unique filenames                                                                                                                                                                                                   
 2 Modify the upload process to use unique filenames                                                                                                                                                                                             
 3 Add tests to verify the functionality                                                                                                                                                                                                         

To run the tests, you can use:                                                                                                                                                                                                                   

                                                                                                                                                                                                                                                 
 python -m pytest src/_tests/s3_test.py -v                                                                                                                                                                                                       
                                                                                                                                                                                                                                                 

Tokens: 4.2k sent, 904 received. Cost: $0.03 message, $0.03 session.
Applied edit to src/_tests/s3_test.py
Applied edit to src/pages/Upload.py
Commit 9e99592 feat: Implement auto-renaming for file uploads to prevent S3 filename conflicts

python -m pytest src/_tests/s3_test.py -v

This response was automatically generated by blech_bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

auto-rename if filename already present

1 participant