Skip to content

feat: Add AWS backend#43

Open
clemlesne wants to merge 5 commits intodevelopfrom
feat/aws-peristence
Open

feat: Add AWS backend#43
clemlesne wants to merge 5 commits intodevelopfrom
feat/aws-peristence

Conversation

@clemlesne
Copy link
Owner

No description provided.

@clemlesne clemlesne added the enhancement New feature or request label Jan 15, 2025
@clemlesne clemlesne requested a review from Copilot January 15, 2025 09:52
@clemlesne clemlesne self-assigned this Jan 15, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 8 out of 23 changed files in this pull request and generated 1 comment.

Files not reviewed (15)
  • .env.example: Language not supported
  • .vscode/settings.json: Language not supported
  • Makefile: Language not supported
  • cicd/test-unit-ci.sh: Language not supported
  • .github/workflows/pipeline.yaml: Evaluated as low risk
  • src/scrape_it_now/models/message.py: Evaluated as low risk
  • src/scrape_it_now/index.py: Evaluated as low risk
  • src/scrape_it_now/persistence/local_disk.py: Evaluated as low risk
  • src/scrape_it_now/persistence/azure_queue_storage.py: Evaluated as low risk
  • src/scrape_it_now/scrape.py: Evaluated as low risk
  • src/scrape_it_now/helpers/persistence.py: Evaluated as low risk
  • tests/blob.py: Evaluated as low risk
  • src/scrape_it_now/cli.py: Evaluated as low risk
  • src/scrape_it_now/persistence/iblob.py: Evaluated as low risk
  • src/scrape_it_now/persistence/iqueue.py: Evaluated as low risk
Comments suppressed due to low confidence (2)

src/scrape_it_now/persistence/aws_s3.py:286

  • The __aenter__ method tries to create a bucket using get_session().create_client(**client_kwargs).create_bucket(Bucket=self._config.name), which might lead to race conditions or unnecessary bucket creation attempts. Consider handling bucket creation separately or ensuring it is only done when necessary.
get_session().create_client(**client_kwargs).create_bucket(Bucket=self._config.name)

src/scrape_it_now/persistence/aws_s3.py:80

  • The lease_blob method has a recursive call to itself which might lead to potential issues if not handled carefully. Consider refactoring to avoid recursion or ensure proper handling of recursive calls.
async with self.lease_blob(blob=blob, lease_duration=lease_duration) as lease_id:

@clemlesne clemlesne requested a review from Copilot January 15, 2025 09:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 8 out of 23 changed files in this pull request and generated 3 comments.

Files not reviewed (15)
  • .env.example: Language not supported
  • .vscode/settings.json: Language not supported
  • Makefile: Language not supported
  • cicd/test-unit-ci.sh: Language not supported
  • .github/workflows/pipeline.yaml: Evaluated as low risk
  • src/scrape_it_now/index.py: Evaluated as low risk
  • src/scrape_it_now/models/message.py: Evaluated as low risk
  • src/scrape_it_now/scrape.py: Evaluated as low risk
  • src/scrape_it_now/persistence/azure_queue_storage.py: Evaluated as low risk
  • src/scrape_it_now/persistence/local_disk.py: Evaluated as low risk
  • src/scrape_it_now/helpers/persistence.py: Evaluated as low risk
  • tests/blob.py: Evaluated as low risk
  • src/scrape_it_now/persistence/iqueue.py: Evaluated as low risk
  • pyproject.toml: Evaluated as low risk
  • src/scrape_it_now/persistence/iblob.py: Evaluated as low risk
Comments suppressed due to low confidence (4)

src/scrape_it_now/persistence/aws_s3.py:114

  • [nitpick] The argument name 'lease_id' could be more descriptive, such as 'existing_lease_id', to clarify its purpose.
lease_id: str | None = None

src/scrape_it_now/persistence/aws_s3.py:78

  • The retry mechanism in the 'lease_blob' method should be covered by tests to ensure it works as expected under different scenarios.
await asyncio.sleep(0.1)

src/scrape_it_now/persistence/aws_s3.py:224

  • The timeout handling in the '_file_lock' method should be covered by tests to ensure it works as expected under different scenarios.
seconds=timeout

src/scrape_it_now/persistence/aws_sqs.py:65

  • [nitpick] The variable name 'pulled_now' is ambiguous. It should be renamed to 'messages_pulled_now'.
pulled_now = min(max_messages - total_pulled, max_per_batch)

@click.option(
"--aws-sqs-region",
"-asr",
default="eu-west-1",
Copy link

Copilot AI Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value for 'aws_sqs_region' might lead to unintended behavior if the user is not aware of it. Consider removing the default value to ensure the user explicitly sets the region.

Suggested change
default="eu-west-1",
envvar="AWS_SQS_REGION",

Copilot uses AI. Check for mistakes.
@clemlesne clemlesne force-pushed the feat/aws-peristence branch 5 times, most recently from ffb1ef5 to 3f11f80 Compare January 15, 2025 11:35
@clemlesne clemlesne requested a review from Copilot January 15, 2025 11:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 9 out of 24 changed files in this pull request and generated no comments.

Files not reviewed (15)
  • .env.example: Language not supported
  • .vscode/settings.json: Language not supported
  • Makefile: Language not supported
  • cicd/test-unit-ci.sh: Language not supported
  • .github/workflows/pipeline.yaml: Evaluated as low risk
  • src/scrape_it_now/models/message.py: Evaluated as low risk
  • src/scrape_it_now/index.py: Evaluated as low risk
  • src/scrape_it_now/persistence/local_disk.py: Evaluated as low risk
  • src/scrape_it_now/scrape.py: Evaluated as low risk
  • src/scrape_it_now/helpers/http.py: Evaluated as low risk
  • src/scrape_it_now/persistence/azure_queue_storage.py: Evaluated as low risk
  • src/scrape_it_now/helpers/persistence.py: Evaluated as low risk
  • src/scrape_it_now/cli.py: Evaluated as low risk
  • pyproject.toml: Evaluated as low risk
  • src/scrape_it_now/persistence/iblob.py: Evaluated as low risk
Comments suppressed due to low confidence (4)

src/scrape_it_now/persistence/aws_s3.py:273

  • The _file_lock method should handle stale lock files to prevent deadlock situations. Consider adding a mechanism to handle stale lock files.
while await self._exists(lock_file):

src/scrape_it_now/persistence/aws_s3.py:341

  • The __aenter__ method should check if the S3 bucket creation is successful. Add a check to ensure that the bucket is created successfully.
get_session().create_client(**client_kwargs).create_bucket(

src/scrape_it_now/persistence/aws_sqs.py:130

  • The error message should be more descriptive. Consider changing it to 'Receipt handle for message "{message.message_id}" is invalid'.
f'Message "{message.message_id}" not found'

src/scrape_it_now/persistence/aws_sqs.py:60

  • The send_message method should have a test case to verify its behavior.
async def send_message(

@clemlesne clemlesne force-pushed the feat/aws-peristence branch 2 times, most recently from d897f69 to 08f150b Compare January 15, 2025 11:54
@clemlesne clemlesne force-pushed the feat/aws-peristence branch from 08f150b to f07045e Compare January 16, 2025 11:09
@clemlesne clemlesne force-pushed the develop branch 4 times, most recently from 7de0253 to d03ab37 Compare February 8, 2025 18:11
@clemlesne clemlesne force-pushed the feat/aws-peristence branch from 355c43a to b5bc866 Compare February 8, 2025 18:49
@clemlesne clemlesne force-pushed the develop branch 23 times, most recently from a2c41d3 to 0b8145e Compare February 10, 2025 10:19
@clemlesne clemlesne requested a review from Copilot March 30, 2025 18:54
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds AWS backend support to the application by integrating AWS S3 for blob storage and AWS SQS for queue management. Key changes include adding AWS credential and endpoint parameters in tests and CLI commands, implementing AWS S3 and SQS clients, and updating documentation and configurations.

Reviewed Changes

Copilot reviewed 18 out of 21 changed files in this pull request and generated no comments.

File Description
tests/scrape.py, tests/queue.py, tests/blob.py Include AWS credential parameters for AWS-based tests
src/scrape_it_now/* Add AWS backend support in worker, persistence, and client helper logic
src/scrape_it_now/cli.py Add new CLI options for AWS credentials and endpoints
pyproject.toml, README.md, pipeline.yaml Update dependencies, documentation, and CI configuration for AWS usage
Files not reviewed (3)
  • .env.example: Language not supported
  • Makefile: Language not supported
  • cicd/test-unit-ci.sh: Language not supported
Comments suppressed due to low confidence (1)

src/scrape_it_now/cli.py:122

  • [nitpick] The short option '-ase' for the AWS SQS endpoint in aws_sqs_params conflicts with the one used for the AWS S3 endpoint in aws_s3_params, which can lead to ambiguity. Consider using a distinct short option (e.g. '-asq') for AWS SQS endpoint to avoid confusion.
        "-ase",

@clemlesne clemlesne requested a review from Copilot May 31, 2025 13:16
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces AWS S3 and AWS SQS support throughout the application, extending persistence layers, tests, CLI, and documentation.

  • Added new aws_s3 and aws_sqs persistence implementations with configuration and retry logic.
  • Extended CLI decorators and helper factories to accept AWS parameters.
  • Updated tests, documentation, CI, and example env files to cover AWS providers.

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/scrape.py Pass AWS credentials/endpoints into blob_client and queue_client tests
tests/queue.py Include AWS_SQS in test parameterization via PROVIDERS
tests/blob.py Include AWS_S3 in test parameterization via PROVIDERS
tests/conftest.py Updated install() flag comment to reflect manual deps
src/scrape_it_now/persistence/local_disk.py Removed duplicate LeaseModel definition, imported shared model
src/scrape_it_now/persistence/iqueue.py Added AWS_SQS to Provider enum
src/scrape_it_now/persistence/iblob.py Added AWS_S3 to Provider enum
src/scrape_it_now/persistence/aws_sqs.py New AWS SQS client implementation with retries
src/scrape_it_now/persistence/aws_s3.py New AWS S3 client implementation with leasing
src/scrape_it_now/models/lease.py Introduced LeaseModel for cross-provider leases
src/scrape_it_now/helpers/persistence.py Extended blob_client and queue_client to handle AWS backends
src/scrape_it_now/cli.py Added AWS CLI options and decorators for S3 and SQS
src/scrape_it_now/scrape.py Propagated AWS parameters into scrape workers
src/scrape_it_now/index.py Propagated AWS parameters into index workers
pyproject.toml Added aiobotocore and botocore dependencies
cicd/test-unit-ci.sh Added AWS mock and cleanup logic in CI script
.github/workflows/pipeline.yaml Ensure tests run on all branches and set up Python
.env.example Documented AWS env vars for local testing
README.md Documented AWS usage examples and options
Makefile Added Playwright install reordering and AWS mock target



def aws_s3_params(func):
@click.option(
Copy link

Copilot AI May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The shorthand -ase for both --aws-s3-endpoint and --aws-sqs-endpoint conflicts. Use distinct flags (e.g. -ase3 for S3) to avoid ambiguous CLI behavior.

Copilot uses AI. Check for mistakes.
self._client = await self._service.__aenter__()
# Create if it does not exist
try:
get_session().create_client(**client_kwargs).create_bucket(
Copy link

Copilot AI May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling the synchronous create_bucket blocks the async event loop. Consider using the async S3 client or offloading this to a thread to avoid blocking.

Suggested change
get_session().create_client(**client_kwargs).create_bucket(
await self._client.create_bucket(

Copilot uses AI. Check for mistakes.
yield Message(
content=content,
delete_token=message["ReceiptHandle"],
dequeue_count=message["Attributes"]["ApproximateReceiveCount"],
Copy link

Copilot AI May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ApproximateReceiveCount attribute is returned as a string; this should be cast to int before passing into Message.deque_count to match the expected type.

Suggested change
dequeue_count=message["Attributes"]["ApproximateReceiveCount"],
dequeue_count=int(message["Attributes"]["ApproximateReceiveCount"]),

Copilot uses AI. Check for mistakes.
```

Most frequent options are:
Frequent general options are:
Copy link

Copilot AI May 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The options sections are duplicated in multiple places. Consolidate into a single reference table to reduce redundancy and maintenance burden.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants