Conversation
There was a problem hiding this comment.
Copilot reviewed 8 out of 23 changed files in this pull request and generated 1 comment.
Files not reviewed (15)
- .env.example: Language not supported
- .vscode/settings.json: Language not supported
- Makefile: Language not supported
- cicd/test-unit-ci.sh: Language not supported
- .github/workflows/pipeline.yaml: Evaluated as low risk
- src/scrape_it_now/models/message.py: Evaluated as low risk
- src/scrape_it_now/index.py: Evaluated as low risk
- src/scrape_it_now/persistence/local_disk.py: Evaluated as low risk
- src/scrape_it_now/persistence/azure_queue_storage.py: Evaluated as low risk
- src/scrape_it_now/scrape.py: Evaluated as low risk
- src/scrape_it_now/helpers/persistence.py: Evaluated as low risk
- tests/blob.py: Evaluated as low risk
- src/scrape_it_now/cli.py: Evaluated as low risk
- src/scrape_it_now/persistence/iblob.py: Evaluated as low risk
- src/scrape_it_now/persistence/iqueue.py: Evaluated as low risk
Comments suppressed due to low confidence (2)
src/scrape_it_now/persistence/aws_s3.py:286
- The
__aenter__method tries to create a bucket usingget_session().create_client(**client_kwargs).create_bucket(Bucket=self._config.name), which might lead to race conditions or unnecessary bucket creation attempts. Consider handling bucket creation separately or ensuring it is only done when necessary.
get_session().create_client(**client_kwargs).create_bucket(Bucket=self._config.name)
src/scrape_it_now/persistence/aws_s3.py:80
- The
lease_blobmethod has a recursive call to itself which might lead to potential issues if not handled carefully. Consider refactoring to avoid recursion or ensure proper handling of recursive calls.
async with self.lease_blob(blob=blob, lease_duration=lease_duration) as lease_id:
There was a problem hiding this comment.
Copilot reviewed 8 out of 23 changed files in this pull request and generated 3 comments.
Files not reviewed (15)
- .env.example: Language not supported
- .vscode/settings.json: Language not supported
- Makefile: Language not supported
- cicd/test-unit-ci.sh: Language not supported
- .github/workflows/pipeline.yaml: Evaluated as low risk
- src/scrape_it_now/index.py: Evaluated as low risk
- src/scrape_it_now/models/message.py: Evaluated as low risk
- src/scrape_it_now/scrape.py: Evaluated as low risk
- src/scrape_it_now/persistence/azure_queue_storage.py: Evaluated as low risk
- src/scrape_it_now/persistence/local_disk.py: Evaluated as low risk
- src/scrape_it_now/helpers/persistence.py: Evaluated as low risk
- tests/blob.py: Evaluated as low risk
- src/scrape_it_now/persistence/iqueue.py: Evaluated as low risk
- pyproject.toml: Evaluated as low risk
- src/scrape_it_now/persistence/iblob.py: Evaluated as low risk
Comments suppressed due to low confidence (4)
src/scrape_it_now/persistence/aws_s3.py:114
- [nitpick] The argument name 'lease_id' could be more descriptive, such as 'existing_lease_id', to clarify its purpose.
lease_id: str | None = None
src/scrape_it_now/persistence/aws_s3.py:78
- The retry mechanism in the 'lease_blob' method should be covered by tests to ensure it works as expected under different scenarios.
await asyncio.sleep(0.1)
src/scrape_it_now/persistence/aws_s3.py:224
- The timeout handling in the '_file_lock' method should be covered by tests to ensure it works as expected under different scenarios.
seconds=timeout
src/scrape_it_now/persistence/aws_sqs.py:65
- [nitpick] The variable name 'pulled_now' is ambiguous. It should be renamed to 'messages_pulled_now'.
pulled_now = min(max_messages - total_pulled, max_per_batch)
| @click.option( | ||
| "--aws-sqs-region", | ||
| "-asr", | ||
| default="eu-west-1", |
There was a problem hiding this comment.
The default value for 'aws_sqs_region' might lead to unintended behavior if the user is not aware of it. Consider removing the default value to ensure the user explicitly sets the region.
| default="eu-west-1", | |
| envvar="AWS_SQS_REGION", |
ffb1ef5 to
3f11f80
Compare
There was a problem hiding this comment.
Copilot reviewed 9 out of 24 changed files in this pull request and generated no comments.
Files not reviewed (15)
- .env.example: Language not supported
- .vscode/settings.json: Language not supported
- Makefile: Language not supported
- cicd/test-unit-ci.sh: Language not supported
- .github/workflows/pipeline.yaml: Evaluated as low risk
- src/scrape_it_now/models/message.py: Evaluated as low risk
- src/scrape_it_now/index.py: Evaluated as low risk
- src/scrape_it_now/persistence/local_disk.py: Evaluated as low risk
- src/scrape_it_now/scrape.py: Evaluated as low risk
- src/scrape_it_now/helpers/http.py: Evaluated as low risk
- src/scrape_it_now/persistence/azure_queue_storage.py: Evaluated as low risk
- src/scrape_it_now/helpers/persistence.py: Evaluated as low risk
- src/scrape_it_now/cli.py: Evaluated as low risk
- pyproject.toml: Evaluated as low risk
- src/scrape_it_now/persistence/iblob.py: Evaluated as low risk
Comments suppressed due to low confidence (4)
src/scrape_it_now/persistence/aws_s3.py:273
- The
_file_lockmethod should handle stale lock files to prevent deadlock situations. Consider adding a mechanism to handle stale lock files.
while await self._exists(lock_file):
src/scrape_it_now/persistence/aws_s3.py:341
- The
__aenter__method should check if the S3 bucket creation is successful. Add a check to ensure that the bucket is created successfully.
get_session().create_client(**client_kwargs).create_bucket(
src/scrape_it_now/persistence/aws_sqs.py:130
- The error message should be more descriptive. Consider changing it to 'Receipt handle for message "{message.message_id}" is invalid'.
f'Message "{message.message_id}" not found'
src/scrape_it_now/persistence/aws_sqs.py:60
- The
send_messagemethod should have a test case to verify its behavior.
async def send_message(
d897f69 to
08f150b
Compare
08f150b to
f07045e
Compare
7de0253 to
d03ab37
Compare
f07045e to
355c43a
Compare
355c43a to
b5bc866
Compare
a2c41d3 to
0b8145e
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR adds AWS backend support to the application by integrating AWS S3 for blob storage and AWS SQS for queue management. Key changes include adding AWS credential and endpoint parameters in tests and CLI commands, implementing AWS S3 and SQS clients, and updating documentation and configurations.
Reviewed Changes
Copilot reviewed 18 out of 21 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/scrape.py, tests/queue.py, tests/blob.py | Include AWS credential parameters for AWS-based tests |
| src/scrape_it_now/* | Add AWS backend support in worker, persistence, and client helper logic |
| src/scrape_it_now/cli.py | Add new CLI options for AWS credentials and endpoints |
| pyproject.toml, README.md, pipeline.yaml | Update dependencies, documentation, and CI configuration for AWS usage |
Files not reviewed (3)
- .env.example: Language not supported
- Makefile: Language not supported
- cicd/test-unit-ci.sh: Language not supported
Comments suppressed due to low confidence (1)
src/scrape_it_now/cli.py:122
- [nitpick] The short option '-ase' for the AWS SQS endpoint in aws_sqs_params conflicts with the one used for the AWS S3 endpoint in aws_s3_params, which can lead to ambiguity. Consider using a distinct short option (e.g. '-asq') for AWS SQS endpoint to avoid confusion.
"-ase",
There was a problem hiding this comment.
Pull Request Overview
This PR introduces AWS S3 and AWS SQS support throughout the application, extending persistence layers, tests, CLI, and documentation.
- Added new
aws_s3andaws_sqspersistence implementations with configuration and retry logic. - Extended CLI decorators and helper factories to accept AWS parameters.
- Updated tests, documentation, CI, and example env files to cover AWS providers.
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/scrape.py | Pass AWS credentials/endpoints into blob_client and queue_client tests |
| tests/queue.py | Include AWS_SQS in test parameterization via PROVIDERS |
| tests/blob.py | Include AWS_S3 in test parameterization via PROVIDERS |
| tests/conftest.py | Updated install() flag comment to reflect manual deps |
| src/scrape_it_now/persistence/local_disk.py | Removed duplicate LeaseModel definition, imported shared model |
| src/scrape_it_now/persistence/iqueue.py | Added AWS_SQS to Provider enum |
| src/scrape_it_now/persistence/iblob.py | Added AWS_S3 to Provider enum |
| src/scrape_it_now/persistence/aws_sqs.py | New AWS SQS client implementation with retries |
| src/scrape_it_now/persistence/aws_s3.py | New AWS S3 client implementation with leasing |
| src/scrape_it_now/models/lease.py | Introduced LeaseModel for cross-provider leases |
| src/scrape_it_now/helpers/persistence.py | Extended blob_client and queue_client to handle AWS backends |
| src/scrape_it_now/cli.py | Added AWS CLI options and decorators for S3 and SQS |
| src/scrape_it_now/scrape.py | Propagated AWS parameters into scrape workers |
| src/scrape_it_now/index.py | Propagated AWS parameters into index workers |
| pyproject.toml | Added aiobotocore and botocore dependencies |
| cicd/test-unit-ci.sh | Added AWS mock and cleanup logic in CI script |
| .github/workflows/pipeline.yaml | Ensure tests run on all branches and set up Python |
| .env.example | Documented AWS env vars for local testing |
| README.md | Documented AWS usage examples and options |
| Makefile | Added Playwright install reordering and AWS mock target |
|
|
||
|
|
||
| def aws_s3_params(func): | ||
| @click.option( |
There was a problem hiding this comment.
[nitpick] The shorthand -ase for both --aws-s3-endpoint and --aws-sqs-endpoint conflicts. Use distinct flags (e.g. -ase3 for S3) to avoid ambiguous CLI behavior.
| self._client = await self._service.__aenter__() | ||
| # Create if it does not exist | ||
| try: | ||
| get_session().create_client(**client_kwargs).create_bucket( |
There was a problem hiding this comment.
Calling the synchronous create_bucket blocks the async event loop. Consider using the async S3 client or offloading this to a thread to avoid blocking.
| get_session().create_client(**client_kwargs).create_bucket( | |
| await self._client.create_bucket( |
| yield Message( | ||
| content=content, | ||
| delete_token=message["ReceiptHandle"], | ||
| dequeue_count=message["Attributes"]["ApproximateReceiveCount"], |
There was a problem hiding this comment.
The ApproximateReceiveCount attribute is returned as a string; this should be cast to int before passing into Message.deque_count to match the expected type.
| dequeue_count=message["Attributes"]["ApproximateReceiveCount"], | |
| dequeue_count=int(message["Attributes"]["ApproximateReceiveCount"]), |
| ``` | ||
|
|
||
| Most frequent options are: | ||
| Frequent general options are: |
There was a problem hiding this comment.
[nitpick] The options sections are duplicated in multiple places. Consolidate into a single reference table to reduce redundancy and maintenance burden.
No description provided.