-
Notifications
You must be signed in to change notification settings - Fork 1
Scraper Refactor #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scraper Refactor #45
Conversation
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
|
""" WalkthroughThe changes refactor the BlueSky post search logic by removing the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant mission_blue
participant scraper
participant BlueSkyAPI
User->>mission_blue: Initiate post search
mission_blue->>scraper: search_posts(params, token)
loop While more posts and limit not reached
scraper->>BlueSkyAPI: GET /search/posts (with params, token)
BlueSkyAPI-->>scraper: JSON response (posts, cursor)
scraper->>scraper: Update progress bar, accumulate posts
end
scraper-->>mission_blue: Return list of posts
mission_blue-->>User: Present posts
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
search_posts now ensures that query is within params (required) Test Cases validate this new logic
Verfied that search_posts() logic handles Client Error 400 correctly and gracefully
|
@Lementknight turning this over to you to review, anything else you think I should add to test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (7)
scraper.py (1)
83-83: Fix typo in comment."enxt" should be "next".
- # Move to the enxt page if available + # Move to the next page if availablemission_blue.py (1)
1-1: Fix typo in module docstring."conatins" should be "contains".
-"""This module conatins the BlueSky Web Scrapper.""" +"""This module contains the BlueSky Web Scrapper."""tests/scraper_test.py (5)
1-1: Fix module reference in docstring.The docstring refers to "mission_blue module" but this file tests the scraper module.
-"""Testing suite for the mission_blue module.""" +"""Testing suite for the scraper module."""
28-28: Fix typo in docstring."it not provided" should be "is not provided".
- """Test if the function raises ValueError when a token it not provided.""" + """Test if the function raises ValueError when a token is not provided."""
38-39: Remove or complete the incomplete comment.The comment appears to be incomplete or unnecessary.
- # Ensure that the function returns an empty list when no posts are found -
95-95: Fix typo in comment."Redircting" should be "Redirecting".
- # Redircting stdout to StringIO + # Redirecting stdout to StringIO
11-104: Consider adding test coverage for pagination and posts_limit functionality.The current tests cover basic scenarios well, but are missing coverage for:
- Pagination handling (when cursor is returned)
- posts_limit functionality (when total posts exceed the limit)
- Progress bar behavior (though this might be mocked)
Would you like me to generate additional test cases to improve coverage?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
mission_blue.py(2 hunks)scraper.py(1 hunks)tests/scraper_test.py(1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
mission_blue.py (1)
scraper.py (1)
search_posts(10-96)
🔇 Additional comments (1)
mission_blue.py (1)
6-6: LGTM!The refactoring correctly delegates the search functionality to the new scraper module.
Also applies to: 361-361
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (2)
scraper.py (2)
35-35: Docstring is consistent with implementation.Contrary to the previous review comment, the docstring correctly states the default as 500, which matches the implementation at line 59. No changes needed.
90-96: Improve error handling pattern.Using
'response' in locals()is a code smell and the previous review comment correctly identified this issue.
🧹 Nitpick comments (3)
scraper.py (3)
10-13: Remove redundant pylint disable comments.The pylint disable comments appear unnecessary without clear justification for their specific violations.
def search_posts(params, token): - # pylint: disable=E1102 - # pylint: disable=C0301 -
77-81: Optimize condition and list slicing.The condition
if posts_limitis redundant sinceposts_limitis guaranteed to be a number. Also, slicing large lists can be memory-intensive.- if posts_limit and total_fetched >= posts_limit: + if total_fetched >= posts_limit: print( f"Fetched {total_fetched} posts, total: {total_fetched}/{posts_limit}" ) - return posts[:posts_limit] + # Truncate only if we exceeded the limit + if len(posts) > posts_limit: + posts = posts[:posts_limit] + return posts
83-83: Fix typo in comment.There's a typo "nextt" instead of "next".
- # Move to the nextt page if available + # Move to the next page if available
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
mission_blue.py(2 hunks)scraper.py(1 hunks)tests/scraper_test.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- mission_blue.py
- tests/scraper_test.py
🔇 Additional comments (4)
scraper.py (4)
1-8: LGTM! Clean module structure and appropriate imports.The module docstring is clear and the imports are well-organized for the functionality needed (HTTP requests and progress tracking).
45-49: LGTM! Proper input validation.The validation correctly checks for required parameters and provides clear error messages.
51-56: LGTM! Standard API request setup.The URL, headers, and authorization are properly configured for the BlueSky API.
58-61: LGTM! Creative progress bar implementation.The butterfly-themed progress bar adds a nice touch while maintaining functionality.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
By making the clean it ensures that our scraper remains functional when scrapes get large.
Lementknight
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
@andewmark even though you wrote the logic for this refactor, since I created the PR you have to be the one to approve it and merge it. |
Posts_limit was still defaulted to 1000 even though docstring said 500. Other areas were the opposite. Everything now set to defaulted 500
andewmark
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving myself as per @Lementknight
Summary by CodeRabbit
New Features
Bug Fixes
Tests