Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
e53a496
Match non recursive element texts
alirezamika Oct 26, 2020
dad82b7
Update README.md
alirezamika Oct 26, 2020
3f953ea
Update version to 1.1.8
alirezamika Oct 26, 2020
93af234
replace fuzzywuzzy with difflib
maxbachmann Oct 30, 2020
fc346bb
Merge pull request #36 from maxbachmann/master
alirezamika Oct 31, 2020
f09d395
Merge pull request #37 from alirezamika/dev
alirezamika Oct 31, 2020
17ea783
Resolve a backward compatibility issue
alirezamika Nov 5, 2020
10153e8
Add ability to set fuzziness ratio for matching the wanted items
alirezamika Nov 5, 2020
b34c72e
Update version to 1.1.9
alirezamika Nov 5, 2020
b2e3391
Update README.md
alirezamika Nov 27, 2020
5d5f390
Add support for regular expressions for wanted items
alirezamika Nov 29, 2020
64a190e
Update README.md
alirezamika Dec 15, 2020
5b9c0a0
apply fuzziness ratio on full url matching
alirezamika Jan 10, 2021
3a00473
Update setup.py
alirezamika Jan 10, 2021
0c5922c
Update README.md
alirezamika Jan 11, 2021
48048ff
Create FUNDING.yml
alirezamika Jan 12, 2021
0b64c8b
Fix fetching result from root element
alirezamika Jan 18, 2021
a4ab3ed
refactor fetch html method
alirezamika Jan 18, 2021
731f625
Fix requests encoding
alirezamika Jan 23, 2021
1193d13
Update version to 1.1.12
alirezamika Jan 23, 2021
d33dbf0
Update README.md
alirezamika Jan 28, 2021
973ba6a
Update README.md
alirezamika Feb 3, 2021
3901d69
Add keep_blank option
gsakkis Jul 9, 2022
ea74e39
Merge pull request #73 from gsakkis/master
alirezamika Jul 17, 2022
26bc6bf
update version to 1.1.14
Jul 17, 2022
f209c3d
Update README.md
alirezamika Sep 24, 2024
e261605
Create stale-issues.yml
alirezamika Oct 8, 2024
4429311
Update stale-issues.yml
alirezamika Oct 9, 2024
348c355
Update stale-issues.yml
alirezamika Oct 9, 2024
e95999c
Update stale-issues.yml
alirezamika Oct 12, 2024
621779b
Refine complex tests and reorganize suite
alirezamika Jun 8, 2025
ea9e90c
Add CI workflows for tests
alirezamika Jun 8, 2025
996f06e
Merge pull request #106 from alirezamika/codex/create-tests-directory…
alirezamika Jun 8, 2025
eec3339
Remove unused get_random_str
alirezamika Jun 8, 2025
eb72f5d
Merge pull request #108 from alirezamika/codex/make-rule-id-determini…
alirezamika Jun 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions .github/FUNDING.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# These are supported funding model platforms

github: [alirezamika] # Replace with up to 4 GitHub Sponsors-enabled usernames e.g., [user1, user2]
patreon: # Replace with a single Patreon username
open_collective: # Replace with a single Open Collective username
ko_fi: # Replace with a single Ko-fi username
tidelift: # Replace with a single Tidelift platform-name/package-name e.g., npm/babel
community_bridge: # Replace with a single Community Bridge project-name e.g., cloud-foundry
liberapay: # Replace with a single Liberapay username
issuehunt: # Replace with a single IssueHunt username
otechie: # Replace with a single Otechie username
custom: # Replace with up to 4 custom sponsorship URLs e.g., ['link1', 'link2']
6 changes: 5 additions & 1 deletion .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,11 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
pip install setuptools wheel twine pytest
pip install .
- name: Run tests
run: |
pytest -q
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
Expand Down
22 changes: 22 additions & 0 deletions .github/workflows/stale-issues.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: Close inactive issues
on:
schedule:
- cron: "30 1 * * *"

jobs:
close-issues:
runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@v5
with:
days-before-issue-stale: 30
days-before-issue-close: 14
stale-issue-label: "stale"
stale-issue-message: "This issue is stale because it has been open for 30 days with no activity."
close-issue-message: "This issue was closed because it has been inactive for 14 days since being marked as stale."
days-before-pr-stale: 30
days-before-pr-close: 14
repo-token: ${{ secrets.GITHUB_TOKEN }}
23 changes: 23 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: Run Tests

on:
push:
release:
types: [created]

jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pytest
pip install .
- name: Run tests
run: pytest -q
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
This project is made for automatic web scraping to make scraping easy.
It gets a url or the html content of a web page and a list of sample data which we want to scrape from that page. **This data can be text, url or any html tag value of that page.** It learns the scraping rules and returns the similar elements. Then you can use this learned object with new urls to get similar content or the exact same element of those new pages.


## Installation

It's compatible with python 3.
Expand Down Expand Up @@ -37,7 +38,7 @@ url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = ["How to call an external command?"]
wanted_list = ["What are metaclasses in Python?"]

scraper = AutoScraper()
result = scraper.build(url, wanted_list)
Expand Down Expand Up @@ -108,7 +109,7 @@ from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '2.2k', 'https://github.com/alirezamika/autoscraper/issues']
wanted_list = ['A Smart, Automatic, Fast and Lightweight Web Scraper for Python', '6.2k', 'https://github.com/alirezamika/autoscraper/issues']

scraper = AutoScraper()
scraper.build(url, wanted_list)
Expand Down Expand Up @@ -140,6 +141,7 @@ scraper.load('yahoo-finance')
## Issues
Feel free to open an issue if you have any problem using the module.


## Support the project

<a href="https://www.buymeacoffee.com/alirezam" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-black.png" alt="Buy Me A Coffee" height="45" width="163" ></a>
Expand Down
Loading