Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
a4472ab
version [prerelease]: 0.6.7a0
gitronald Dec 5, 2025
e4fd5c3
refactor: modernize type hints to Python 3.10+ syntax
gitronald Dec 5, 2025
3ddaff2
fix: update version in init
gitronald Dec 5, 2025
280b69a
refactor: modernize type hints in models/
gitronald Dec 5, 2025
f3b9338
Merge pull request #91 from gitronald/master
gitronald Jan 19, 2026
6a5f6bb
Merge branch 'dev' into refactor/typehints-and-docs
gitronald Jan 19, 2026
6a8a468
add type hints and fix proxy_port bug in webutils
gitronald Jan 19, 2026
ee3b435
format webutils with black and refine type hints
gitronald Jan 20, 2026
2ddfa20
update poetry files and gitignore
gitronald Jan 20, 2026
333b4c3
Merge branch 'refactor/typehints-and-docs' into dev
gitronald Jan 29, 2026
94d729d
bump chrome version to 144
gitronald Jan 29, 2026
a8466a4
refactor ad parser and add details item model
gitronald Jan 29, 2026
3491b7a
build(deps): bump protobuf from 6.33.1 to 6.33.5
dependabot[bot] Feb 5, 2026
11003f3
format ad parser pep 8 spacing and dict literals
gitronald Feb 5, 2026
a363037
add to_dict method to DetailsItem and use across parsers
gitronald Feb 5, 2026
61d7019
add misc field to details item and refactor available on parser
gitronald Feb 5, 2026
e0d54bc
add details list model and use in available on parser
gitronald Feb 5, 2026
00010bd
use details list across component parsers
gitronald Feb 5, 2026
9188503
Merge branch 'component-details' into dev
gitronald Feb 5, 2026
befa7c7
version [prerelease]: 0.6.7a1
gitronald Feb 5, 2026
2422278
resolve poetry.lock merge conflict
gitronald Feb 5, 2026
bd41a0b
Merge pull request #92 from gitronald/dependabot/pip/protobuf-6.33.5
gitronald Feb 5, 2026
b3dce38
update init version to 0.6.7a1
gitronald Feb 5, 2026
a17185f
version [prerelease]: 0.6.7a2
gitronald Feb 5, 2026
bf6f215
update classifiers: tag-agnostic heading search, tighten discussions_…
gitronald Feb 6, 2026
f5298f6
extract find_subcomponents() in general parser, add video format
gitronald Feb 6, 2026
70e8ba9
deduplicate urls in knowledge panel parser
gitronald Feb 6, 2026
4035d19
add get_title() helper to top stories parser for perspectives titles
gitronald Feb 6, 2026
c34d5e5
rewrite test_parse_serp with snapshot and structural tests
gitronald Feb 6, 2026
6bec20a
add demo screenshot script for visual serp inspection
gitronald Feb 6, 2026
076dd8e
Merge pull request #93 from gitronald/parser-updates
gitronald Feb 6, 2026
67f4dfb
add sub_type to perspectives parser from header text
gitronald Feb 6, 2026
6501f02
update readme: add poetry install, move updates to log
gitronald Feb 6, 2026
699dbb3
version [prerelease]: 0.6.7a3
gitronald Feb 6, 2026
cce9964
add "latest posts from" to recent_posts classifier
gitronald Feb 6, 2026
e370c7a
version [prerelease]: 0.6.7a4
gitronald Feb 6, 2026
8054d25
add get_text_by_selectors to webutils, update 7 parsers
gitronald Feb 6, 2026
904602e
fix serp_rank test to check per-serp sequentiality
gitronald Feb 6, 2026
cafbb84
update dependency lower bounds for security patches
gitronald Feb 6, 2026
5c89631
version [prerelease]: 0.6.7a5
gitronald Feb 6, 2026
d9e554d
update checkout and setup-python actions to v6
gitronald Feb 6, 2026
3f1ab6b
add test workflow on push to dev
gitronald Feb 6, 2026
409786e
add compressed test fixtures and condense script for ci
gitronald Feb 6, 2026
f6b7330
track test snapshots for ci
gitronald Feb 6, 2026
2cce8b1
use orjson in test and condense scripts
gitronald Feb 6, 2026
7d66c5d
version [prerelease]: 0.6.7a6
gitronald Feb 6, 2026
adf8fba
add recent updates section with 0.6.7 changes to readme
gitronald Feb 6, 2026
ccbbf69
update testing section in readme with poetry and fixtures
gitronald Feb 6, 2026
9ee532a
update github actions section in readme with test workflow
gitronald Feb 6, 2026
2a18e93
update license year to 2026
gitronald Feb 6, 2026
91788a8
fix test workflow to install dev dependencies
gitronald Feb 6, 2026
295d4e9
version [patch]: 0.6.7
gitronald Feb 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,10 @@ jobs:
id-token: write
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@v6

- name: Set up Python 3.12
uses: actions/setup-python@v5
uses: actions/setup-python@v6
with:
python-version: "3.12"

Expand Down
35 changes: 35 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
name: Test

on:
push:
branches:
- dev

permissions:
contents: read

jobs:
test:
name: Run test suite
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v6

- name: Set up Python 3.12
uses: actions/setup-python@v6
with:
python-version: "3.12"

- name: Install Poetry
run: |
curl -sSL https://install.python-poetry.org | python - -y

- name: Update PATH
run: echo "$HOME/.local/bin" >> $GITHUB_PATH

- name: Install dependencies
run: poetry install --no-interaction --with dev

- name: Run tests
run: poetry run pytest tests/ -q
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
.venv
.archive
.claude

build
data
notebooks
scripts/ads-no-subtype

*.egg-info
*__pycache__

# Ignore test data
# Ignore test cache
.pytest_cache
tests/__snapshots__/*
105 changes: 60 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,53 +8,45 @@ and saving searches. It also includes a modular parser built on `BeautifulSoup`
for decomposing a SERP into list of components with categorical classifications
and position-based specifications.

## Recent Updates

Below are some details about recent updates. For a longer list, see the [Update Log](#update-log).

`0.6.6`
- Update packages with dependabot alerts (brotli, urllib3)

`0.6.5`
- Add GitHub Actions section to README

`0.6.0`
- Method for collecting data with selenium; requests no longer works without a redirect
- Pull request [#72](https://github.com/gitronald/WebSearcher/pull/72)

## Table of Contents

- [WebSearcher](#websearcher)
- [Tools for conducting and parsing web searches](#tools-for-conducting-and-parsing-web-searches)
- [Recent Updates](#recent-updates)
- [Table of Contents](#table-of-contents)
- [Getting Started](#getting-started)
- [Usage](#usage)
- [Example Search Script](#example-search-script)
- [Step by Step](#step-by-step)
- [1. Initialize Collector](#1-initialize-collector)
- [2. Conduct a Search](#2-conduct-a-search)
- [3. Parse Search Results](#3-parse-search-results)
- [4. Save HTML and Metadata](#4-save-html-and-metadata)
- [5. Save Parsed Results](#5-save-parsed-results)
- [Localization](#localization)
- [Contributing](#contributing)
- [Repair or Enhance a Parser](#repair-or-enhance-a-parser)
- [Add a Parser](#add-a-parser)
- [Testing](#testing)
- [GitHub Actions](#github-actions)
- [Recent Updates](#recent-updates)
- [Update Log](#update-log)
- [Similar Packages](#similar-packages)
- [License](#license)

---
---
## Recent Updates

### 0.6.7

- Added `get_text_by_selectors()` to `webutils` -- centralizes multi-selector fallback pattern across 7 component parsers
- Added `perspectives`, `recent_posts`, and `latest_from` component classifiers
- Added `sub_type` to perspectives parser from header text
- Added CI test workflow on push to dev branch
- Added compressed test fixtures with `condense_fixtures.py` script
- Updated dependency lower bounds for security patches (protobuf, orjson)
- Updated GitHub Actions to checkout v6 and setup-python v6

---
## Getting Started

```bash
# Install pip version
# Install from PyPI
pip install WebSearcher

# Install Github development version - less stable, more fun!
# Or install with Poetry
poetry add WebSearcher

# Install development version from GitHub
pip install git+https://github.com/gitronald/WebSearcher@dev
```

Expand Down Expand Up @@ -229,45 +221,68 @@ Happy to have help! If you see a component that we aren't covering yet, please a
3. Add new parser to imports and catalogue in `/component_parsers/__init__.py`

### Testing

Run tests:
```
pytest
```bash
poetry run pytest tests/ -q
```

Update snapshots:
```
pytest --snapshot-update
```bash
poetry run pytest tests/ --snapshot-update
```

Running pytest with the `-vv` flag will show a diff of the snapshots that have changed:
```
pytest -vv
Show snapshot diffs with `-vv`:
```bash
poetry run pytest tests/ -vv
```

With the `-k` flag you can run a test for a specific html file:
Run a specific snapshot test by serp_id prefix:
```bash
poetry run pytest tests/ -k "45b6e019bfa2"
```
pytest -k "1684837514.html"

### Test Fixtures

Tests load from compressed fixtures in `tests/fixtures/`. To update fixtures after collecting new demo data:

```bash
poetry run python scripts/condense_fixtures.py 0.6.7
poetry run pytest tests/ --snapshot-update
```

---
## GitHub Actions

This repository uses GitHub Actions for automated publishing:
**Test Workflow** (`.github/workflows/test.yml`)
Runs the test suite on every push to `dev`.

**Release Workflow** (`.github/workflows/publish.yml`)
Automatically publishes to PyPI when a pull request is merged into `master`. The workflow:
- Triggers on merged PRs to `master`
Publishes to PyPI when a pull request is merged into `master`:
- Builds the package using Poetry
- Publishes to PyPI using trusted publishing (no API tokens required)
- Publishes using trusted publishing (no API tokens required)

To release a new version:
1. Update the version in `pyproject.toml`
2. Create a PR to `master`
3. Once merged, the package is automatically published to PyPI
1. Merge `dev` into `master` via PR
2. Once merged, the package is automatically published to PyPI

---
## Update Log

`0.6.7`
- Add `get_text_by_selectors()` utility, CI test workflow, compressed test fixtures
- Add `perspectives`, `recent_posts`, `latest_from` classifiers and `sub_type` for perspectives
- Update dependency bounds for security patches, GitHub Actions to v6

`0.6.6`
- Update packages with dependabot alerts (brotli, urllib3)

`0.6.5`
- Add GitHub Actions section to README

`0.6.0`
- Method for collecting data with selenium; requests no longer works without a redirect
- Pull request [#72](https://github.com/gitronald/WebSearcher/pull/72)

`0.5.2`
- Added support for Spanish component headers by text
Expand Down Expand Up @@ -376,7 +391,7 @@ Many of the packages I've found for collecting web search data via python are no
---
## License

Copyright (C) 2017-2024 Ronald E. Robertson <rer@acm.org>
Copyright (C) 2017-2026 Ronald E. Robertson <rer@acm.org>

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
Expand Down
2 changes: 1 addition & 1 deletion WebSearcher/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "0.6.6"
__version__ = "0.6.7"
from .searchers import SearchEngine
from .parsers import parse_serp, FeatureExtractor
from .extractors import Extractor
Expand Down
12 changes: 6 additions & 6 deletions WebSearcher/classifiers/header_text.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,7 @@ def _classify_header(cmpt: bs4.element.Tag, level: int) -> str:
header_list = []
header_list.extend(cmpt.find_all(f"h{level}", {"role":"heading"}))
header_list.extend(cmpt.find_all(f"h{level}", {"class":["O3JH7", "q8U8x", "mfMhoc"]}))
header_list.extend(cmpt.find_all("div", {"aria-level":f"{level}", "role":"heading"}))
header_list.extend(cmpt.find_all("div", {"aria-level":f"{level}", "class":"XmmGVd"}))
header_list.extend(cmpt.find_all(attrs={"aria-level": f"{level}", "role": "heading"}))

# Check header text for known title matches
for header in filter(None, header_list):
Expand Down Expand Up @@ -83,7 +82,6 @@ def _get_header_level_mapping(level) -> dict:
"Artworks", "Obras de arte",
"Songs", "Canciones",
"Albums", "Álbumes",
"What people are saying",
"About", "Información",
"Profiles", "Perfiles"],
"local_news": ["Local news", "Noticias Locales"],
Expand All @@ -101,8 +99,9 @@ def _get_header_level_mapping(level) -> dict:
"Hotel"],
"omitted_notice": ["Notices about Filtered Results"],
"people_also_ask": ["People also ask", "Más preguntas"],
"perspectives": ["Perspectives & opinions",
"Perspectives"],
"perspectives": ["Perspectives & opinions",
"Perspectives",
"What people are saying"],
"searches_related": ["Additional searches",
"More searches", "Ver más",
"Other searches",
Expand All @@ -117,7 +116,8 @@ def _get_header_level_mapping(level) -> dict:
"News",
"Noticias",
"Market news"],
"recent_posts": ["Recent posts"],
"recent_posts": ["Recent posts",
"Latest posts from"],
"twitter": ["Twitter Results"],
"videos": ["Videos"]
}
Expand Down
8 changes: 4 additions & 4 deletions WebSearcher/classifiers/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,10 @@ def classify(cmpt: bs4.element.Tag) -> str:

@staticmethod
def discussions_and_forums(cmpt: bs4.element.Tag) -> str:
conditions = [
cmpt.find("div", {"class": "IFnjPb", "role": "heading"}),
]
return 'discussions_and_forums' if all(conditions) else "unknown"
heading = cmpt.find("div", {"class": "IFnjPb", "role": "heading"})
if heading and heading.get_text(strip=True).startswith("Discussions and forums"):
return 'discussions_and_forums'
return "unknown"

@staticmethod
def available_on(cmpt: bs4.element.Tag) -> str:
Expand Down
Loading
Loading