A Python-based tool for scraping and analyzing ETF holdings data from Vanguard.
- Complete Holdings Scraper: Automatically scrapes all holdings from paginated Vanguard ETF pages
- Pagination Handling: Intelligently navigates through hundreds of pages to collect complete datasets
- Flexible Options: Support for testing with limited pages or single-page scraping
- CSV Export: Saves all holdings data (ticker symbols and fund percentages) to CSV format
python3 -m venv .venvsource .venv/bin/activatepip3 install -r requirements.txtBy default, the scraper will collect all holdings from all pages:
# Scrape all holdings from VT ETF
python src/main.py
# Scrape from a different ETF
python src/main.py --url "https://investor.vanguard.com/investment-products/etfs/profile/vti"
# Specify custom output file
python src/main.py --out my_etf_holdings.csv# Test with limited pages
python src/main.py --max-pages 5
# Original single-page behavior
python src/main.py --single-page
# Show browser window while scraping
python src/main.py --headful--url: Vanguard ETF profile URL (default: VT ETF)--out: Output CSV file path (default:vt_holdings.csv)--headful: Show browser window instead of headless mode--single-page: Only scrape the first page (for testing)--max-pages N: Limit scraping to first N pages (for testing)
The scraper generates a CSV file with the following columns:
- Ticker: Stock ticker symbol
- % of fund: Percentage weight in the ETF
Example output:
Ticker,% of fund
NVDA,4.11 %
MSFT,3.78 %
AAPL,3.45 %
...- Uses Selenium WebDriver with Chrome for robust web scraping
- Automatically handles cookie banners and navigation
- Implements smart pagination detection using dropdown selectors
- Includes retry logic and error handling for reliable data collection
- Processes approximately 10 holdings per page with 1-second delays between pages