ETF Modeling

A Python-based tool for scraping and analyzing ETF holdings data from Vanguard.

Features

Complete Holdings Scraper: Automatically scrapes all holdings from paginated Vanguard ETF pages
Pagination Handling: Intelligently navigates through hundreds of pages to collect complete datasets
Flexible Options: Support for testing with limited pages or single-page scraping
CSV Export: Saves all holdings data (ticker symbols and fund percentages) to CSV format

Setup

Create a virtual environment

python3 -m venv .venv

Activate the virtual environment

source .venv/bin/activate

Install dependencies

pip3 install -r requirements.txt

Usage

Scrape Complete ETF Holdings

By default, the scraper will collect all holdings from all pages:

# Scrape all holdings from VT ETF
python src/main.py

# Scrape from a different ETF
python src/main.py --url "https://investor.vanguard.com/investment-products/etfs/profile/vti"

# Specify custom output file
python src/main.py --out my_etf_holdings.csv

Testing and Development Options

# Test with limited pages
python src/main.py --max-pages 5

# Original single-page behavior
python src/main.py --single-page

# Show browser window while scraping
python src/main.py --headful

Command Line Options

--url: Vanguard ETF profile URL (default: VT ETF)
--out: Output CSV file path (default: vt_holdings.csv)
--headful: Show browser window instead of headless mode
--single-page: Only scrape the first page (for testing)
--max-pages N: Limit scraping to first N pages (for testing)

Output Format

The scraper generates a CSV file with the following columns:

Ticker: Stock ticker symbol
% of fund: Percentage weight in the ETF

Example output:

Ticker,% of fund
NVDA,4.11 %
MSFT,3.78 %
AAPL,3.45 %
...

Technical Details

Uses Selenium WebDriver with Chrome for robust web scraping
Automatically handles cookie banners and navigation
Implements smart pagination detection using dropdown selectors
Includes retry logic and error handling for reliable data collection
Processes approximately 10 holdings per page with 1-second delays between pages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
research		research
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETF Modeling

Features

Setup

Create a virtual environment

Activate the virtual environment

Install dependencies

Usage

Scrape Complete ETF Holdings

Testing and Development Options

Command Line Options

Output Format

Technical Details

About

Uh oh!

Releases

Packages

Languages

GSMIF/etf-modeling

Folders and files

Latest commit

History

Repository files navigation

ETF Modeling

Features

Setup

Create a virtual environment

Activate the virtual environment

Install dependencies

Usage

Scrape Complete ETF Holdings

Testing and Development Options

Command Line Options

Output Format

Technical Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages