pxseek

Query, filter, and retrieve proteomics dataset metadata from ProteomeXchange.

Overview

pxseek replaces the original Selenium-based web scraper with a clean, API-driven approach using the ProteomeCentral bulk TSV and per-dataset XML endpoints. No browser or ChromeDriver required.

Commands

Command	Status	Description
`pxseek fetch`	Available	Download the full dataset listing from ProteomeCentral
`pxseek filter`	Available	Filter datasets by species, repository, keywords, dates, etc.
`pxseek lookup`	Available	Fetch detailed metadata for specific PXD identifiers

Installation

Requires Python 3.12+ and uv for package management.

git clone https://github.com/LangeLab/pxseek.git
cd pxseek
uv sync

Usage

Fetch all datasets

# Download full ProteomeXchange listing (~50k datasets)
uv run pxseek fetch

# Custom output path
uv run pxseek fetch -o my_datasets.tsv

# Force re-download (bypass cache)
uv run pxseek fetch --refresh

# Verbose output
uv run pxseek fetch -v

The output TSV has the following columns:

Column	Description
`dataset_id`	ProteomeXchange identifier (e.g. PXD063194)
`title`	Dataset title
`repository`	Hosting repository (PRIDE, MassIVE, jPOST, iProX, etc.)
`species`	Species name(s)
`instrument`	Instrument type(s)
`publication`	Associated publication(s)
`lab_head`	Lab head / PI
`announce_date`	Date the dataset was announced
`keywords`	Dataset keywords

Caching

Fetched data is cached locally in .pxseek_cache/ (in the current directory) for 24 hours. Subsequent runs use the cache for instant results. Use --refresh to force a fresh download, or --cache-dir to specify an alternative cache location.

Filter datasets

# Filter by species (regex)
uv run pxseek filter -s "Homo sapiens"

# Filter by repository
uv run pxseek filter -r "PRIDE,MassIVE"

# Filter by keywords (searched in title and keywords columns)
uv run pxseek filter -k "cancer,proteomics"

# Filter by date range
uv run pxseek filter --after 2024-01-01 --before 2024-12-31

# Filter by instrument (regex)
uv run pxseek filter --instrument "Orbitrap|timsTOF"

# Combine multiple filters
uv run pxseek filter -s "Homo sapiens" -r PRIDE -k "cancer" --after 2024-01-01

# Use a keyword file (one keyword per line)
uv run pxseek filter -k keywords.txt

# Filter from a previously fetched file
uv run pxseek filter -i px_datasets.tsv -s "Mus musculus" -o mouse_datasets.tsv

# Search specific columns for keywords
uv run pxseek filter -k "brain" --keyword-columns "title"

# Deep search — also search within dataset descriptions/abstracts (fetches XML)
uv run pxseek filter -k "phosphoproteomics" --deep

# Deep search with species pre-filter to minimise XML requests
uv run pxseek filter -s "Homo sapiens" -k "ubiquitylation" --deep

# Deep search with confirmation prompt skipped
uv run pxseek filter -k "glycoproteomics" --deep --yes

When no --input is given, filter automatically uses cached data or downloads fresh data from ProteomeCentral.

`pxseek lookup` — fetch detailed XML metadata for specific datasets

# Look up one or more IDs by flag
uv run pxseek lookup --ids PXD000001

# Multiple IDs (comma-separated)
uv run pxseek lookup --ids PXD000001,PXD000002,PXD000003

# Read IDs from a file (one per line)
uv run pxseek lookup --ids-file my_ids.txt

# Pipeline: feed filter output directly into lookup
uv run pxseek filter -s "Homo sapiens" -o filtered.tsv
uv run pxseek lookup --input filtered.tsv -o detailed.tsv

# Skip confirmation prompt (useful in scripts)
uv run pxseek lookup --ids PXD000001 --yes

# Custom request delay (default: 1.0 s)
uv run pxseek lookup --ids PXD000001 --delay 2.0

# Custom cache directory
uv run pxseek lookup --ids PXD000001 --cache-dir /data/cache

lookup outputs a TSV with one row per dataset containing 19 fields: dataset_id, title, description, species, instruments, modifications, keywords, review_level, announce_date, repository, submitter_name, submitter_email, submitter_affiliation, lab_head_name, lab_head_email, lab_head_affiliation, pubmed_ids, dois, and ftp_location.

XML files are cached on disk so repeated lookups do not re-download data. Remove .pxseek_cache/PXD*.xml to force a fresh fetch.

Development

# Install with dev dependencies
uv sync --extra dev

# Run tests (228 tests)
uv run pytest

# Run tests with coverage
uv run pytest --cov=pxseek --cov-report=term-missing

# Lint
uv run ruff check src/ tests/

# Format check
uv run ruff format --check src/ tests/

Project structure

src/pxseek/
├── __init__.py      # Package version
├── cli.py           # Click CLI entry point
├── api.py           # ProteomeCentral API client (polite User-Agent, rate-limited)
├── parse.py         # TSV + XML parsing (HTML stripping, column mapping)
├── cache.py         # Local caching with staleness check
├── models.py        # Column names, constants, configuration
└── filter.py        # DataFrame filtering logic (Phase 2)

Legacy

The original single-file Selenium scraper is preserved in legacy/proteomeXchange_scraper.py for reference.

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.github/workflows		.github/workflows
examples		examples
legacy		legacy
src/pxseek		src/pxseek
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pxseek

Overview

Commands

Installation

Usage

Fetch all datasets

Caching

Filter datasets

`pxseek lookup` — fetch detailed XML metadata for specific datasets

Development

Project structure

Legacy

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pxseek

Overview

Commands

Installation

Usage

Fetch all datasets

Caching

Filter datasets

pxseek lookup — fetch detailed XML metadata for specific datasets

Development

Project structure

Legacy

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`pxseek lookup` — fetch detailed XML metadata for specific datasets

Packages