Skip to content

LangeLab/pxseek

Repository files navigation

pxseek

Query, filter, and retrieve proteomics dataset metadata from ProteomeXchange.

Overview

pxseek replaces the original Selenium-based web scraper with a clean, API-driven approach using the ProteomeCentral bulk TSV and per-dataset XML endpoints. No browser or ChromeDriver required.

Commands

Command Status Description
pxseek fetch Available Download the full dataset listing from ProteomeCentral
pxseek filter Available Filter datasets by species, repository, keywords, dates, etc.
pxseek lookup Available Fetch detailed metadata for specific PXD identifiers

Installation

Requires Python 3.12+ and uv for package management.

git clone https://github.com/LangeLab/pxseek.git
cd pxseek
uv sync

Usage

Fetch all datasets

# Download full ProteomeXchange listing (~50k datasets)
uv run pxseek fetch

# Custom output path
uv run pxseek fetch -o my_datasets.tsv

# Force re-download (bypass cache)
uv run pxseek fetch --refresh

# Verbose output
uv run pxseek fetch -v

The output TSV has the following columns:

Column Description
dataset_id ProteomeXchange identifier (e.g. PXD063194)
title Dataset title
repository Hosting repository (PRIDE, MassIVE, jPOST, iProX, etc.)
species Species name(s)
instrument Instrument type(s)
publication Associated publication(s)
lab_head Lab head / PI
announce_date Date the dataset was announced
keywords Dataset keywords

Caching

Fetched data is cached locally in .pxseek_cache/ (in the current directory) for 24 hours. Subsequent runs use the cache for instant results. Use --refresh to force a fresh download, or --cache-dir to specify an alternative cache location.

Filter datasets

# Filter by species (regex)
uv run pxseek filter -s "Homo sapiens"

# Filter by repository
uv run pxseek filter -r "PRIDE,MassIVE"

# Filter by keywords (searched in title and keywords columns)
uv run pxseek filter -k "cancer,proteomics"

# Filter by date range
uv run pxseek filter --after 2024-01-01 --before 2024-12-31

# Filter by instrument (regex)
uv run pxseek filter --instrument "Orbitrap|timsTOF"

# Combine multiple filters
uv run pxseek filter -s "Homo sapiens" -r PRIDE -k "cancer" --after 2024-01-01

# Use a keyword file (one keyword per line)
uv run pxseek filter -k keywords.txt

# Filter from a previously fetched file
uv run pxseek filter -i px_datasets.tsv -s "Mus musculus" -o mouse_datasets.tsv

# Search specific columns for keywords
uv run pxseek filter -k "brain" --keyword-columns "title"

# Deep search — also search within dataset descriptions/abstracts (fetches XML)
uv run pxseek filter -k "phosphoproteomics" --deep

# Deep search with species pre-filter to minimise XML requests
uv run pxseek filter -s "Homo sapiens" -k "ubiquitylation" --deep

# Deep search with confirmation prompt skipped
uv run pxseek filter -k "glycoproteomics" --deep --yes

When no --input is given, filter automatically uses cached data or downloads fresh data from ProteomeCentral.

pxseek lookup — fetch detailed XML metadata for specific datasets

# Look up one or more IDs by flag
uv run pxseek lookup --ids PXD000001

# Multiple IDs (comma-separated)
uv run pxseek lookup --ids PXD000001,PXD000002,PXD000003

# Read IDs from a file (one per line)
uv run pxseek lookup --ids-file my_ids.txt

# Pipeline: feed filter output directly into lookup
uv run pxseek filter -s "Homo sapiens" -o filtered.tsv
uv run pxseek lookup --input filtered.tsv -o detailed.tsv

# Skip confirmation prompt (useful in scripts)
uv run pxseek lookup --ids PXD000001 --yes

# Custom request delay (default: 1.0 s)
uv run pxseek lookup --ids PXD000001 --delay 2.0

# Custom cache directory
uv run pxseek lookup --ids PXD000001 --cache-dir /data/cache

lookup outputs a TSV with one row per dataset containing 19 fields: dataset_id, title, description, species, instruments, modifications, keywords, review_level, announce_date, repository, submitter_name, submitter_email, submitter_affiliation, lab_head_name, lab_head_email, lab_head_affiliation, pubmed_ids, dois, and ftp_location.

XML files are cached on disk so repeated lookups do not re-download data. Remove .pxseek_cache/PXD*.xml to force a fresh fetch.

Development

# Install with dev dependencies
uv sync --extra dev

# Run tests (228 tests)
uv run pytest

# Run tests with coverage
uv run pytest --cov=pxseek --cov-report=term-missing

# Lint
uv run ruff check src/ tests/

# Format check
uv run ruff format --check src/ tests/

Project structure

src/pxseek/
├── __init__.py      # Package version
├── cli.py           # Click CLI entry point
├── api.py           # ProteomeCentral API client (polite User-Agent, rate-limited)
├── parse.py         # TSV + XML parsing (HTML stripping, column mapping)
├── cache.py         # Local caching with staleness check
├── models.py        # Column names, constants, configuration
└── filter.py        # DataFrame filtering logic (Phase 2)

Legacy

The original single-file Selenium scraper is preserved in legacy/proteomeXchange_scraper.py for reference.

License

MIT License. See LICENSE for details.

About

Automated web scraper to collect and filter pediatric cancer datasets from ProteomeXchange.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages