Skip to content

Visualization tool for comparing RNA-seq expression data across different library preparation methods using UMAP.

License

Notifications You must be signed in to change notification settings

UCSC-Treehouse/lib-prep-visualization

Repository files navigation

lib-prep-visualization

This project is a visualization tool for gene expression data across different library preparation (or other) methods using UMAP. Use this repository to download gene expression and clinical files, process them into a unified format, run UMAP for dimensionality reduction, and create cluster visualizations of the results. While this code was written to take advantage of RNA-Seq compendia generated by Treehouse, it can be generalized.

This project includes a python package, lib_prep_tools, along with several scripts to perform the following tasks:

  1. Download expression and metadata files from specified URLs.
  2. Process and merge the downloaded files into a single Anndata object and run UMAP.
  3. Create visualizations from the processed data.

Development Environment Setup

To create a development environment using conda:

conda env create -f environment.yaml
conda activate lib-prep-visualization

Ensure you have conda installed: https://www.anaconda.com/download

Project scripts

This repository includes three main scripts:

  • scripts/download_data.py — downloads gzipped expression and metadata files.
  • scripts/process_data.py — merges expression/metadata and runs UMAP.
  • scripts/plot_data.py — creates figures from the processed data.

Each script accepts a --config config_file.json argument. Example configs are provided in the config/ directory.

General usage:

python scripts/script_name.py --config config_file.json

Download data

The scripts/download_data.py script retrieves expression and metadata files from specified URLs and organizes them into a structured directory layout. These files are typically large, so the script is designed to handle interrupted downloads gracefully. Files will be prepared for downstream processing.

Configuration format

The download script expects a JSON array of compendia objects. Example:

[
    {
        "compendia_id": "Tumor_Compendium_25.01_PolyA_01_2025",
        "expression_url": "https://example.org/path/to/expression.tsv.gz",
        "metadata_url": "https://example.org/path/to/metadata.tsv.gz"
    },
    {
        "compendia_id": "another_compendia",
        "expression_url": "https://example.org/path/to/expression2.tsv.gz",
        "metadata_url": "https://example.org/path/to/metadata2.tsv.gz"
    }
]
  • compendia_id: unique, directory-name-safe identifier used for the output folder.
  • expression_url: URL to the expression file (.tsv or .tsv.gz).
  • metadata_url: URL to the metadata file (.tsv or .tsv.gz).

Download script methods

When run, the script creates the following directory layout under data/:

data/
    <version>/
        <compendia_id1>/
        <compendia_id2>/
        ...
        download_manifest.json
        download.log

The script tracks all of its file downloads in the data/<version>/download_manifest.json file.

Here is an example of the download manifest structure:

{
    "Tumor_Compendium_25.01_PolyA_01_2025": {
        "expression": {
            "last_download": "0000-00-00T00:00:00.000000",
            "md5checksum": "",
            "file_size": 0,
            "status": "success",
            "software_version": "x.x.x"
        },
        "metadata": {
            "last_download": "0000-00-00T00:00:00.000000",
            "md5checksum": "",
            "file_size": 0,
            "status": "failed",
            "software_version": "x.x.x"
        }
    }, 
    "another_compendia": {...}
}

Each file in the compendia has its own manifest entry with the following fields:

  • last_download: Timestamp of the last download attempt.
  • md5checksum: Not yet implemented.
  • file_size: Not yet implemented.
  • status: Status of the download attempt: one of "success", "failed", or "incomplete".
  • software_version: Version of the download script used.

When the script is called, to avoid redundant downloads of large files, files that have already been downloaded may be skipped. Existing files will be skipped under these conditions:

  • The compendia exists in the download_manifest.json file.
  • The status field for the specific target file is "success".
  • The software_version field matches the current version of the download script.

Expression files and metadata files are treated separately. Downloading a new expression file does not automatically trigger a download of the corresponding metadata file.

Files are first downloaded into a temporary directory because compendia files are often large and downloads can be interrupted. Downloading into a temporary directory helps prevent overwriting existing files with incomplete downloads. Once a file is fully downloaded, it is moved from the temporary directory to the final destination.

Log entries for downloads are appended to data/<version>/download.log.

Process data

The scripts/process_data.py script merges multiple compendia expression and metadata files into a single AnnData object and runs UMAP for dimensionality reduction. This script prepares the data for plotting. The output is stored in an .h5ad file, which is compatible with the Scanpy library and the UCSC Cell Browser.

The script accepts the following command-line arguments:

  • --config <config_file>.json: Path to a configuration file specifying compendia to merge and processing parameters.
  • --data-dir <path> (optional): Root directory where compendia data files are stored. Defaults to data/<version>/ if not provided.
    This argument can be used to run the script with the included pilot dataset or with any other locally stored dataset.
    The directory must follow the same structure produced by the download_data script.
{
    "out_dir_name": "data/Tumor_Compendium_25.01_01_2025/",
    "sample_subset": "path/to/subset.tsv",
    "compendia_list": [
        {
            "compendia_id": "Tumor_Compendium_25.01_PolyA_01_2025",
            "lib_prep_type": "polya"
        },
        {
            "compendia_id": "Tumor_Compendium_25.01_RiboD_01_2025",
            "lib_prep_type": "ribodepletion"
        }
    ],
    "seed": 20
}
  • out_dir_name: output sub-directory of the processed/ directory where the processed data will be saved.
  • sample_subset: Optional. Path to a TSV file containing a list of sample IDs (one per line) to include in the final processed data. If provided, only these samples will be retained in the output Anndata object.
  • compendia_list: list of compendia to merge and process. Each entry is a JSON object with the following fields:
    • compendia_id: unique identifier matching a downloaded compendia.
    • lib_prep_type: library preparation type, either "polya" or "ribodepletion".
  • seed: Optional. Integer random seed for UMAP reproducibility.

Process script methods

For each compendia in the list, the process script will:

  1. Load the expression and metadata files from the corresponding data/<version>/<compendia_id>/ directory into Pandas Dataframes.
  2. Transpose the expression dataframe so that genes are in columns and samples are in rows.
  3. Add column "compendia_type" to the metadata dataframe, populated with the lib_prep_type value from the config.
  4. Create an Anndata object from the expression and metadata dataframes.

Read more about Anndata here: https://anndata.readthedocs.io/en/stable/

Once each compendia from the config has been loaded into an Anndata object, the script will concatenate all of the Anndata objects into a single Anndata object. This is preformed using the anndata.concat() method from the Anndata library, which concatenates row-wise (i.e., samples are stacked, genes are aligned).

If the sample_subset option is provided in the config, the script filters the merged Anndata object to only include samples whose IDs are present in the specified TSV file.

The script then computes the neighbor graph and UMAP using the scanpy library and storing the results in the Anndata object. Both are seeded with the seed value from the config for reproducibility. If no seed is provided in the config, a default value of 42 is used.

The following parameters are used to generate the neighbor graph with scanpy.pp.neighbors():

  • n_neighbors=15
  • n_pcs=None
  • use_rep="X"
  • knn=True
  • method="umap"
  • transformer=None
  • metric="euclidean"
  • metric_kwds={}
  • random_state=<seed from config, default 42>
  • key_added=None
  • copy=False

The following parameters are used to generate the UMAP embedding with scanpy.tl.umap():

  • min_dist=0.5
  • spread=1.0
  • n_components=2
  • maxiter=None
  • alpha=1.0
  • gamma=1.0
  • negative_sample_rate=5
  • init_pos="spectral"
  • random_state=<seed from config, default 42>
  • a=None
  • b=None
  • method="umap"
  • key_added=None
  • neighbors_key="neighbors"
  • copy=False

Finally, the processed Anndata object is saved to processed/{out_dir_name}/merged_compendia.h5ad.

Logs for the script are saved to processed/process.log.

Plot Data

The scripts/plot_data.py script creates matplotlib figures from the processed Anndata object created by the process script. Like the other scripts, it accepts a --config <config_file>.json argument. Figures are saved to the plots/ directory as .png files.

Configuration format

This script expects a JSON object specifying plotting parameters. Example:

{
    "src_adata_path": "processed/sub_dir_name/merged_compendia.h5ad",
    "plot_title": "Plot Title",
    "out_dir_name": "fig_sub_dir",
    "custom_metadata": "table.tsv",
    "color_by": {
        "meta_key": "metadata column name",
        "categories": ["a", "b", "custom_category"],
        "color_map": {  
            "a": "#d4c5b0",
            "b": "#b7595f",
            "custom_category": "#d5d5d5",
            "other": "#808080"
        }
    }
    "legend": {
        "title": "Legend Title",
        "alignment": "left",
        "frameon": false,
        "vertical_position": "top"
    }
}
  • src_adata_path: Path to the processed Anndata file.
  • plot_title: Title for the plot.
  • out_dir_name: Subdirectory within the plots/ directory where the figure will be saved.
  • custom_metadata: Optional. Path to a TSV file with additional metadata to merge into the Anndata object. See below for the required format.
  • color_by: JSON object specifying how to assign legend colors to points in the plot.
    • meta_key: Column name in the Anndata metadata used to color points. This column can optionally come from the custom_metadata file.
    • categories: Optional. List of values within the meta_key column of the metadata dataframe to assign colors to; these values will appear in the figure legend. Any values present but not specified in this list will be assigned the "other" color. If omitted, all unique values in the meta_key column will be assigned colors and included in the legend.
    • color_map: Optional. Dictionary mapping each legend label to a hex color code. The other key is used to color any categories not explicitly listed. Legend labels not present in the color_map will be assigned default colors from a seaborn color palette.
  • legend: Optional. JSON object specifying legend parameters.
    • title: Optional. Title for the legend. If omitted, no title will be displayed.
    • alignment: Optional. Alignment of the legend within the figure. Default is "left".
    • frameon: Optional. Boolean indicating whether to draw a frame around the legend. Default is false.
    • vertical_position: Optional. Vertical alignment of the legend within the figure: "top", "center", or "bottom". Default is "top".

Custom metadata format

If you use the custom_metadata option in the plot config, the TSV file must be parseable by pandas into a DataFrame (for example, with pandas.read_csv(..., sep='\t', index_col=0)). The first column should contain sample IDs and will be used as the DataFrame index; the remaining columns must have headers. Example:

    attr1    attr2
sample_1    a    one
sample_2    b    one
sample_3    a    three
sample_4    c    four
...

The TSV is merged into the Anndata object’s obs DataFrame by aligning sample IDs (index). Samples missing from the TSV will receive NaN for the added columns. This merged metadata is used only for plotting and is not written back to disk. The plot script currently uses a single metadata column specified by color_by.meta_key to color points, so include and verify only the column(s) you need for plotting.

Plot script methods

The plot script performs the following steps:

  1. Load the processed Anndata object from the specified src_adata_path.
  2. If custom_metadata is specified, load the custom metadata TSV file into a Pandas DataFrame and merge it into the Anndata object's obs DataFrame.
  3. Validate that the color_by.meta_key exists in the Anndata obs DataFrame.
  4. Generate a color map for the legend labels based on the color_by configuration.
  5. Create a matplotlib figure plotting UMAP coordinates colored by the specified metadata.
  6. Save the figure to the plots/ directory as a .png file named after the plot_title.

Matched Subsampling Script

The scripts/matched_subsampling.py script performs a count-matched subsampling of samples across multiple compendia based on a specified metadata label. This is useful for creating balanced datasets for comparison, such as matching disease distributions across different library preparation method compendia.

The script accepts the following command-line arguments:

  • --config <config_file>.json: Path to a configuration file specifying subsampling parameters. Example configs are provided in the config/matched_subsampling/ directory.
  • --data-dir <path> (optional): Root directory where compendia data files are stored. Defaults to data/<version>/ if not provided.
    This argument can be used to run the script with the included pilot dataset or with any other locally stored dataset.
    The directory must follow the same structure produced by the download_data script.

Configuration format

The matched subsampling script expects a JSON object specifying subsampling parameters. Example:

{
    "out_dir_name": "dir_name",
    "metadata_label": "disease",
    "compendia_set": ["compendia_id1", "compendia_id2"]
}
  • out_dir_name: output sub-directory of the matched_subsamples/ directory where the subset sample IDs will be saved.
  • metadata_label: Column name in the metadata files used for matching sample counts across compendia.
  • compendia_set: List of compendia IDs to include in the matched subsampling. These IDs must correspond to downloaded compendia in the data directory.

Matched Subsampling script methods

The matched subsampling script performs the following steps:

  1. For each compendia in the compendia_set, load the corresponding metadata file into a Pandas DataFrame.
  2. Determine the unique values and their counts for the specified metadata_label across all compendia.
  3. Identify the minimum count for each label value across the compendia.
  4. For each compendia, randomly sample sample ids for each label value according to the minimum counts determined in the previous step.
  5. Combine the sampled sample ids from all compendia into a single list.
  6. Save the combined list of sampled sample ids to a TSV file at matched_subsamples/{out_dir_name}/subset_samples.tsv, with one sample ID per line.

Note: Metadata column entries that are not present in all compendia will be ignored during subsampling to ensure consistent matching across datasets.

Testing

This repository includes a pytest-based test suite. To run tests use the following command:

Make sure to activate the conda environment before running the tests

pytest

This will execute all the test cases defined in the tests/ directory.

About

Visualization tool for comparing RNA-seq expression data across different library preparation methods using UMAP.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages