Add sce by leawlb · Pull Request #3 · odomlab2/snakemake-cellranger

leawlb · 2022-10-12T11:18:11Z

-added rule and script for construction of Single Cell Experiment objects from cellranger output
-added functionality to automatically obtain individual sample names from custom identifiers, to be used as wildcards
-minor changes to cellranger output paths, extra and log paths and used separate, specific config file for remote functionality

config/config-example.yaml workflow/Snakefile

fritjoflammers · 2022-10-13T10:59:50Z

workflow/scripts/construct_sce_objects.R

+# r and library modules must be installed in snakemake conda environment
+# https://anaconda.org/conda-forge/r-base
+# https://anaconda.org/bioconda/bioconductor-dropletutils
+# https://anaconda.org/r/r-tidyverse


It would be great if the R dependencies would be automatically resolved by snakemake using an own conda environment. See the example here https://snakemake.readthedocs.io/en/v3.10.0/snakefiles/deployment.html#integrated-package-management

I implemented using the construct_sce_objects.yaml environment with the required R packages. I still recommend directly installing the packages into the snakemake environment because it is more stable and faster. snakemake frequently rebuilds the environment and old environments must be removed manually.

Interesting that it frequently rebuilds the env. This should only happen if dependencies (not listed in the environment.yaml) are updated. The separation is definitively preferable. One way to "fix" the env better is to store the full env by activating it and export all definitions with conda env export.

fritjoflammers · 2022-10-13T11:03:37Z

workflow/scripts/construct_sce_objects.R

+                     as.is=TRUE, 
+                     colClasses = "character")
+
+wildcard_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample 


why is this variable called wildcard_curr, the code reads like it contains the individual label

fritjoflammers · 2022-10-13T11:06:40Z

workflow/scripts/samples.py

+# list identifiers identically to config identifiers and in same order
+IDENTIFIERS = ["Species_ID", "Age_ID", "Fraction_ID", "Sample_NR"] # find a way to lift directly from config


The config is an argument to the Samples-class below. Thereby every element of the config can be accessed via config[key1][key2]

fritjoflammers · 2022-10-13T11:08:04Z

workflow/scripts/samples.py

+        """Make "individual" column
+
+        A single identifier col is copied and renamed to "individual" but 
+        entries from multiple identifier cols are concatenated to "individual".
+
+        Entries in the "individual" col are used as wildcards.
+
+        For metadata_full it is used for downstream functions (?).
+
+        This may cause a warning that I still have to take care of but that doesn't seem critical.
+        """       
+        if not "individual" in metadata_full.columns: 
+            print('establishing "individual" column') 
+            for i in IDENTIFIERS:
+                if i == IDENTIFIERS[0]:
+                    metadata_full["individual"] = metadata_full[i].map(str)
+                else: 
+                    metadata_full["individual"] = metadata_full["individual"] + '_' + metadata_full[i].map(str)
+


I would suggest to make this a function side the class to improve readability

I shortened this part a lot for better readability but without making a function.

fritjoflammers · 2022-10-13T11:09:29Z

config/config-test-lea.yaml

+# Enable / Disable rules and specifiy rule-specific parameters
+rules:
+  cellranger_count:
+  #  extra: ""  # set additional arguments for cellranger count


why the comment here? the empty string should function as no extra arguments given.

fritjoflammers · 2022-10-13T11:10:24Z

workflow/Snakefile

        "--localcores={threads} "
-        "{params.extra} "
-        "--sample {wildcards.individual}_{wildcards.sample_type} "
+        #"{params.extra} "


not sure if necessary to comment out

…anger into add_SCE

fritjoflammers

I made some minor comments. I hope they make sense and help to clarify some things to future maintainers.

One thing, that I would like to do is make changes to scripts/samples.py. Specifically, I would like to extract the code to adjust metadata (i.e. DATE_OF_BIRTH). At the same time we should make clear what the specifications of the pipeline's metadata sheets is. In general this is Species , Sample, and FastQ Path. For your case in addition Age and Fraction are required. Is that correct?

fritjoflammers · 2023-02-02T15:10:19Z

README.md

+```
+
+to the snakemake command.
+For multiple runs, it is recommended to install these packages directly into the snakemake environment as well.


What is meant by "multipe runs" - repeated runs? Why would you install them into the base snakemake environment, too? In general, I'm in favour of smaller, isolated environments. The readme here should state why you recommend this approach.

fritjoflammers · 2023-02-02T15:13:18Z

config/config-interspecies-bonemarrow.yaml

+references:
+  all_masked: "/omics/groups/OE0538/internal/shared_data/CellRangerReferences/GRCm38_masked_allStrains/"
+metadata:
+  raw:


If we say that the pipeline reads a defined set of metadata columns, naming it raw is not really accurate anymore. Could be omit it and just state metadata: path/to/file.yaml ? This would also apply to the the config/config-example.yaml and (multiple) occurrences in the code.

I changed it to table to keep it easily separated from identifiers and single_cell_object_metadata_fields below

fritjoflammers · 2023-02-02T15:13:40Z

workflow/Snakefile

 samples = Samples(config)

-REFERENCES = config["references"]
+METADATA_PATH = config["metadata"]["raw"]


Possibly omit raw. See above.

fritjoflammers · 2023-02-02T15:19:15Z

workflow/envs/snakemake_cellranger.yaml

+# this env is required, please install and activate for running snakemake
+name: snakemake-cellranger
+channels:
+- conda-forge
+- bioconda
+- defaults
+dependencies:
+- libsqlite=3.39.4=h753d276_0
+- toposort=1.7=pyhd8ed1ab_0
+- importlib-metadata=5.0.0=pyha770c72_1
+- jinja2=3.1.2=pyhd8ed1ab_1
+- exceptiongroup=1.0.1=pyhd8ed1ab_0
+- veracitools=0.1.3=py_0
+- google-cloud-core=2.3.2=pyhd8ed1ab_0
+- coincbc=2.10.8=0_metapackage
+- oauth2client=4.1.3=py_0
+- ubiquerg=0.6.2=pyhd8ed1ab_0
+- google-auth=2.14.0=pyh1a96a4e_0
+- python-dateutil=2.8.2=pyhd8ed1ab_0
+- liblapacke=3.9.0=16_linux64_openblas
+- micromamba=1.0.0=1
+- coin-or-clp=1.17.7=hc56784d_2
+- aioeasywebdav=2.4.0=pyha770c72_0
+- libgomp=12.2.0=h65d4601_19
+- ply=3.11=py_1
+- importlib_resources=5.10.0=pyhd8ed1ab_0
+- libgfortran5=12.2.0=h337968e_19
+- datrie=0.8.2=py310h5764c6d_6
+- pip=22.3.1=pyhd8ed1ab_0
+- packaging=21.3=pyhd8ed1ab_0
+- dataclasses=0.8=pyhc8e2a94_3
+- pycparser=2.21=pyhd8ed1ab_0
+- configargparse=1.5.3=pyhd8ed1ab_0
+- urllib3=1.26.11=pyhd8ed1ab_0
+- colorama=0.4.6=pyhd8ed1ab_0
+- yarl=1.8.1=py310h5764c6d_0
+- psutil=5.9.4=py310h5764c6d_0
+- plac=1.3.5=pyhd8ed1ab_0
+- certifi=2022.9.24=pyhd8ed1ab_0
+- markupsafe=2.1.1=py310h5764c6d_2
+- toolz=0.12.0=pyhd8ed1ab_0
+- cachetools=5.2.0=pyhd8ed1ab_0
+- google-crc32c=1.1.2=py310he8fe98e_4
+- amply=0.1.5=pyhd8ed1ab_0
+- reretry=0.11.1=pyhd8ed1ab_0
+- c-ares=1.18.1=h7f98852_0
+- libsodium=1.0.18=h36c2ea0_1
+- peppy=0.35.2=pyhd8ed1ab_0
+- snakemake=7.18.1=hdfd78af_0
+- pytz=2022.6=pyhd8ed1ab_0
+- pytest=7.2.0=pyhd8ed1ab_2
+- libcblas=3.9.0=16_linux64_openblas
+- httplib2=0.21.0=pyhd8ed1ab_0
+- yte=1.5.1=py310hff52083_1
+- pyrsistent=0.19.2=py310h5764c6d_0
+- libgrpc=1.49.1=h30feacc_1
+- pulp=2.7.0=py310hff52083_0
+- attrs=22.1.0=pyh71513ae_1
+- pandas=1.5.1=py310h769672d_1
+- multidict=6.0.2=py310h5764c6d_2
+- connection_pool=0.0.3=pyhd3deb0d_0
+- _libgcc_mutex=0.1=conda_forge
+- google-api-core=2.10.2=pyhd8ed1ab_0
+- smmap=3.0.5=pyh44b312d_0
+- pygments=2.13.0=pyhd8ed1ab_0
+- wheel=0.38.3=pyhd8ed1ab_0
+- docutils=0.19=py310hff52083_1
+- ftputil=5.0.4=pyhd8ed1ab_0
+- conda=22.9.0=py310hff52083_2
+- gitdb=4.0.9=pyhd8ed1ab_0
+- aiohttp=3.8.3=py310h5764c6d_1
+- libgfortran-ng=12.2.0=h69a702a_19
+- stopit=1.1.2=py_0
+- defusedxml=0.7.1=pyhd8ed1ab_0
+- liblapack=3.9.0=16_linux64_openblas
+- backports=1.0=py_2
+- numpy=1.23.4=py310h53a5b5f_1
+- iniconfig=1.1.1=pyh9f0ad1d_0
+- snakemake-minimal=7.18.1=pyhdfd78af_0
+- coin-or-cbc=2.10.8=h3786ebc_0
+- tzdata=2022f=h191b570_0
+- readline=8.1.2=h0f457ee_0
+- frozenlist=1.3.3=py310h5764c6d_0
+- filelock=3.8.0=pyhd8ed1ab_0
+- ruamel_yaml=0.15.80=py310h5764c6d_1008
+- pycosat=0.6.4=py310h5764c6d_1
+- logmuse=0.2.6=pyh8c360ce_0
+- boto3=1.26.5=pyhd8ed1ab_0
+- pyparsing=3.0.9=pyhd8ed1ab_0
+- backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0
+- pyu2f=0.1.5=pyhd8ed1ab_0
+- protobuf=4.21.9=py310hd8f1fbe_0
+- conda-package-handling=1.9.0=py310h5764c6d_1
+- pysftp=0.2.9=py_1
+- pkgutil-resolve-name=1.3.10=pyhd8ed1ab_0
+- python-fastjsonschema=2.16.2=pyhd8ed1ab_0
+- ld_impl_linux-64=2.39=hc81fddc_0
+- ncurses=6.3=h27087fc_1
+- googleapis-common-protos=1.56.4=py310hff52083_1
+- google-cloud-storage=2.6.0=pyh1a96a4e_0
+- slacker=0.14.0=py_0
+- libffi=3.4.2=h7f98852_5
+- _openmp_mutex=4.5=2_gnu
+- libzlib=1.2.13=h166bdaf_4
+- coin-or-utils=2.11.6=h202d8b1_2
+- libabseil=20220623.0=cxx17_h48a1fff_5
+- prettytable=3.4.1=pyhd8ed1ab_0
+- cffi=1.15.1=py310h255011f_2
+- typing-extensions=4.4.0=hd8ed1ab_0
+- pyyaml=6.0=py310h5764c6d_5
+- dropbox=11.35.0=pyhd8ed1ab_0
+- python_abi=3.10=2_cp310
+- nbformat=5.7.0=pyhd8ed1ab_0
+- libnsl=2.0.0=h7f98852_0
+- stone=3.3.1=pyhd8ed1ab_0
+- libblas=3.9.0=16_linux64_openblas
+- cryptography=38.0.3=py310h600f1e7_0
+- google-auth-httplib2=0.1.0=pyhd8ed1ab_1
+- tk=8.6.12=h27826a3_0
+- libopenblas=0.3.21=pthreads_h78a6416_3
+- pyasn1-modules=0.2.7=py_0
+- jsonschema=4.17.0=pyhd8ed1ab_0
+- coin-or-cgl=0.60.6=h6f57e76_2
+- dpath=2.0.6=py310hff52083_2
+- google-api-python-client=2.65.0=pyhd8ed1ab_0
+- setuptools-scm=7.0.5=pyhd8ed1ab_1
+- rsa=4.9=pyhd8ed1ab_0
+- pyasn1=0.4.8=py_0
+- wcwidth=0.2.5=pyh9f0ad1d_2
+- tqdm=4.64.1=pyhd8ed1ab_0
+- traitlets=5.5.0=pyhd8ed1ab_0
+- wrapt=1.14.1=py310h5764c6d_1
+- zipp=3.10.0=pyhd8ed1ab_0
+- botocore=1.29.5=pyhd8ed1ab_0
+- idna=3.4=pyhd8ed1ab_0
+- google-resumable-media=2.4.0=pyhd8ed1ab_0
+- bcrypt=3.2.2=py310h5764c6d_1
+- attmap=0.13.2=pyhd8ed1ab_0
+- requests=2.28.1=pyhd8ed1ab_1
+- xz=5.2.6=h166bdaf_0
+- grpcio=1.49.1=py310hc32fa93_1
+- libprotobuf=3.21.9=h6239696_0
+- gitpython=3.1.29=pyhd8ed1ab_0
+- s3transfer=0.6.0=pyhd8ed1ab_0
+- yaml=0.2.5=h7f98852_2
+- ratelimiter=1.2.0=pyhd8ed1ab_1003
+- bzip2=1.0.8=h7f98852_4
+- ca-certificates=2022.9.24=ha878542_0
+- uritemplate=4.1.1=pyhd8ed1ab_0
+- future=0.18.2=pyhd8ed1ab_6
+- jupyter_core=4.11.2=py310hff52083_0
+- pluggy=1.0.0=pyhd8ed1ab_5
+- brotlipy=0.7.0=py310h5764c6d_1005
+- jmespath=1.0.1=pyhd8ed1ab_0
+- setuptools=65.5.1=pyhd8ed1ab_0
+- libcrc32c=1.1.2=h9c3ff4c_0
+- libstdcxx-ng=12.2.0=h46fd767_19
+- commonmark=0.9.1=py_0
+- zlib=1.2.13=h166bdaf_4
+- tabulate=0.9.0=pyhd8ed1ab_1
+- re2=2022.06.01=h27087fc_0
+- appdirs=1.4.4=pyh9f0ad1d_0
+- aiosignal=1.3.1=pyhd8ed1ab_0
+- paramiko=2.12.0=pyhd8ed1ab_0
+- filechunkio=1.8=py_2
+- charset-normalizer=2.1.1=pyhd8ed1ab_0
+- python-irodsclient=1.1.5=pyhd8ed1ab_0
+- rich=12.6.0=pyhd8ed1ab_0
+- async-timeout=4.0.2=pyhd8ed1ab_0
+- six=1.16.0=pyh6c4a22f_0
+- coin-or-osi=0.108.7=h2720bb7_2
+- pyopenssl=22.1.0=pyhd8ed1ab_0
+- libuuid=2.32.1=h7f98852_1000
+- pysocks=1.7.1=pyha2e5f31_6
+- openssl=3.0.7=h166bdaf_0
+- typing_extensions=4.4.0=pyha770c72_0
+- python=3.10.6=ha86cf86_0_cpython
+- pynacl=1.5.0=py310h5764c6d_2
+- smart_open=6.2.0=pyha770c72_0
+- tomli=2.0.1=pyhd8ed1ab_0
+- libgcc-ng=12.2.0=h65d4601_19


I think this file should be in the projects root directory because it is not activated by snakemake's --use-conda directive, but used to start snakemake in the first place.

fritjoflammers · 2023-02-02T15:20:59Z

workflow/scripts/construct_sce_objects.R

+individual_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample 
+IDENTIFIERS <- snakemake@params[["identifiers"]] 


I would suggest to move the definitions of constants to the top of the script to improve readability.

fritjoflammers · 2023-02-02T15:22:41Z

workflow/scripts/construct_sce_objects.R

+IDENTIFIERS <- snakemake@params[["identifiers"]] 
+
+# if necessary, concatenate identifiers again to obtain all possible wildcards 
+metadata_curr <- metadata


Is this reassignment necessary? :)

fritjoflammers · 2023-02-02T15:28:27Z

workflow/scripts/construct_sce_objects.R

+}
+
+# subset data as specified by wildcard and single_cell_object_metadata_fields
+metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),]


Why not stick to the tidyverse? (sorry for being nitty gritty?

If you're using >= R 4.1, it is also possible to use native pipes (|> instead of magrittrs %>%)

Suggested change

metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),]

# not tested

metadata_curr <- metadata_curr |> filter(individual == individual_curr)

implemented in commit 47c9b94

fritjoflammers · 2023-02-17T11:08:38Z

workflow/scripts/samples.py


-        self.metadata = self.select_columns(metadata_full)
-        self.metadata = self.metadata.rename(self.columns_map, axis="columns")
+        self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS)


Suggested change

self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS)

self.metadata = self.select_columns(metadata_full, custom_columns = IDENTIFIERS)

fritjoflammers · 2023-02-17T11:08:50Z

workflow/scripts/samples.py

    def select_columns(self,
                       df: pd.DataFrame,
-                       columns: list = None):
+                       columns: list = None,
+                       identifiers: str = None):
        """Select/Subset columns from DataFrame to reduce
        DataFrame dimensions. """
+
        if not columns:
-            columns = self.columns
+            columns = self.columns + identifiers
        return df[columns]
-
+            


The functionality of this method is not super clear to me. What should it do?
Currently, it subsets the dataframe to columns (defined as class attributes inside Samples).
If a columns attributes is provided, this list of column names plus the contents of identifiers will be used for subsetting. If individual is not set (defaults to None), combining columns + identifiers will raise a TypeError.

Wouldn't a more clearer functionality be use define the method as follows? Then combining columns and identifiers need to be done when calling the method, i.e. in line 43

def select_columns(self, df: pd.DataFrame, custom_columns: list = None): """Select/Subset columns from DataFrame to reduce DataFrame dimensions. A list of column names can be provided by `custom_columns` """ if custom_columns: df_subset = df[custom_columns] else: df_subset = df[columns] return df_subset

implemented in commit b2d5b8c. It works if the defined columns are moved below def init(self, config)

…anger into add_SCE

Co-authored-by: Fritjof Lammers <25619971+mobilegenome@users.noreply.github.com>

leawlb added 19 commits October 4, 2022 16:45

most minimal changes that still work in dry run

6351c4d

customized for one object_ID as identifier

18d078d

WIP small addition for customization later

ded0ee9

WIP reying to run rule cellranger_count

028af44

path changes to try successful run

17e1d0d

minor change for successful run with test dataset

31cc8e7

added R script for constructing SCE objects

1598700

added R script for SCE object construction

cf7af69

WIP trying to change 'individual' to identifier wildcards

2fa037f

WIP going back to individual

224a296

addition of metadata to SCE objects during construction

135c3c1

minor fixes

e969133

Conflicts:

26b7d6f

config/config-example.yaml workflow/Snakefile

restored modified versions after merge

9276f95

minor fixes

a289532

minor adjustments

0b12e34

SCe construction now functional with metadata_full

9a869c7

concatenate multiple identifiers

cf8fb83

minor changes for functionality

f415304

fritjoflammers reviewed Oct 13, 2022

View reviewed changes

leawlb and others added 7 commits October 21, 2022 17:48

changed subfolderstructure, added second wildcard again

1489930

add automatic resolution of required conda env for SCE construction

fd312f0

removed manual addition of identifiers to samples.py

3bdeeec

various corrections

32a5d55

altered individual column generation and renamed config

7931239

Update README.md

91401fa

added complete env files and smalle improvements

28fc106

leawlb added 2 commits January 31, 2023 13:44

Merge branch 'add_SCE' of https://github.com/odomlab2/snakemake-cellr…

1764867

…anger into add_SCE

Merge branch 'add_SCE' of https://github.com/odomlab2/snakemake-cellr…

36d3654

…anger into add_SCE

fritjoflammers reviewed Feb 2, 2023

View reviewed changes

extract method to rename BIRTH columns

ed54ac1

fritjoflammers reviewed Feb 17, 2023

View reviewed changes

leawlb and others added 6 commits March 27, 2023 15:30

improved readability

5d9d69f

Merge branch 'add_SCE' of https://github.com/odomlab2/snakemake-cellr…

7e7a3fe

…anger into add_SCE

readability improvements

47c9b94

Update workflow/scripts/samples.py

c5033c6

Co-authored-by: Fritjof Lammers <25619971+mobilegenome@users.noreply.github.com>

adjusted select_columns function plus minor corrections

b2d5b8c

minor corrections

cdcda74

		# list identifiers identically to config identifiers and in same order
		IDENTIFIERS = ["Species_ID", "Age_ID", "Fraction_ID", "Sample_NR"] # find a way to lift directly from config

		individual_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample
		IDENTIFIERS <- snakemake@params[["identifiers"]]

	metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),]
	# not tested
	metadata_curr <- metadata_curr \|> filter(individual == individual_curr)

	self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS)
	self.metadata = self.select_columns(metadata_full, custom_columns = IDENTIFIERS)

Conversation

leawlb commented Oct 12, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fritjoflammers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants