Conversation
config/config-example.yaml workflow/Snakefile
| # r and library modules must be installed in snakemake conda environment | ||
| # https://anaconda.org/conda-forge/r-base | ||
| # https://anaconda.org/bioconda/bioconductor-dropletutils | ||
| # https://anaconda.org/r/r-tidyverse |
There was a problem hiding this comment.
It would be great if the R dependencies would be automatically resolved by snakemake using an own conda environment. See the example here https://snakemake.readthedocs.io/en/v3.10.0/snakefiles/deployment.html#integrated-package-management
There was a problem hiding this comment.
I implemented using the construct_sce_objects.yaml environment with the required R packages. I still recommend directly installing the packages into the snakemake environment because it is more stable and faster. snakemake frequently rebuilds the environment and old environments must be removed manually.
There was a problem hiding this comment.
Interesting that it frequently rebuilds the env. This should only happen if dependencies (not listed in the environment.yaml) are updated. The separation is definitively preferable. One way to "fix" the env better is to store the full env by activating it and export all definitions with conda env export.
| as.is=TRUE, | ||
| colClasses = "character") | ||
|
|
||
| wildcard_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample |
There was a problem hiding this comment.
why is this variable called wildcard_curr, the code reads like it contains the individual label
workflow/scripts/samples.py
Outdated
| # list identifiers identically to config identifiers and in same order | ||
| IDENTIFIERS = ["Species_ID", "Age_ID", "Fraction_ID", "Sample_NR"] # find a way to lift directly from config |
There was a problem hiding this comment.
The config is an argument to the Samples-class below. Thereby every element of the config can be accessed via config[key1][key2]
workflow/scripts/samples.py
Outdated
| """Make "individual" column | ||
|
|
||
| A single identifier col is copied and renamed to "individual" but | ||
| entries from multiple identifier cols are concatenated to "individual". | ||
|
|
||
| Entries in the "individual" col are used as wildcards. | ||
|
|
||
| For metadata_full it is used for downstream functions (?). | ||
|
|
||
| This may cause a warning that I still have to take care of but that doesn't seem critical. | ||
| """ | ||
| if not "individual" in metadata_full.columns: | ||
| print('establishing "individual" column') | ||
| for i in IDENTIFIERS: | ||
| if i == IDENTIFIERS[0]: | ||
| metadata_full["individual"] = metadata_full[i].map(str) | ||
| else: | ||
| metadata_full["individual"] = metadata_full["individual"] + '_' + metadata_full[i].map(str) | ||
|
|
There was a problem hiding this comment.
I would suggest to make this a function side the class to improve readability
There was a problem hiding this comment.
I shortened this part a lot for better readability but without making a function.
config/config-test-lea.yaml
Outdated
| # Enable / Disable rules and specifiy rule-specific parameters | ||
| rules: | ||
| cellranger_count: | ||
| # extra: "" # set additional arguments for cellranger count |
There was a problem hiding this comment.
why the comment here? the empty string should function as no extra arguments given.
workflow/Snakefile
Outdated
| "--localcores={threads} " | ||
| "{params.extra} " | ||
| "--sample {wildcards.individual}_{wildcards.sample_type} " | ||
| #"{params.extra} " |
There was a problem hiding this comment.
not sure if necessary to comment out
fritjoflammers
left a comment
There was a problem hiding this comment.
I made some minor comments. I hope they make sense and help to clarify some things to future maintainers.
One thing, that I would like to do is make changes to scripts/samples.py. Specifically, I would like to extract the code to adjust metadata (i.e. DATE_OF_BIRTH). At the same time we should make clear what the specifications of the pipeline's metadata sheets is. In general this is Species , Sample, and FastQ Path. For your case in addition Age and Fraction are required. Is that correct?
README.md
Outdated
| ``` | ||
|
|
||
| to the snakemake command. | ||
| For multiple runs, it is recommended to install these packages directly into the snakemake environment as well. |
There was a problem hiding this comment.
What is meant by "multipe runs" - repeated runs? Why would you install them into the base snakemake environment, too? In general, I'm in favour of smaller, isolated environments. The readme here should state why you recommend this approach.
| references: | ||
| all_masked: "/omics/groups/OE0538/internal/shared_data/CellRangerReferences/GRCm38_masked_allStrains/" | ||
| metadata: | ||
| raw: |
There was a problem hiding this comment.
If we say that the pipeline reads a defined set of metadata columns, naming it raw is not really accurate anymore. Could be omit it and just state metadata: path/to/file.yaml ? This would also apply to the the config/config-example.yaml and (multiple) occurrences in the code.
There was a problem hiding this comment.
I changed it to table to keep it easily separated from identifiers and single_cell_object_metadata_fields below
workflow/Snakefile
Outdated
| samples = Samples(config) | ||
|
|
||
| REFERENCES = config["references"] | ||
| METADATA_PATH = config["metadata"]["raw"] |
There was a problem hiding this comment.
Possibly omit raw. See above.
| # this env is required, please install and activate for running snakemake | ||
| name: snakemake-cellranger | ||
| channels: | ||
| - conda-forge | ||
| - bioconda | ||
| - defaults | ||
| dependencies: | ||
| - libsqlite=3.39.4=h753d276_0 | ||
| - toposort=1.7=pyhd8ed1ab_0 | ||
| - importlib-metadata=5.0.0=pyha770c72_1 | ||
| - jinja2=3.1.2=pyhd8ed1ab_1 | ||
| - exceptiongroup=1.0.1=pyhd8ed1ab_0 | ||
| - veracitools=0.1.3=py_0 | ||
| - google-cloud-core=2.3.2=pyhd8ed1ab_0 | ||
| - coincbc=2.10.8=0_metapackage | ||
| - oauth2client=4.1.3=py_0 | ||
| - ubiquerg=0.6.2=pyhd8ed1ab_0 | ||
| - google-auth=2.14.0=pyh1a96a4e_0 | ||
| - python-dateutil=2.8.2=pyhd8ed1ab_0 | ||
| - liblapacke=3.9.0=16_linux64_openblas | ||
| - micromamba=1.0.0=1 | ||
| - coin-or-clp=1.17.7=hc56784d_2 | ||
| - aioeasywebdav=2.4.0=pyha770c72_0 | ||
| - libgomp=12.2.0=h65d4601_19 | ||
| - ply=3.11=py_1 | ||
| - importlib_resources=5.10.0=pyhd8ed1ab_0 | ||
| - libgfortran5=12.2.0=h337968e_19 | ||
| - datrie=0.8.2=py310h5764c6d_6 | ||
| - pip=22.3.1=pyhd8ed1ab_0 | ||
| - packaging=21.3=pyhd8ed1ab_0 | ||
| - dataclasses=0.8=pyhc8e2a94_3 | ||
| - pycparser=2.21=pyhd8ed1ab_0 | ||
| - configargparse=1.5.3=pyhd8ed1ab_0 | ||
| - urllib3=1.26.11=pyhd8ed1ab_0 | ||
| - colorama=0.4.6=pyhd8ed1ab_0 | ||
| - yarl=1.8.1=py310h5764c6d_0 | ||
| - psutil=5.9.4=py310h5764c6d_0 | ||
| - plac=1.3.5=pyhd8ed1ab_0 | ||
| - certifi=2022.9.24=pyhd8ed1ab_0 | ||
| - markupsafe=2.1.1=py310h5764c6d_2 | ||
| - toolz=0.12.0=pyhd8ed1ab_0 | ||
| - cachetools=5.2.0=pyhd8ed1ab_0 | ||
| - google-crc32c=1.1.2=py310he8fe98e_4 | ||
| - amply=0.1.5=pyhd8ed1ab_0 | ||
| - reretry=0.11.1=pyhd8ed1ab_0 | ||
| - c-ares=1.18.1=h7f98852_0 | ||
| - libsodium=1.0.18=h36c2ea0_1 | ||
| - peppy=0.35.2=pyhd8ed1ab_0 | ||
| - snakemake=7.18.1=hdfd78af_0 | ||
| - pytz=2022.6=pyhd8ed1ab_0 | ||
| - pytest=7.2.0=pyhd8ed1ab_2 | ||
| - libcblas=3.9.0=16_linux64_openblas | ||
| - httplib2=0.21.0=pyhd8ed1ab_0 | ||
| - yte=1.5.1=py310hff52083_1 | ||
| - pyrsistent=0.19.2=py310h5764c6d_0 | ||
| - libgrpc=1.49.1=h30feacc_1 | ||
| - pulp=2.7.0=py310hff52083_0 | ||
| - attrs=22.1.0=pyh71513ae_1 | ||
| - pandas=1.5.1=py310h769672d_1 | ||
| - multidict=6.0.2=py310h5764c6d_2 | ||
| - connection_pool=0.0.3=pyhd3deb0d_0 | ||
| - _libgcc_mutex=0.1=conda_forge | ||
| - google-api-core=2.10.2=pyhd8ed1ab_0 | ||
| - smmap=3.0.5=pyh44b312d_0 | ||
| - pygments=2.13.0=pyhd8ed1ab_0 | ||
| - wheel=0.38.3=pyhd8ed1ab_0 | ||
| - docutils=0.19=py310hff52083_1 | ||
| - ftputil=5.0.4=pyhd8ed1ab_0 | ||
| - conda=22.9.0=py310hff52083_2 | ||
| - gitdb=4.0.9=pyhd8ed1ab_0 | ||
| - aiohttp=3.8.3=py310h5764c6d_1 | ||
| - libgfortran-ng=12.2.0=h69a702a_19 | ||
| - stopit=1.1.2=py_0 | ||
| - defusedxml=0.7.1=pyhd8ed1ab_0 | ||
| - liblapack=3.9.0=16_linux64_openblas | ||
| - backports=1.0=py_2 | ||
| - numpy=1.23.4=py310h53a5b5f_1 | ||
| - iniconfig=1.1.1=pyh9f0ad1d_0 | ||
| - snakemake-minimal=7.18.1=pyhdfd78af_0 | ||
| - coin-or-cbc=2.10.8=h3786ebc_0 | ||
| - tzdata=2022f=h191b570_0 | ||
| - readline=8.1.2=h0f457ee_0 | ||
| - frozenlist=1.3.3=py310h5764c6d_0 | ||
| - filelock=3.8.0=pyhd8ed1ab_0 | ||
| - ruamel_yaml=0.15.80=py310h5764c6d_1008 | ||
| - pycosat=0.6.4=py310h5764c6d_1 | ||
| - logmuse=0.2.6=pyh8c360ce_0 | ||
| - boto3=1.26.5=pyhd8ed1ab_0 | ||
| - pyparsing=3.0.9=pyhd8ed1ab_0 | ||
| - backports.functools_lru_cache=1.6.4=pyhd8ed1ab_0 | ||
| - pyu2f=0.1.5=pyhd8ed1ab_0 | ||
| - protobuf=4.21.9=py310hd8f1fbe_0 | ||
| - conda-package-handling=1.9.0=py310h5764c6d_1 | ||
| - pysftp=0.2.9=py_1 | ||
| - pkgutil-resolve-name=1.3.10=pyhd8ed1ab_0 | ||
| - python-fastjsonschema=2.16.2=pyhd8ed1ab_0 | ||
| - ld_impl_linux-64=2.39=hc81fddc_0 | ||
| - ncurses=6.3=h27087fc_1 | ||
| - googleapis-common-protos=1.56.4=py310hff52083_1 | ||
| - google-cloud-storage=2.6.0=pyh1a96a4e_0 | ||
| - slacker=0.14.0=py_0 | ||
| - libffi=3.4.2=h7f98852_5 | ||
| - _openmp_mutex=4.5=2_gnu | ||
| - libzlib=1.2.13=h166bdaf_4 | ||
| - coin-or-utils=2.11.6=h202d8b1_2 | ||
| - libabseil=20220623.0=cxx17_h48a1fff_5 | ||
| - prettytable=3.4.1=pyhd8ed1ab_0 | ||
| - cffi=1.15.1=py310h255011f_2 | ||
| - typing-extensions=4.4.0=hd8ed1ab_0 | ||
| - pyyaml=6.0=py310h5764c6d_5 | ||
| - dropbox=11.35.0=pyhd8ed1ab_0 | ||
| - python_abi=3.10=2_cp310 | ||
| - nbformat=5.7.0=pyhd8ed1ab_0 | ||
| - libnsl=2.0.0=h7f98852_0 | ||
| - stone=3.3.1=pyhd8ed1ab_0 | ||
| - libblas=3.9.0=16_linux64_openblas | ||
| - cryptography=38.0.3=py310h600f1e7_0 | ||
| - google-auth-httplib2=0.1.0=pyhd8ed1ab_1 | ||
| - tk=8.6.12=h27826a3_0 | ||
| - libopenblas=0.3.21=pthreads_h78a6416_3 | ||
| - pyasn1-modules=0.2.7=py_0 | ||
| - jsonschema=4.17.0=pyhd8ed1ab_0 | ||
| - coin-or-cgl=0.60.6=h6f57e76_2 | ||
| - dpath=2.0.6=py310hff52083_2 | ||
| - google-api-python-client=2.65.0=pyhd8ed1ab_0 | ||
| - setuptools-scm=7.0.5=pyhd8ed1ab_1 | ||
| - rsa=4.9=pyhd8ed1ab_0 | ||
| - pyasn1=0.4.8=py_0 | ||
| - wcwidth=0.2.5=pyh9f0ad1d_2 | ||
| - tqdm=4.64.1=pyhd8ed1ab_0 | ||
| - traitlets=5.5.0=pyhd8ed1ab_0 | ||
| - wrapt=1.14.1=py310h5764c6d_1 | ||
| - zipp=3.10.0=pyhd8ed1ab_0 | ||
| - botocore=1.29.5=pyhd8ed1ab_0 | ||
| - idna=3.4=pyhd8ed1ab_0 | ||
| - google-resumable-media=2.4.0=pyhd8ed1ab_0 | ||
| - bcrypt=3.2.2=py310h5764c6d_1 | ||
| - attmap=0.13.2=pyhd8ed1ab_0 | ||
| - requests=2.28.1=pyhd8ed1ab_1 | ||
| - xz=5.2.6=h166bdaf_0 | ||
| - grpcio=1.49.1=py310hc32fa93_1 | ||
| - libprotobuf=3.21.9=h6239696_0 | ||
| - gitpython=3.1.29=pyhd8ed1ab_0 | ||
| - s3transfer=0.6.0=pyhd8ed1ab_0 | ||
| - yaml=0.2.5=h7f98852_2 | ||
| - ratelimiter=1.2.0=pyhd8ed1ab_1003 | ||
| - bzip2=1.0.8=h7f98852_4 | ||
| - ca-certificates=2022.9.24=ha878542_0 | ||
| - uritemplate=4.1.1=pyhd8ed1ab_0 | ||
| - future=0.18.2=pyhd8ed1ab_6 | ||
| - jupyter_core=4.11.2=py310hff52083_0 | ||
| - pluggy=1.0.0=pyhd8ed1ab_5 | ||
| - brotlipy=0.7.0=py310h5764c6d_1005 | ||
| - jmespath=1.0.1=pyhd8ed1ab_0 | ||
| - setuptools=65.5.1=pyhd8ed1ab_0 | ||
| - libcrc32c=1.1.2=h9c3ff4c_0 | ||
| - libstdcxx-ng=12.2.0=h46fd767_19 | ||
| - commonmark=0.9.1=py_0 | ||
| - zlib=1.2.13=h166bdaf_4 | ||
| - tabulate=0.9.0=pyhd8ed1ab_1 | ||
| - re2=2022.06.01=h27087fc_0 | ||
| - appdirs=1.4.4=pyh9f0ad1d_0 | ||
| - aiosignal=1.3.1=pyhd8ed1ab_0 | ||
| - paramiko=2.12.0=pyhd8ed1ab_0 | ||
| - filechunkio=1.8=py_2 | ||
| - charset-normalizer=2.1.1=pyhd8ed1ab_0 | ||
| - python-irodsclient=1.1.5=pyhd8ed1ab_0 | ||
| - rich=12.6.0=pyhd8ed1ab_0 | ||
| - async-timeout=4.0.2=pyhd8ed1ab_0 | ||
| - six=1.16.0=pyh6c4a22f_0 | ||
| - coin-or-osi=0.108.7=h2720bb7_2 | ||
| - pyopenssl=22.1.0=pyhd8ed1ab_0 | ||
| - libuuid=2.32.1=h7f98852_1000 | ||
| - pysocks=1.7.1=pyha2e5f31_6 | ||
| - openssl=3.0.7=h166bdaf_0 | ||
| - typing_extensions=4.4.0=pyha770c72_0 | ||
| - python=3.10.6=ha86cf86_0_cpython | ||
| - pynacl=1.5.0=py310h5764c6d_2 | ||
| - smart_open=6.2.0=pyha770c72_0 | ||
| - tomli=2.0.1=pyhd8ed1ab_0 | ||
| - libgcc-ng=12.2.0=h65d4601_19 No newline at end of file |
There was a problem hiding this comment.
I think this file should be in the projects root directory because it is not activated by snakemake's --use-conda directive, but used to start snakemake in the first place.
| individual_curr <- snakemake@wildcards[["individual"]] # currently loaded individual sample | ||
| IDENTIFIERS <- snakemake@params[["identifiers"]] |
There was a problem hiding this comment.
I would suggest to move the definitions of constants to the top of the script to improve readability.
| IDENTIFIERS <- snakemake@params[["identifiers"]] | ||
|
|
||
| # if necessary, concatenate identifiers again to obtain all possible wildcards | ||
| metadata_curr <- metadata |
There was a problem hiding this comment.
Is this reassignment necessary? :)
| } | ||
|
|
||
| # subset data as specified by wildcard and single_cell_object_metadata_fields | ||
| metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),] |
There was a problem hiding this comment.
Why not stick to the tidyverse? (sorry for being nitty gritty?
If you're using >= R 4.1, it is also possible to use native pipes (|> instead of magrittrs %>%)
| metadata_curr <- metadata_curr[which(metadata_curr$individual == individual_curr),] | |
| # not tested | |
| metadata_curr <- metadata_curr |> filter(individual == individual_curr) |
workflow/scripts/samples.py
Outdated
|
|
||
| self.metadata = self.select_columns(metadata_full) | ||
| self.metadata = self.metadata.rename(self.columns_map, axis="columns") | ||
| self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS) |
There was a problem hiding this comment.
| self.metadata = self.select_columns(metadata_full, identifiers = IDENTIFIERS) | |
| self.metadata = self.select_columns(metadata_full, custom_columns = IDENTIFIERS) |
| def select_columns(self, | ||
| df: pd.DataFrame, | ||
| columns: list = None): | ||
| columns: list = None, | ||
| identifiers: str = None): | ||
| """Select/Subset columns from DataFrame to reduce | ||
| DataFrame dimensions. """ | ||
|
|
||
| if not columns: | ||
| columns = self.columns | ||
| columns = self.columns + identifiers | ||
| return df[columns] | ||
|
|
||
There was a problem hiding this comment.
The functionality of this method is not super clear to me. What should it do?
Currently, it subsets the dataframe to columns (defined as class attributes inside Samples).
If a columns attributes is provided, this list of column names plus the contents of identifiers will be used for subsetting. If individual is not set (defaults to None), combining columns + identifiers will raise a TypeError.
Wouldn't a more clearer functionality be use define the method as follows? Then combining columns and identifiers need to be done when calling the method, i.e. in line 43
def select_columns(self,
df: pd.DataFrame,
custom_columns: list = None):
"""Select/Subset columns from DataFrame to reduce
DataFrame dimensions. A list of column names can be provided by `custom_columns` """
if custom_columns:
df_subset = df[custom_columns]
else:
df_subset = df[columns]
return df_subset
There was a problem hiding this comment.
implemented in commit b2d5b8c. It works if the defined columns are moved below def init(self, config)
Co-authored-by: Fritjof Lammers <25619971+mobilegenome@users.noreply.github.com>
-added rule and script for construction of Single Cell Experiment objects from cellranger output
-added functionality to automatically obtain individual sample names from custom identifiers, to be used as wildcards
-minor changes to cellranger output paths, extra and log paths and used separate, specific config file for remote functionality