Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test_pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
run: |
cd test
NXF_VER=24.10.5 nextflow run ../main.nf --test -profile docker --config config_files/config_elaeis.txt --expert config_files/config_expert.txt -with-docker ksrates

- name: Visualize output files
if: ${{ always() }}
run: ls -l test/rate_adjustment/elaeis
Expand Down
6 changes: 4 additions & 2 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,8 @@ The following can be used as a template (default values)::
max_gene_family_size = 200
distribution_peak_estimate = mode
min_ks_anchor_pairs = 0.05
top_reciprocally_retained_gfs = 2000
num_reciprocally_retained_gfs = 2000
use_bottom_gfs_instead_of_top = no
use_original_orthomcl_version = no

* **logging_level**: the lowest logging/verbosity level of messages printed to the console/logs (increasing severity levels: *notset*, *debug*, *info*, *warning*, *error*, *critical*). Messages less severe than *level* will be ignored; *notset* causes all messages to be processed. [Default: "info"]
Expand All @@ -256,5 +257,6 @@ The following can be used as a template (default values)::
* **max_gene_family_size**: maximum number of members that any paralog gene family can have to be included in *K*:sub:`S` estimation. Large gene families increase the run time and are often composed of unrelated sequences grouped together by shared protein domains or repetitive sequences. But this is not always the case, so one may want to check manually the gene families in file ``paralog_distributions/wgd_species/species.mcl.tsv`` and increase (or even decrease) this number. [Default: 200]
* **distribution_peak_estimate**: the statistical method used to obtain a single ortholog *K*:sub:`S` estimate for the divergence time of a species pair from its ortholog distribution or to obtain a single paralog *K*:sub:`S` estimate from an anchor *K*:sub:`S` cluster or from lognormal components in mixture models (options: "mode" or "median"). [Default: "mode"]
* **min_ks_anchor_pairs**: lower limit for the *K*:sub:`S` range used to build the anchor pair *K*:sub:`S` distribution. By default, *K*:sub:`S` values smaller than 0.05 in the anchor distribition are ignored (weighted as 0), in order to remove noise at recent *K*:sub:`S` age that could be mistakenly interpreted as a very recent WGM. Set this value to 0 to instead keep all *K*:sub:`S` values when you are actually interested in WGMs with very young age. [Default: 0.05]
* **top_reciprocally_retained_gfs**: number of gene families at the top of the reciprocal retention ranking that will be used to build the related *K*:sub:`S` distribution. [Default: 2000]
* **num_reciprocally_retained_gfs**: number of gene families at the top of the reciprocal retention ranking that will be used to build the related *K*:sub:`S` distribution. [Default: 2000]
* **use_bottom_gfs_instead_of_top**: use the bottom-ranked reciprocally retained GFs instead of the top-ranked ones (not recommended; only meant for comparison purposes with top-ranked GFs)
* **use_original_orthomcl_version**: allows compatibility with the original OrthoMCL v1.4 version; by default it is used a modified faster version called OrthoMCLight. [Default: "no"]
3 changes: 2 additions & 1 deletion example/config_files/config_expert.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ extra_paralogs_analyses_methods = yes
max_mixture_model_components = 5
max_mixture_model_ks = 5
max_gene_family_size = 200
top_reciprocally_retained_gfs = 2000
num_reciprocally_retained_gfs = 2000
use_bottom_gfs_instead_of_top = no
use_original_orthomcl_version = no
min_ks_anchor_pairs = 0.05
2 changes: 1 addition & 1 deletion ksrates/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "2.0.0"
__version__ = "2.0.1"
51 changes: 45 additions & 6 deletions ksrates/fc_configfile.py
Original file line number Diff line number Diff line change
Expand Up @@ -872,31 +872,70 @@ def get_max_gene_family_size(self):
return max_size


def get_reciprocal_retention_top(self, reciprocal_retention):
def get_num_reciprocal_retention_gfs(self, reciprocal_retention):
"""
Gets the number of top reciprocally retained gene families to be considered out of the total ranked 9178 ones.
Gets the number of ranked reciprocally retained gene families to be considered out of the total ranked 9178 ones.
Default: 2000 GFs are used.

Note: The normal use-case scenario gets the *TOP*-ranked GFs, e.g. from 1 to 2000.
However, the user can also get the *BOTTOM*-ranked GFs (toggling "use_bottom_gfs_instead_of_top"), for comparison purposes.
In this case, "get_num_reciprocal_retention_gfs" will get instead e.g. from 7178 to 9178.

:return top: integer or float
"""
if reciprocal_retention:
if self.expert_config is not None:
# Get user-defined top value in field "top_reciprocally_retained_gfs"
# NOTE: If FIELD "top_reciprocally_retained_gfs" is missing in config file, "top" variable falls back to 2000
# Get user-defined top value in field "num_reciprocally_retained_gfs"
# NOTE: If FIELD "num_reciprocally_retained_gfs" is missing in config file, "top" variable falls back to 2000
# NOTE: If instead only the related VALUE is not present, "top" variable is an empty string
top = self.expert_config.get("EXPERT PARAMETERS", "top_reciprocally_retained_gfs", fallback="2000")
top = self.expert_config.get("EXPERT PARAMETERS", "num_reciprocally_retained_gfs", fallback="2000")
# Convert to integer the user-defined value or the fallback value
try:
top = int(top)
# If the VALUE was left empty in by user ("top" variable is empty string), assume again 2000
except ValueError:
logging.warning("Field [top_reciprocally_retained_gfs] in expert config file was left empty: assuming top 2000 GFs")
logging.warning("Field [num_reciprocally_retained_gfs] in expert config file was left empty: assuming top 2000 GFs")
top = 2000
else:
top = 2000
else:
top = None
return top

def use_bottom_gfs_instead_of_top(self, reciprocal_retention):
"""
Checks whether the BOTTOM rec.ret. GFs should be used instead of the TOP rec.ret. GFs, as per expert configuration file.
If set to "yes", BOTTOM reciprocally retained gene families will be used, with number decided by "num_reciprocally_retained_gfs" (default 2000)
Default is "no", i.e. using TOP rec.ret.

This parameter builds Ks distributions from the BOTTOM GFs instead of from the TOP GFs (default bottom 2000 in ranking)
The is done to test whether bottom-derived Ks distributions are less informative about WGD peaks than top-derived Ks distributions.

:return boolean: boolean deciding whether to use the BOTTOM rec.ret GFs (True), or to use the standard TOP rec.ret. GFs (False); default: False
"""
if reciprocal_retention:
if self.expert_config is not None:
try:
bottom = self.expert_config.get("EXPERT PARAMETERS", "use_bottom_gfs_instead_of_top").lower()
if bottom not in ["yes", "no"]:
logging.warning(f'Unrecognized field in expert configuration file [bottom = {bottom}]. Please choose between "yes" and "no". Default choice will be applied [no]')
bottom = False
else:
if bottom == "yes":
bottom = True
elif bottom == "no":
bottom = False
except Exception:
logging.warning(f'Missing field [bottom] in expert configuration file. Please choose between "yes" and "no". Default choice will be applied [no]')
bottom = False
else:
# Default: DOESN'T use bottom (False) -> so uses TOP
bottom = False
else:
# If rec.ret. pipeline is not required, then this bottom parameter is not needed (None)
bottom = None
return bottom


def get_reciprocal_retention_rank_type(self, reciprocal_retention):
"""
Expand Down
6 changes: 3 additions & 3 deletions ksrates/fc_plotting.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ def create_artists(self, legend, text ,xdescent, ydescent, width, height, fontsi

def generate_mixed_plot_figure(species, x_max_lim, y_max_lim, corrected_or_not, correction_table_available,
plot_correction_arrows, paranome_data=False, colinearity_data=False,
reciprocal_retention_data=False, top=None, rank_type=None):
reciprocal_retention_data=False, num_gfs=None, rank_type=None):
"""
Initializes a figure with a single empty plot for the mixed distribution.

Expand All @@ -67,7 +67,7 @@ def generate_mixed_plot_figure(species, x_max_lim, y_max_lim, corrected_or_not,
:param paranome_data: boolean to include or not whole-paranome Ks values
:param colinearity_data: boolean to include or not anchor pair Ks values
:param reciprocal_retention_data: boolean to include or not reciprocal retention Ks values
:param top: cut-off for the reciprocal retention ranking of the 9178 core-angiosperm gene families
:param num_gfs: cut-off for the reciprocal retention ranking of the 9178 core-angiosperm gene families
:param rank_type: Type of reciprocal retention ranking ('lambda' by default or 'combined')
:return: figure and axis objects
"""
Expand Down Expand Up @@ -96,7 +96,7 @@ def generate_mixed_plot_figure(species, x_max_lim, y_max_lim, corrected_or_not,

# TODO: will this subtitle stay in the final version of the Ks plot?
# if reciprocal_retention_data:
# ax.set_title(f"Top {top} GFs from {rank_type} ranking")
# ax.set_title(f"Top {num_gfs} GFs from {rank_type} ranking")

seaborn.despine(offset=10)
ax.set_xlabel("$K_\mathregular{S}$")
Expand Down
Loading