-
Notifications
You must be signed in to change notification settings - Fork 0
Implementing lusSTR workflow for Amelogenin locus #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…quences [skip ci]
| if not filt_df[filt_df["SampleID"] == sample_id].empty: | ||
| make_plot(filt_df, sample_id, filters=True, at=False) | ||
| pdf.savefig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accounting for samples with no sequences that pass filters
|
@standage this is ready for review. |
| ax = fig.add_subplot(6, 5, n) | ||
| if not marker_df.empty: | ||
| if marker == "AMELOGENIN": | ||
| for i, row in marker_df.iterrows(): | ||
| marker_df.loc[i, "CE_Allele"] = ( | ||
| 0 if marker_df.loc[i, "CE_Allele"] == "X" else 1 | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creates empty subplot if no alleles/sequences exist for marker
standage
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and suggestions:
lusSTR/cli/gui.py
Outdated
| for i, row in sample_df.iterrows(): | ||
| if sample_df.loc[i, "Locus"] == "AMELOGENIN": | ||
| sample_df.loc[i, "CE_Allele"] = 0 if sample_df.loc[i, "CE_Allele"] == "X" else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little numpy action could clean this code up a bit.
sample_df = np.where(
sample_df["Locus"] == "AMELOGENIN",
np.where(sample_df["CE_Allele"] == "X", 0, 1),
sample_df["CE_Allele"],
)If you're not a fan of the functional approach and want to keep the row iteration, you may consider using the row object that you're currently ignoring.
for i, row in sample_df.iterrows():
if row["Locus"] == "AMELOGENIN":
sample_df.loc[i, "CE_Allele"] = 0 of row.CE_Allele == "X" else 1I can understand using .loc for all access and assignment operations, since it keeps the syntax consistent. But calling .iterrows without using the row objects is confusing on first read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| plot_df = sample_df | ||
| for i, row in plot_df.iterrows(): | ||
| if plot_df.loc[i, "Locus"] == "AMELOGENIN": | ||
| plot_df.loc[i, "CE_Allele"] = 0 if plot_df.loc[i, "CE_Allele"] == "X" else 1 | ||
| plot_df["CE_Allele"] = pd.to_numeric(plot_df["CE_Allele"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
lusSTR/cli/gui.py
Outdated
| increase_value = int(math.ceil((max_yvalue / 5) / n)) * n | ||
| n = 0 | ||
| for marker in sample_df["Locus"].unique(): | ||
| all_loci = f_strs if st.session_state.kit == "forenseq" else p_strs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p_strs = powerseq and f_strs = forenseq? I could only figure that out by finding where they're called in the code. You could consider renaming the variables to be more self explanatory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| if len(locus_allele_info) == 1: | ||
| locus_allele_info = single_allele_thresholds(metadata, locus_reads, locus_allele_info) | ||
| if locus == "AMELOGENIN": | ||
| locus_allele_info = filter_amel(metadata, locus_allele_info, locus_reads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing necessarily wrong with how you've updated this code, but we could keep the nesting complexity to a minimum with something like this, right?
if locus == "AMELOGENIN":
# ...
elif len(locus_allele_info) == 1:
# ...
else:
# ...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just trying to avoid have to repeat code (line 34 would have to be in both blocks).
lusSTR/wrappers/convert.py
Outdated
| if ( | ||
| len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"])) | ||
| and software != "uas" | ||
| and locus != "AMELOGENIN" | ||
| ) or ( | ||
| software != "uas" and locus == "AMELOGENIN" and len(sequence) < (remove_5p + remove_3p) | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conditional is dizzying!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, it's complicated. The Amelogenin true X chr sequences were getting thrown out because it's 0 bases after removal... but that first statement is still necessary for all the other markers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't doubt it's necessary. I'll see if I can come up with a way to make the syntax a little less inscrutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning to this after myriad distractions.
Here's a step closer to something more manageable. The variable names are still too vague, but I think this breaks the complex conditional down into a structure that can actually be understood by mere mortals.
short1 = len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"]))
short2 = len(sequence) < (remove_5p + remove_3p)
cond1 = short1 and software != "uas" and locus != "AMELOGENIN"
cond2 = short2 and software != "uas" and locus == "AMELOGENIN"
if cond1 or cond2:
flank_summary = [...]How would we improve the variable names short1, short2, cond1, and cond2 to communicate their purpose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about:
flanks_removed_len = remove_5p + remove_3p
amel_remove = len(sequence) < flanks_removed_len
otherloci_remove = len(sequence) <= (flanks_removed_len + len(metadata["LUS"]))
if (
software != "uas"
and locus != "AMELOGENIN"
and otherloci_remove
) or (
software != "uas" and locus == "AMELOGENIN" and amel_remove
):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, results from our Teams call just now.
locus_min_length = remove_5p + remove_3p + len(metadata["LUS"])
if locus == "AMELOGENIN":
locus_min_length += 1
if software != "uas" and len(sequence) < locus_min_length:
flank_summary = [...]We may need to fiddle with += 1 vs -= 1 and < vs <=, but once those are in agreement this should work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It ended up being -=1 but is updated and working!
lusSTR/wrappers/filter.py
Outdated
| p_strs = [ | ||
| "AMELOGENIN", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we could define this list in a single place and reference it as necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger that. I created json file storing this information.
| plot = interactive_plots(marker_df, marker, max_yvalue, increase_value, all=True) | ||
| if marker in missing_loci: | ||
| marker = f"⚠️{marker}⚠️" | ||
| plot = go.Figure() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this plotly import pattern is an elaborate ruse so people can put "go figure" in their code 😆
| strs = str_lists["powerseq_strs"] if kit == "powerseq" else str_lists["forenseq_strs"] | ||
| ystrs = str_lists["powerseq_ystrs"] if kit == "powerseq" else str_lists["forenseq_ystrs"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! This avoids a scenario where we update a list in one place but forget to in another place. Not that this ever happens...
lusSTR/wrappers/convert.py
Outdated
| if ( | ||
| len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"])) | ||
| and software != "uas" | ||
| and locus != "AMELOGENIN" | ||
| ) or ( | ||
| software != "uas" and locus == "AMELOGENIN" and len(sequence) < (remove_5p + remove_3p) | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't doubt it's necessary. I'll see if I can come up with a way to make the syntax a little less inscrutable.
Previously, we've excluded the amelogenin locus from processing, as there is no need to produce an alternate sequence format (no use for bracketed sequence format, etc.) and it is not used in the prob gen software. However, now that visualizations is now implemented in lusSTR, it is important to be able to evaluate the Amelogenin results for sex determination and overall profile quality. This PR will include the Amelogenin locus through lusSTR but remove it before producing prob gen files.