Skip to content

Conversation

@rnmitchell
Copy link
Contributor

Previously, we've excluded the amelogenin locus from processing, as there is no need to produce an alternate sequence format (no use for bracketed sequence format, etc.) and it is not used in the prob gen software. However, now that visualizations is now implemented in lusSTR, it is important to be able to evaluate the Amelogenin results for sex determination and overall profile quality. This PR will include the Amelogenin locus through lusSTR but remove it before producing prob gen files.

Comment on lines 372 to 374
if not filt_df[filt_df["SampleID"] == sample_id].empty:
make_plot(filt_df, sample_id, filters=True, at=False)
pdf.savefig()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accounting for samples with no sequences that pass filters

@rnmitchell rnmitchell marked this pull request as ready for review June 2, 2025 10:06
@rnmitchell rnmitchell requested a review from standage June 2, 2025 10:06
@rnmitchell
Copy link
Contributor Author

@standage this is ready for review.

Comment on lines +452 to +458
ax = fig.add_subplot(6, 5, n)
if not marker_df.empty:
if marker == "AMELOGENIN":
for i, row in marker_df.iterrows():
marker_df.loc[i, "CE_Allele"] = (
0 if marker_df.loc[i, "CE_Allele"] == "X" else 1
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creates empty subplot if no alleles/sequences exist for marker

Copy link
Member

@standage standage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and suggestions:

Comment on lines 264 to 266
for i, row in sample_df.iterrows():
if sample_df.loc[i, "Locus"] == "AMELOGENIN":
sample_df.loc[i, "CE_Allele"] = 0 if sample_df.loc[i, "CE_Allele"] == "X" else 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little numpy action could clean this code up a bit.

sample_df = np.where(
    sample_df["Locus"] == "AMELOGENIN",
    np.where(sample_df["CE_Allele"] == "X", 0, 1),
    sample_df["CE_Allele"],
)

If you're not a fan of the functional approach and want to keep the row iteration, you may consider using the row object that you're currently ignoring.

for i, row in sample_df.iterrows():
    if row["Locus"] == "AMELOGENIN":
        sample_df.loc[i, "CE_Allele"] = 0 of row.CE_Allele == "X" else 1

I can understand using .loc for all access and assignment operations, since it keeps the syntax consistent. But calling .iterrows without using the row objects is confusing on first read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 386 to 390
plot_df = sample_df
for i, row in plot_df.iterrows():
if plot_df.loc[i, "Locus"] == "AMELOGENIN":
plot_df.loc[i, "CE_Allele"] = 0 if plot_df.loc[i, "CE_Allele"] == "X" else 1
plot_df["CE_Allele"] = pd.to_numeric(plot_df["CE_Allele"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

increase_value = int(math.ceil((max_yvalue / 5) / n)) * n
n = 0
for marker in sample_df["Locus"].unique():
all_loci = f_strs if st.session_state.kit == "forenseq" else p_strs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p_strs = powerseq and f_strs = forenseq? I could only figure that out by finding where they're called in the code. You could consider renaming the variables to be more self explanatory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines -31 to +32
if len(locus_allele_info) == 1:
locus_allele_info = single_allele_thresholds(metadata, locus_reads, locus_allele_info)
if locus == "AMELOGENIN":
locus_allele_info = filter_amel(metadata, locus_allele_info, locus_reads)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing necessarily wrong with how you've updated this code, but we could keep the nesting complexity to a minimum with something like this, right?

if locus == "AMELOGENIN":
   # ...
elif len(locus_allele_info) == 1:
   # ...
else:
   # ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just trying to avoid have to repeat code (line 34 would have to be in both blocks).

Comment on lines 73 to 79
if (
len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"]))
and software != "uas"
and locus != "AMELOGENIN"
) or (
software != "uas" and locus == "AMELOGENIN" and len(sequence) < (remove_5p + remove_3p)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional is dizzying!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, it's complicated. The Amelogenin true X chr sequences were getting thrown out because it's 0 bases after removal... but that first statement is still necessary for all the other markers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't doubt it's necessary. I'll see if I can come up with a way to make the syntax a little less inscrutable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning to this after myriad distractions.

Here's a step closer to something more manageable. The variable names are still too vague, but I think this breaks the complex conditional down into a structure that can actually be understood by mere mortals.

        short1 = len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"]))
        short2 = len(sequence) < (remove_5p + remove_3p)
        cond1 = short1 and software != "uas" and locus != "AMELOGENIN"
        cond2 = short2 and software != "uas" and locus == "AMELOGENIN"
        if cond1 or cond2:
            flank_summary = [...]

How would we improve the variable names short1, short2, cond1, and cond2 to communicate their purpose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

flanks_removed_len = remove_5p + remove_3p
amel_remove = len(sequence) < flanks_removed_len
otherloci_remove = len(sequence) <= (flanks_removed_len + len(metadata["LUS"]))
if (
      software != "uas"
      and locus != "AMELOGENIN"
      and otherloci_remove
) or (
      software != "uas" and locus == "AMELOGENIN" and amel_remove
):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, results from our Teams call just now.

locus_min_length = remove_5p + remove_3p + len(metadata["LUS"])
if locus == "AMELOGENIN":
    locus_min_length += 1
if software != "uas" and len(sequence) < locus_min_length:
    flank_summary = [...]

We may need to fiddle with += 1 vs -= 1 and < vs <=, but once those are in agreement this should work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ended up being -=1 but is updated and working!

Comment on lines 30 to 31
p_strs = [
"AMELOGENIN",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we could define this list in a single place and reference it as necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that. I created json file storing this information.

plot = interactive_plots(marker_df, marker, max_yvalue, increase_value, all=True)
if marker in missing_loci:
marker = f"⚠️{marker}⚠️"
plot = go.Figure()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this plotly import pattern is an elaborate ruse so people can put "go figure" in their code 😆

Comment on lines +49 to +50
strs = str_lists["powerseq_strs"] if kit == "powerseq" else str_lists["forenseq_strs"]
ystrs = str_lists["powerseq_ystrs"] if kit == "powerseq" else str_lists["forenseq_ystrs"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! This avoids a scenario where we update a list in one place but forget to in another place. Not that this ever happens...

Comment on lines 73 to 79
if (
len(sequence) <= (remove_5p + remove_3p + len(metadata["LUS"]))
and software != "uas"
and locus != "AMELOGENIN"
) or (
software != "uas" and locus == "AMELOGENIN" and len(sequence) < (remove_5p + remove_3p)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't doubt it's necessary. I'll see if I can come up with a way to make the syntax a little less inscrutable.

@standage standage merged commit 2425112 into master Jun 5, 2025
2 checks passed
@standage standage deleted the amelogenin branch June 5, 2025 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants