Skip to content

Inquiry About Character-Level Basis of Duplication Calculation #116

@Ping-Guo-UCAS

Description

@Ping-Guo-UCAS

Hi, thank you for your release. I've been reviewing the method we use to calculate the repetition score for identifying duplicate content in documents, specifically the segment where we compute this score based on the number of characters within duplicate n-grams:

word_lengths = np.array(list(map(len, document.normalized_words)))
chars_duped = np.sum(word_lengths * duplicated_grams)
total_chars = np.sum(word_lengths)

I noticed that we're using character counts (word_lengths) to determine the extent of duplication. This approach focuses on the granularity of characters rather than whole words. Could you help me understand the rationale behind choosing character-level analysis for this metric instead of basing our calculations directly on word counts? Are there specific advantages or scenarios where character-level detail provides better insights into data quality or model training effectiveness that might not be as apparent with word-level analysis?

Looking forward to your insights.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions