Skip to content

DocumentScorer.kt stopwords.size > 2 seems to be wrong #9

@zaixiaguozhen

Description

@zaixiaguozhen

Not sure whether I get it right. in the DocumentScorer.kt, I think the code here is using wrong judgement:

class DocumentScorer(private val stopWords: StopWords) : Scorer {

    override fun score(doc: Document): ScoredElement? {
        val nodesWithText = mutableListOf<Element>()
        val nodesToCheck = doc.select("p, pre, td")
        nodesToCheck.forEach { node ->
            val text = node.text()
            val wordStats = stopWords.statistics(text)
            val hasHighLinkDensity = NodeHeuristics.hasHighLinkDensity(node)
            // if stopWords.size is bigger than 2, this node should be ignored, rather than added to nodesWithText?
           // this should be changed to: wordStats.stopWords.size <= 2
            if (wordStats.stopWords.size > 2 && !hasHighLinkDensity) {
                nodesWithText.add(node)
            }
        }
        ......
   }
}

I think we meant to find the the nodes with good text, and not containing a lot of stopwords, right?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions