Skip to content
This repository was archived by the owner on Apr 22, 2025. It is now read-only.
This repository was archived by the owner on Apr 22, 2025. It is now read-only.

Issues in htmlparse Metro and Telegraaf #486

@mariekevh

Description

@mariekevh

Issues to be fixed in htmlparse in Telegraaf and Metro rss scrapers:

  • Metro htmlparser for text also catches some 'invisible' HTML that is not part of the main article text. (Likely they have CSS display: none applied?)
  • Telegraaf htmlparser is unable to parse some texts, because they are not included in the HTML, but only load after a script is run on the website. Possible solution... htmlsource is a string that has the text included in the script: "articleBody": "HERE IS THE TEXT.","author":
if text.strip() == "":
    logger.warning("Trying alternative method....")
    #parse the text from htmlsource```

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions