Skip to content

Encoding issues #133

@maaaaz

Description

@maaaaz

Command that causes the issue

When scanning a (randomly-found) open dir with french accents:

$ dirhunt "http://freeit.free.fr/"
Welcome to Dirhunt v1.0.0 using Python 3.12.3
[ERROR] Error on CommonCrawl source: 503 Server Error: Service Temporarily Unavailable for url: https://index.commoncrawl.org/collinfo.json
◐ Started now
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/exceptions.py", line 47, in wrapped
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/crawler_url.py", line 84, in start
    processor.process(text, soup)
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/processors.py", line 351, in process
    self.search_keywords(text)
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/processors.py", line 102, in search_keywords
    text = text.decode('utf-8')
           ^^^^^^^^^^^^^^^^^^^^
◐ Started 2 seconds ago
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/exceptions.py", line 47, in wrapped
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/crawler_url.py", line 84, in start
    processor.process(text, soup)
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/processors.py", line 351, in process
    self.search_keywords(text)
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/processors.py", line 102, in search_keywords
    text = text.decode('utf-8')
           ^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 522: invalid continuation byte
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/exceptions.py", line 47, in wrapped
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/crawler_url.py", line 84, in start
[200] http://freeit.free.fr/Elasticity/  (Index Of) (Nothing interesting)
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/processors.py", line 351, in process
    self.search_keywords(text)
  File "/usr/local/lib/python3.12/dist-packages/dirhunt/processors.py", line 102, in search_keywords
    text = text.decode('utf-8')
           ^^^^^^^^^^^^^^^^^^^^

Expected behavior

UTF8 should be handled

Actual behavior

Crash due to UTF8 mis-handling

Traceback

No response

Dirhunt version

v1.0.0

Operating system (including distribution name and version)

Linux Ubuntu

Other details

No response

Checklist

  • The error is in the project's code, and not in my own.
  • I have searched for this issue before posting it and there isn't an open duplicate.
  • I ran pip install -U dirhunt and triggered the bug in the latest version.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions