-
-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Describe the bug
The ported_string utility function raises a TypeError when it receives an instance of email.header.Header.
In certain edge cases (specifically involving "dirty" Outlook conversions or specific email library configurations), the parser extracts an email.header.Header object instead of a string. Because ported_string does not check for this object type, the parser crashes when it attempts to sanitize these complex headers.
To Reproduce
- Create a minimal Python script that simulates the edge case by creating a
Headerobject manually. - Pass this object to
ported_string.
from email.header import Header
from mailparser.utils import ported_string
# This simulates the return value of p.get() in complex parsing scenarios
h = Header("Test Filename.pdf", charset="utf-8")
# This causes a TypeError because ported_string expects str or bytes
ported_string(h)This causes the followingTypeError because ported_string expects str or bytes:
TypeError: decoding to str: need a bytes-like object, Header found
Expected behavior
ported_string should detect that the input is an email.header.Header object and convert it to a string (using str(obj) or six.text_type(obj)) before returning.
Raw mail
I cannot provide the full original .eml file due to GDPR restrictions. Additionally, creating a synthetic file that forces the Python standard email library to return a Header object is non-deterministic, as it often depends on specific system locales and Python versions.
Relevant Header Context: The crash happens specifically when parsing the Content-Disposition header (see traceback). In my environment, the header contained mixed encoding/dirty bytes (likely from an Outlook conversion) similar to: Content-Disposition: attachment; filename="Report \r\n\t\x96\x96\x96 Final.pdf"
While mailparser usually receives a string here, the traceback confirms that in this specific case, the email library returned an email.header.Header object, which mailparser then failed to handle.
Looking at the docstring for decode_header from the Python standard email library, it seems that the library explicitly supports Header objects in this flow:
header may be a string that may or may not contain RFC2047 encoded words,
or it may be a Header object.
This indicates that mailparser (specifically ported_string) needs to handle Header objects defensively, as they are a valid state within the email library ecosystem.
Environment:
- OS: MacOS 15.7.3 (Python 3.12.8)
- Docker: no
- mail-parser version 4.1.4
Traceback
return mailparser.parse_from_bytes(inner_email_bytes)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/.virtualenvs/project/lib/python3.11/site-packages/mailparser/core.py", line 113, in parse_from_bytes
return MailParser.from_bytes(bt)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/.virtualenvs/venv/lib/python3.11/site-packages/mailparser/core.py", line 236, in from_bytes
return cls(message)
^^^^^^^^^^^^
File "/Users/user/.virtualenvs/venv/lib/python3.11/site-packages/mailparser/core.py", line 132, in _init_
self.parse()
File "/Users/user/.virtualenvs/venv/lib/python3.11/site-packages/mailparser/core.py", line 395, in parse
content_disposition = ported_string(p.get("content-disposition"))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/.virtualenvs/venv/lib/python3.11/site-packages/mailparser/utils.py", line 88, in wrapper
return normalize("NFC", func(*args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/.virtualenvs/venv/lib/python3.11/site-packages/mailparser/utils.py", line 122, in ported_string
return six.text_type(raw_data, encoding)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: decoding to str: need a bytes-like object, Header found