iso-8859-8-i.LOG
The htmlbody of a tnefobject containing Hebrew text (iso-8859-8-i encoding) is incorrect. Attached is an email example iso-8859-8-i.LOG [LOG extension only because github does not allow eml] to demonstrate the issue.
>>> fname = 'iso-8859-8-i.LOG'
>>> import email
>>> mime_msg = email.message_from_file(open(fname))
>>> tnef_parsed_content = mime_msg.get_payload()[-1].get_payload(decode=True)
>>> from tnefparse import TNEF
>>> tnefobj = TNEF(tnef_parsed_content)
>>> htmlbody = tnefobj.htmlbody
>>> htmlbody
u'<html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-8-i">\r\n<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>\r\n</head>\r\n<body dir="ltr">\r\n<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);">\r\n<a href="https://www.walla.co.il/" id="LPlnk">https://www.walla.co.il/</a><br>\r\n</div>\r\n<div class="_Entity _EType_OWALinkPreview _EId_OWALinkPreview _EReadonly_1">\r\n<div id="LPBorder_GTaHR0cHM6Ly93d3cud2FsbGEuY28uaWwv" class="LPBorder510319" style="width: 100%; margin-top: 16px; margin-bottom: 16px; position: relative; max-width: 800px; min-width: 424px;">\r\n<table id="LPContainer510319" role="presentation" style="padding: 12px 36px 12px 12px; width: 100%; border-width: 1px; border-style: solid; border-color: rgb(200, 200, 200); border-radius: 2px;">\r\n<tbody>\r\n<tr valign="top" style="border-spacing: 0px;">\r\n<td>\r\n<div id="LPImageContainer510319" style="position: relative; margin-right: 12px; height: 135px; overflow: hidden; width: 240px;">\r\n<a target="_blank" id="LPImageAnchor510319" href="https://www.walla.co.il/"><img id="LPThumbnailImageId510319" alt="" height="135" width="240" style="display: block;" src="https://img.wcdn.co.il/f_auto,q_auto,w_1200,t_54/3/1/3/6/3136860-46.jpg"></a></div>\r\n</td>\r\n<td style="width: 100%;">\r\n<div id="LPTitle510319" style="font-size: 21px; font-weight: 300; margin-right: 8px; font-family: wf_segoe-ui_light, "Segoe UI Light", "Segoe WP Light", "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; margin-bottom: 12px;">\r\n<a target="_blank" id="LPUrlAnchor510319" href="https://www.walla.co.il/" style="text-decoration: none; color: var(--themePrimary);">\xe5\xe5\xe0\xec\xe4! - \xe4\xe0\xfa\xf8 \xe4\xee\xe5\xe1\xe9\xec \xe1\xe9\xf9\xf8\xe0\xec - \xf2\xe3\xeb\xe5\xf0\xe9\xed \xee\xf1\xe1\xe9\xe1 \xec\xf9\xf2\xe5\xef</a></div>\r\n<div id="LPDescription510319" style="font-size: 14px; max-height: 100px; color: rgb(102, 102, 102); font-family: wf_segoe-ui_normal, "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif; margin-bottom: 12px; margin-right: 8px; overflow: hidden;">\r\n\xe5\xe5\xe0\xec\xe4!- \xe4\xe0\xfa\xf8 \xe4\xf4\xe5\xf4\xe5\xec\xf8\xe9 \xe1\xe9\xf9\xf8\xe0\xec. \xe7\xe3\xf9\xe5\xfa \xf2\xe3\xeb\xf0\xe9\xe5\xfa 24/7, \xf2\xf9\xf8\xe5\xfa \xf2\xf8\xe5\xf6\xe9 \xfa\xe5\xeb\xef \xe5\xee\xe9\xe3\xf2 \xee\xe5\xe1\xe9\xec\xe9\xed, \xf9\xe9\xf8\xe5\xfa \xe3\xe5\xe0\xf8 \xe0\xec\xf7\xe8\xf8\xe5\xf0\xe9 \xec\xec\xe0 \xe4\xe2\xe1\xec\xfa \xf0\xf4\xe7, \xf9\xe9\xf8\xe5\xfa\xe9 \xf7\xf0\xe9\xe5\xfa \xe5\xfa\xe9\xe9\xf8\xe5\xfa \xe1\xe0\xfa\xf8 walla!</div>\r\n<div id="LPMetadata510319" style="font-size: 14px; font-weight: 400; color: rgb(166, 166, 166); font-family: wf_segoe-ui_normal, "Segoe UI", "Segoe WP", Tahoma, Arial, sans-serif;">\r\nwww.walla.co.il</div>\r\n</td>\r\n</tr>\r\n</tbody>\r\n</table>\r\n</div>\r\n</div>\r\n<br>\r\n</body>\r\n</html>\r\n'
>>> htmlbody[1795:1799]
u'\xe5\xe5\xe0\xec'
>>>
This seems wrongly decoded while constructing the htmlbody as the part in 1795:1799 is not unicode. Using this htmlbody to create a text/html part fails while setting the payload with content charset.
>>> from email.charset import Charset, QP
>>> from email.mime.nonmultipart import MIMENonMultipart
>>> charset_name = 'iso-8859-8' # text/plain content type is 'iso-8859-8-i' which is mapped to 'iso-8859-8'
>>> cs = Charset(charset_name)
>>> cs.body_encoding = QP
>>> text_body_part = MIMENonMultipart('text', 'html', charset=charset_name)
>>> text_body_part.set_payload(htmlbody, charset=cs)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/email/message.py", line 226, in set_payload
self.set_charset(charset)
File "/usr/local/lib/python2.7/email/message.py", line 262, in set_charset
self._payload = self._payload.encode(charset.output_charset)
File "/usr/local/lib/python2.7/encodings/iso8859_8.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 1795-1799: character maps to <undefined>
>>> htmlbody[1795:1799]
u'\xe5\xe5\xe0\xec'
>>>
Above code snippets were run with version 1.31 on ubuntu machine.
The issue seem to exist in 1.31 and 1.40 versions (didn't check earlier ones but I think they too have the bug).
It seems the bug is here: https://github.com/koodaamo/tnefparse/blob/master/tnefparse/mapi.py#L152-L155
iso-8859-8-i.LOG
The htmlbody of a tnefobject containing
Hebrewtext (iso-8859-8-i encoding) is incorrect. Attached is an email example iso-8859-8-i.LOG [LOG extension only because github does not allow eml] to demonstrate the issue.This seems wrongly decoded while constructing the
htmlbodyas the part in 1795:1799 is not unicode. Using this htmlbody to create a text/html part fails while setting the payload with content charset.Above code snippets were run with version 1.31 on ubuntu machine.
The issue seem to exist in 1.31 and 1.40 versions (didn't check earlier ones but I think they too have the bug).
It seems the bug is here: https://github.com/koodaamo/tnefparse/blob/master/tnefparse/mapi.py#L152-L155