Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

email package: UnicodeEncodeError in as_bytes() when folding malformed Subject header mixing raw UTF‐8 and encoded‐words #143712

Open
Labels
pendingThe issue will be closed if no feedback is provided stdlibStandard Library Python modules in the Lib/ directory topic-email type-featureA feature request or enhancement
@DRSpalding

Description

Bug report

Bug description:

While parsing a recent spam message, my script threw a UnicodeEncodeError exception from inside the email package. I reduced the problem to a minimal test case based on the real‐world (non‐compliant) Subject header. The header mixes raw UTF‐8 characters with RFC 2047 encoded‐words, which appears to cause the parser to produce surrogateescape code points when utf8=True is enabled.
Calling as_bytes() then triggers a failure during header folding.

I think that the package is mixing raw utf-8 handling with trying to wrap a header in =?unknown-8bit? sequences when utf-8 is true but should likely do something to treat the entirety of the header line as malformed and defensively fence it all off.

Below is a complete, self‐contained Python script that reproduces the error:

# This is a real-world spam subject line, and the testcase is distilled down
# to the form you see below with simply a subject, mime-version and short body.
# 
# This script reproduces a UnicodeEncodeError in the email package
# when parsing a header containing mixed encoded-words and raw UTF-8
# under policy.default.clone(utf8=True). NB: the email is malformed
# because in mixes raw UTF-8 with encoded-words in the same header.
#
# Expected traceback:
#
expected_traceback = r"""
Traceback (most recent call last):
 File "C:\Tmp\PyTest\email_package_corrupted.py", line 32, in <module>
 print(msg.as_bytes())
 ~~~~~~~~~~~~^^
 File "C:\Program Files\Python\Lib\email\message.py", line 208, in as_bytes
 g.flatten(self, unixfrom=unixfrom)
 ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
 File "C:\Program Files\Python\Lib\email\generator.py", line 117, in flatten
 self._write(msg)
 ~~~~~~~~~~~^^^^^
 File "C:\Program Files\Python\Lib\email\generator.py", line 200, in _write
 self._write_headers(msg)
 ~~~~~~~~~~~~~~~~~~~^^^^^
 File "C:\Program Files\Python\Lib\email\generator.py", line 432, in _write_headers
 self._fp.write(self.policy.fold_binary(h, v))
 ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
 File "C:\Program Files\Python\Lib\email\policy.py", line 207, in fold_binary
 folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
 File "C:\Program Files\Python\Lib\email\policy.py", line 228, in _fold
 return self.header_factory(name, ''.join(lines)).fold(policy=self)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
 File "C:\Program Files\Python\Lib\email\headerregistry.py", line 253, in fold
 return header.fold(policy=policy)
 ~~~~~~~~~~~^^^^^^^^^^^^^^^
 File "C:\Program Files\Python\Lib\email\_header_value_parser.py", line 166, in fold
 return _refold_parse_tree(self, policy=policy)
 File "C:\Program Files\Python\Lib\email\_header_value_parser.py", line 2874, in _refold_parse_tree
 last_ew = _fold_as_ew(tstr, lines, maxlen, last_ew,
 part.ew_combine_allowed, charset, leading_whitespace)
 File "C:\Program Files\Python\Lib\email\_header_value_parser.py", line 3006, in _fold_as_ew
 encoded_word = _ew.encode(to_encode_word, charset=encode_as)
 File "C:\Program Files\Python\Lib\email\_encoded_words.py", line 222, in encode
 bstring = string.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)
"""
from email import policy
from email import message_from_bytes, message_from_binary_file
# Raw email with NO Content-Type header
raw = (
	b'Subject: '
	b'xbox =?UTF-8?B?4puU77iP?= \xf0\x9d\x90\xa5\xf0\x9d\x90'
	b'\x9a\xf0\x9d\x90\xac\xf0\x9d\x90\xad \xf0\x9d\x90\xab\xf0\x9d\x90'
	b'\x9e\xf0\x9d\x90\xa6\xf0\x9d\x90\xa2\xf0\x9d\x90\xa7\xf0\x9d\x90'
	b'\x9d\xf0\x9d\x90\x9e\xf0\x9d\x90\xab! \xf0\x9d\x90\x98\xf0\x9d\x90'
	b'\xa8\xf0\x9d\x90\xae\xf0\x9d\x90\xab \xf0\x9d\x90\xa5\xf0\x9d\x90'
	b'\xa2\xf0\x9d\x90\x9c\xf0\x9d\x90\x9e\xf0\x9d\x90\xa7\xf0\x9d\x90\xac'
	b'\xf0\x9d\x90\x9e \xf0\x9d\x90\xa1\xf0\x9d\x90\x9a\xf0\x9d\x90\xac '
	b'\xf0\x9d\x90\x9e\xf0\x9d\x90\xb1\xf0\x9d\x90\xa9\xf0\x9d\x90\xa2\xf0'
	b'\x9d\x90\xab\xf0\x9d\x90\x9e\xf0\x9d\x90\x9d \xf0\x9d\x90\x93\xf0\x9d'
	b'\x90\xa8\xf0\x9d\x90\x9d\xf0\x9d\x90\x9a\xf0\x9d\x90\xb2 '
	b'=?UTF-8?B?4pqg77iP?=... \xf0\x9d\x90\xb2\xf0\x9d\x90\xa8\xf0\x9d\x90'
	b'\xae\xf0\x9d\x90\xab \xf0\x9d\x90\x9c\xf0\x9d\x90\xa8\xf0\x9d\x90\xa6'
	b'\xf0\x9d\x90\xa9\xf0\x9d\x90\xae\xf0\x9d\x90\xad\xf0\x9d\x90\x9e\xf0'
	b'\x9d\x90\xab \xf0\x9d\x90\xa2\xf0\x9d\x90\xac \xf0\x9d\x90\xa2\xf0\x9d'
	b'\x90\xa7 \xf0\x9d\x90\x9d\xf0\x9d\x90\x9a\xf0\x9d\x90\xa7\xf0\x9d\x90'
	b'\xa0\xf0\x9d\x90\x9e\xf0\x9d\x90\xab\r\n'
	b'MIME-Version: 1.0'
	b'\r\n\r\n'
	b'Hello, world!')
msg = message_from_bytes(raw, policy=policy.default.clone(utf8=True))
print('This call to as_bytes() should trigger the UnicodeEncodeError')
print(msg.as_bytes())
expected_traceback = r"""
This call to as_bytes() should trigger the UnicodeEncodeError
Traceback (most recent call last):
 File "C:\Tmp\PyTest\email_package_corrupted.py", line 32, in <module>
 print(msg.as_bytes())
 ~~~~~~~~~~~~^^
 File "C:\Program Files\Python\Lib\email\message.py", line 208, in as_bytes
 g.flatten(self, unixfrom=unixfrom)
 ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
 File "C:\Program Files\Python\Lib\email\generator.py", line 117, in flatten
 self._write(msg)
 ~~~~~~~~~~~^^^^^
 File "C:\Program Files\Python\Lib\email\generator.py", line 200, in _write
 self._write_headers(msg)
 ~~~~~~~~~~~~~~~~~~~^^^^^
 File "C:\Program Files\Python\Lib\email\generator.py", line 432, in _write_headers
 self._fp.write(self.policy.fold_binary(h, v))
 ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
 File "C:\Program Files\Python\Lib\email\policy.py", line 207, in fold_binary
 folded = self._fold(name, value, refold_binary=self.cte_type=='7bit')
 File "C:\Program Files\Python\Lib\email\policy.py", line 228, in _fold
 return self.header_factory(name, ''.join(lines)).fold(policy=self)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
 File "C:\Program Files\Python\Lib\email\headerregistry.py", line 253, in fold
 return header.fold(policy=policy)
 ~~~~~~~~~~~^^^^^^^^^^^^^^^
 File "C:\Program Files\Python\Lib\email\_header_value_parser.py", line 166, in fold
 return _refold_parse_tree(self, policy=policy)
 File "C:\Program Files\Python\Lib\email\_header_value_parser.py", line 2874, in _refold_parse_tree
 last_ew = _fold_as_ew(tstr, lines, maxlen, last_ew,
 part.ew_combine_allowed, charset, leading_whitespace)
 File "C:\Program Files\Python\Lib\email\_header_value_parser.py", line 3006, in _fold_as_ew
 encoded_word = _ew.encode(to_encode_word, charset=encode_as)
 File "C:\Program Files\Python\Lib\email\_encoded_words.py", line 222, in encode
 bstring = string.encode('ascii', 'surrogateescape')
UnicodeEncodeError: 'ascii' codec can't encode characters in position 4-5: ordinal not in range(128)
"""

CPython versions tested on:

3.14

Operating systems tested on:

Windows

Metadata

Metadata

Assignees

No one assigned

    Labels

    pendingThe issue will be closed if no feedback is provided stdlibStandard Library Python modules in the Lib/ directory topic-email type-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /