Message 81068 - Python tracker

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

In-reply-to
Author	tlynn
Recipients	jafo, kael, tlynn
Date	2009年02月03日.17:01:59
SpamBayes Score	0.0014747144
Marked as misclassified	No
Message-id	<1233680522.35.0.022264315991.issue1079@psf.upfronthosting.co.za>

Content
The only difference between the two regexps is that the email/header.py version looks for:: (?=[ \t]\|$) # whitespace or the end of the string at the end (with re.MULTILINE, so $ also matches '\n'). To expand on "There is nothing about that thing in RFC 2047", it says:: IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's by an RFC 822 parser. RFC 822 says:: atom = 1*<any CHAR except specials, SPACE and CTLs> ... specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted- / "," / ";" / ":" / "\" / <"> ; string, to use / "." / "[" / "]" ; within a word. So an example of mis-parsing is:: >>> import email.header >>> h = '=?utf-8?q?=E2=98=BA?=(unicode white smiling face)' >>> email.header.decode_header(h) [('=?utf-8?q?=E2=98=BA?=(unicode white smiling face)', None)] The correct result would be:: >>> email.header.decode_header(h) [('\xe2\x98\xba', 'utf-8'), ('(unicode white smiling face)', None)] which is what you get if you insert a space before the '(' in h.

Content

The only difference between the two regexps is that the email/header.py
version looks for::
 (?=[ \t]|$) # whitespace or the end of the string
at the end (with re.MULTILINE, so $ also matches '\n').
To expand on "There is nothing about that thing in RFC 2047", it says::
 IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
 by an RFC 822 parser.
RFC 822 says::
 atom = 1*<any CHAR except specials, SPACE and CTLs>
 ...
 specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted-
 / "," / ";" / ":" / "\" / <"> ; string, to use
 / "." / "[" / "]" ; within a word.
So an example of mis-parsing is::
 >>> import email.header
 >>> h = '=?utf-8?q?=E2=98=BA?=(unicode white smiling face)'
 >>> email.header.decode_header(h)
 [('=?utf-8?q?=E2=98=BA?=(unicode white smiling face)', None)]
The correct result would be::
 >>> email.header.decode_header(h)
 [('\xe2\x98\xba', 'utf-8'), ('(unicode white smiling face)', None)]
which is what you get if you insert a space before the '(' in h.

History
Date	User	Action	Args
2009年02月03日 17:02:02	tlynn	set	recipients: + tlynn, jafo, kael
2009年02月03日 17:02:02	tlynn	set	messageid: <1233680522.35.0.022264315991.issue1079@psf.upfronthosting.co.za>
2009年02月03日 17:02:00	tlynn	link	issue1079 messages
2009年02月03日 17:01:59	tlynn	create

homepage