This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009年01月15日 23:33 by oxij, last changed 2022年04月11日 14:56 by admin. This issue is now closed.
| Messages (4) | |||
|---|---|---|---|
| msg79927 - (view) | Author: Jan Malakhovski (oxij) | Date: 2009年01月15日 23:33 | |
Hello.
I have dedicated mail server at home
and it holds about 1G of mail.
Most of mail is in non UTF-8 codepage, so today
I wrote little script that should recode
all letters to UTF. But I found that
email.header.decode_header parses some headers wrong.
For example, header
Content-Type: application/x-msword; name="2008
=?windows-1251?B?wu7v8O7x+w==?= 2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="
parsed as
[('application/x-msword; name="2008', None),
('\xc2\xee\xef\xf0\xee\xf1\xfb', 'windows-1251'), ('2
=?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="', None)]
that is obviously wrong.
Now I'm playing with email/header.py file in
python 2.5 debian package
(but it's same in 2.6.1 version except that all <> changed to !=).
If it's patched with
==================BEGIN CUT==================
--- oldheader.py 2009年01月16日 01:47:32.553130030 +0300
+++ header.py 2009年01月16日 01:47:16.783119846 +0300
@@ -39,7 +39,6 @@
\? # literal ?
(?P<encoded>.*?) # non-greedy up to the next ?= is the encoded
string
\?= # literal ?=
- (?=[ \t]|$) # whitespace or the end of the string
''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
# Field name regexp, including trailing colon, but not separating
whitespace,
==================END CUT==================
it works fine.
So I wonder if this
(?=[ \t]|$) # whitespace or the end of the string
really needed, after all if there is only
whitespaces after encoded word, its just
appended to the list by
parts = ecre.split(line)
--
Also, there is related mail list thread:
http://mail.python.org/pipermail/python-dev/2009-January/085088.html
|
|||
| msg79938 - (view) | Author: Gabriel Genellina (ggenellina) | Date: 2009年01月16日 07:32 | |
Your example header is invalid. Excerpt from RFC2047 <http:// www.ietf.org/rfc/rfc2047.txt> section 5: + An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'. Even in the places where an "encoded word" (the sequence =?...?=) is allowed, it must always be surrounded by whitespace -- this is by design in the RFC. If you have many of those invalid headers, you'll have to "cook" the output of decode_header, posibly detecting malformed sequences and calling decode_header again with just the offending substring. I don't think that Python should accept malformed headers - but if you come to a good solution you may publish the recipe in the Python cookbook <http://www.activestate.com/ASPN/Python/Cookbook/> I'd close this report as invalid. |
|||
| msg81069 - (view) | Author: Tom Lynn (tlynn) | Date: 2009年02月03日 17:05 | |
Duplicates issue1047. |
|||
| msg81070 - (view) | Author: Tom Lynn (tlynn) | Date: 2009年02月03日 17:06 | |
Oops, duplicates issue 1079 even. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:56:44 | admin | set | github: 49208 |
| 2009年03月27日 20:50:57 | amaury.forgeotdarc | set | status: open -> closed resolution: duplicate superseder: decode_header does not follow RFC 2047 |
| 2009年02月03日 17:06:36 | tlynn | set | messages: + msg81070 |
| 2009年02月03日 17:05:27 | tlynn | set | nosy:
+ tlynn messages: + msg81069 |
| 2009年01月16日 07:32:07 | ggenellina | set | nosy:
+ ggenellina messages: + msg79938 |
| 2009年01月15日 23:33:08 | oxij | create | |