homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: email/header.py ecre regular expression issue
Type: behavior Stage:
Components: Library (Lib) Versions: Python 2.6, Python 2.5
process
Status: closed Resolution: duplicate
Dependencies: Superseder: decode_header does not follow RFC 2047
View: 1079
Assigned To: Nosy List: ggenellina, oxij, tlynn
Priority: normal Keywords:

Created on 2009年01月15日 23:33 by oxij, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Messages (4)
msg79927 - (view) Author: Jan Malakhovski (oxij) Date: 2009年01月15日 23:33
Hello.
I have dedicated mail server at home
and it holds about 1G of mail.
Most of mail is in non UTF-8 codepage, so today
I wrote little script that should recode
all letters to UTF. But I found that
email.header.decode_header parses some headers wrong.
For example, header
Content-Type: application/x-msword; name="2008
=?windows-1251?B?wu7v8O7x+w==?= 2 =?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="
parsed as
[('application/x-msword; name="2008', None),
('\xc2\xee\xef\xf0\xee\xf1\xfb', 'windows-1251'), ('2
=?windows-1251?B?4+7kIDgwONUwMC5kb2M=?="', None)]
that is obviously wrong.
Now I'm playing with email/header.py file in
python 2.5 debian package
(but it's same in 2.6.1 version except that all <> changed to !=).
If it's patched with
==================BEGIN CUT==================
--- oldheader.py	2009年01月16日 01:47:32.553130030 +0300
+++ header.py	2009年01月16日 01:47:16.783119846 +0300
@@ -39,7 +39,6 @@
 \? # literal ?
 (?P<encoded>.*?) # non-greedy up to the next ?= is the encoded
string
 \?= # literal ?=
- (?=[ \t]|$) # whitespace or the end of the string
 ''', re.VERBOSE | re.IGNORECASE | re.MULTILINE)
 
 # Field name regexp, including trailing colon, but not separating
whitespace,
==================END CUT==================
it works fine.
So I wonder if this
 (?=[ \t]|$) # whitespace or the end of the string
really needed, after all if there is only
whitespaces after encoded word, its just
appended to the list by
parts = ecre.split(line)
--
Also, there is related mail list thread:
http://mail.python.org/pipermail/python-dev/2009-January/085088.html 
msg79938 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009年01月16日 07:32
Your example header is invalid. Excerpt from RFC2047 <http://
www.ietf.org/rfc/rfc2047.txt> section 5:
 + An 'encoded-word' MUST NOT be used in parameter of a MIME
 Content-Type or Content-Disposition field, or in any structured
 field body except within a 'comment' or 'phrase'.
Even in the places where an "encoded word" (the sequence =?...?=) is 
allowed, it must always be surrounded by whitespace -- this is by 
design in the RFC.
If you have many of those invalid headers, you'll have to "cook" the 
output of decode_header, posibly detecting malformed sequences and 
calling decode_header again with just the offending substring. 
I don't think that Python should accept malformed headers - but if you 
come to a good solution you may publish the recipe in the Python 
cookbook <http://www.activestate.com/ASPN/Python/Cookbook/>
I'd close this report as invalid.
msg81069 - (view) Author: Tom Lynn (tlynn) Date: 2009年02月03日 17:05
Duplicates issue1047.
msg81070 - (view) Author: Tom Lynn (tlynn) Date: 2009年02月03日 17:06
Oops, duplicates issue 1079 even.
History
Date User Action Args
2022年04月11日 14:56:44adminsetgithub: 49208
2009年03月27日 20:50:57amaury.forgeotdarcsetstatus: open -> closed
resolution: duplicate
superseder: decode_header does not follow RFC 2047
2009年02月03日 17:06:36tlynnsetmessages: + msg81070
2009年02月03日 17:05:27tlynnsetnosy: + tlynn
messages: + msg81069
2009年01月16日 07:32:07ggenellinasetnosy: + ggenellina
messages: + msg79938
2009年01月15日 23:33:08oxijcreate

AltStyle によって変換されたページ (->オリジナル) /