homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: xmllib unable to parse "german scharfes ß" in UTF8 format
Type: Stage:
Components: XML Versions:
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: sjoerd Nosy List: ajung, nobody, sjoerd
Priority: normal Keywords:

Created on 2000年11月16日 14:10 by ajung, last changed 2022年04月10日 16:03 by admin. This issue is now closed.

Messages (3)
msg2420 - (view) Author: Andreas Jung (ajung) Date: 2000年11月16日 14:10
The xmllib.XMLParser seems to be unable to parse 
an XML file that contains 0xc3 0x9f (UTF8 representation
of the german ß).
Python 2.0 (Linux i386) always give me the following 
traceback:
suxlap@/tmp/xx(2)% python test.py test.xml
<?xml version="1.0" encoding="UTF-8" ?>
<test>Ãxüöä</test>
Traceback (most recent call last):
 File "test.py", line 20, in ?
 pp.feed(data)
 File "/opt/python-2.0/lib/python2.0/xmllib.py", line 165, in feed
 self.goahead(0)
 File "/opt/python-2.0/lib/python2.0/xmllib.py", line 261, in goahead
 self.syntax_error('illegal character in content')
 File "/opt/python-2.0/lib/python2.0/xmllib.py", line 786, in syntax_error
 raise RuntimeError, 'Syntax error at line %d: %s' % (self.lineno, message)
RuntimeError: Syntax error at line 3: illegal character in content 
Other UTF8 characters seem to work
msg2421 - (view) Author: Nobody/Anonymous (nobody) Date: 2000年11月16日 18:58
works with accept_utf=1 as additional parameter,
but using xml package instead.
 
msg2422 - (view) Author: Sjoerd Mullender (sjoerd) * (Python committer) Date: 2000年11月24日 09:47
The problem here is the character reference x. xmllib is from before Python support for Unicode, so it doesn't support any characters that are not representable in 8 bits, and it only really supports iso-8859-1 (latin1), and not even the utf-8 encoding of latin1. It also doesn't do the right thing for character references outside of the ASCII range, although it'll accept characters references in the range 0 - 255 (decimal).
It is too much work to fix this.
I will make available a rewrite of xmllib that has full Unicode support and what's more, is a validating XML parser. The main problem with this rewrite is that it is pretty slow (it uses many, big regular expressions, and compiling those re's is a time consuming task). Mail me if you want a copy.
History
Date User Action Args
2022年04月10日 16:03:29adminsetgithub: 33481
2000年11月16日 14:10:42ajungcreate

AltStyle によって変換されたページ (->オリジナル) /