ElementTree XML parsing problem

Mike Mike at invalid.invalid
Wed Apr 27 14:26:05 EDT 2011


I'm using ElementTree to parse an XML file, but it stops at the second 
record (id = 002), which contains a non-standard ascii character, ä. 
Here's the XML:
<?xml version="1.0"?>
<snapshot time="Mon Apr 25 08:47:23 PDT 2011">
<records>
<record id="001" education="High School" employment="7 yrs" />
<record id="002" education="Universität Bremen" employment="3 years" />
<record id="003" education="River College" employment="5 yrs" />
</records>
</snapshot>
The complaint offered up by the parser is
Unexpected error opening simple_fail.xml: not well-formed (invalid 
token): line 5, column 40
and if I change the line to eliminate the ä, everything is wonderful. 
The parser is perfectly happy with this modification:
<record id="002" education="University Bremen" employment="3 yrs" />
I can't find anything in the ElementTree docs about allowing additional 
text characters or coercing strange ascii to Unicode.
Is there a way to coerce the text so it doesn't cause the parser to 
raise an exception?
Here's my test script (simple_fail contains the offending line, and 
simple_pass contains the line that passes).
import sys
import xml.etree.ElementTree as ET
def main():
 xml_files = ['simple_fail.xml', 'simple_pass.xml']
 for xml_file in xml_files:
 print
 print 'XML file: %s' % (xml_file)
 try:
 tree = ET.parse(xml_file)
 except Exception, inst:
 print "Unexpected error opening %s: %s" % (xml_file, inst)
 continue
 root = tree.getroot()
 records = root.find('records')
 for record in records:
 print record.attrib['id'], record.attrib['education']
if __name__ == "__main__":
	main()
Thanks,
-- Mike --


More information about the Python-list mailing list

AltStyle によって変換されたページ (->オリジナル) /