I am very new to Python, and also this is my first time trying to parse XML.
I am interested in information within str
elements. I can identify that information using the str@name
attribute value.
def get_cg_resources(pref_label, count=10):
r = request_that_has_the_xml
ns = {'ns':"http://www.loc.gov/zing/srw/"}
tree = ET.ElementTree(ET.fromstring(r.text))
records =[]
for elem in tree.iter(tag='{http://www.loc.gov/zing/srw/}record'):
record = {
'title':'',
'source': '',
'snippet': '',
'link':'',
'image':'',
'adapter':'CG'
}
for value in elem.iter(tag='str'):
attr = value.attrib['name']
if(attr == 'dc.title'):
record['title'] = value.text
elif(attr == 'authority_name'):
record['source'] = value.text
elif(attr == 'dc.description'):
record['snippet'] = value.text
elif(attr == 'dc.related.link' ):
record['link'] = value.text
elif(attr == 'cached_thumbnail'):
img_part = value.text
record['image'] = "http://urlbase%s" % img_part
records.append(record)
return records
Is this approach correct/efficient for extracting the information I need? Should I be searching for the str
elements differently?
Any suggestions for improvements are welcome.
-
\$\begingroup\$ Is request_that_has_the_xml a global variable? Why isn't it a parameter? \$\endgroup\$Attilio– Attilio2015年03月31日 20:08:41 +00:00Commented Mar 31, 2015 at 20:08
-
\$\begingroup\$ You can ignore that line, just know that it gives the XML string \$\endgroup\$latusaki– latusaki2015年04月01日 08:04:50 +00:00Commented Apr 1, 2015 at 8:04
1 Answer 1
def get_cg_resources(pref_label, count=10):
r = request_that_has_the_xml
ns = {'ns':"http://www.loc.gov/zing/srw/"}
tree = ET.ElementTree(ET.fromstring(r.text))
You dont't need ElementTree
to extract data from xml, Element
is enough.
root = ET.fromstring(r.text)
If 'str' tag is contained only in 'record' tag you don't have to find 'record'
tag first. You can simply look for 'str' tag. The iter
method recursively
iterates over it's children.
There is a dict to represent 'namespace'. So you don't have to explicitly list it's value, 'key:tag', dict is enough.
for elem in root.iter('ns:str',ns):
If there are 'str' tags that are contained in other tags that you don't want, then you have to first find 'record' tags.
records =[]
for elem in root.iter('ns:record',ns):
record = {
'title':'',
'source': '',
'snippet': '',
'link':'',
'image':'',
'adapter':'CG'
}
record
can be initialized as follows,
record =dict.fromkeys(['title','source','snippet','link','image'],'')
record['adapter']='CG'
for value in elem.iter('ns:str',ns):
attr = value.attrib['name']
if(attr == 'dc.title'):
record['title'] = value.text
elif(attr == 'authority_name'):
record['source'] = value.text
elif(attr == 'dc.description'):
record['snippet'] = value.text
elif(attr == 'dc.related.link' ):
record['link'] = value.text
elif(attr == 'cached_thumbnail'):
img_part = value.text
record['image'] = "http://urlbase%s" % img_part
records.append(record)
return records
The above code means you want to extract value of the 'name' attribute of 'str' tags which are contained in 'record' tags.
If you want a generator you can simply
replace records.append(record)
wiht yield record
and delete return records
and records = []
which will be efficient if the list is huge.