1
\$\begingroup\$

I am very new to Python, and also this is my first time trying to parse XML.
I am interested in information within str elements. I can identify that information using the str@name attribute value.

def get_cg_resources(pref_label, count=10):
 r = request_that_has_the_xml
 ns = {'ns':"http://www.loc.gov/zing/srw/"}
 tree = ET.ElementTree(ET.fromstring(r.text))
 records =[]
 for elem in tree.iter(tag='{http://www.loc.gov/zing/srw/}record'):
 record = {
 'title':'',
 'source': '',
 'snippet': '',
 'link':'',
 'image':'',
 'adapter':'CG'
 }
 for value in elem.iter(tag='str'):
 attr = value.attrib['name']
 if(attr == 'dc.title'):
 record['title'] = value.text
 elif(attr == 'authority_name'):
 record['source'] = value.text
 elif(attr == 'dc.description'):
 record['snippet'] = value.text
 elif(attr == 'dc.related.link' ):
 record['link'] = value.text
 elif(attr == 'cached_thumbnail'):
 img_part = value.text
 record['image'] = "http://urlbase%s" % img_part
 records.append(record)
 return records

Is this approach correct/efficient for extracting the information I need? Should I be searching for the str elements differently?

Any suggestions for improvements are welcome.

Veedrac
9,77323 silver badges38 bronze badges
asked Mar 31, 2015 at 12:09
\$\endgroup\$
2
  • \$\begingroup\$ Is request_that_has_the_xml a global variable? Why isn't it a parameter? \$\endgroup\$ Commented Mar 31, 2015 at 20:08
  • \$\begingroup\$ You can ignore that line, just know that it gives the XML string \$\endgroup\$ Commented Apr 1, 2015 at 8:04

1 Answer 1

1
\$\begingroup\$
def get_cg_resources(pref_label, count=10):
 r = request_that_has_the_xml
 ns = {'ns':"http://www.loc.gov/zing/srw/"}
 tree = ET.ElementTree(ET.fromstring(r.text))

You dont't need ElementTree to extract data from xml, Element is enough.

 root = ET.fromstring(r.text)

If 'str' tag is contained only in 'record' tag you don't have to find 'record' tag first. You can simply look for 'str' tag. The iter method recursively iterates over it's children.

There is a dict to represent 'namespace'. So you don't have to explicitly list it's value, 'key:tag', dict is enough.

 for elem in root.iter('ns:str',ns):

If there are 'str' tags that are contained in other tags that you don't want, then you have to first find 'record' tags.

 records =[]
 for elem in root.iter('ns:record',ns):
 record = {
 'title':'',
 'source': '',
 'snippet': '',
 'link':'',
 'image':'',
 'adapter':'CG'
 }

record can be initialized as follows,

 record =dict.fromkeys(['title','source','snippet','link','image'],'')
 record['adapter']='CG'
 for value in elem.iter('ns:str',ns):
 attr = value.attrib['name']
 if(attr == 'dc.title'):
 record['title'] = value.text
 elif(attr == 'authority_name'):
 record['source'] = value.text
 elif(attr == 'dc.description'):
 record['snippet'] = value.text
 elif(attr == 'dc.related.link' ):
 record['link'] = value.text
 elif(attr == 'cached_thumbnail'):
 img_part = value.text
 record['image'] = "http://urlbase%s" % img_part
 records.append(record)
 return records

The above code means you want to extract value of the 'name' attribute of 'str' tags which are contained in 'record' tags.

If you want a generator you can simply replace records.append(record) wiht yield record and delete return records and records = [] which will be efficient if the list is huge.

answered Apr 1, 2015 at 14:41
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.