Python XML - extracting information

Question 1

I am very new to Python, and also this is my first time trying to parse XML.
I am interested in information within str elements. I can identify that information using the str@name attribute value.

def get_cg_resources(pref_label, count=10):
 r = request_that_has_the_xml
 ns = {'ns':"http://www.loc.gov/zing/srw/"}
 tree = ET.ElementTree(ET.fromstring(r.text))
 records =[]
 for elem in tree.iter(tag='{http://www.loc.gov/zing/srw/}record'):
 record = {
 'title':'',
 'source': '',
 'snippet': '',
 'link':'',
 'image':'',
 'adapter':'CG'
 }
 for value in elem.iter(tag='str'):
 attr = value.attrib['name']
 if(attr == 'dc.title'):
 record['title'] = value.text
 elif(attr == 'authority_name'):
 record['source'] = value.text
 elif(attr == 'dc.description'):
 record['snippet'] = value.text
 elif(attr == 'dc.related.link' ):
 record['link'] = value.text
 elif(attr == 'cached_thumbnail'):
 img_part = value.text
 record['image'] = "http://urlbase%s" % img_part
 records.append(record)
 return records

Is this approach correct/efficient for extracting the information I need? Should I be searching for the str elements differently?

Any suggestions for improvements are welcome.

Question 2

Is request_that_has_the_xml a global variable? Why isn't it a parameter?

Question 3

You can ignore that line, just know that it gives the XML string

Question 4

def get_cg_resources(pref_label, count=10):
 r = request_that_has_the_xml
 ns = {'ns':"http://www.loc.gov/zing/srw/"}
 tree = ET.ElementTree(ET.fromstring(r.text))

You dont't need ElementTree to extract data from xml, Element is enough.

 root = ET.fromstring(r.text)

If 'str' tag is contained only in 'record' tag you don't have to find 'record' tag first. You can simply look for 'str' tag. The iter method recursively iterates over it's children.

There is a dict to represent 'namespace'. So you don't have to explicitly list it's value, 'key:tag', dict is enough.

 for elem in root.iter('ns:str',ns):

If there are 'str' tags that are contained in other tags that you don't want, then you have to first find 'record' tags.

 records =[]
 for elem in root.iter('ns:record',ns):
 record = {
 'title':'',
 'source': '',
 'snippet': '',
 'link':'',
 'image':'',
 'adapter':'CG'
 }

record can be initialized as follows,

 record =dict.fromkeys(['title','source','snippet','link','image'],'')
 record['adapter']='CG'
 for value in elem.iter('ns:str',ns):
 attr = value.attrib['name']
 if(attr == 'dc.title'):
 record['title'] = value.text
 elif(attr == 'authority_name'):
 record['source'] = value.text
 elif(attr == 'dc.description'):
 record['snippet'] = value.text
 elif(attr == 'dc.related.link' ):
 record['link'] = value.text
 elif(attr == 'cached_thumbnail'):
 img_part = value.text
 record['image'] = "http://urlbase%s" % img_part
 records.append(record)
 return records

The above code means you want to extract value of the 'name' attribute of 'str' tags which are contained in 'record' tags.

If you want a generator you can simply replace records.append(record) wiht yield record and delete return records and records = [] which will be efficient if the list is huge.

Nizam Mohamed Nizam Mohamed 3942 silver badges6 bronze badges · Answer 1 · 2015-04-01 14:41:47Z

def get_cg_resources(pref_label, count=10):
 r = request_that_has_the_xml
 ns = {'ns':"http://www.loc.gov/zing/srw/"}
 tree = ET.ElementTree(ET.fromstring(r.text))

You dont't need ElementTree to extract data from xml, Element is enough.

 root = ET.fromstring(r.text)

If 'str' tag is contained only in 'record' tag you don't have to find 'record' tag first. You can simply look for 'str' tag. The iter method recursively iterates over it's children.

There is a dict to represent 'namespace'. So you don't have to explicitly list it's value, 'key:tag', dict is enough.

 for elem in root.iter('ns:str',ns):

If there are 'str' tags that are contained in other tags that you don't want, then you have to first find 'record' tags.

 records =[]
 for elem in root.iter('ns:record',ns):
 record = {
 'title':'',
 'source': '',
 'snippet': '',
 'link':'',
 'image':'',
 'adapter':'CG'
 }

record can be initialized as follows,

 record =dict.fromkeys(['title','source','snippet','link','image'],'')
 record['adapter']='CG'
 for value in elem.iter('ns:str',ns):
 attr = value.attrib['name']
 if(attr == 'dc.title'):
 record['title'] = value.text
 elif(attr == 'authority_name'):
 record['source'] = value.text
 elif(attr == 'dc.description'):
 record['snippet'] = value.text
 elif(attr == 'dc.related.link' ):
 record['link'] = value.text
 elif(attr == 'cached_thumbnail'):
 img_part = value.text
 record['image'] = "http://urlbase%s" % img_part
 records.append(record)
 return records

The above code means you want to extract value of the 'name' attribute of 'str' tags which are contained in 'record' tags.

If you want a generator you can simply replace records.append(record) wiht yield record and delete return records and records = [] which will be efficient if the list is huge.

Stack Exchange Network

Python XML - extracting information

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python XML - extracting information

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions