Return to Question

Changed title in accordance with other user request

Link

edited Sep 8, 2019 at 16:48

Pawel Kam

edited Sep 8, 2019 at 16:48

Pawel Kam

Iterating over xml element children, and their children, and their children, etc. without nested Nested for-loops in xml parsing

Added links to articles mentioned in post

Source Link

edited Sep 6, 2019 at 20:21

Pawel Kam

edited Sep 6, 2019 at 20:21

Pawel Kam

What I wrote was partly justified by being in a hurry, but to be honest I don't have an idea how to handle this more elegantly. I went through methods proposed in several articles (for example this one or that one ), but nothing came to mind. Maybe this can be done with 'while True...break' statement or getchildren() method hidden in a generator function...

What I wrote was partly justified by being in a hurry, but to be honest I don't have an idea how to handle this more elegantly. I went through methods proposed in several articles, but nothing came to mind. Maybe this can be done with 'while True...break' statement or getchildren() method hidden in a generator function...

Source Link

asked Sep 6, 2019 at 20:14

Pawel Kam

asked Sep 6, 2019 at 20:14

Pawel Kam

Iterating over xml element children, and their children, and their children, etc. without nested for-loops

I wrote program that reads and processes xml files submitted by external users. The part of the code in question iterates over xml node's children, appends a dictionary - a future row in a table, and repeats the process for every child and its child and so on.

Exemplary xml node looks like this:

xmlstr = '<Root_Level><Level1><Foo_A>1065106.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level1_A><Foo_A>23750.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_A><Level1_B><Foo_A>1041356.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level1_B_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_1><Level1_B_2><Foo_A>466158.93</Foo_A><Foo_B>59838.40</Foo_B><Foo_B1>0.00</Foo_B1><Level1_B_2_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_2_1><Level1_B_2_2><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_2_2></Level1_B_2></Level1_B><Level1_C><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_C><Level1_D><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_D></Level1><Level2><Foo_A>1065106.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level2_A><Foo_A>556001.19</Foo_A><Foo_B>138410.82</Foo_B><Foo_B1>0.00</Foo_B1><Level2_A_1><Foo_A>50000.00</Foo_A><Foo_B>50000.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_A_1></Level2_A><Level2_B><Foo_A>509105.27</Foo_A><Foo_B>537295.49</Foo_B><Foo_B1>0.00</Foo_B1><Level2_B_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_B_1><Level2_B_2><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_B_2></Level2_B></Level2></Root_Level>’

Code I’ve written (brace yourself, it's pretty ugly):

# code to help you get started
from lxml import etree
tree = etree.fromstring(xmlstr)
tree.xpath('.')[0].getchildren()
xml_node = tree.xpath('.')[0]
# real use case starts here
table = []
tmp = {}
for b in xml_node:
 lvl = 0
 tag = b.tag
 txt = b.text
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 
 for elem in b.getchildren():
 lvl = 1
 tag = elem.tag
 txt = elem.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for el in elem.getchildren():
 lvl = 2
 tag = el.tag
 txt = el.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in el.getchildren():
 lvl = 3
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 4
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 5
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 6
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 7
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 
 if e.getchildren():
 raise NotImplementedError
# make_sure_last_row_is_appended
try:
 if table[-1]['lbl'] != tag:
 table.append(tmp)
except KeyError:
 # probably table has only 1 row
 table.append(tmp)
 
# remove_technical_first_row
p = table.pop(0)

Output looks like this (as it should, there is no leeway regarding its structure):

[ {'lbl': 'Level1', 'lvl': 0, 'Foo_A': '1065106.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level1_A', 'lvl': 1, 'Foo_A': '23750.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B', 'lvl': 1, 'Foo_A': '1041356.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_1', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2', 'lvl': 2, 'Foo_A': '466158.93', 'Foo_B': '59838.40', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2_1', 'lvl': 3, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2_2', 'lvl': 3, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_C', 'lvl': 1, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_D', 'lvl': 1, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2', 'lvl': 0, 'Foo_A': '1065106.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level2_A', 'lvl': 1, 'Foo_A': '556001.19', 'Foo_B': '138410.82', 'Foo_B1': '0.00'}, {'lbl': 'Level2_A_1', 'lvl': 2, 'Foo_A': '50000.00', 'Foo_B': '50000.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B', 'lvl': 1, 'Foo_A': '509105.27', 'Foo_B': '537295.49', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B_1', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B_2', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}]

The main problem with this code is that I don't have control over the input xml, so theoretically it can be more nested than I'll anticipate in my code. So far I wrote 'n'+3 for-loops with 'n' being the maximal nesting level observed in a test sample, but an input xml easily can be more nested that 'n'+3.

Also, it's important to maintain the data structure, so lxml._Element.iter() method instead of lxml._Element.getchildren() seems a bad idea.

So far performance wasn’t an issue. At last stage of refactoring process I usually improve variable names and divide code into smallest possible functions with descriptive names as per ‘Clean Code’ methodology, so suggestions in this area while welcomed are secondary to the main problem.

How can I improve my code?

default