Nested for-loops in xml parsing

Question 1

I wrote program that reads and processes xml files submitted by external users. The part of the code in question iterates over xml node's children, appends a dictionary - a future row in a table, and repeats the process for every child and its child and so on.

Exemplary xml node looks like this:

xmlstr = '<Root_Level><Level1><Foo_A>1065106.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level1_A><Foo_A>23750.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_A><Level1_B><Foo_A>1041356.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level1_B_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_1><Level1_B_2><Foo_A>466158.93</Foo_A><Foo_B>59838.40</Foo_B><Foo_B1>0.00</Foo_B1><Level1_B_2_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_2_1><Level1_B_2_2><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_2_2></Level1_B_2></Level1_B><Level1_C><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_C><Level1_D><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_D></Level1><Level2><Foo_A>1065106.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level2_A><Foo_A>556001.19</Foo_A><Foo_B>138410.82</Foo_B><Foo_B1>0.00</Foo_B1><Level2_A_1><Foo_A>50000.00</Foo_A><Foo_B>50000.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_A_1></Level2_A><Level2_B><Foo_A>509105.27</Foo_A><Foo_B>537295.49</Foo_B><Foo_B1>0.00</Foo_B1><Level2_B_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_B_1><Level2_B_2><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_B_2></Level2_B></Level2></Root_Level>’

Code I’ve written (brace yourself, it's pretty ugly):

# code to help you get started
from lxml import etree
tree = etree.fromstring(xmlstr)
tree.xpath('.')[0].getchildren()
xml_node = tree.xpath('.')[0]
# real use case starts here
table = []
tmp = {}
for b in xml_node:
 lvl = 0
 tag = b.tag
 txt = b.text
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for elem in b.getchildren():
 lvl = 1
 tag = elem.tag
 txt = elem.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for el in elem.getchildren():
 lvl = 2
 tag = el.tag
 txt = el.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in el.getchildren():
 lvl = 3
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 4
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 5
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 6
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 7
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 if e.getchildren():
 raise NotImplementedError
# make_sure_last_row_is_appended
try:
 if table[-1]['lbl'] != tag:
 table.append(tmp)
except KeyError:
 # probably table has only 1 row
 table.append(tmp)
# remove_technical_first_row
p = table.pop(0)

Output looks like this (as it should, there is no leeway regarding its structure):

[ {'lbl': 'Level1', 'lvl': 0, 'Foo_A': '1065106.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level1_A', 'lvl': 1, 'Foo_A': '23750.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B', 'lvl': 1, 'Foo_A': '1041356.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_1', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2', 'lvl': 2, 'Foo_A': '466158.93', 'Foo_B': '59838.40', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2_1', 'lvl': 3, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2_2', 'lvl': 3, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_C', 'lvl': 1, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_D', 'lvl': 1, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2', 'lvl': 0, 'Foo_A': '1065106.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level2_A', 'lvl': 1, 'Foo_A': '556001.19', 'Foo_B': '138410.82', 'Foo_B1': '0.00'}, {'lbl': 'Level2_A_1', 'lvl': 2, 'Foo_A': '50000.00', 'Foo_B': '50000.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B', 'lvl': 1, 'Foo_A': '509105.27', 'Foo_B': '537295.49', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B_1', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B_2', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}]

The main problem with this code is that I don't have control over the input xml, so theoretically it can be more nested than I'll anticipate in my code. So far I wrote 'n'+3 for-loops with 'n' being the maximal nesting level observed in a test sample, but an input xml easily can be more nested that 'n'+3.

What I wrote was partly justified by being in a hurry, but to be honest I don't have an idea how to handle this more elegantly. I went through methods proposed in several articles (for example this one or that one), but nothing came to mind. Maybe this can be done with 'while True...break' statement or getchildren() method hidden in a generator function...

Also, it's important to maintain the data structure, so lxml._Element.iter() method instead of lxml._Element.getchildren() seems a bad idea.

So far performance wasn’t an issue. At last stage of refactoring process I usually improve variable names and divide code into smallest possible functions with descriptive names as per ‘Clean Code’ methodology, so suggestions in this area while welcomed are secondary to the main problem.

How can I improve my code?

Question 2

"How can I improve my code?" Do some XSLT procesing before in 1st place.

Question 3

Your code doesn't do what the question says it does. Is this because it is not implemented?

Question 4

Quick suggestion: get all matching nodes with tree.xpath("//*[starts-with(name(), 'Foo')]") and then use getparent() to find the height of the node (or perhaps that's possible with XPath too).

Question 5

Thanks @ferada, I'll try that. Can I use XPath() instead of xpath()? (see:ibm.com/developerworks/xml/library/x-hiperfparse)

Question 6

I don't know, I meant XPath the standard, not the method. Seems to work fine for me with tree.xpath, but of course you can see if there's a more convenient/fast/... implementation.

Question 7

The way to process recursively nested data is with recursively nested code.

I don't know Python, and I would do this in XQuery or XSLT by preference, but the pseudo-code is the same whatever the language:

function deep_process (parent, table, level) {
 for child in parent.children() {
 shallow_process(child, table, level)
 deep_process(child, table, level+1);
 }
}

where shallow_process() is whatever local processing you do at every level.

The key message to take away is that processing recursive data requires recursive code.

Michael Kay Michael Kay 6513 silver badges4 bronze badges · Accepted Answer · 2019-09-06 21:51:08Z

The way to process recursively nested data is with recursively nested code.

I don't know Python, and I would do this in XQuery or XSLT by preference, but the pseudo-code is the same whatever the language:

function deep_process (parent, table, level) {
 for child in parent.children() {
 shallow_process(child, table, level)
 deep_process(child, table, level+1);
 }
}

where shallow_process() is whatever local processing you do at every level.

The key message to take away is that processing recursive data requires recursive code.

Stack Exchange Network

Nested for-loops in xml parsing

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Nested for-loops in xml parsing

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions