4
\$\begingroup\$

I wrote program that reads and processes xml files submitted by external users. The part of the code in question iterates over xml node's children, appends a dictionary - a future row in a table, and repeats the process for every child and its child and so on.

Exemplary xml node looks like this:

xmlstr = '<Root_Level><Level1><Foo_A>1065106.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level1_A><Foo_A>23750.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_A><Level1_B><Foo_A>1041356.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level1_B_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_1><Level1_B_2><Foo_A>466158.93</Foo_A><Foo_B>59838.40</Foo_B><Foo_B1>0.00</Foo_B1><Level1_B_2_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_2_1><Level1_B_2_2><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_B_2_2></Level1_B_2></Level1_B><Level1_C><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_C><Level1_D><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level1_D></Level1><Level2><Foo_A>1065106.46</Foo_A><Foo_B>675706.31</Foo_B><Foo_B1>0.00</Foo_B1><Level2_A><Foo_A>556001.19</Foo_A><Foo_B>138410.82</Foo_B><Foo_B1>0.00</Foo_B1><Level2_A_1><Foo_A>50000.00</Foo_A><Foo_B>50000.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_A_1></Level2_A><Level2_B><Foo_A>509105.27</Foo_A><Foo_B>537295.49</Foo_B><Foo_B1>0.00</Foo_B1><Level2_B_1><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_B_1><Level2_B_2><Foo_A>0.00</Foo_A><Foo_B>0.00</Foo_B><Foo_B1>0.00</Foo_B1></Level2_B_2></Level2_B></Level2></Root_Level>’

Code I’ve written (brace yourself, it's pretty ugly):

# code to help you get started
from lxml import etree
tree = etree.fromstring(xmlstr)
tree.xpath('.')[0].getchildren()
xml_node = tree.xpath('.')[0]
# real use case starts here
table = []
tmp = {}
for b in xml_node:
 lvl = 0
 tag = b.tag
 txt = b.text
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for elem in b.getchildren():
 lvl = 1
 tag = elem.tag
 txt = elem.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for el in elem.getchildren():
 lvl = 2
 tag = el.tag
 txt = el.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in el.getchildren():
 lvl = 3
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 4
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 5
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 6
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 for e in e.getchildren():
 lvl = 7
 tag = e.tag
 txt = e.text
 if tag.lower().strip().startswith('foo'):
 tmp[tag] = txt
 else:
 table.append(tmp)
 tmp = {}
 tmp['lbl'] = tag
 tmp['lvl'] = lvl
 if e.getchildren():
 raise NotImplementedError
# make_sure_last_row_is_appended
try:
 if table[-1]['lbl'] != tag:
 table.append(tmp)
except KeyError:
 # probably table has only 1 row
 table.append(tmp)
# remove_technical_first_row
p = table.pop(0)

Output looks like this (as it should, there is no leeway regarding its structure):

[ {'lbl': 'Level1', 'lvl': 0, 'Foo_A': '1065106.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level1_A', 'lvl': 1, 'Foo_A': '23750.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B', 'lvl': 1, 'Foo_A': '1041356.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_1', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2', 'lvl': 2, 'Foo_A': '466158.93', 'Foo_B': '59838.40', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2_1', 'lvl': 3, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_B_2_2', 'lvl': 3, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_C', 'lvl': 1, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level1_D', 'lvl': 1, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2', 'lvl': 0, 'Foo_A': '1065106.46', 'Foo_B': '675706.31', 'Foo_B1': '0.00'}, {'lbl': 'Level2_A', 'lvl': 1, 'Foo_A': '556001.19', 'Foo_B': '138410.82', 'Foo_B1': '0.00'}, {'lbl': 'Level2_A_1', 'lvl': 2, 'Foo_A': '50000.00', 'Foo_B': '50000.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B', 'lvl': 1, 'Foo_A': '509105.27', 'Foo_B': '537295.49', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B_1', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}, {'lbl': 'Level2_B_2', 'lvl': 2, 'Foo_A': '0.00', 'Foo_B': '0.00', 'Foo_B1': '0.00'}]

The main problem with this code is that I don't have control over the input xml, so theoretically it can be more nested than I'll anticipate in my code. So far I wrote 'n'+3 for-loops with 'n' being the maximal nesting level observed in a test sample, but an input xml easily can be more nested that 'n'+3.

What I wrote was partly justified by being in a hurry, but to be honest I don't have an idea how to handle this more elegantly. I went through methods proposed in several articles (for example this one or that one), but nothing came to mind. Maybe this can be done with 'while True...break' statement or getchildren() method hidden in a generator function...

Also, it's important to maintain the data structure, so lxml._Element.iter() method instead of lxml._Element.getchildren() seems a bad idea.

So far performance wasn’t an issue. At last stage of refactoring process I usually improve variable names and divide code into smallest possible functions with descriptive names as per ‘Clean Code’ methodology, so suggestions in this area while welcomed are secondary to the main problem.

How can I improve my code?

asked Sep 6, 2019 at 20:14
\$\endgroup\$
9
  • \$\begingroup\$ "How can I improve my code?" Do some XSLT procesing before in 1st place. \$\endgroup\$ Commented Sep 6, 2019 at 20:35
  • \$\begingroup\$ Your code doesn't do what the question says it does. Is this because it is not implemented? \$\endgroup\$ Commented Sep 6, 2019 at 20:55
  • \$\begingroup\$ Quick suggestion: get all matching nodes with tree.xpath("//*[starts-with(name(), 'Foo')]") and then use getparent() to find the height of the node (or perhaps that's possible with XPath too). \$\endgroup\$ Commented Sep 6, 2019 at 20:56
  • \$\begingroup\$ Thanks @ferada, I'll try that. Can I use XPath() instead of xpath()? (see:ibm.com/developerworks/xml/library/x-hiperfparse) \$\endgroup\$ Commented Sep 6, 2019 at 21:02
  • \$\begingroup\$ I don't know, I meant XPath the standard, not the method. Seems to work fine for me with tree.xpath, but of course you can see if there's a more convenient/fast/... implementation. \$\endgroup\$ Commented Sep 6, 2019 at 21:04

1 Answer 1

3
\$\begingroup\$

The way to process recursively nested data is with recursively nested code.

I don't know Python, and I would do this in XQuery or XSLT by preference, but the pseudo-code is the same whatever the language:

function deep_process (parent, table, level) {
 for child in parent.children() {
 shallow_process(child, table, level)
 deep_process(child, table, level+1);
 }
}

where shallow_process() is whatever local processing you do at every level.

The key message to take away is that processing recursive data requires recursive code.

answered Sep 6, 2019 at 21:51
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.