4
\$\begingroup\$

I need to extract from a large XML file some information:

  • 0 or 1 attribute per tag
  • Trigger on a leaf element

So I wrote this Python code:

def ascend(element):
 parent = element.getparent()
 return [parent] + ascend(parent) if parent is not None else []
def parsexml(filename, tag, attrs):
 from lxml import etree
 for el in etree.ElementTree(file=filename).iter(tag):
 nodes = ascend(el)[::-1] + [el]
 yield [n.attrib[a] for n, a in zip(nodes,attrs) if a]

This snippet can be used as follow (with simplified data):

In [1]: xml = StringIO.StringIO("""
<data>
 <level1 a="1">
 <level2 b="1">
 <level3 c="1"/> 
 <level3 c="2"/> 
 </level2> 
 <level2 b="2">
 <level3 c="3"/> 
 <level3 c="4"/> 
 </level2> 
 </level1>
 <level1 a="2">
 <level2 b="3">
 <level3 c="5"/> 
 <level3 c="6"/> 
 </level2> 
 <level2 b="4">
 <level3 c="7"/> 
 <level3 c="8"/> 
 </level2> 
 </level1> 
</data>""")
In [2]: list(parsexml(xml, 'level3', [None, 'a', None, 'c']))
Out[2]:
[['1', '1'],
 ['1', '2'],
 ['1', '3'],
 ['1', '4'],
 ['2', '5'],
 ['2', '6'],
 ['2', '7'],
 ['2', '8']]

My goal is to import the extracted data in Pandas:

ds = pd.DataFrame(data, columns=['level1', 'level3'])

Is my implementation correct or can I do better? I think the code is short and elegant, but I feel it spends most of its time looking for parents which can be ineffective (I didn't measured it and for my application it is still fast enough).

asked Apr 4, 2017 at 7:22
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

You say that the XML file is large - how large, it's unclear. I'm pretty sure that all etree implementations fully load the file into memory, but (a) that isn't entirely safe, and (b) isn't necessary if you switch to Sax.

Normally I would suggest moving to pandas.read_xml, but it doesn't seem to do a very good job of capturing your use case (the output is sparse with only one non-NaN per row assuming the use of a pipe in the xpath expression). It's also not very careful about performance because it's a non-vectorised wrapper around lxml.

Things I like about your implementation - it's short and correct. Things I don't like - it uses recursion, which doesn't work well with Python (no tail recursion, limited stack depth); it isn't very careful in generating a column index; and the attrs argument has to care about the entire ancestry path even for positions that have no important attribute.

Unfortunately, there's no silver bullet here. I'll demonstrate a Sax method that will probably take up less memory, though I'm not confident that it will be particularly fast. To get faster, you could write some C (still around a sax parser).

import io
import xml.sax
import pandas as pd
class Handler(xml.sax.handler.ContentHandler):
 def __init__(
 self,
 select: dict[str, str],
 leaf: str,
 ) -> None:
 super().__init__()
 self.leaf = leaf
 self.select = select
 self.data: list[list[str]] = []
 self.row = [None]*len(select)
 self.index = {
 elm: i
 for (i, elm) in enumerate(select.keys())
 }
 def startElement(self, name: str, attrs: xml.sax.xmlreader.AttributesImpl) -> None:
 attr = self.select.get(name)
 if attr is not None:
 value = attrs.getValue(attr)
 self.row[self.index[name]] = value
 if value is not None and name == self.leaf:
 self.data.append(list(self.row))
 def get_frame(self) -> pd.DataFrame:
 cols = pd.MultiIndex.from_tuples(
 tuples=tuple(self.select.items()),
 names=('element', 'attr'),
 )
 return pd.DataFrame(columns=cols, data=self.data)
handler=Handler(
 select={
 'level1': 'a',
 'level3': 'c',
 },
 leaf='level3',
)
with io.BytesIO(b"""
<data>
 <level1 a="1">
 <level2 b="1">
 <level3 c="1"/> 
 <level3 c="2"/> 
 </level2> 
 <level2 b="2">
 <level3 c="3"/> 
 <level3 c="4"/> 
 </level2> 
 </level1>
 <level1 a="2">
 <level2 b="3">
 <level3 c="5"/> 
 <level3 c="6"/> 
 </level2> 
 <level2 b="4">
 <level3 c="7"/> 
 <level3 c="8"/> 
 </level2> 
 </level1> 
</data>""") as f:
 xml.sax.parse(f, handler=handler)
print(handler.get_frame())
element level1 level3
attr a c
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8
answered Dec 17, 2024 at 1:13
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.