Parse an XML and retrieve zero or one attribute per level starting from a leaf

Question 1

I need to extract from a large XML file some information:

0 or 1 attribute per tag
Trigger on a leaf element

So I wrote this Python code:

def ascend(element):
 parent = element.getparent()
 return [parent] + ascend(parent) if parent is not None else []
def parsexml(filename, tag, attrs):
 from lxml import etree
 for el in etree.ElementTree(file=filename).iter(tag):
 nodes = ascend(el)[::-1] + [el]
 yield [n.attrib[a] for n, a in zip(nodes,attrs) if a]

This snippet can be used as follow (with simplified data):

In [1]: xml = StringIO.StringIO("""
<data>
 <level1 a="1">
 <level2 b="1">
 <level3 c="1"/> 
 <level3 c="2"/> 
 </level2> 
 <level2 b="2">
 <level3 c="3"/> 
 <level3 c="4"/> 
 </level2> 
 </level1>
 <level1 a="2">
 <level2 b="3">
 <level3 c="5"/> 
 <level3 c="6"/> 
 </level2> 
 <level2 b="4">
 <level3 c="7"/> 
 <level3 c="8"/> 
 </level2> 
 </level1> 
</data>""")
In [2]: list(parsexml(xml, 'level3', [None, 'a', None, 'c']))
Out[2]:
[['1', '1'],
 ['1', '2'],
 ['1', '3'],
 ['1', '4'],
 ['2', '5'],
 ['2', '6'],
 ['2', '7'],
 ['2', '8']]

My goal is to import the extracted data in Pandas:

ds = pd.DataFrame(data, columns=['level1', 'level3'])

Is my implementation correct or can I do better? I think the code is short and elegant, but I feel it spends most of its time looking for parents which can be ineffective (I didn't measured it and for my application it is still fast enough).

Question 2

You say that the XML file is large - how large, it's unclear. I'm pretty sure that all etree implementations fully load the file into memory, but (a) that isn't entirely safe, and (b) isn't necessary if you switch to Sax.

Normally I would suggest moving to pandas.read_xml, but it doesn't seem to do a very good job of capturing your use case (the output is sparse with only one non-NaN per row assuming the use of a pipe in the xpath expression). It's also not very careful about performance because it's a non-vectorised wrapper around lxml.

Things I like about your implementation - it's short and correct. Things I don't like - it uses recursion, which doesn't work well with Python (no tail recursion, limited stack depth); it isn't very careful in generating a column index; and the attrs argument has to care about the entire ancestry path even for positions that have no important attribute.

Unfortunately, there's no silver bullet here. I'll demonstrate a Sax method that will probably take up less memory, though I'm not confident that it will be particularly fast. To get faster, you could write some C (still around a sax parser).

import io
import xml.sax
import pandas as pd
class Handler(xml.sax.handler.ContentHandler):
 def __init__(
 self,
 select: dict[str, str],
 leaf: str,
 ) -> None:
 super().__init__()
 self.leaf = leaf
 self.select = select
 self.data: list[list[str]] = []
 self.row = [None]*len(select)
 self.index = {
 elm: i
 for (i, elm) in enumerate(select.keys())
 }
 def startElement(self, name: str, attrs: xml.sax.xmlreader.AttributesImpl) -> None:
 attr = self.select.get(name)
 if attr is not None:
 value = attrs.getValue(attr)
 self.row[self.index[name]] = value
 if value is not None and name == self.leaf:
 self.data.append(list(self.row))
 def get_frame(self) -> pd.DataFrame:
 cols = pd.MultiIndex.from_tuples(
 tuples=tuple(self.select.items()),
 names=('element', 'attr'),
 )
 return pd.DataFrame(columns=cols, data=self.data)
handler=Handler(
 select={
 'level1': 'a',
 'level3': 'c',
 },
 leaf='level3',
)
with io.BytesIO(b"""
<data>
 <level1 a="1">
 <level2 b="1">
 <level3 c="1"/> 
 <level3 c="2"/> 
 </level2> 
 <level2 b="2">
 <level3 c="3"/> 
 <level3 c="4"/> 
 </level2> 
 </level1>
 <level1 a="2">
 <level2 b="3">
 <level3 c="5"/> 
 <level3 c="6"/> 
 </level2> 
 <level2 b="4">
 <level3 c="7"/> 
 <level3 c="8"/> 
 </level2> 
 </level1> 
</data>""") as f:
 xml.sax.parse(f, handler=handler)
print(handler.get_frame())

element level1 level3
attr a c
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2024-12-17 01:13:21Z

You say that the XML file is large - how large, it's unclear. I'm pretty sure that all etree implementations fully load the file into memory, but (a) that isn't entirely safe, and (b) isn't necessary if you switch to Sax.

Normally I would suggest moving to pandas.read_xml, but it doesn't seem to do a very good job of capturing your use case (the output is sparse with only one non-NaN per row assuming the use of a pipe in the xpath expression). It's also not very careful about performance because it's a non-vectorised wrapper around lxml.

Things I like about your implementation - it's short and correct. Things I don't like - it uses recursion, which doesn't work well with Python (no tail recursion, limited stack depth); it isn't very careful in generating a column index; and the attrs argument has to care about the entire ancestry path even for positions that have no important attribute.

Unfortunately, there's no silver bullet here. I'll demonstrate a Sax method that will probably take up less memory, though I'm not confident that it will be particularly fast. To get faster, you could write some C (still around a sax parser).

import io
import xml.sax
import pandas as pd
class Handler(xml.sax.handler.ContentHandler):
 def __init__(
 self,
 select: dict[str, str],
 leaf: str,
 ) -> None:
 super().__init__()
 self.leaf = leaf
 self.select = select
 self.data: list[list[str]] = []
 self.row = [None]*len(select)
 self.index = {
 elm: i
 for (i, elm) in enumerate(select.keys())
 }
 def startElement(self, name: str, attrs: xml.sax.xmlreader.AttributesImpl) -> None:
 attr = self.select.get(name)
 if attr is not None:
 value = attrs.getValue(attr)
 self.row[self.index[name]] = value
 if value is not None and name == self.leaf:
 self.data.append(list(self.row))
 def get_frame(self) -> pd.DataFrame:
 cols = pd.MultiIndex.from_tuples(
 tuples=tuple(self.select.items()),
 names=('element', 'attr'),
 )
 return pd.DataFrame(columns=cols, data=self.data)
handler=Handler(
 select={
 'level1': 'a',
 'level3': 'c',
 },
 leaf='level3',
)
with io.BytesIO(b"""
<data>
 <level1 a="1">
 <level2 b="1">
 <level3 c="1"/> 
 <level3 c="2"/> 
 </level2> 
 <level2 b="2">
 <level3 c="3"/> 
 <level3 c="4"/> 
 </level2> 
 </level1>
 <level1 a="2">
 <level2 b="3">
 <level3 c="5"/> 
 <level3 c="6"/> 
 </level2> 
 <level2 b="4">
 <level3 c="7"/> 
 <level3 c="8"/> 
 </level2> 
 </level1> 
</data>""") as f:
 xml.sax.parse(f, handler=handler)
print(handler.get_frame())

element level1 level3
attr a c
0 1 1
1 1 2
2 1 3
3 1 4
4 2 5
5 2 6
6 2 7
7 2 8

Stack Exchange Network

Parse an XML and retrieve zero or one attribute per level starting from a leaf

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse an XML and retrieve zero or one attribute per level starting from a leaf

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions