I wrote a class to slightly customize the behavior of lxml.etree.ElementTree
and I use it quite extensively. It works great, but there are a few methods that I'm not sure how I wrote, and there are a few other methods that seem awfully redundant. I'll elaborate on specific questions below, but first here's the code.
import lxml.etree as old_etree
from xml_utilities import clean_xml
from datetime import datetime
VERSION = 'X.X.X.X'
TOOL = 'ExampleTool'
class etree(old_etree._ElementTree):
@staticmethod
def parse(path):
old_tdx = old_etree.parse(path)
new_tdx = etree(old_tdx)
new_tdx._setroot(old_tdx.getroot())
return new_tdx
@staticmethod
def fromstring(string):
old_tdx = old_etree.fromstring(string)
new_tdx = etree(old_tdx.getroottree())
new_tdx._setroot(old_tdx)
return new_tdx
@staticmethod
def getISOTime():
iso_time = datetime.now().isoformat()
format_time = iso_time.split('.')[0] + 'Z'
return format_time
@staticmethod
def SubElement(*args, **kargs):
return old_etree.SubElement(*args, **kargs)
@staticmethod
def Element(*args, **kargs):
return old_etree.Element(*args, **kargs)
def getMetaNode(self, name, default=''):
metas = self.findall('meta')
for meta in metas:
if meta.get('name') == name:
return meta
return old_etree.SubElement(self.getroot(), 'meta',
attrib={'name': name, 'value': default})
def setMetaNode(self, name, value):
node = self.getMetaNode(name)
node.set('value', value)
def write(self, path, updateTool=True):
self.setMetaNode('saved', etree.getISOTime())
if updateTool:
self.setMetaNode('version', VERSION)
self.setMetaNode('type', TOOL)
super(etree, self).write(path)
clean_xml(path)
The function clean_xml
is custom written and largely irrelevant: It escapes troublesome characters that ElementTree.write
doesn't by default (this is an issue with the file specification we use, rather than with lxml
). For questions:
Is there a better way to extend the functionality of
lxml.etree
than inheriting from_ElementTree
? Ultimately, I want to be changing the behavior of the methodElementTree.write
and to add the methodsElementTree.getMetaNode
andElementTree.setMetaNode
.Is there a better way of incorporating the methods
parse
,fromstring
,SubElement
, andElement
, rather than just creating thin wrappers that reference the original methods?Does anyone understand what my methods
parse
andfromstring
are doing? I know that my methodfromstring
has different functionality fromlxml.etree.fromstring
because I want it to return anElementTree
rather than anElement
. These methods were written largely through guess-and-check (oops).Is there a better way to write the
return old_etree.SubElement(...)
line so it doesn't have to take up two lines? This isn't major, but...
Originally, I tried to just overwrite lxml.etree.ElementTree.write
directly, but that throws the error AttributeError: 'lxml.etree._ElementTree' object attribute 'write' is read-only
. General critiques are also welcomed. I know that I should have comments in here, but everything is pretty straight-forward except for the two methods that I don't understand.
Also, I'm not sure if anyone is going to think that using the name etree
in order to overwrite lxml.etree
is a bad idea. The reason for this is so that I only have to change the import
line at the top of any previously written file from from lxml import etree
to from my_lxml import etree
.
1 Answer 1
Boilerplate
class etree(old_etree._ElementTree):
@staticmethod
def parse(path):
old_tdx = old_etree.parse(path)
new_tdx = etree(old_tdx)
new_tdx._setroot(old_tdx.getroot())
return new_tdx
@staticmethod
def fromstring(string):
old_tdx = old_etree.fromstring(string)
new_tdx = etree(old_tdx.getroottree())
new_tdx._setroot(old_tdx)
return new_tdx
In those boilerplate functions, I don't know what "tdx" means. I guess it's linked to the file specification you use.
Those functions are the main problem I have with your code. You mixed the lxml.etree
module with the lxml.etree.ElementTree
class. Those four methods are methods of the module, not the class! I think you should:
- have a real
etree
module - put your current
etree
class (renamed asElementTree
) in it - add those functions in the module, not the class
- fix the return type to return what lxml returns
This would make things way less confusing because it would be closer to how lxml.etree
and the standard xml.etree
work. Your coworkers would be able to pick your code faster.
You would do from lxml.etree import *
and only redefine the functions you want to redefine. Your probably can't expect a perfect compatibility, but at least the basic API would be the same.
I don't have much to say about the implementation of the functions apart from the return type: ugly but I did not find a way to make things better. And yes, it's quite easy to understand them.
@staticmethod
def SubElement(*args, **kargs):
return old_etree.SubElement(*args, **kargs)
@staticmethod
def Element(*args, **kargs):
return old_etree.Element(*args, **kargs)
You no longer would need do to this with my proposal above.
getISOTime
@staticmethod
def getISOTime():
iso_time = datetime.now().isoformat()
format_time = iso_time.split('.')[0] + 'Z'
return format_time
Adding the 'Z' means you're saying this is an UTC time, but it's not since you've used datetime.now()
instead of datetime.utcnow()
. I live in UTC+4, and the format_time
you're returning is 4 hours off for me.
getMetaNode
def getMetaNode(self, name, default=''):
metas = self.findall('meta')
for meta in metas:
if meta.get('name') == name:
return meta
You can use lxml XPath support here if you want to avoid the loop: meta.xpath("meta[name = $name]", name=name)
.
Write
def write(self, path, updateTool=True):
self.setMetaNode('saved', etree.getISOTime())
if updateTool:
self.setMetaNode('version', VERSION)
self.setMetaNode('type', TOOL)
super(etree, self).write(path)
clean_xml(path)
nitpick: Cleaning at the ElementTree would prevent your filesystem to see two different versions. Say at some point you decide to use watchdog, the callback will kick in before you have a chance to run clean_xml
, which could cause subtle bugs.
Answering your questions
- Is there a better way to extend the functionality of
lxml.etree
than inheriting from_ElementTree
? Ultimately, I want to be changing the behavior of the methodElementTree.write
and to add the methodsElementTree.getMetaNode
andElementTree.setMetaNode
.
Since lxml is written in Cython, I think you can't monkeypatch lxml directly anyway, so subclassing is the way to go. lxml could have chosen to make things easier by using __new__
just like numpy does for facilitating ndarray subclassing.
- Is there a better way of incorporating the methods
parse
,fromstring
,SubElement
, andElement
, rather than just creating thin wrappers that reference the original methods?
See "Boilerplate" above.
- Does anyone understand what my methods
parse
andfromstring
are doing? I know that my methodfromstring
has different functionality fromlxml.etree.fromstring
because I want it to return anElementTree
rather than anElement
. These methods were written largely through guess-and-check (oops).
Why do you want to get an ElementTree rather than an Element? I think it's a bad idea for two reasons:
- It's very easy to get an ElementTree from an Element.
- It's already hard enough to reason about the type of object you get when using lxml, if we can't use our knowledge about ElementTree/lxml it's only going to get worse.
- Is there a better way to write the
return old_etree.SubElement(...)
line so it doesn't have to take up two lines? This isn't major, but...
If anything, I think I would use more than two lines. If you want to stop worrying about such things, I highly recommand yapf.
Originally, I tried to just overwrite
lxml.etree.ElementTree.write
directly, but that throws the errorAttributeError: 'lxml.etree._ElementTree' object attribute 'write' is read-only
. General critiques are also welcomed. I know that I should have comments in here, but everything is pretty straight-forward except for the two methods that I don't understand.
Yep, you can't monkeypatch it (as said in answer to 1.), but that's not what you wanted to do, since you wanted to keep the original implementation too. And a decorator would not have improved things.
Naming
Also, I'm not sure if anyone is going to think that using the name
etree
in order to overwritelxml.etree
is a bad idea. The reason for this is so that I only have to change theimport
line at the top of any previously written file fromfrom lxml import etree
tofrom my_lxml import etree
.
Nope, I think using the name etree
is a good idea. It's a common way to provide replacements in Python, and this is way the standard library xml.etree
does things. However, you need to do things correctly. Again, see the "Boilerplate" section above.
Oh, and PEP 8 recommends against underscores in module names unless it improves readability, but I think mylxml
is readable enough. However, simply adding my
in front of a module is not such a good idea, because you just lost an opportunity to explain what makes your lxml different. Since it seems specific to the file specification you use, why not use that in the name?
-
\$\begingroup\$ Thanks for all that, I'll look into it more tomorrow when I'm on my work PC. As for the
mylxml
bit, the module is named after the file specification; I renamed it for posting here. Regarding the "cleaning at the element tree" comment, this doesn't seem possible to do beforehand as some character's get escaped "improperly" when writing the XML. (I haven't been able to figure out what is going on; however some, but not all, nodes want things like<
escaped as<
rather than<
and updating prewriting has not solved this issue.) Definitely going to implement everything else though. \$\endgroup\$Jared Goguen– Jared Goguen2016年02月24日 19:56:07 +00:00Commented Feb 24, 2016 at 19:56
etree
name is not ultimately a bad idea, but the reason may be bad. You could just as well dofrom my_lxml import etree_wrap as etree
. \$\endgroup\$