Extending the functionality of lxml.etree

Question 1

I wrote a class to slightly customize the behavior of lxml.etree.ElementTree and I use it quite extensively. It works great, but there are a few methods that I'm not sure how I wrote, and there are a few other methods that seem awfully redundant. I'll elaborate on specific questions below, but first here's the code.

import lxml.etree as old_etree
from xml_utilities import clean_xml
from datetime import datetime
VERSION = 'X.X.X.X'
TOOL = 'ExampleTool'
class etree(old_etree._ElementTree):
 @staticmethod
 def parse(path): 
 old_tdx = old_etree.parse(path)
 new_tdx = etree(old_tdx)
 new_tdx._setroot(old_tdx.getroot())
 return new_tdx
 @staticmethod
 def fromstring(string):
 old_tdx = old_etree.fromstring(string)
 new_tdx = etree(old_tdx.getroottree())
 new_tdx._setroot(old_tdx)
 return new_tdx
 @staticmethod
 def getISOTime():
 iso_time = datetime.now().isoformat()
 format_time = iso_time.split('.')[0] + 'Z'
 return format_time
 @staticmethod
 def SubElement(*args, **kargs):
 return old_etree.SubElement(*args, **kargs)
 @staticmethod
 def Element(*args, **kargs):
 return old_etree.Element(*args, **kargs)
 def getMetaNode(self, name, default=''):
 metas = self.findall('meta')
 for meta in metas:
 if meta.get('name') == name:
 return meta
 return old_etree.SubElement(self.getroot(), 'meta', 
 attrib={'name': name, 'value': default})
 def setMetaNode(self, name, value):
 node = self.getMetaNode(name)
 node.set('value', value)
 def write(self, path, updateTool=True): 
 self.setMetaNode('saved', etree.getISOTime())
 if updateTool:
 self.setMetaNode('version', VERSION)
 self.setMetaNode('type', TOOL)
 super(etree, self).write(path)
 clean_xml(path)

The function clean_xml is custom written and largely irrelevant: It escapes troublesome characters that ElementTree.write doesn't by default (this is an issue with the file specification we use, rather than with lxml). For questions:

Is there a better way to extend the functionality of lxml.etree than inheriting from _ElementTree? Ultimately, I want to be changing the behavior of the method ElementTree.write and to add the methods ElementTree.getMetaNode and ElementTree.setMetaNode.
Is there a better way of incorporating the methods parse, fromstring, SubElement, and Element, rather than just creating thin wrappers that reference the original methods?
Does anyone understand what my methods parse and fromstring are doing? I know that my method fromstring has different functionality from lxml.etree.fromstring because I want it to return an ElementTree rather than an Element. These methods were written largely through guess-and-check (oops).
Is there a better way to write the return old_etree.SubElement(...) line so it doesn't have to take up two lines? This isn't major, but...

Originally, I tried to just overwrite lxml.etree.ElementTree.write directly, but that throws the error AttributeError: 'lxml.etree._ElementTree' object attribute 'write' is read-only. General critiques are also welcomed. I know that I should have comments in here, but everything is pretty straight-forward except for the two methods that I don't understand.

Also, I'm not sure if anyone is going to think that using the name etree in order to overwrite lxml.etree is a bad idea. The reason for this is so that I only have to change the import line at the top of any previously written file from from lxml import etree to from my_lxml import etree.

Question 2

The choice of the etree name is not ultimately a bad idea, but the reason may be bad. You could just as well do from my_lxml import etree_wrap as etree.

Question 3

Boilerplate

class etree(old_etree._ElementTree):
 @staticmethod
 def parse(path): 
 old_tdx = old_etree.parse(path)
 new_tdx = etree(old_tdx)
 new_tdx._setroot(old_tdx.getroot())
 return new_tdx
 @staticmethod
 def fromstring(string):
 old_tdx = old_etree.fromstring(string)
 new_tdx = etree(old_tdx.getroottree())
 new_tdx._setroot(old_tdx)
 return new_tdx

In those boilerplate functions, I don't know what "tdx" means. I guess it's linked to the file specification you use.

Those functions are the main problem I have with your code. You mixed the lxml.etree module with the lxml.etree.ElementTree class. Those four methods are methods of the module, not the class! I think you should:

have a real etree module
put your current etree class (renamed as ElementTree) in it
add those functions in the module, not the class
fix the return type to return what lxml returns

This would make things way less confusing because it would be closer to how lxml.etree and the standard xml.etree work. Your coworkers would be able to pick your code faster.

You would do from lxml.etree import * and only redefine the functions you want to redefine. Your probably can't expect a perfect compatibility, but at least the basic API would be the same.

I don't have much to say about the implementation of the functions apart from the return type: ugly but I did not find a way to make things better. And yes, it's quite easy to understand them.

 @staticmethod
 def SubElement(*args, **kargs):
 return old_etree.SubElement(*args, **kargs)
 @staticmethod
 def Element(*args, **kargs):
 return old_etree.Element(*args, **kargs)

You no longer would need do to this with my proposal above.

getISOTime

 @staticmethod
 def getISOTime():
 iso_time = datetime.now().isoformat()
 format_time = iso_time.split('.')[0] + 'Z'
 return format_time

Adding the 'Z' means you're saying this is an UTC time, but it's not since you've used datetime.now() instead of datetime.utcnow(). I live in UTC+4, and the format_time you're returning is 4 hours off for me.

getMetaNode

 def getMetaNode(self, name, default=''):
 metas = self.findall('meta')
 for meta in metas:
 if meta.get('name') == name:
 return meta

You can use lxml XPath support here if you want to avoid the loop: meta.xpath("meta[name = $name]", name=name).

Write

 def write(self, path, updateTool=True): 
 self.setMetaNode('saved', etree.getISOTime())
 if updateTool:
 self.setMetaNode('version', VERSION)
 self.setMetaNode('type', TOOL)
 super(etree, self).write(path)
 clean_xml(path)

nitpick: Cleaning at the ElementTree would prevent your filesystem to see two different versions. Say at some point you decide to use watchdog, the callback will kick in before you have a chance to run clean_xml, which could cause subtle bugs.

Answering your questions

Is there a better way to extend the functionality of lxml.etree than inheriting from _ElementTree? Ultimately, I want to be changing the behavior of the method ElementTree.write and to add the methods ElementTree.getMetaNode and ElementTree.setMetaNode.

Since lxml is written in Cython, I think you can't monkeypatch lxml directly anyway, so subclassing is the way to go. lxml could have chosen to make things easier by using __new__ just like numpy does for facilitating ndarray subclassing.

Is there a better way of incorporating the methods parse, fromstring, SubElement, and Element, rather than just creating thin wrappers that reference the original methods?

See "Boilerplate" above.

Does anyone understand what my methods parse and fromstring are doing? I know that my method fromstring has different functionality from lxml.etree.fromstring because I want it to return an ElementTree rather than an Element. These methods were written largely through guess-and-check (oops).

Why do you want to get an ElementTree rather than an Element? I think it's a bad idea for two reasons:

It's very easy to get an ElementTree from an Element.
It's already hard enough to reason about the type of object you get when using lxml, if we can't use our knowledge about ElementTree/lxml it's only going to get worse.

Is there a better way to write the return old_etree.SubElement(...) line so it doesn't have to take up two lines? This isn't major, but...

If anything, I think I would use more than two lines. If you want to stop worrying about such things, I highly recommand yapf.

Originally, I tried to just overwrite lxml.etree.ElementTree.write directly, but that throws the error AttributeError: 'lxml.etree._ElementTree' object attribute 'write' is read-only. General critiques are also welcomed. I know that I should have comments in here, but everything is pretty straight-forward except for the two methods that I don't understand.

Yep, you can't monkeypatch it (as said in answer to 1.), but that's not what you wanted to do, since you wanted to keep the original implementation too. And a decorator would not have improved things.

Naming

Also, I'm not sure if anyone is going to think that using the name etree in order to overwrite lxml.etree is a bad idea. The reason for this is so that I only have to change the import line at the top of any previously written file from from lxml import etree to from my_lxml import etree.

Nope, I think using the name etree is a good idea. It's a common way to provide replacements in Python, and this is way the standard library xml.etree does things. However, you need to do things correctly. Again, see the "Boilerplate" section above.

Oh, and PEP 8 recommends against underscores in module names unless it improves readability, but I think mylxml is readable enough. However, simply adding my in front of a module is not such a good idea, because you just lost an opportunity to explain what makes your lxml different. Since it seems specific to the file specification you use, why not use that in the name?

Question 4

Thanks for all that, I'll look into it more tomorrow when I'm on my work PC. As for the mylxml bit, the module is named after the file specification; I renamed it for posting here. Regarding the "cleaning at the element tree" comment, this doesn't seem possible to do beforehand as some character's get escaped "improperly" when writing the XML. (I haven't been able to figure out what is going on; however some, but not all, nodes want things like < escaped as < rather than < and updating prewriting has not solved this issue.) Definitely going to implement everything else though.

Quentin Pradet Quentin Pradet 7,0641 gold badge25 silver badges44 bronze badges · Accepted Answer · 2016-02-24 13:37:24Z

Boilerplate

class etree(old_etree._ElementTree):
 @staticmethod
 def parse(path): 
 old_tdx = old_etree.parse(path)
 new_tdx = etree(old_tdx)
 new_tdx._setroot(old_tdx.getroot())
 return new_tdx
 @staticmethod
 def fromstring(string):
 old_tdx = old_etree.fromstring(string)
 new_tdx = etree(old_tdx.getroottree())
 new_tdx._setroot(old_tdx)
 return new_tdx

In those boilerplate functions, I don't know what "tdx" means. I guess it's linked to the file specification you use.

Those functions are the main problem I have with your code. You mixed the lxml.etree module with the lxml.etree.ElementTree class. Those four methods are methods of the module, not the class! I think you should:

have a real etree module
put your current etree class (renamed as ElementTree) in it
add those functions in the module, not the class
fix the return type to return what lxml returns

This would make things way less confusing because it would be closer to how lxml.etree and the standard xml.etree work. Your coworkers would be able to pick your code faster.

You would do from lxml.etree import * and only redefine the functions you want to redefine. Your probably can't expect a perfect compatibility, but at least the basic API would be the same.

I don't have much to say about the implementation of the functions apart from the return type: ugly but I did not find a way to make things better. And yes, it's quite easy to understand them.

 @staticmethod
 def SubElement(*args, **kargs):
 return old_etree.SubElement(*args, **kargs)
 @staticmethod
 def Element(*args, **kargs):
 return old_etree.Element(*args, **kargs)

You no longer would need do to this with my proposal above.

getISOTime

 @staticmethod
 def getISOTime():
 iso_time = datetime.now().isoformat()
 format_time = iso_time.split('.')[0] + 'Z'
 return format_time

Adding the 'Z' means you're saying this is an UTC time, but it's not since you've used datetime.now() instead of datetime.utcnow(). I live in UTC+4, and the format_time you're returning is 4 hours off for me.

getMetaNode

 def getMetaNode(self, name, default=''):
 metas = self.findall('meta')
 for meta in metas:
 if meta.get('name') == name:
 return meta

You can use lxml XPath support here if you want to avoid the loop: meta.xpath("meta[name = $name]", name=name).

Write

 def write(self, path, updateTool=True): 
 self.setMetaNode('saved', etree.getISOTime())
 if updateTool:
 self.setMetaNode('version', VERSION)
 self.setMetaNode('type', TOOL)
 super(etree, self).write(path)
 clean_xml(path)

nitpick: Cleaning at the ElementTree would prevent your filesystem to see two different versions. Say at some point you decide to use watchdog, the callback will kick in before you have a chance to run clean_xml, which could cause subtle bugs.

Answering your questions

Is there a better way to extend the functionality of lxml.etree than inheriting from _ElementTree? Ultimately, I want to be changing the behavior of the method ElementTree.write and to add the methods ElementTree.getMetaNode and ElementTree.setMetaNode.

Since lxml is written in Cython, I think you can't monkeypatch lxml directly anyway, so subclassing is the way to go. lxml could have chosen to make things easier by using __new__ just like numpy does for facilitating ndarray subclassing.

Is there a better way of incorporating the methods parse, fromstring, SubElement, and Element, rather than just creating thin wrappers that reference the original methods?

See "Boilerplate" above.

Does anyone understand what my methods parse and fromstring are doing? I know that my method fromstring has different functionality from lxml.etree.fromstring because I want it to return an ElementTree rather than an Element. These methods were written largely through guess-and-check (oops).

Why do you want to get an ElementTree rather than an Element? I think it's a bad idea for two reasons:

It's very easy to get an ElementTree from an Element.
It's already hard enough to reason about the type of object you get when using lxml, if we can't use our knowledge about ElementTree/lxml it's only going to get worse.

Is there a better way to write the return old_etree.SubElement(...) line so it doesn't have to take up two lines? This isn't major, but...

If anything, I think I would use more than two lines. If you want to stop worrying about such things, I highly recommand yapf.

Originally, I tried to just overwrite lxml.etree.ElementTree.write directly, but that throws the error AttributeError: 'lxml.etree._ElementTree' object attribute 'write' is read-only. General critiques are also welcomed. I know that I should have comments in here, but everything is pretty straight-forward except for the two methods that I don't understand.

Yep, you can't monkeypatch it (as said in answer to 1.), but that's not what you wanted to do, since you wanted to keep the original implementation too. And a decorator would not have improved things.

Naming

Also, I'm not sure if anyone is going to think that using the name etree in order to overwrite lxml.etree is a bad idea. The reason for this is so that I only have to change the import line at the top of any previously written file from from lxml import etree to from my_lxml import etree.

Nope, I think using the name etree is a good idea. It's a common way to provide replacements in Python, and this is way the standard library xml.etree does things. However, you need to do things correctly. Again, see the "Boilerplate" section above.

Oh, and PEP 8 recommends against underscores in module names unless it improves readability, but I think mylxml is readable enough. However, simply adding my in front of a module is not such a good idea, because you just lost an opportunity to explain what makes your lxml different. Since it seems specific to the file specification you use, why not use that in the name?

Thanks for all that, I'll look into it more tomorrow when I'm on my work PC. As for the mylxml bit, the module is named after the file specification; I renamed it for posting here. Regarding the "cleaning at the element tree" comment, this doesn't seem possible to do beforehand as some character's get escaped "improperly" when writing the XML. (I haven't been able to figure out what is going on; however some, but not all, nodes want things like < escaped as < rather than < and updating prewriting has not solved this issue.) Definitely going to implement everything else though.

Stack Exchange Network

Extending the functionality of lxml.etree

1 Answer 1

Boilerplate

getISOTime

getMetaNode

Write

Answering your questions

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extending the functionality of lxml.etree

1 Answer 1

Boilerplate

getISOTime

getMetaNode

Write

Answering your questions

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions