I'm working on pulling some information from word document. I have multiple tables in it and i want to get 1 specific table that does not have any specific location in the document that could be referred by specific [x] identifier. What i need to do is to search for it.
If I do:
import lxml.etree
root = lxml.etree.parse("document.xml")
element=root.xpath(".//*[contains(text(), 'some_searched_text')]")
Which finds me a data within a document, but I get only 1 element within element variable and i cannot access other elements in that table where that text is located. I need to extract that other text from table when some_searched_text is located. How to get parent element that i can browse? How to get xpath from location that element variable that I can work with?
e.g.,
<w:tbl>
<w:tblPr>
...
</w:tblPr>
<w:tblGrid>
...
</w:tblGrid>
<w:tr w:rsidR="0042513B" w14:paraId="390A7EAC" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="4A577133" w14:textId="41AF034D" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>categories</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="179E9017" w14:textId="479091DC" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Opis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="3F2BE8B0" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="62AA6ECA" w14:textId="5CF66E6C" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Jakis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>sobie</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>description of my tasks</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="3F692136" w14:textId="440E2FD5" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>some_searched_text</w:t>
</w:r>
<w:r>
<w:br/>
<w:t>some random text and description that i want to extract from this table</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="6CCF2C45" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="128A7A23" w14:textId="4C1B124E" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>this is the other text that I want to pull out</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="657C4E27" w14:textId="38BF07A2" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>KAT1</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>
How to get based on that search this /w:tbl element that i can iterate? Again this can be anywhere in the word document so position will change and i need to search for some relative position based on text that is always there.
if i add
table=element.getpartent()
then i get error in this like that says
AttributeError: 'list' object has no attribute 'getpartent
1 Answer 1
first: xpath() alwayas returns list with all matching items (it can be even list with one element or empty list if it can't find items) and it may need to use element[0] to work with first item (to make sure it may need if len(element) > 0 or simpler if element:)
If you will have one element then you can use .getparent() to get its parent. Sometimes it may need to use many times like .getparent().getparent().getparent()
You can also use .getnext() to get next element.
And this gives me code like this
elements = root.xpath(".//*[contains(text(), 'some_searched_text')]")
# print(type(elements))
if not elements:
print('no matching element(s)')
else:
parent = elements[0].getparent()
print('parent:', parent)
next = parent.getnext()
item = next.find('.//w:t')
# print(' tag :', item.tag)
print(' text:', item.text)
grandparent = parent.getparent().getparent().getparent()
print('grandparent:', grandparent)
next = grandparent.getnext()
item = next.find('.//w:t')
# print(' tag :', item.tag)
print(' text:', item.text)
To make it more elastic you could run .getparent() in for-loop to and check .tag to move back to correct (grand)parent.
Full working code used for tests - with example data directly in code.
text = """<w:body xmlns:w="http://www.example.com/w" xmlns:w14="http://www.example.com/w14">
<w:tbl>
<w:tblPr>
...
</w:tblPr>
<w:tblGrid>
...
</w:tblGrid>
<w:tr w:rsidR="0042513B" w14:paraId="390A7EAC" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="4A577133" w14:textId="41AF034D" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>categories</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="179E9017" w14:textId="479091DC" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Opis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="3F2BE8B0" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="62AA6ECA" w14:textId="5CF66E6C" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Jakis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>sobie</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>description of my tasks</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="3F692136" w14:textId="440E2FD5" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>some_searched_text</w:t>
</w:r>
<w:r>
<w:br/>
<w:t>some random text and description that i want to extract from this table</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="6CCF2C45" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="128A7A23" w14:textId="4C1B124E" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>this is the other text that I want to pull out</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="657C4E27" w14:textId="38BF07A2" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>KAT1</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>
</w:body>"""
import lxml.etree
# root = lxml.etree.parse("document.xml")
ns = {'w': "http://www.example.com/w", 'w14': "http://www.example.com/w14"} # namespaces
root = lxml.etree.fromstring(text)
elements = root.xpath(".//*[contains(text(), 'some_searched_text')]")
# print(type(elements))
if not elements:
print('no matching element(s)')
else:
parent = elements[0].getparent()
print('parent:', parent)
next = parent.getnext()
item = next.find('.//w:t', namespaces=ns)
# print(' tag :', item.tag)
print(' text:', item.text)
grandparent = parent.getparent().getparent().getparent()
print('grandparent:', grandparent)
next = grandparent.getnext()
item = next.find('.//w:t', namespaces=ns)
# print(' tag :', item.tag)
print(' text:', item.text)
Result:
parent: <Element {http://www.example.com/w}r at 0x76af127e27c0>
text: some random text and description that i want to extract from this table
grandparent: <Element {http://www.example.com/w}tr at 0x76af127e2700>
text: this is the other text that I want to pull out
I don't know if BeautifulSoup can work with xml but it have more very useful functions.
textdo you mean? Your example data don't have any text in tags - only parameters in tags.lxmlhas function.getparent()to get parent elementxpath()gives list with all matching elements (even if it is only one element or empty list) - and you may have to usefor-loop to work with every element on this list, or useelement[0]to work only with first element (but check if there is any element on listif len(element) > 0:or simplerif element:).print()(andprint(type(...)),print(len(...)), etc.) to see which part of code is executed and what you really have in variables. It is called"print debugging"and it helps to see what code is really doing.