0

I'm working on pulling some information from word document. I have multiple tables in it and i want to get 1 specific table that does not have any specific location in the document that could be referred by specific [x] identifier. What i need to do is to search for it.

If I do:

import lxml.etree
root = lxml.etree.parse("document.xml")
element=root.xpath(".//*[contains(text(), 'some_searched_text')]")

Which finds me a data within a document, but I get only 1 element within element variable and i cannot access other elements in that table where that text is located. I need to extract that other text from table when some_searched_text is located. How to get parent element that i can browse? How to get xpath from location that element variable that I can work with?

e.g.,

<w:tbl>
<w:tblPr>
...
</w:tblPr>
<w:tblGrid>
...
</w:tblGrid>
<w:tr w:rsidR="0042513B" w14:paraId="390A7EAC" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="4A577133" w14:textId="41AF034D" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>categories</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="179E9017" w14:textId="479091DC" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Opis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="3F2BE8B0" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="62AA6ECA" w14:textId="5CF66E6C" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Jakis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>sobie</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>description of my tasks</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="3F692136" w14:textId="440E2FD5" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>some_searched_text</w:t>
</w:r>
<w:r>
<w:br/>
<w:t>some random text and description that i want to extract from this table</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="6CCF2C45" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="128A7A23" w14:textId="4C1B124E" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>this is the other text that I want to pull out</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="657C4E27" w14:textId="38BF07A2" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>KAT1</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>

How to get based on that search this /w:tbl element that i can iterate? Again this can be anywhere in the word document so position will change and i need to search for some relative position based on text that is always there.

if i add

table=element.getpartent()

then i get error in this like that says

AttributeError: 'list' object has no attribute 'getpartent
6
  • What text do you mean? Your example data don't have any text in tags - only parameters in tags. Commented Apr 9, 2025 at 19:36
  • better create minimal working code with example data - so we could simply copy and test it. Commented Apr 9, 2025 at 19:38
  • lxml has function .getparent() to get parent element Commented Apr 9, 2025 at 19:41
  • xpath() gives list with all matching elements (even if it is only one element or empty list) - and you may have to use for-loop to work with every element on this list, or use element[0] to work only with first element (but check if there is any element on list if len(element) > 0: or simpler if element:). Commented Apr 10, 2025 at 12:25
  • Maybe first use print() (and print(type(...)), print(len(...)), etc.) to see which part of code is executed and what you really have in variables. It is called "print debugging" and it helps to see what code is really doing. Commented Apr 10, 2025 at 12:29

1 Answer 1

0

first: xpath() alwayas returns list with all matching items (it can be even list with one element or empty list if it can't find items) and it may need to use element[0] to work with first item (to make sure it may need if len(element) > 0 or simpler if element:)

If you will have one element then you can use .getparent() to get its parent. Sometimes it may need to use many times like .getparent().getparent().getparent()

You can also use .getnext() to get next element.

And this gives me code like this

elements = root.xpath(".//*[contains(text(), 'some_searched_text')]")
# print(type(elements))
if not elements:
 print('no matching element(s)')
else:
 parent = elements[0].getparent()
 print('parent:', parent)
 next = parent.getnext()
 item = next.find('.//w:t')
 # print(' tag :', item.tag)
 print(' text:', item.text)
 grandparent = parent.getparent().getparent().getparent()
 print('grandparent:', grandparent)
 next = grandparent.getnext()
 item = next.find('.//w:t')
 # print(' tag :', item.tag)
 print(' text:', item.text)

To make it more elastic you could run .getparent() in for-loop to and check .tag to move back to correct (grand)parent.


Full working code used for tests - with example data directly in code.

text = """<w:body xmlns:w="http://www.example.com/w" xmlns:w14="http://www.example.com/w14">
<w:tbl>
<w:tblPr>
...
</w:tblPr>
<w:tblGrid>
...
</w:tblGrid>
<w:tr w:rsidR="0042513B" w14:paraId="390A7EAC" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="4A577133" w14:textId="41AF034D" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>categories</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="179E9017" w14:textId="479091DC" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Opis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="3F2BE8B0" w14:textId="77777777" w:rsidTr="0042513B">
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="62AA6ECA" w14:textId="5CF66E6C" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>Jakis</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 <w:r>
 <w:t xml:space="preserve"> </w:t>
 </w:r>
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>sobie</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 <w:r>
 <w:t xml:space="preserve"> </w:t>
 </w:r>
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>description of my tasks</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 </w:p>
 </w:tc>
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="3F692136" w14:textId="440E2FD5" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>some_searched_text</w:t>
 </w:r>
 <w:r>
 <w:br/>
 <w:t>some random text and description that i want to extract from this table</w:t>
 </w:r>
 </w:p>
 </w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="6CCF2C45" w14:textId="77777777" w:rsidTr="0042513B">
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="128A7A23" w14:textId="4C1B124E" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>this is the other text that I want to pull out</w:t>
 </w:r>
 </w:p>
 </w:tc>
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="657C4E27" w14:textId="38BF07A2" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>KAT1</w:t>
 </w:r>
 </w:p>
 </w:tc>
</w:tr>
</w:tbl>
</w:body>"""
import lxml.etree
# root = lxml.etree.parse("document.xml")
ns = {'w': "http://www.example.com/w", 'w14': "http://www.example.com/w14"} # namespaces
root = lxml.etree.fromstring(text)
elements = root.xpath(".//*[contains(text(), 'some_searched_text')]")
# print(type(elements))
if not elements:
 print('no matching element(s)')
else:
 parent = elements[0].getparent()
 print('parent:', parent)
 next = parent.getnext()
 item = next.find('.//w:t', namespaces=ns)
 # print(' tag :', item.tag)
 print(' text:', item.text)
 grandparent = parent.getparent().getparent().getparent()
 print('grandparent:', grandparent)
 next = grandparent.getnext()
 item = next.find('.//w:t', namespaces=ns)
 # print(' tag :', item.tag)
 print(' text:', item.text)

Result:

parent: <Element {http://www.example.com/w}r at 0x76af127e27c0>
 text: some random text and description that i want to extract from this table
grandparent: <Element {http://www.example.com/w}tr at 0x76af127e2700>
 text: this is the other text that I want to pull out

I don't know if BeautifulSoup can work with xml but it have more very useful functions.

answered Apr 11, 2025 at 0:50
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.