python lxml and getting a table from word document

Question 1

I'm working on pulling some information from word document. I have multiple tables in it and i want to get 1 specific table that does not have any specific location in the document that could be referred by specific [x] identifier. What i need to do is to search for it.

If I do:

import lxml.etree
root = lxml.etree.parse("document.xml")
element=root.xpath(".//*[contains(text(), 'some_searched_text')]")

Which finds me a data within a document, but I get only 1 element within element variable and i cannot access other elements in that table where that text is located. I need to extract that other text from table when some_searched_text is located. How to get parent element that i can browse? How to get xpath from location that element variable that I can work with?

e.g.,

<w:tbl>
<w:tblPr>
...
</w:tblPr>
<w:tblGrid>
...
</w:tblGrid>
<w:tr w:rsidR="0042513B" w14:paraId="390A7EAC" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="4A577133" w14:textId="41AF034D" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>categories</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="179E9017" w14:textId="479091DC" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Opis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="3F2BE8B0" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="62AA6ECA" w14:textId="5CF66E6C" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Jakis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>sobie</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
<w:r>
<w:t xml:space="preserve"> </w:t>
</w:r>
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>description of my tasks</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="3F692136" w14:textId="440E2FD5" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>some_searched_text</w:t>
</w:r>
<w:r>
<w:br/>
<w:t>some random text and description that i want to extract from this table</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="6CCF2C45" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="128A7A23" w14:textId="4C1B124E" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>this is the other text that I want to pull out</w:t>
</w:r>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="657C4E27" w14:textId="38BF07A2" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:r>
<w:t>KAT1</w:t>
</w:r>
</w:p>
</w:tc>
</w:tr>
</w:tbl>

How to get based on that search this /w:tbl element that i can iterate? Again this can be anywhere in the word document so position will change and i need to search for some relative position based on text that is always there.

if i add

table=element.getpartent()

then i get error in this like that says

AttributeError: 'list' object has no attribute 'getpartent

Question 2

What text do you mean? Your example data don't have any text in tags - only parameters in tags.

Question 3

better create minimal working code with example data - so we could simply copy and test it.

Question 4

lxml has function .getparent() to get parent element

Question 5

xpath() gives list with all matching elements (even if it is only one element or empty list) - and you may have to use for-loop to work with every element on this list, or use element[0] to work only with first element (but check if there is any element on list if len(element) > 0: or simpler if element:).

Question 6

Maybe first use print() (and print(type(...)), print(len(...)), etc.) to see which part of code is executed and what you really have in variables. It is called "print debugging" and it helps to see what code is really doing.

Question 7

first: xpath() alwayas returns list with all matching items (it can be even list with one element or empty list if it can't find items) and it may need to use element[0] to work with first item (to make sure it may need if len(element) > 0 or simpler if element:)

If you will have one element then you can use .getparent() to get its parent. Sometimes it may need to use many times like .getparent().getparent().getparent()

You can also use .getnext() to get next element.

And this gives me code like this

elements = root.xpath(".//*[contains(text(), 'some_searched_text')]")
# print(type(elements))
if not elements:
 print('no matching element(s)')
else:
 parent = elements[0].getparent()
 print('parent:', parent)
 next = parent.getnext()
 item = next.find('.//w:t')
 # print(' tag :', item.tag)
 print(' text:', item.text)
 grandparent = parent.getparent().getparent().getparent()
 print('grandparent:', grandparent)
 next = grandparent.getnext()
 item = next.find('.//w:t')
 # print(' tag :', item.tag)
 print(' text:', item.text)

To make it more elastic you could run .getparent() in for-loop to and check .tag to move back to correct (grand)parent.

Full working code used for tests - with example data directly in code.

text = """<w:body xmlns:w="http://www.example.com/w" xmlns:w14="http://www.example.com/w14">
<w:tbl>
<w:tblPr>
...
</w:tblPr>
<w:tblGrid>
...
</w:tblGrid>
<w:tr w:rsidR="0042513B" w14:paraId="390A7EAC" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="4A577133" w14:textId="41AF034D" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>categories</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="179E9017" w14:textId="479091DC" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Opis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="3F2BE8B0" w14:textId="77777777" w:rsidTr="0042513B">
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="62AA6ECA" w14:textId="5CF66E6C" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>Jakis</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 <w:r>
 <w:t xml:space="preserve"> </w:t>
 </w:r>
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>sobie</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 <w:r>
 <w:t xml:space="preserve"> </w:t>
 </w:r>
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>description of my tasks</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 </w:p>
 </w:tc>
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="3F692136" w14:textId="440E2FD5" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>some_searched_text</w:t>
 </w:r>
 <w:r>
 <w:br/>
 <w:t>some random text and description that i want to extract from this table</w:t>
 </w:r>
 </w:p>
 </w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="6CCF2C45" w14:textId="77777777" w:rsidTr="0042513B">
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="128A7A23" w14:textId="4C1B124E" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>this is the other text that I want to pull out</w:t>
 </w:r>
 </w:p>
 </w:tc>
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="657C4E27" w14:textId="38BF07A2" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>KAT1</w:t>
 </w:r>
 </w:p>
 </w:tc>
</w:tr>
</w:tbl>
</w:body>"""
import lxml.etree
# root = lxml.etree.parse("document.xml")
ns = {'w': "http://www.example.com/w", 'w14': "http://www.example.com/w14"} # namespaces
root = lxml.etree.fromstring(text)
elements = root.xpath(".//*[contains(text(), 'some_searched_text')]")
# print(type(elements))
if not elements:
 print('no matching element(s)')
else:
 parent = elements[0].getparent()
 print('parent:', parent)
 next = parent.getnext()
 item = next.find('.//w:t', namespaces=ns)
 # print(' tag :', item.tag)
 print(' text:', item.text)
 grandparent = parent.getparent().getparent().getparent()
 print('grandparent:', grandparent)
 next = grandparent.getnext()
 item = next.find('.//w:t', namespaces=ns)
 # print(' tag :', item.tag)
 print(' text:', item.text)

Result:

parent: <Element {http://www.example.com/w}r at 0x76af127e27c0>
 text: some random text and description that i want to extract from this table
grandparent: <Element {http://www.example.com/w}tr at 0x76af127e2700>
 text: this is the other text that I want to pull out

I don't know if BeautifulSoup can work with xml but it have more very useful functions.

furas 149k12 gold badges121 silver badges171 bronze badges · Accepted Answer · 2025-04-11 00:50:17Z

first: xpath() alwayas returns list with all matching items (it can be even list with one element or empty list if it can't find items) and it may need to use element[0] to work with first item (to make sure it may need if len(element) > 0 or simpler if element:)

If you will have one element then you can use .getparent() to get its parent. Sometimes it may need to use many times like .getparent().getparent().getparent()

You can also use .getnext() to get next element.

And this gives me code like this

elements = root.xpath(".//*[contains(text(), 'some_searched_text')]")
# print(type(elements))
if not elements:
 print('no matching element(s)')
else:
 parent = elements[0].getparent()
 print('parent:', parent)
 next = parent.getnext()
 item = next.find('.//w:t')
 # print(' tag :', item.tag)
 print(' text:', item.text)
 grandparent = parent.getparent().getparent().getparent()
 print('grandparent:', grandparent)
 next = grandparent.getnext()
 item = next.find('.//w:t')
 # print(' tag :', item.tag)
 print(' text:', item.text)

To make it more elastic you could run .getparent() in for-loop to and check .tag to move back to correct (grand)parent.

Full working code used for tests - with example data directly in code.

text = """<w:body xmlns:w="http://www.example.com/w" xmlns:w14="http://www.example.com/w14">
<w:tbl>
<w:tblPr>
...
</w:tblPr>
<w:tblGrid>
...
</w:tblGrid>
<w:tr w:rsidR="0042513B" w14:paraId="390A7EAC" w14:textId="77777777" w:rsidTr="0042513B">
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="4A577133" w14:textId="41AF034D" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>categories</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
<w:tc>
<w:tcPr>
<w:tcW w:w="4698" w:type="dxa"/>
</w:tcPr>
<w:p w14:paraId="179E9017" w14:textId="479091DC" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
<w:proofErr w:type="spellStart"/>
<w:r>
<w:t>Opis</w:t>
</w:r>
<w:proofErr w:type="spellEnd"/>
</w:p>
</w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="3F2BE8B0" w14:textId="77777777" w:rsidTr="0042513B">
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="62AA6ECA" w14:textId="5CF66E6C" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>Jakis</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 <w:r>
 <w:t xml:space="preserve"> </w:t>
 </w:r>
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>sobie</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 <w:r>
 <w:t xml:space="preserve"> </w:t>
 </w:r>
 <w:proofErr w:type="spellStart"/>
 <w:r>
 <w:t>description of my tasks</w:t>
 </w:r>
 <w:proofErr w:type="spellEnd"/>
 </w:p>
 </w:tc>
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="3F692136" w14:textId="440E2FD5" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>some_searched_text</w:t>
 </w:r>
 <w:r>
 <w:br/>
 <w:t>some random text and description that i want to extract from this table</w:t>
 </w:r>
 </w:p>
 </w:tc>
</w:tr>
<w:tr w:rsidR="0042513B" w14:paraId="6CCF2C45" w14:textId="77777777" w:rsidTr="0042513B">
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="128A7A23" w14:textId="4C1B124E" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>this is the other text that I want to pull out</w:t>
 </w:r>
 </w:p>
 </w:tc>
 <w:tc>
 <w:tcPr>
 <w:tcW w:w="4698" w:type="dxa"/>
 </w:tcPr>
 <w:p w14:paraId="657C4E27" w14:textId="38BF07A2" w:rsidR="0042513B" w:rsidRDefault="0042513B" w:rsidP="0042513B">
 <w:r>
 <w:t>KAT1</w:t>
 </w:r>
 </w:p>
 </w:tc>
</w:tr>
</w:tbl>
</w:body>"""
import lxml.etree
# root = lxml.etree.parse("document.xml")
ns = {'w': "http://www.example.com/w", 'w14': "http://www.example.com/w14"} # namespaces
root = lxml.etree.fromstring(text)
elements = root.xpath(".//*[contains(text(), 'some_searched_text')]")
# print(type(elements))
if not elements:
 print('no matching element(s)')
else:
 parent = elements[0].getparent()
 print('parent:', parent)
 next = parent.getnext()
 item = next.find('.//w:t', namespaces=ns)
 # print(' tag :', item.tag)
 print(' text:', item.text)
 grandparent = parent.getparent().getparent().getparent()
 print('grandparent:', grandparent)
 next = grandparent.getnext()
 item = next.find('.//w:t', namespaces=ns)
 # print(' tag :', item.tag)
 print(' text:', item.text)

Result:

parent: <Element {http://www.example.com/w}r at 0x76af127e27c0>
 text: some random text and description that i want to extract from this table
grandparent: <Element {http://www.example.com/w}tr at 0x76af127e2700>
 text: this is the other text that I want to pull out

I don't know if BeautifulSoup can work with xml but it have more very useful functions.

CollectivesTM on Stack Overflow

python lxml and getting a table from word document

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related