1

I'm using elementree to extract data from HTML in a format that has evolved in structure over time (see samples below).

I'm currently doing this by using iterfind to find different matching blocks of structure (font/b, b/font, font)

But, I've noticed there is a general pattern. Regardless of the specific set of HTML elements in use, the ultimate inner text of the first div child is the color, the second child is the pet-type, and the third child is the name.

Is there a generic way of doing this via elementree? That would make my code simpler, and possibly be more future-proof.

<div>
 <font><b>Brown</b></font><a>Cat</a><font><b>Larry</b></font>
</div>
<div>
 <b><font>White</font></b><i><a>Poodle</a></i><b><font>Foxy</font></b>
</div>
<div>
 <font><i>Tabby</i></font><a><i>Cat</i></a><font>Tempi</font>
</div>
asked Dec 10, 2021 at 10:28
0

2 Answers 2

1

How about something like this:

pets = """<body><div>
 <font><b>Brown</b></font><a>Cat</a><font><b>Larry</b></font>
</div>
<div>
 <b><font>White</font></b><i><a>Poodle</a></i><b><font>Foxy</font></b>
</div>
<div>
 <font><i>Tabby</i></font><a><i>Cat</i></a><font>Tempi</font>
</div></body>"""
animals = []
doc = ET.fromstring(pets)
for pet in doc.findall('.//div'):
 animals.append([animal.text for animal in pet.findall('.//*') if animal.text] )
animals

Output:

[['Brown', 'Cat', 'Larry'],
 ['White', 'Poodle', 'Foxy'],
 ['Tabby', 'Cat', 'Tempi']]
answered Dec 10, 2021 at 13:28
Sign up to request clarification or add additional context in comments.

1 Comment

aah, './/*' - that's clever.
0

This code appears to work:

 items = div.itertext()
 textblocks = []
 for item in items:
 trimmed = item.strip()
 if len(trimmed) > 0:
 textblocks.append(trimmed)
 color = textblocks[0]
 pet_type = textblocks[1]
 name = textblocks[2]
 print(color + ', ' + pet_type + ', ' + name)

I welcome any improvements to the code

answered Dec 10, 2021 at 11:46

1 Comment

It was a suggested answer. A better answer has since been provided by +jack-fleeting

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.