6

im not a coder but i need to implement a simple HTML parser.

After a simple research i was able to implement as a given example:

from lxml import html
import requests
page = requests.get('https://URL.COM')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')
print 'Buyers: ', buyers
print 'Prices: ', prices

How can i use tree.xpath to parse all words ending with ".com.br" and starting with "://"

asked Aug 10, 2018 at 14:08
4
  • 1
    can you add html dummy snippet? Commented Aug 10, 2018 at 14:11
  • 1
    why do you need to implement your own? just use bs4 you need external libraries anyway so why don't use bs4 instead of lxml? Commented Aug 10, 2018 at 14:19
  • 1
    That is not how xpath parsing works - you parse using first the document structure, not contents! If the "words ending with .com.br and starting with ://" are actually links (<a href="..."> tags), you can use xpath to extract all links, then filter the ones you want. Commented Aug 10, 2018 at 14:19
  • stackoverflow.com/q/73937173/extract-data-from-html-in-python, stackoverflow.com/q/63455933/… Commented Jun 27, 2025 at 2:06

1 Answer 1

10

As @nosklo pointed out here, you are looking for href tags and the associated links. A parse tree will be organized by the html elements themselves, and you find text by searching those elements specifically. For urls, this would look like so (using the lxml library in python 3.6):

from lxml import etree
from io import StringIO
import requests
# Set explicit HTMLParser
parser = etree.HTMLParser()
page = requests.get('https://URL.COM')
# Decode the page content from bytes to string
html = page.content.decode("utf-8")
# Create your etree with a StringIO object which functions similarly
# to a fileHandler
tree = etree.parse(StringIO(html), parser=parser)
# Call this function and pass in your tree
def get_links(tree):
 # This will get the anchor tags <a href...>
 refs = tree.xpath("//a")
 # Get the url from the ref
 links = [link.get('href', '') for link in refs]
 # Return a list that only ends with .com.br
 return [l for l in links if l.endswith('.com.br')]
# Example call
links = get_links(tree)
answered Aug 10, 2018 at 18:47
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.