0

I am a seller on Target.com and am trying to scrape the URL for every product in my catalog using Python (Python 3). When I try this I get an empty list for 'urllist', and when I print the variable 'soup', what BS4 has actually collected is the contents "view page source" (forgive my naiveté here, definitely a novice at this still!). In reality I'd really like to be scraping URLs from the content found in the "elements" section of the Devtools page. I can sift through the html on that page manually and find the links, so I know they're in there...I just don't know enough yet to tell BS4 that's the content I want to search. How can I do that?

import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
#Need this part below for HTTPS
ctx=ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
#Needs that context = ctx line to deal with HTTPS
url = input('Enter URL: ')
urllist=[]
html = urllib.request.urlopen(url, context = ctx).read()
soup=BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
 urllist.append(link.get('href'))
print(urllist)

If it helps, I found code that someone developed in Java that can be run from the developer console that works and grabbed me all of my links. But my goal is to be able to do this in Python (Python 3)

var x = document.querySelectorAll("a");
var myarray = []
for (var i=0; i<x.length; i++){
var nametext = x[i].textContent;
var cleantext = nametext.replace(/\s+/g, ' ').trim();
var cleanlink = x[i].href;
myarray.push([cleantext,cleanlink]);
};
function make_table() {
 var table = '<table><thead><th>Name</th><th>Links</th></thead><tbody>';
 for (var i=0; i<myarray.length; i++) {
 table += '<tr><td>'+ myarray[i][0] + '</td><td>'+myarray[i][1]+'</td></tr>';
 };
 
 var w = window.open("");
w.document.write(table); 
}
make_table()
asked Dec 8, 2021 at 21:06

2 Answers 2

0

I suspect this is occurring because Target's website (at least, the main page) builds the page content via Javascript. Your browser is able to render the page's source code, but your python code does no such thing. See this post for help in that regard.

answered Dec 8, 2021 at 21:53
Sign up to request clarification or add additional context in comments.

1 Comment

This accurately describes and solves my issue, thank you!
0

Without going into the specifics of your code, fundamentally, if you can make a call to a url - you've got that url. If you use the script to scrape one entered url at the time - that could be logged by entering the correct amendment to the urllist entry (the object returned by each .link.get('href')).
If you have some other original source (a list?) for the urls to scrape, that could be added to the urllist.-object in a similar fashion.

The course of action choosen depends on the actual data structure returned by .link.get('href')). Suggestions:

  • If it's a string containing html, put that string in a dict key 'html', and add another dict key 'url'
  • If it's already a dict object: Just add a key-value-pair 'url'.
  • If you want to enter one url and extract the others from the url's html document, retreive the html and parse it with something like ElementTree

You can do this a number of ways.

answered Dec 8, 2021 at 21:28

2 Comments

Sorry, I think I explained my issue poorly! The URL I am putting into the url variable is a webpage that has about 50 URLs on it, and I want to get each of those URLs one at a time and append that to urlist. So I'm making a call to a the website that contains my list of products, each of which could be accessed manually be clicking a link. I want that list of links.
I see. Then there are a number of options. You might want to look into ElementTRee parse xml and fiter out <a href's.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.