Python script for Web scraping

Question 1

I wrote a Python script for Web scraping of a Website. Please review my code and suggest me any changes or make me aware of my blunders/mistakes?. I wrote the almost same script for other websites also so please can you suggest me a way to combine all other scripts into one script so that I can get one single [merged] file.

import requests
import json
import pandas as pd
def geturl():
 urls = [
 # List of URLs
 ]
 main = [] 
 id = 0 
 for url in urls:
 r = requests.get(url)
 data = json.loads(r.content)
 print(r.status_code)
 items = data['items'] 
 baseurl = #URL 
 for item in items:
 data = {}
 data['id']= id
 data['Title'] = item['name']
 data["Price"] = item['price']
 data['Detai Page'] =baseurl+item['slug']
 data['Image'] = item['thumb_image']
 id += 1
 main.append(data)
 sr= pd.Series(main)
 sr.to_json('data.json', orient='records')
 
geturl()

Question 2

Are you able to share any real URLs, or at least markup from the real pages?

Question 3

This has nothing to do with functional programming, web scraping or BeautifulSoup.

Question 4

When using Requests to fetch JSON content, it's usually more convenient to use the resonse.json() method instead of manually passing response.content to json.loads.
As far as I can tell, you're using Pandas to turn a dictionary into a JSON string and write it to a file - as far as I can tell, the structure doesn't change. That seems to me like a strange reason to use Pandas. It seems to me like you could get the same results in a simpler way by doing something like
```
with open("data.json", "w") as output_file:
 json.dump(main, output_file)
```
That said, I somewhat dislike not having the ability to just fetch the data without also writing it to a file. Personally I'd have a separate function to fetch the data, which could in turn be called by the larger geturl function.
Hard-coding URLs and output paths makes your function less reusable than it could be. Consider having it take such things as parameters instead.
Right now, the last line of your script calls geturl not just whenever the script itself is run, but also whenever this file is imported as part of a larger program. That's awkward, since I'm sure this function might be useful elsewhere too. You can avoid having the code run when imported by putting it in an if __name__ == '__main__' block.
Creating data as an empty dict and adding elements one by one is a fine approach. However, for dicts like this one, which are small, and have consistent structures and simple contents, I often find it neater to just make it all at once with a single dict literal. But that's a matter of taste, and your current approach works just fine.
Re-using the name of a built-in function (such as id) for another variable will work, but is generally not considered great practice - I'd suggest renaming the id variable for that reason.

Put that all together, you might end up with something like:

import requests
import json
def get_item_data(urls, detail_base_url):
 main = []
 item_id = 0
 for url in urls:
 r = requests.get(url)
 data = r.json()
 for item in data['items']:
 item_data = {
 'id': item_id
 'Title': item['name']
 'Price': item['price']
 'Detai Page': detail_base_url + item['slug']
 'Image': item['thumb_image']
 }
 id += 1
 main.append(item_data)
 return main
def save_item_data(urls, detail_base_url, output_filename):
 item_data = get_item_data(urls, detail_base_url)
 with open(output_filename, 'w') as output_file:
 json.dump(item_data, output_file)
if __name__ == '__main__':
 # TODO: We might want to get these from the command line. The "argparse" module would be useful for that
 urls = [ ... ]
 detail_base_url = ...
 output_file_name = 'data.json'
 save_item_data(urls, detail_base_url, output_file_name)

Question 5

When you work with requests and json transformations you should allways wrap that code on a try except block

r = requests.get(url)
data = json.loads(r.content)
print(r.status_code)

Should be rewrite to

try:
 response = requests.get(url)
 if response.ok:
 data = response.json()
except (requests.exceptions.RequestException, json.decoder.JSONDecodeError) as exc:
 print(exc)

Those are the minimal changes that you need to do when you work with APIs and use json transformations and requests library.

Hope it helps

Question 6

This is... not right. Universally catching and printing exceptions is the opposite of good practice.

Sara J Sara JSara J 4,15612 silver badges37 bronze badges · Answer 1 · 2022-05-17 02:13:10Z

When using Requests to fetch JSON content, it's usually more convenient to use the resonse.json() method instead of manually passing response.content to json.loads.
As far as I can tell, you're using Pandas to turn a dictionary into a JSON string and write it to a file - as far as I can tell, the structure doesn't change. That seems to me like a strange reason to use Pandas. It seems to me like you could get the same results in a simpler way by doing something like
```
with open("data.json", "w") as output_file:
 json.dump(main, output_file)
```
That said, I somewhat dislike not having the ability to just fetch the data without also writing it to a file. Personally I'd have a separate function to fetch the data, which could in turn be called by the larger geturl function.
Hard-coding URLs and output paths makes your function less reusable than it could be. Consider having it take such things as parameters instead.
Right now, the last line of your script calls geturl not just whenever the script itself is run, but also whenever this file is imported as part of a larger program. That's awkward, since I'm sure this function might be useful elsewhere too. You can avoid having the code run when imported by putting it in an if __name__ == '__main__' block.
Creating data as an empty dict and adding elements one by one is a fine approach. However, for dicts like this one, which are small, and have consistent structures and simple contents, I often find it neater to just make it all at once with a single dict literal. But that's a matter of taste, and your current approach works just fine.
Re-using the name of a built-in function (such as id) for another variable will work, but is generally not considered great practice - I'd suggest renaming the id variable for that reason.

Put that all together, you might end up with something like:

import requests
import json
def get_item_data(urls, detail_base_url):
 main = []
 item_id = 0
 for url in urls:
 r = requests.get(url)
 data = r.json()
 for item in data['items']:
 item_data = {
 'id': item_id
 'Title': item['name']
 'Price': item['price']
 'Detai Page': detail_base_url + item['slug']
 'Image': item['thumb_image']
 }
 id += 1
 main.append(item_data)
 return main
def save_item_data(urls, detail_base_url, output_filename):
 item_data = get_item_data(urls, detail_base_url)
 with open(output_filename, 'w') as output_file:
 json.dump(item_data, output_file)
if __name__ == '__main__':
 # TODO: We might want to get these from the command line. The "argparse" module would be useful for that
 urls = [ ... ]
 detail_base_url = ...
 output_file_name = 'data.json'
 save_item_data(urls, detail_base_url, output_file_name)

camp0 camp0camp0 8555 silver badges7 bronze badges · Answer 2 · 2022-03-22 10:02:54Z

When you work with requests and json transformations you should allways wrap that code on a try except block

r = requests.get(url)
data = json.loads(r.content)
print(r.status_code)

Should be rewrite to

try:
 response = requests.get(url)
 if response.ok:
 data = response.json()
except (requests.exceptions.RequestException, json.decoder.JSONDecodeError) as exc:
 print(exc)

Those are the minimal changes that you need to do when you work with APIs and use json transformations and requests library.

Hope it helps

This is... not right. Universally catching and printing exceptions is the opposite of good practice.

Stack Exchange Network

Python script for Web scraping

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python script for Web scraping

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions