I wrote a Python script for Web scraping of a Website. Please review my code and suggest me any changes or make me aware of my blunders/mistakes?. I wrote the almost same script for other websites also so please can you suggest me a way to combine all other scripts into one script so that I can get one single [merged] file.
import requests
import json
import pandas as pd
def geturl():
urls = [
# List of URLs
]
main = []
id = 0
for url in urls:
r = requests.get(url)
data = json.loads(r.content)
print(r.status_code)
items = data['items']
baseurl = #URL
for item in items:
data = {}
data['id']= id
data['Title'] = item['name']
data["Price"] = item['price']
data['Detai Page'] =baseurl+item['slug']
data['Image'] = item['thumb_image']
id += 1
main.append(data)
sr= pd.Series(main)
sr.to_json('data.json', orient='records')
geturl()
-
1\$\begingroup\$ Are you able to share any real URLs, or at least markup from the real pages? \$\endgroup\$Reinderien– Reinderien2022年03月21日 18:15:29 +00:00Commented Mar 21, 2022 at 18:15
-
\$\begingroup\$ This has nothing to do with functional programming, web scraping or BeautifulSoup. \$\endgroup\$Reinderien– Reinderien2022年03月21日 21:58:08 +00:00Commented Mar 21, 2022 at 21:58
2 Answers 2
When using Requests to fetch JSON content, it's usually more convenient to use the
resonse.json()
method instead of manually passingresponse.content
tojson.loads
.As far as I can tell, you're using Pandas to turn a dictionary into a JSON string and write it to a file - as far as I can tell, the structure doesn't change. That seems to me like a strange reason to use Pandas. It seems to me like you could get the same results in a simpler way by doing something like
with open("data.json", "w") as output_file: json.dump(main, output_file)
That said, I somewhat dislike not having the ability to just fetch the data without also writing it to a file. Personally I'd have a separate function to fetch the data, which could in turn be called by the larger
geturl
function.Hard-coding URLs and output paths makes your function less reusable than it could be. Consider having it take such things as parameters instead.
Right now, the last line of your script calls
geturl
not just whenever the script itself is run, but also whenever this file is imported as part of a larger program. That's awkward, since I'm sure this function might be useful elsewhere too. You can avoid having the code run when imported by putting it in anif __name__ == '__main__'
block.Creating
data
as an emptydict
and adding elements one by one is a fine approach. However, fordict
s like this one, which are small, and have consistent structures and simple contents, I often find it neater to just make it all at once with a singledict
literal. But that's a matter of taste, and your current approach works just fine.Re-using the name of a built-in function (such as
id
) for another variable will work, but is generally not considered great practice - I'd suggest renaming theid
variable for that reason.
Put that all together, you might end up with something like:
import requests
import json
def get_item_data(urls, detail_base_url):
main = []
item_id = 0
for url in urls:
r = requests.get(url)
data = r.json()
for item in data['items']:
item_data = {
'id': item_id
'Title': item['name']
'Price': item['price']
'Detai Page': detail_base_url + item['slug']
'Image': item['thumb_image']
}
id += 1
main.append(item_data)
return main
def save_item_data(urls, detail_base_url, output_filename):
item_data = get_item_data(urls, detail_base_url)
with open(output_filename, 'w') as output_file:
json.dump(item_data, output_file)
if __name__ == '__main__':
# TODO: We might want to get these from the command line. The "argparse" module would be useful for that
urls = [ ... ]
detail_base_url = ...
output_file_name = 'data.json'
save_item_data(urls, detail_base_url, output_file_name)
When you work with requests and json transformations you should allways wrap that code on a try except block
r = requests.get(url)
data = json.loads(r.content)
print(r.status_code)
Should be rewrite to
try:
response = requests.get(url)
if response.ok:
data = response.json()
except (requests.exceptions.RequestException, json.decoder.JSONDecodeError) as exc:
print(exc)
Those are the minimal changes that you need to do when you work with APIs and use json transformations and requests library.
Hope it helps
-
4\$\begingroup\$ This is... not right. Universally catching and printing exceptions is the opposite of good practice. \$\endgroup\$Reinderien– Reinderien2022年03月22日 13:05:53 +00:00Commented Mar 22, 2022 at 13:05