I've written some code using Python 3 to scrape movie names, links to the movie posters, and finally save the pictures on the local drive after downloading them from a web page.
I have used two functions to accomplish the whole task. I've tried my best to make the process clean. It is working great now. Any suggestion as to the betterment of this script will be very helpful for me. Thanks in advance. Here is the working code:
import requests
from lxml import html
import os
url = "https://www.yify-torrent.org/search/1080p/"
def ImageScraper(link):
response = requests.session().get(link).text
tree = html.fromstring(response)
for title in tree.xpath('//div[@class="mv"]'):
movie_title = title.findtext('.//h3/a')
image_url = title.xpath('.//img/@src')[0]
image_url = "https:" + image_url
image_name = image_url.split('/')[-1]
SavingImages(movie_title, image_name, image_url)
def SavingImages(movie_name, item_name, item_link):
response = requests.session().get(item_link, stream = True)
if response.status_code == 200:
os.chdir(r"C:\Users\ar\Desktop\mth")
with open(item_name, 'wb') as f:
for chunk in response.iter_content(1024):
f.write(chunk)
print(movie_name, item_link)
ImageScraper(url)
1 Answer 1
I would focus on the following things specifically:
- variable and function naming:
- use
lower_case_with_underscores
naming convention - what if we rename
title
tomovie
andmovie_title
totitle
- I think that would be a bit more descriptive response
should probably be namedpage_source
since it is not aResponse
instance but already thetext
of the response
- use
- use of spaces and line breaks:
- according to PEP8 coding style, you should have 2 line breaks between the functions
- when passing a keyword argument to a function, don't put spaces around the
=
- code organization:
- I would use a class to share a web-scraping session and have it parameterized with a url and a download directory. I think that would be more modular.
Improved code:
import os
import requests
from lxml import html
class ImageScraper:
def __init__(self, url, download_path):
self.url = url
self.download_path = download_path
self.session = requests.Session()
def scrape_images(self):
response = self.session.get(self.url).text
tree = html.fromstring(response)
for movie in tree.xpath('//div[@class="mv"]'):
title = movie.findtext('.//h3/a')
image_url = "https:" + movie.xpath('.//img/@src')[0]
image_name = image_url.split('/')[-1]
self.save_image(title, image_name, image_url)
def save_image(self, movie_name, file_name, item_link):
response = self.session.get(item_link, stream=True)
if response.status_code == 200:
with open(os.path.join(self.download_path, file_name), 'wb') as image_file:
for chunk in response.iter_content(1024):
image_file.write(chunk)
print(movie_name, file_name)
if __name__ == '__main__':
scraper = ImageScraper(url="https://www.yify-torrent.org/search/1080p/",
download_path=r"C:\Users\ar\Desktop\mth")
scraper.scrape_images()
-
\$\begingroup\$ Thanks sir alecxe for such a great review. It's very nice to get a new working script from your end every time you make a review. Next time I'll be more careful about the mistakes I'm making monotonously. Thanks once again. \$\endgroup\$SIM– SIM2017年07月13日 20:47:35 +00:00Commented Jul 13, 2017 at 20:47
Explore related questions
See similar questions with these tags.