Scraper for downloading and saving images from web page

Question 1

I've written some code using Python 3 to scrape movie names, links to the movie posters, and finally save the pictures on the local drive after downloading them from a web page.

I have used two functions to accomplish the whole task. I've tried my best to make the process clean. It is working great now. Any suggestion as to the betterment of this script will be very helpful for me. Thanks in advance. Here is the working code:

import requests
from lxml import html
import os
url = "https://www.yify-torrent.org/search/1080p/"
def ImageScraper(link):
 response = requests.session().get(link).text
 tree = html.fromstring(response)
 for title in tree.xpath('//div[@class="mv"]'):
 movie_title = title.findtext('.//h3/a')
 image_url = title.xpath('.//img/@src')[0]
 image_url = "https:" + image_url
 image_name = image_url.split('/')[-1]
 SavingImages(movie_title, image_name, image_url)
def SavingImages(movie_name, item_name, item_link):
 response = requests.session().get(item_link, stream = True)
 if response.status_code == 200:
 os.chdir(r"C:\Users\ar\Desktop\mth")
 with open(item_name, 'wb') as f:
 for chunk in response.iter_content(1024):
 f.write(chunk)
 print(movie_name, item_link)
ImageScraper(url)

Question 2

I would focus on the following things specifically:

variable and function naming:
- use lower_case_with_underscores naming convention
- what if we rename title to movie and movie_title to title - I think that would be a bit more descriptive
- response should probably be named page_source since it is not a Response instance but already the text of the response
use of spaces and line breaks:
- according to PEP8 coding style, you should have 2 line breaks between the functions
- when passing a keyword argument to a function, don't put spaces around the =
code organization:
- I would use a class to share a web-scraping session and have it parameterized with a url and a download directory. I think that would be more modular.

Improved code:

import os
import requests
from lxml import html
class ImageScraper:
 def __init__(self, url, download_path):
 self.url = url
 self.download_path = download_path
 self.session = requests.Session()
 def scrape_images(self):
 response = self.session.get(self.url).text
 tree = html.fromstring(response)
 for movie in tree.xpath('//div[@class="mv"]'):
 title = movie.findtext('.//h3/a')
 image_url = "https:" + movie.xpath('.//img/@src')[0]
 image_name = image_url.split('/')[-1]
 self.save_image(title, image_name, image_url)
 def save_image(self, movie_name, file_name, item_link):
 response = self.session.get(item_link, stream=True)
 if response.status_code == 200:
 with open(os.path.join(self.download_path, file_name), 'wb') as image_file:
 for chunk in response.iter_content(1024):
 image_file.write(chunk)
 print(movie_name, file_name)
if __name__ == '__main__':
 scraper = ImageScraper(url="https://www.yify-torrent.org/search/1080p/",
 download_path=r"C:\Users\ar\Desktop\mth")
 scraper.scrape_images()

Question 3

Thanks sir alecxe for such a great review. It's very nice to get a new working script from your end every time you make a review. Next time I'll be more careful about the mistakes I'm making monotonously. Thanks once again.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-07-13 20:30:07Z

I would focus on the following things specifically:

variable and function naming:
- use lower_case_with_underscores naming convention
- what if we rename title to movie and movie_title to title - I think that would be a bit more descriptive
- response should probably be named page_source since it is not a Response instance but already the text of the response
use of spaces and line breaks:
- according to PEP8 coding style, you should have 2 line breaks between the functions
- when passing a keyword argument to a function, don't put spaces around the =
code organization:
- I would use a class to share a web-scraping session and have it parameterized with a url and a download directory. I think that would be more modular.

Improved code:

import os
import requests
from lxml import html
class ImageScraper:
 def __init__(self, url, download_path):
 self.url = url
 self.download_path = download_path
 self.session = requests.Session()
 def scrape_images(self):
 response = self.session.get(self.url).text
 tree = html.fromstring(response)
 for movie in tree.xpath('//div[@class="mv"]'):
 title = movie.findtext('.//h3/a')
 image_url = "https:" + movie.xpath('.//img/@src')[0]
 image_name = image_url.split('/')[-1]
 self.save_image(title, image_name, image_url)
 def save_image(self, movie_name, file_name, item_link):
 response = self.session.get(item_link, stream=True)
 if response.status_code == 200:
 with open(os.path.join(self.download_path, file_name), 'wb') as image_file:
 for chunk in response.iter_content(1024):
 image_file.write(chunk)
 print(movie_name, file_name)
if __name__ == '__main__':
 scraper = ImageScraper(url="https://www.yify-torrent.org/search/1080p/",
 download_path=r"C:\Users\ar\Desktop\mth")
 scraper.scrape_images()

Thanks sir alecxe for such a great review. It's very nice to get a new working script from your end every time you make a review. Next time I'll be more careful about the mistakes I'm making monotonously. Thanks once again.

Stack Exchange Network

Scraper for downloading and saving images from web page

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scraper for downloading and saving images from web page

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions