3
\$\begingroup\$

I've written some code using Python 3 to scrape movie names, links to the movie posters, and finally save the pictures on the local drive after downloading them from a web page.

I have used two functions to accomplish the whole task. I've tried my best to make the process clean. It is working great now. Any suggestion as to the betterment of this script will be very helpful for me. Thanks in advance. Here is the working code:

import requests
from lxml import html
import os
url = "https://www.yify-torrent.org/search/1080p/"
def ImageScraper(link):
 response = requests.session().get(link).text
 tree = html.fromstring(response)
 for title in tree.xpath('//div[@class="mv"]'):
 movie_title = title.findtext('.//h3/a')
 image_url = title.xpath('.//img/@src')[0]
 image_url = "https:" + image_url
 image_name = image_url.split('/')[-1]
 SavingImages(movie_title, image_name, image_url)
def SavingImages(movie_name, item_name, item_link):
 response = requests.session().get(item_link, stream = True)
 if response.status_code == 200:
 os.chdir(r"C:\Users\ar\Desktop\mth")
 with open(item_name, 'wb') as f:
 for chunk in response.iter_content(1024):
 f.write(chunk)
 print(movie_name, item_link)
ImageScraper(url)
Phrancis
20.5k6 gold badges69 silver badges155 bronze badges
asked Jul 13, 2017 at 20:14
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

I would focus on the following things specifically:

  • variable and function naming:
    • use lower_case_with_underscores naming convention
    • what if we rename title to movie and movie_title to title - I think that would be a bit more descriptive
    • response should probably be named page_source since it is not a Response instance but already the text of the response
  • use of spaces and line breaks:
    • according to PEP8 coding style, you should have 2 line breaks between the functions
    • when passing a keyword argument to a function, don't put spaces around the =
  • code organization:
    • I would use a class to share a web-scraping session and have it parameterized with a url and a download directory. I think that would be more modular.

Improved code:

import os
import requests
from lxml import html
class ImageScraper:
 def __init__(self, url, download_path):
 self.url = url
 self.download_path = download_path
 self.session = requests.Session()
 def scrape_images(self):
 response = self.session.get(self.url).text
 tree = html.fromstring(response)
 for movie in tree.xpath('//div[@class="mv"]'):
 title = movie.findtext('.//h3/a')
 image_url = "https:" + movie.xpath('.//img/@src')[0]
 image_name = image_url.split('/')[-1]
 self.save_image(title, image_name, image_url)
 def save_image(self, movie_name, file_name, item_link):
 response = self.session.get(item_link, stream=True)
 if response.status_code == 200:
 with open(os.path.join(self.download_path, file_name), 'wb') as image_file:
 for chunk in response.iter_content(1024):
 image_file.write(chunk)
 print(movie_name, file_name)
if __name__ == '__main__':
 scraper = ImageScraper(url="https://www.yify-torrent.org/search/1080p/",
 download_path=r"C:\Users\ar\Desktop\mth")
 scraper.scrape_images()
answered Jul 13, 2017 at 20:30
\$\endgroup\$
1
  • \$\begingroup\$ Thanks sir alecxe for such a great review. It's very nice to get a new working script from your end every time you make a review. Next time I'll be more careful about the mistakes I'm making monotonously. Thanks once again. \$\endgroup\$ Commented Jul 13, 2017 at 20:47

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.