This code takes a website and downloads all .jpg images in the webpage. It supports only websites that have the <img>
element and src
contains a .jpg link.
import random
import urllib.request
import requests
from bs4 import BeautifulSoup
def Download_Image_from_Web(url):
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
raw_text = r'links.txt'
with open(raw_text, 'w') as fw:
for link in soup.findAll('img'):
image_links = link.get('src')
if '.jpg' in image_links:
for i in image_links.split("\\n"):
fw.write(i + '\n')
num_lines = sum(1 for line in open('links.txt'))
if num_lines == 0:
print("There is 0 photo in this web page.")
elif num_lines == 1:
print("There is", num_lines, "photo in this web page:")
else:
print("There are", num_lines, "photos in this web page:")
k = 0
while k <= (num_lines-1):
name = random.randrange(1, 1000)
fullName = str(name) + ".jpg"
with open('links.txt', 'r') as f:
lines = f.readlines()[k]
urllib.request.urlretrieve(lines, fullName)
print(lines+fullName+'\n')
k += 1
Download_Image_from_Web("https://pixabay.com")
-
1\$\begingroup\$ Welcome to Code Review! Please do not update the code in your question to incorporate feedback from answers, doing so goes against the Question + Answer style of Code Review. This is not a forum where you should keep the most updated version in your question. Please see what you may and may not do after receiving answers . \$\endgroup\$Simon Forsberg– Simon Forsberg2017年04月29日 19:27:06 +00:00Commented Apr 29, 2017 at 19:27
3 Answers 3
Unnecessary file operations
This is horribly inefficient:
k = 0 while k <= (num_lines-1): name = random.randrange(1, 1000) fullName = str(name) + ".jpg" with open('links.txt', 'r') as f: lines = f.readlines()[k] urllib.request.urlretrieve(lines, fullName) print(lines+fullName+'\n') k += 1
Re-reading the same file num_lines
times, to download the k-th!
Btw, do you really need to write the list of urls to a file? Why not just keep them in a list? Even if you want the urls in a file, you could keep them in a list in memory and never read that file, only write.
Code organization
Instead of having all the code in a single function that does multiple things, it would be better to organize your program into smaller functions, each with a single responsibility.
Python conventions
Python has a well-defined set of coding conventions in PEP8, many of which are violated here. I suggest to read through that document, and follow as much as possible.
-
\$\begingroup\$ list doesn't work form me, it separates each link in one list and i don't know how to merge those links in one lists; \$\endgroup\$Salah Eddine– Salah Eddine2017年04月29日 17:16:03 +00:00Commented Apr 29, 2017 at 17:16
-
\$\begingroup\$ @SalahEddine perhaps you're looking for the
extend
function, for exampleall_links.extend(links)
\$\endgroup\$janos– janos2017年04月29日 17:18:41 +00:00Commented Apr 29, 2017 at 17:18 -
\$\begingroup\$ please look at it now I have solved some problems \$\endgroup\$Salah Eddine– Salah Eddine2017年04月29日 19:25:22 +00:00Commented Apr 29, 2017 at 19:25
-
\$\begingroup\$ codereview.stackexchange.com/questions/162160/… \$\endgroup\$Salah Eddine– Salah Eddine2017年04月30日 05:24:49 +00:00Commented Apr 30, 2017 at 5:24
Aside from the things others mentioned, you can also improve the way you locate the img
elements that have src
attribute ending with .jpg
. Instead of using findAll
and if conditions, you can do it in one go with a CSS selector:
for img in soup.select("img[src$=jpg]"):
print(img["src"])
How about the following?
import random
import requests
from bs4 import BeautifulSoup
# got from http://stackoverflow.com/a/16696317
def download_file(url):
local_filename = url.split('/')[-1]
print("Downloading {} ---> {}".format(url, local_filename))
# NOTE the stream=True parameter
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return local_filename
def Download_Image_from_Web(url):
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.findAll('img'):
image_links = link.get('src')
if not image_links.startswith('http'):
image_links = url + '/' + image_links
download_file(image_links)
Download_Image_from_Web("https://pixabay.com")
-
\$\begingroup\$ it works pretty good, i feel bad about my code lol \$\endgroup\$Salah Eddine– Salah Eddine2017年04月30日 05:43:56 +00:00Commented Apr 30, 2017 at 5:43
-
\$\begingroup\$ I learned quite a lot from this code thank's for shairing it \$\endgroup\$Salah Eddine– Salah Eddine2017年04月30日 05:53:35 +00:00Commented Apr 30, 2017 at 5:53
Explore related questions
See similar questions with these tags.