I've created a script in python to make proxied requests by picking working proxies from a list of proxies scraped from a free proxy site. The bot traverses few links to parse the urls of the different posts from a website. The script however uses a new working proxy every time it makes a new requests as there are multiple requests to make.
At this point I've rectified the logic within my script in such a way so that the script will first check whether the existing proxy is still working in making new requests. If it is still a working proxy then the script should stick to it otherwise It will pick a random one from the list to go on.
The logic to reuse the same working proxy in multiple requests (until it is invalid) is defined within this start_script()
function.
The script eventually has got a weird look. I suppose there are rooms for improvement to make it more concise and less verbose.
This is what I've created so far (working one):
import random
import requests
from random import choice
from bs4 import BeautifulSoup
from urllib.parse import urljoin
test_url = 'https://stackoverflow.com/' #It is for testing proxied requests
base_url = 'https://stackoverflow.com'
main_urls = ['https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50'.format(page) for page in range(2,5)]
cbool = False
usable_proxy = None
def get_proxies():
response = requests.get("https://www.sslproxies.org/")
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies
def get_random_proxy(proxy_vault):
while proxy_vault:
print("trying to get one----")
random.shuffle(proxy_vault)
proxy_url = proxy_vault.pop()
proxy_dict = {
'http': proxy_url,
'https': proxy_url
}
try:
res = requests.get(test_url, proxies=proxy_dict, timeout=10)
res.raise_for_status()
return proxy_url
except:
continue
def start_script(url):
global cbool
global usable_proxy
if not cbool:
proxy = get_proxies()
random_proxy = get_random_proxy(proxy)
if random_proxy:
usable_proxy = {'https': f'http://{random_proxy}'}
urls = make_requests(url,usable_proxy)
cbool = True
return urls
else:
return start_script(url)
else:
urls = make_requests(url,usable_proxy)
if urls:
return urls
else:
cbool = False
def make_requests(url,proxy):
try:
res = requests.get(url, proxies=proxy, timeout=10)
except Exception:
return start_script(url)
print("proxy used in requests:",proxy)
if res.status_code!=200:
return start_script(url)
soup = BeautifulSoup(res.text, "lxml")
return [urljoin(base_url,item.get("href")) for item in soup.select(".summary .question-hyperlink")]
if __name__ == '__main__':
for url in main_urls:
print(start_script(url))
1 Answer 1
Do not recurse if the iterative solution is readily available. In Python it is particularly very important: Python does not optimize tail recursion, and there is a serious chance to hit the stack limit.
For example,
make_request
should look likewhile True: try: res = requests.get(url, proxies=proxy, timeout=10) except Exception: continue print("proxy used in requests:",proxy) if res.status_code!=200: continue soup = BeautifulSoup(res.text, "lxml") return [urljoin(base_url,item.get("href")) for item in soup.select(".summary .question-hyperlink")]
Similarly,
start_script
shall also be converted into a loop. As a side benefit, there would be no need for the very alarmingusable_proxy
andcbool
globals.You shall not blindly retry on
res.status_code!=200
. Some status codes (e.g. a 400 family) guarantee that you will get the same error over and over again, resulting in the infinite loop.Ditto for exceptions.
-
\$\begingroup\$ Thanks for your invaluable suggestions @vnp. I'm not quite sure if you expected this function
start_script()
to be like this . If this is what you meant then the function will definitely create new proxies to suppy within each new requests whereas my intention is to use the same working proxy until it gets invalid like the way my existing script is working. Thanks again. \$\endgroup\$MITHU– MITHU2019年06月20日 19:00:23 +00:00Commented Jun 20, 2019 at 19:00
Explore related questions
See similar questions with these tags.