Sticking to a working proxy generated by a rotating proxy script

Question 1

I've created a script in python to make proxied requests by picking working proxies from a list of proxies scraped from a free proxy site. The bot traverses few links to parse the urls of the different posts from a website. The script however uses a new working proxy every time it makes a new requests as there are multiple requests to make.

At this point I've rectified the logic within my script in such a way so that the script will first check whether the existing proxy is still working in making new requests. If it is still a working proxy then the script should stick to it otherwise It will pick a random one from the list to go on.

The logic to reuse the same working proxy in multiple requests (until it is invalid) is defined within this start_script() function.

The script eventually has got a weird look. I suppose there are rooms for improvement to make it more concise and less verbose.

This is what I've created so far (working one):

import random
import requests
from random import choice
from bs4 import BeautifulSoup
from urllib.parse import urljoin
test_url = 'https://stackoverflow.com/' #It is for testing proxied requests
base_url = 'https://stackoverflow.com'
main_urls = ['https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50'.format(page) for page in range(2,5)]
cbool = False
usable_proxy = None
def get_proxies(): 
 response = requests.get("https://www.sslproxies.org/")
 soup = BeautifulSoup(response.text,"lxml")
 proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
 return proxies
def get_random_proxy(proxy_vault):
 while proxy_vault:
 print("trying to get one----")
 random.shuffle(proxy_vault)
 proxy_url = proxy_vault.pop()
 proxy_dict = {
 'http': proxy_url,
 'https': proxy_url
 }
 try:
 res = requests.get(test_url, proxies=proxy_dict, timeout=10)
 res.raise_for_status()
 return proxy_url
 except:
 continue
def start_script(url):
 global cbool
 global usable_proxy
 if not cbool:
 proxy = get_proxies()
 random_proxy = get_random_proxy(proxy)
 if random_proxy:
 usable_proxy = {'https': f'http://{random_proxy}'}
 urls = make_requests(url,usable_proxy)
 cbool = True
 return urls
 else:
 return start_script(url)
 else:
 urls = make_requests(url,usable_proxy)
 if urls:
 return urls
 else:
 cbool = False 
def make_requests(url,proxy):
 try:
 res = requests.get(url, proxies=proxy, timeout=10)
 except Exception:
 return start_script(url)
 print("proxy used in requests:",proxy)
 if res.status_code!=200:
 return start_script(url)
 soup = BeautifulSoup(res.text, "lxml")
 return [urljoin(base_url,item.get("href")) for item in soup.select(".summary .question-hyperlink")]
if __name__ == '__main__':
 for url in main_urls:
 print(start_script(url))

Question 2

Do not recurse if the iterative solution is readily available. In Python it is particularly very important: Python does not optimize tail recursion, and there is a serious chance to hit the stack limit.

For example, make_request should look like
```
while True:
 try:
 res = requests.get(url, proxies=proxy, timeout=10)
 except Exception:
 continue
 print("proxy used in requests:",proxy)
 if res.status_code!=200:
 continue
 soup = BeautifulSoup(res.text, "lxml")
 return [urljoin(base_url,item.get("href")) for item in soup.select(".summary .question-hyperlink")]
```
Similarly, start_script shall also be converted into a loop. As a side benefit, there would be no need for the very alarming usable_proxy and cbool globals.
You shall not blindly retry on res.status_code!=200. Some status codes (e.g. a 400 family) guarantee that you will get the same error over and over again, resulting in the infinite loop.

Ditto for exceptions.

Question 3

Thanks for your invaluable suggestions @vnp. I'm not quite sure if you expected this function start_script() to be like this . If this is what you meant then the function will definitely create new proxies to suppy within each new requests whereas my intention is to use the same working proxy until it gets invalid like the way my existing script is working. Thanks again.

vnp vnp 58.6k4 gold badges55 silver badges144 bronze badges · Answer 1 · 2019-06-20 18:01:43Z

Do not recurse if the iterative solution is readily available. In Python it is particularly very important: Python does not optimize tail recursion, and there is a serious chance to hit the stack limit.

For example, make_request should look like
```
while True:
 try:
 res = requests.get(url, proxies=proxy, timeout=10)
 except Exception:
 continue
 print("proxy used in requests:",proxy)
 if res.status_code!=200:
 continue
 soup = BeautifulSoup(res.text, "lxml")
 return [urljoin(base_url,item.get("href")) for item in soup.select(".summary .question-hyperlink")]
```
Similarly, start_script shall also be converted into a loop. As a side benefit, there would be no need for the very alarming usable_proxy and cbool globals.
You shall not blindly retry on res.status_code!=200. Some status codes (e.g. a 400 family) guarantee that you will get the same error over and over again, resulting in the infinite loop.

Ditto for exceptions.

Thanks for your invaluable suggestions @vnp. I'm not quite sure if you expected this function start_script() to be like this . If this is what you meant then the function will definitely create new proxies to suppy within each new requests whereas my intention is to use the same working proxy until it gets invalid like the way my existing script is working. Thanks again.

Stack Exchange Network

Sticking to a working proxy generated by a rotating proxy script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Sticking to a working proxy generated by a rotating proxy script

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions