This is the what I've created so far (working one):
This is the what I've created so far (working one):
This is what I've created so far (working one):
Sticking to a working proxy genratedgenerated by a rotating proxy script
Sticking to a working proxy genrated by a rotating proxy script
I've created a script in python to make proxied requests by picking working proxies from a list of proxies scraped from a free proxy site. The bot traverses few links to parse the urls of the different posts from a website. The script however uses a new working proxy every time it makes a new requests as there are multiple requests to make.
At this point I've rectified the logic within my script in such a way so that the script will first check whether the existing proxy is still working in making new requests. If it is still a working proxy then the script should stick to it otherwise It will pick a random one from the list to go on.
The logic to reuse the same working proxy in multiple requests (until it is invalid) is defined within this start_script()
function.
The script eventually has got a weird look. I suppose there are rooms for improvement to make it more concise and less verbose.
This is the what I've created so far (working one):
import random
import requests
from random import choice
from bs4 import BeautifulSoup
from urllib.parse import urljoin
test_url = 'https://stackoverflow.com/' #It is for testing proxied requests
base_url = 'https://stackoverflow.com'
main_urls = ['https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50'.format(page) for page in range(2,5)]
cbool = False
usable_proxy = None
def get_proxies():
response = requests.get("https://www.sslproxies.org/")
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
return proxies
def get_random_proxy(proxy_vault):
while proxy_vault:
print("trying to get one----")
random.shuffle(proxy_vault)
proxy_url = proxy_vault.pop()
proxy_dict = {
'http': proxy_url,
'https': proxy_url
}
try:
res = requests.get(test_url, proxies=proxy_dict, timeout=10)
res.raise_for_status()
return proxy_url
except:
continue
def start_script(url):
global cbool
global usable_proxy
if not cbool:
proxy = get_proxies()
random_proxy = get_random_proxy(proxy)
if random_proxy:
usable_proxy = {'https': f'http://{random_proxy}'}
urls = make_requests(url,usable_proxy)
cbool = True
return urls
else:
return start_script(url)
else:
urls = make_requests(url,usable_proxy)
if urls:
return urls
else:
cbool = False
def make_requests(url,proxy):
try:
res = requests.get(url, proxies=proxy, timeout=10)
except Exception:
return start_script(url)
print("proxy used in requests:",proxy)
if res.status_code!=200:
return start_script(url)
soup = BeautifulSoup(res.text, "lxml")
return [urljoin(base_url,item.get("href")) for item in soup.select(".summary .question-hyperlink")]
if __name__ == '__main__':
for url in main_urls:
print(start_script(url))