Return to Question

Notice removed Draw attention by Community Bot

occurred Jul 1, 2019 at 13:03

Bounty Ended with no winning answer by Community Bot

occurred Jul 1, 2019 at 13:03

Tweeted twitter.com/StackCodeReview/status/1142764360516984832

occurred Jun 23, 2019 at 12:00

Notice added Draw attention by MITHU

occurred Jun 23, 2019 at 11:57

Bounty Started worth 50 reputation by MITHU

occurred Jun 23, 2019 at 11:57

updated the content

Source Link

edited Jun 20, 2019 at 16:33

MITHU

edited Jun 20, 2019 at 16:33

MITHU

This is the what I've created so far (working one):

This is what I've created so far (working one):

title

Link

edited Jun 20, 2019 at 16:32

dfhwze

edited Jun 20, 2019 at 16:32

dfhwze

14.1k
3
40
101

Sticking to a working proxy genratedgenerated by a rotating proxy script

Source Link

asked Jun 20, 2019 at 16:31

MITHU

asked Jun 20, 2019 at 16:31

MITHU

Sticking to a working proxy genrated by a rotating proxy script

I've created a script in python to make proxied requests by picking working proxies from a list of proxies scraped from a free proxy site. The bot traverses few links to parse the urls of the different posts from a website. The script however uses a new working proxy every time it makes a new requests as there are multiple requests to make.

At this point I've rectified the logic within my script in such a way so that the script will first check whether the existing proxy is still working in making new requests. If it is still a working proxy then the script should stick to it otherwise It will pick a random one from the list to go on.

The logic to reuse the same working proxy in multiple requests (until it is invalid) is defined within this start_script() function.

The script eventually has got a weird look. I suppose there are rooms for improvement to make it more concise and less verbose.

This is the what I've created so far (working one):

import random
import requests
from random import choice
from bs4 import BeautifulSoup
from urllib.parse import urljoin
test_url = 'https://stackoverflow.com/' #It is for testing proxied requests
base_url = 'https://stackoverflow.com'
main_urls = ['https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&page={}&pagesize=50'.format(page) for page in range(2,5)]
cbool = False
usable_proxy = None
def get_proxies(): 
 response = requests.get("https://www.sslproxies.org/")
 soup = BeautifulSoup(response.text,"lxml")
 proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tr") if "yes" in item.text]
 return proxies
def get_random_proxy(proxy_vault):
 while proxy_vault:
 print("trying to get one----")
 random.shuffle(proxy_vault)
 proxy_url = proxy_vault.pop()
 proxy_dict = {
 'http': proxy_url,
 'https': proxy_url
 }
 try:
 res = requests.get(test_url, proxies=proxy_dict, timeout=10)
 res.raise_for_status()
 return proxy_url
 except:
 continue
def start_script(url):
 global cbool
 global usable_proxy
 if not cbool:
 proxy = get_proxies()
 random_proxy = get_random_proxy(proxy)
 if random_proxy:
 usable_proxy = {'https': f'http://{random_proxy}'}
 urls = make_requests(url,usable_proxy)
 cbool = True
 return urls
 else:
 return start_script(url)
 else:
 urls = make_requests(url,usable_proxy)
 if urls:
 return urls
 else:
 cbool = False 
def make_requests(url,proxy):
 try:
 res = requests.get(url, proxies=proxy, timeout=10)
 except Exception:
 return start_script(url)
 print("proxy used in requests:",proxy)
 if res.status_code!=200:
 return start_script(url)
 soup = BeautifulSoup(res.text, "lxml")
 return [urljoin(base_url,item.get("href")) for item in soup.select(".summary .question-hyperlink")]
if __name__ == '__main__':
 for url in main_urls:
 print(start_script(url))

lang-py