Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Example "examples/cdp_mode/raw_xhr_async.py" but using proxies #3666

andredourado started this conversation in General
Discussion options

Hello,

I took example "examples/cdp_mode/raw_xhr_async.py" and made some changes to work with SCRIPT requests. Worked very well. I would like to use rotating proxies and I couldn't find information to make it work. Best if there is a similar example where I can filter requests using CDP mode, but either I couldn't find. I tried CDP mode examples using log, but it is very slow.

Here my code:

"""CDP.network.ResponseReceived with CDP.network.ResourceType.SCRIPT."""
import ast
import os
import random
import asyncio
import mycdp
import time
import re
import json
from seleniumbase import cdp_driver
from dotenv import load_dotenv
load_dotenv()
packet_requests = []
last_packet_request = None
last_request_type = None
PROXY_USER = os.getenv("PROXY_USER")
PROXY_PASS = os.getenv("PROXY_PASS")
PROXY_HOST = os.getenv("PROXY_HOST")
def get_random_proxy():
 port = random.randint(10001, 10050)
 return f"https://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{port}"
def listenAjaxRequests(page):
 async def handler(evt):
 # Get AJAX requests
 # mycdp.network.ResourceType:
 # XHR,FETCH,SCRIPT,IMAGE,FONT,STYLESHEET,PING,PREFLIGHT,DOCUMENT
 if evt.type_ is mycdp.network.ResourceType.SCRIPT:
 packet_requests.append([evt.response.url, evt.request_id])
 global last_packet_request
 global last_request_type
 last_packet_request = time.time()
 last_request_type = evt.type_
 
 page.add_handler(mycdp.network.ResponseReceived, handler)
async def receivePacket(page, requests):
 responses = []
 retries = 0
 max_retries = 5
 # Wait at least 2 seconds after last XHR request for more
 while True:
 if last_packet_request is None or retries > max_retries:
 break
 if time.time() - last_packet_request <= 2:
 retries = retries + 1
 time.sleep(2)
 continue
 else:
 break
 await page
 # Loop through gathered requests and get response body
 for request in requests:
 try:
 res = await page.send(mycdp.network.get_response_body(request[1]))
 if res is None:
 continue
 responses.append({
 "url": request[0],
 "body": res[0],
 "is_base64": res[1],
 })
 except Exception as e:
 print("Error getting response:", e)
 return responses
async def crawl():
 proxy = get_random_proxy()
 driver = await cdp_driver.start_async()
 tab = await driver.get("about:blank")
 listenAjaxRequests(tab)
 tab = await driver.get("https://pt.aliexpress.com/item/33015656888.html")
 time.sleep(5)
 for i in range(4):
 await tab.scroll_down(4)
 time.sleep(0.04)
 request_responses = await receivePacket(tab, packet_requests)
 for response in request_responses:
 if 'mtop.aliexpress.pdp.pc.query' in response["url"]:
 print("\n*** ==> Request URL <== ***")
 print(f'{response["url"]}')
 is_base64 = response["is_base64"]
 b64_data = "Base64 encoded data"
 try:
 headers = ast.literal_eval(response["body"])["headers"]
 print("*** ==> Response Headers <== ***")
 print(headers if not is_base64 else b64_data)
 except Exception:
 response_body = response["body"]
 print("*** ==> Response Body <== ***")
 match = re.search(r'mtopjsonp2\((\{.*\})\)', response_body if not is_base64 else b64_data)
 if match:
 json_str = match.group(1)
 json_data = json.loads(json_str)
 print(json.dumps(json_data, indent=2))
 else:
 print("Invalid JSONP response")
 print(response_body if not is_base64 else b64_data)
 
if __name__ == "__main__":
 print("<============= START =============>")
 asyncio.run(crawl())
 print("<============== END ==============>")
You must be logged in to vote

Replies: 1 comment 4 replies

Comment options

Set the proxy arg when calling cdp_driver.start_async():

proxy: Optional[str] = None, # "host:port" or "user:pass@host:port"

Format: "host:port" or "user:pass@host:port"

(Don't include the URL Protocol in your string.)

You must be logged in to vote
4 replies
Comment options

First of all thank you very much Michael for your amazing work. I really love your videos and all stuff you publish.

I think I tried before, removing https part. Anyway when I provide function arg, routine stops to work. It doesn't returns any data.

...
def get_random_proxy():
 port = random.randint(10001, 10050)
 return f"{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{port}"
...
 proxy = get_random_proxy()
 driver = await cdp_driver.start_async(proxy=proxy)
...

returns

<============= START =============>
<============== END ==============>
Comment options

The proxy server address, credentials, and ports must be valid. You can't just pick any port number you want. There has to already be a running server at the address and port specified.

Comment options

adapted to show proxy results

...
async def crawl():
 proxy = get_random_proxy()
 driver = await cdp_driver.start_async(proxy=proxy)
 
 tab = await driver.get("https://ipinfo.io/json")
 html = await tab.get_content()
 
 soup = BeautifulSoup(html, "html.parser")
 pre_text = soup.find("pre").get_text()
 print(pre_text)
 tab = await driver.get("about:blank")
 listenAjaxRequests(tab)
...

result

<============= START =============>
{
 "ip": "189.68.23.18",
 "hostname": "189-68-23-18.dsl.telesp.net.br",
 "city": "São Paulo",
 "region": "São Paulo",
 "country": "BR",
 "loc": "-23.5475,-46.6361",
 "org": "AS27699 TELEFÔNICA BRASIL S.A",
 "postal": "01000-000",
 "timezone": "America/Sao_Paulo",
 "readme": "https://ipinfo.io/missingauth"
}
<============== END ==============>
<============= START =============>
{
 "ip": "200.110.205.239",
 "hostname": "siconect.com.br",
 "city": "Vitória",
 "region": "Espírito Santo",
 "country": "BR",
 "loc": "-20.3194,-40.3378",
 "org": "AS270253 SICONECT TELECOMUNICACOES EIRELI",
 "postal": "29000-000",
 "timezone": "America/Sao_Paulo",
 "readme": "https://ipinfo.io/missingauth"
}
<============== END ==============>
Comment options

any help about it? instead using driver, is there any way to use CDP mode? I tried using log, but is pretty different than this example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /