-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Example "examples/cdp_mode/raw_xhr_async.py" but using proxies #3666
-
Hello,
I took example "examples/cdp_mode/raw_xhr_async.py" and made some changes to work with SCRIPT requests. Worked very well. I would like to use rotating proxies and I couldn't find information to make it work. Best if there is a similar example where I can filter requests using CDP mode, but either I couldn't find. I tried CDP mode examples using log, but it is very slow.
Here my code:
"""CDP.network.ResponseReceived with CDP.network.ResourceType.SCRIPT."""
import ast
import os
import random
import asyncio
import mycdp
import time
import re
import json
from seleniumbase import cdp_driver
from dotenv import load_dotenv
load_dotenv()
packet_requests = []
last_packet_request = None
last_request_type = None
PROXY_USER = os.getenv("PROXY_USER")
PROXY_PASS = os.getenv("PROXY_PASS")
PROXY_HOST = os.getenv("PROXY_HOST")
def get_random_proxy():
port = random.randint(10001, 10050)
return f"https://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{port}"
def listenAjaxRequests(page):
async def handler(evt):
# Get AJAX requests
# mycdp.network.ResourceType:
# XHR,FETCH,SCRIPT,IMAGE,FONT,STYLESHEET,PING,PREFLIGHT,DOCUMENT
if evt.type_ is mycdp.network.ResourceType.SCRIPT:
packet_requests.append([evt.response.url, evt.request_id])
global last_packet_request
global last_request_type
last_packet_request = time.time()
last_request_type = evt.type_
page.add_handler(mycdp.network.ResponseReceived, handler)
async def receivePacket(page, requests):
responses = []
retries = 0
max_retries = 5
# Wait at least 2 seconds after last XHR request for more
while True:
if last_packet_request is None or retries > max_retries:
break
if time.time() - last_packet_request <= 2:
retries = retries + 1
time.sleep(2)
continue
else:
break
await page
# Loop through gathered requests and get response body
for request in requests:
try:
res = await page.send(mycdp.network.get_response_body(request[1]))
if res is None:
continue
responses.append({
"url": request[0],
"body": res[0],
"is_base64": res[1],
})
except Exception as e:
print("Error getting response:", e)
return responses
async def crawl():
proxy = get_random_proxy()
driver = await cdp_driver.start_async()
tab = await driver.get("about:blank")
listenAjaxRequests(tab)
tab = await driver.get("https://pt.aliexpress.com/item/33015656888.html")
time.sleep(5)
for i in range(4):
await tab.scroll_down(4)
time.sleep(0.04)
request_responses = await receivePacket(tab, packet_requests)
for response in request_responses:
if 'mtop.aliexpress.pdp.pc.query' in response["url"]:
print("\n*** ==> Request URL <== ***")
print(f'{response["url"]}')
is_base64 = response["is_base64"]
b64_data = "Base64 encoded data"
try:
headers = ast.literal_eval(response["body"])["headers"]
print("*** ==> Response Headers <== ***")
print(headers if not is_base64 else b64_data)
except Exception:
response_body = response["body"]
print("*** ==> Response Body <== ***")
match = re.search(r'mtopjsonp2\((\{.*\})\)', response_body if not is_base64 else b64_data)
if match:
json_str = match.group(1)
json_data = json.loads(json_str)
print(json.dumps(json_data, indent=2))
else:
print("Invalid JSONP response")
print(response_body if not is_base64 else b64_data)
if __name__ == "__main__":
print("<============= START =============>")
asyncio.run(crawl())
print("<============== END ==============>")
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment 4 replies
-
Set the proxy
arg when calling cdp_driver.start_async()
:
Format: "host:port"
or "user:pass@host:port"
(Don't include the URL Protocol in your string.)
Beta Was this translation helpful? Give feedback.
All reactions
-
First of all thank you very much Michael for your amazing work. I really love your videos and all stuff you publish.
I think I tried before, removing https part. Anyway when I provide function arg, routine stops to work. It doesn't returns any data.
...
def get_random_proxy():
port = random.randint(10001, 10050)
return f"{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{port}"
...
proxy = get_random_proxy()
driver = await cdp_driver.start_async(proxy=proxy)
...
returns
<============= START =============>
<============== END ==============>
Beta Was this translation helpful? Give feedback.
All reactions
-
The proxy server address, credentials, and ports must be valid. You can't just pick any port number you want. There has to already be a running server at the address and port specified.
Beta Was this translation helpful? Give feedback.
All reactions
-
adapted to show proxy results
...
async def crawl():
proxy = get_random_proxy()
driver = await cdp_driver.start_async(proxy=proxy)
tab = await driver.get("https://ipinfo.io/json")
html = await tab.get_content()
soup = BeautifulSoup(html, "html.parser")
pre_text = soup.find("pre").get_text()
print(pre_text)
tab = await driver.get("about:blank")
listenAjaxRequests(tab)
...
result
<============= START =============>
{
"ip": "189.68.23.18",
"hostname": "189-68-23-18.dsl.telesp.net.br",
"city": "São Paulo",
"region": "São Paulo",
"country": "BR",
"loc": "-23.5475,-46.6361",
"org": "AS27699 TELEFÔNICA BRASIL S.A",
"postal": "01000-000",
"timezone": "America/Sao_Paulo",
"readme": "https://ipinfo.io/missingauth"
}
<============== END ==============>
<============= START =============>
{
"ip": "200.110.205.239",
"hostname": "siconect.com.br",
"city": "Vitória",
"region": "Espírito Santo",
"country": "BR",
"loc": "-20.3194,-40.3378",
"org": "AS270253 SICONECT TELECOMUNICACOES EIRELI",
"postal": "29000-000",
"timezone": "America/Sao_Paulo",
"readme": "https://ipinfo.io/missingauth"
}
<============== END ==============>
Beta Was this translation helpful? Give feedback.
All reactions
-
any help about it? instead using driver, is there any way to use CDP mode? I tried using log, but is pretty different than this example.
Beta Was this translation helpful? Give feedback.