-
Notifications
You must be signed in to change notification settings - Fork 5
-
so I do something very similar here: http://www.steelrankings.com/
I too ran into the http429 issue. Cloudflare is checking IP; I ended up using zenrows.com. I process 17000 USPSA numbers for steel challenge classification rankings 4x / week. I use 25 concurrent threads and it averages about .07 per request; Happy to have more discussion. I did take a look at your data files and happy to help if you wanted to rebuild them on-demand.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
Replies: 5 comments 2 replies
-
Hey this is awesome! I didn't even think of checking if there's a SaaS for scraping around rate limits.
I just use mobile API that doesn't have rate limiting (well it didn't back in January).
Looking how USPSA and SCSA mobile apps are very similar, maybe you can use their mobile api too and save money on the scraper (don't know if it costs you anything). Let me know if you need help with that.
Link you posted didn't open for me for some reason.
Beta Was this translation helpful? Give feedback.
All reactions
-
Could you paste a sample mobile api link ? I wish they had a swagger doc, but from what I see they ( USPSA / SCSA ) don't really have genuine api's ( maybe you can prove me wrong ).
Fat finger on my part, I typed the name of my own site wrong !
http://www.steelrankings.com/
Beta Was this translation helpful? Give feedback.
All reactions
-
I used a sniffer on iOS app to see what it's hitting. Then I just reused same request from the browser snippets tab using this:
https://github.com/CodeHowlerMonkey/hitfactorlol/blob/main/scripts/uspsaScript.js
All hhfs come from single endpoint, so I just took that thin tight off the sniffer app (it can save files).
For clarifications and classifiers api urls are these:
https://api.uspsa.org/api/app/classification/A100099
https://api.uspsa.org/api/app/classifiers/A100099
Beta Was this translation helpful? Give feedback.
All reactions
-
OK, this is getting good !
This is the page which is rate limited in SCSA ( likely USPSA also )
https://scsa.org/classification/FY105260
https://uspsa.org/classification/fy105260
So, I'm scraping via python url request and then parsing the DoM via Soup. I run this on both MacOS or AWS Linux depending on if I'm busy with my machine. It seems like you are running almost like a selenium approach in your browser. I'll need to play with the Sniffer to track the endpoints. I'm using HTTP Get w/o an API key ( I'm familiar with how to use them ) but I don't see how I get an API key to consume those mobile api endpoints ? Could you explain if there is an api I call to create my token ?
Here is my sample on HTTP Req ( I feed into this 17k urls which I retrieve from AWS RDS )
client_key = "xxx"
client = ZenRowsClient(client_key)
def make_request_with_retry(url, retries=9, backoff_factor=1.9, timeout=50):
for attempt in range(1, retries + 1):
try:
print(f"URL Request Attempt {attempt}/{retries} for {url}")
request_start_time = time.time() # Start timing the request
#params = {"premium_proxy":"true"}
#response = client.get(url, params=params)
response = client.get(url)
request_end_time = time.time() # End timing the request
elapsed_time = request_end_time - request_start_time # Calculate elapsed time
print(f"Elapsed Time for {url}: {elapsed_time:.2f} seconds")
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
result = get_expiration_date(soup)
print("\t\n"+url+" expiration date ",result)
return
else:
print(f"Request for {url} returned a non-success status code: {response.status_code}")
print(f"Elapsed Time for {url}: {elapsed_time:.2f} seconds")
except requests.exceptions.RequestException as e:
print(f"Request for {url} failed: {e}")
print(f"Elapsed Time for {url}: {elapsed_time:.2f} seconds")
time.sleep(backoff_factor * (2 ** (attempt - 1))) # Exponential backoff
except Exception as e:
print(f"An unexpected error occurred for {url}: {e}")
print(f"Elapsed Time for {url}: {elapsed_time:.2f} seconds")
print(f"Maximum retry attempts reached for {url}. Request failed.")
return None
Beta Was this translation helpful? Give feedback.
All reactions
-
The API key I took from the sniffer. For iOS I used this app: https://apps.apple.com/us/app/storm-sniffer-packet-capture/id1610958307
It comes with instructions how to install MITM certificate for sniffing https traffic
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Hey it looks like you are using Zenrows; how do you like it ? Have you taken any different directions as a result of SaaS scraping ?
Beta Was this translation helpful? Give feedback.
All reactions
-
Congrats on the launch of your site. your UI work is outstanding. I having been trying various scraping methods with and without zenrows.
Beta Was this translation helpful? Give feedback.