-
Notifications
You must be signed in to change notification settings - Fork 1.4k
cdp.is_element_present seems to be slow with large number of elements #3929
-
I'm seeing a slowdown when using cdp.is_element_present in a page with many elements. The delay seem seems to increase exponentially after a certain point, I'm guessing around 1,000 elements? I'm scraping a grocery website which can have many products. My script is working, but it is talking a long time to complete.
From the attached trace you can see that a category with 891 products takes under 2 minutes to scrape (circled in yellow). This is an acceptable speed. Most categories on this site have under 500 products , and they only take about 1 minute to scrape. However, when a page has 2,000+ products, it starts to get very slow. E,g. for a category with 2100 products, the scrape time balloons to nearly 21 minutes!
I have timed the load more clicks, as well as each category's scrape time. You can see that time for each click takes longer and longer. I have also scraped this site using another method (browser extension); there is no delay when clicking load more.
Any tips on improving the speed? This site uses Datadome, so CDP mode is needed.
Internet speed should not be factor 'cos I am running this on a datacenter server.
This is my load_more function:
`def click_load_more_if_present(sb_in):
load_more_button = "button:contains('LOAD MORE')"
clicks_made = 0
while True:
if sb_in.cdp.is_element_present(load_more_button):
start = time.perf_counter()
if clicks_made >= max_clicks:
print(f"Reached maximum clicks ({max_clicks}). Stopping.")
break # Exit the loop if max_clicks is reached
sb_in.cdp.click(load_more_button)
time_to_lmore_click = time.perf_counter() - start
clicks_made += 1
print(f"time_to_lmore_click: {time_to_lmore_click:.2f}") # Formatted to 2 decimal places
if time_to_lmore_click < 5.0:
sb_in.sleep(4.25)
time_after_delay = time.perf_counter() - start
print(f"time_after_delay: {time_after_delay:.2f}")
# print("time_after_delay:", time_after_delay)
else:
print("\nLoad More button not found. All content loaded or button disappeared.")
sb_in.sleep(3.8)
break`
Beta Was this translation helpful? Give feedback.
All reactions
Using the 'TAG:contains("TEXT")'
selector is not efficient, as it's not a real CSS Selector. It has to go through every element with that tag, and then find the text to see if it's a match. If you need the speed, you'll have to use a standard CSS Selector.
Replies: 1 comment 1 reply
-
Using the 'TAG:contains("TEXT")'
selector is not efficient, as it's not a real CSS Selector. It has to go through every element with that tag, and then find the text to see if it's a match. If you need the speed, you'll have to use a standard CSS Selector.
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 1
-
Thanks Michael for the tip. I switched to plain CSS and a more direct selector. The running time has improved, but it is still running a bit slow, e.g. scraped a category in about 12 minutes, compared to 21 minutes previously. The CSS selector is now something like this - div#page-content > div > div > div > div > div > div:nth-child(3) > div > div > button
. I've also tried changing it to an XPath selector, but the performance is the same.
The scrape time is not too bad now, and I should be able to mitigate it with concurrency and other tricks.
Beta Was this translation helpful? Give feedback.