Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

cdp.is_element_present seems to be slow with large number of elements #3929

Answered by mdmintz
LeeMeng2020 asked this question in Q&A
Discussion options

I'm seeing a slowdown when using cdp.is_element_present in a page with many elements. The delay seem seems to increase exponentially after a certain point, I'm guessing around 1,000 elements? I'm scraping a grocery website which can have many products. My script is working, but it is talking a long time to complete.

From the attached trace you can see that a category with 891 products takes under 2 minutes to scrape (circled in yellow). This is an acceptable speed. Most categories on this site have under 500 products , and they only take about 1 minute to scrape. However, when a page has 2,000+ products, it starts to get very slow. E,g. for a category with 2100 products, the scrape time balloons to nearly 21 minutes!

I have timed the load more clicks, as well as each category's scrape time. You can see that time for each click takes longer and longer. I have also scraped this site using another method (browser extension); there is no delay when clicking load more.

Any tips on improving the speed? This site uses Datadome, so CDP mode is needed.
Internet speed should not be factor 'cos I am running this on a datacenter server.

This is my load_more function:

`def click_load_more_if_present(sb_in):

load_more_button = "button:contains('LOAD MORE')"
clicks_made = 0 
while True:
 if sb_in.cdp.is_element_present(load_more_button):
 start = time.perf_counter()
 if clicks_made >= max_clicks:
 print(f"Reached maximum clicks ({max_clicks}). Stopping.")
 break # Exit the loop if max_clicks is reached
 sb_in.cdp.click(load_more_button)
 time_to_lmore_click = time.perf_counter() - start
 clicks_made += 1 
 print(f"time_to_lmore_click: {time_to_lmore_click:.2f}") # Formatted to 2 decimal places
 if time_to_lmore_click < 5.0:
 sb_in.sleep(4.25)
 time_after_delay = time.perf_counter() - start
 print(f"time_after_delay: {time_after_delay:.2f}") 
 # print("time_after_delay:", time_after_delay)
 else:
 print("\nLoad More button not found. All content loaded or button disappeared.")
 sb_in.sleep(3.8)
 break`
image
You must be logged in to vote

Using the 'TAG:contains("TEXT")' selector is not efficient, as it's not a real CSS Selector. It has to go through every element with that tag, and then find the text to see if it's a match. If you need the speed, you'll have to use a standard CSS Selector.

Replies: 1 comment 1 reply

Comment options

Using the 'TAG:contains("TEXT")' selector is not efficient, as it's not a real CSS Selector. It has to go through every element with that tag, and then find the text to see if it's a match. If you need the speed, you'll have to use a standard CSS Selector.

You must be logged in to vote
1 reply
Comment options

Thanks Michael for the tip. I switched to plain CSS and a more direct selector. The running time has improved, but it is still running a bit slow, e.g. scraped a category in about 12 minutes, compared to 21 minutes previously. The CSS selector is now something like this - div#page-content > div > div > div > div > div > div:nth-child(3) > div > div > button. I've also tried changing it to an XPath selector, but the performance is the same.

The scrape time is not too bad now, and I should be able to mitigate it with concurrency and other tricks.

Answer selected by mdmintz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet

AltStyle によって変換されたページ (->オリジナル) /