Scrape an infinite-scroll page

Question 1

My algorithm scrapes an infinite-scroll page but it takes too long. It scrolls three times but I'm wondering if there is a way to do a ScrollBottom() so no need of repeated code.

Regarding the site from the example: Scroll is done by jQuery ScrollExtend goo.gl/Sq4vVx triggered when the users scroll beyond a particular tag. When that happens a particular class is added into the tag and removed after the pagination is done.

I think there's room for improvement code and performance wise.

"use strict";
var Xray = require('x-ray');
var phantom = require('x-ray-phantom');
var phantom_opts = {
 webSecurity: false,
 images: false,
 weak: false
};
var x = Xray().driver(phantom(phantom_opts, function (nightmare, done) {
 done
 .useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
 .goto(nightmare.request.req.url)
 .scrollTo(4000, 0)
 .wait()
 .scrollTo(8000, 0)
 .wait()
 .scrollTo(12000, 0)
 .wait()
}));
x('https://www.compraonline.grupoeroski.com/es/supermercado/2059698-Alimentos-Frescos/2059746-Carnes-y-aves/2059753-Pollo/', '.product_list li',
 [{
 name: '.description_1',
 unitPrice: '.description_2',
 image: '.image_line img@src',
 price: '.product_price_cont p',
 url: '.image_line a@href',
 volumen: '.description_1',
 medida: '.description_1'
 }])(function (err, products) {
 if (err) console.log(err);
 console.log(products.length);
 process.exit(0);
 });

Question 2

If the page is infinite scroll, where do you suppose scrollBottom() goes to?

Question 3

ideal scenario replacing all those '.wait().scrollTo()'. But probably the solution goes by running some sort of calculation around page height

Question 4

the initial page likely won't be 4000 pixels high. scrollTo(4000,0) will only go as far as the bottom, then so will 8000 and 12000 all three will have been called and the page might reload content once in this time. Maybe twice if you are lucky. This approach will never work.

Question 5

It would be much easier to replicate the ajax call to get the results, if you scroll down and check network requests you will see that its always the same url with a pageNumber parameter being sent

Question 6

I would want to understand how the infinite scroll is actually being applied.

Do you understand what javascript events actually trigger new items to be added?
Does it make more sense to simply trigger those events vs. worry about physically scrolling the browser?
Is the content being delivered via AJAX? Can you just query the AJAX endpoint more directly to get to the data you want to get?
Is there anything from the ajax response that you need to understand to know when you have reached the end of the list (no more items to be added)?

When you think through these you might find you have a better way to approach the problem.

Question 7

Scroll is done by jQuery ScrollExtend compraonline.grupoeroski.com/assets/1.18.4/ctx/js/… triggered when the users scroll beyond a particular tag. When that happens a particular class is added into the tag and removed after the pagination is done. I'm not aware of ajax calls to external apis.

Question 8

@tribet So, is there something there you can use that you think would be better than scrolling the browser?

Question 9

we might trigger the scrolling mechanism somehow. Either adding cheerio in the equation, either using nightmare's evaluate() function...

Question 10

Perhaps something in Nightmare using .wait(selector) and scolling in loop until condition is met or waiting between scroll operations until something appears (perhaps the class). I would just have to think there is better way than scrolling X number of times.

Question 11

Agree, there must be a better way. Not sure about the class approach as is not reliable. I'd rather want to use kind of loop like: scroll(currentHeight); newHeight = calculatePageHeight();

Mike Brant Mike Brant 9,85814 silver badges24 bronze badges · Answer 1 · 2016-11-09 20:00:42Z

3

\$\begingroup\$

I would want to understand how the infinite scroll is actually being applied.

Do you understand what javascript events actually trigger new items to be added?
Does it make more sense to simply trigger those events vs. worry about physically scrolling the browser?
Is the content being delivered via AJAX? Can you just query the AJAX endpoint more directly to get to the data you want to get?
Is there anything from the ajax response that you need to understand to know when you have reached the end of the list (no more items to be added)?

When you think through these you might find you have a better way to approach the problem.

Share

answered Nov 9, 2016 at 20:00

Mike Brant's user avatar

Mike Brant Mike Brant

9,85814 silver badges24 bronze badges

\$\endgroup\$

8

\$\begingroup\$ Scroll is done by jQuery ScrollExtend compraonline.grupoeroski.com/assets/1.18.4/ctx/js/… triggered when the users scroll beyond a particular tag. When that happens a particular class is added into the tag and removed after the pagination is done. I'm not aware of ajax calls to external apis. \$\endgroup\$

tribet
– tribet

2016年11月09日 20:18:42 +00:00
Commented Nov 9, 2016 at 20:18
\$\begingroup\$ @tribet So, is there something there you can use that you think would be better than scrolling the browser? \$\endgroup\$

Mike Brant
– Mike Brant

2016年11月09日 20:33:11 +00:00
Commented Nov 9, 2016 at 20:33
\$\begingroup\$ we might trigger the scrolling mechanism somehow. Either adding cheerio in the equation, either using nightmare's evaluate() function... \$\endgroup\$

tribet
– tribet

2016年11月09日 20:35:27 +00:00
Commented Nov 9, 2016 at 20:35
\$\begingroup\$ Perhaps something in Nightmare using .wait(selector) and scolling in loop until condition is met or waiting between scroll operations until something appears (perhaps the class). I would just have to think there is better way than scrolling X number of times. \$\endgroup\$

Mike Brant
– Mike Brant

2016年11月09日 20:49:13 +00:00
Commented Nov 9, 2016 at 20:49
\$\begingroup\$ Agree, there must be a better way. Not sure about the class approach as is not reliable. I'd rather want to use kind of loop like: scroll(currentHeight); newHeight = calculatePageHeight(); \$\endgroup\$

tribet
– tribet

2016年11月09日 20:59:19 +00:00
Commented Nov 9, 2016 at 20:59

| Show 3 more comments

Stack Exchange Network

Scrape an infinite-scroll page

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Scrape an infinite-scroll page

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions