My algorithm scrapes an infinite-scroll page but it takes too long. It scrolls three times but I'm wondering if there is a way to do a ScrollBottom()
so no need of repeated code.
Regarding the site from the example: Scroll is done by jQuery ScrollExtend goo.gl/Sq4vVx triggered when the users scroll beyond a particular tag. When that happens a particular class is added into the tag and removed after the pagination is done.
I think there's room for improvement code and performance wise.
"use strict";
var Xray = require('x-ray');
var phantom = require('x-ray-phantom');
var phantom_opts = {
webSecurity: false,
images: false,
weak: false
};
var x = Xray().driver(phantom(phantom_opts, function (nightmare, done) {
done
.useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
.goto(nightmare.request.req.url)
.scrollTo(4000, 0)
.wait()
.scrollTo(8000, 0)
.wait()
.scrollTo(12000, 0)
.wait()
}));
x('https://www.compraonline.grupoeroski.com/es/supermercado/2059698-Alimentos-Frescos/2059746-Carnes-y-aves/2059753-Pollo/', '.product_list li',
[{
name: '.description_1',
unitPrice: '.description_2',
image: '.image_line img@src',
price: '.product_price_cont p',
url: '.image_line a@href',
volumen: '.description_1',
medida: '.description_1'
}])(function (err, products) {
if (err) console.log(err);
console.log(products.length);
process.exit(0);
});
1 Answer 1
I would want to understand how the infinite scroll is actually being applied.
- Do you understand what javascript events actually trigger new items to be added?
- Does it make more sense to simply trigger those events vs. worry about physically scrolling the browser?
- Is the content being delivered via AJAX? Can you just query the AJAX endpoint more directly to get to the data you want to get?
- Is there anything from the ajax response that you need to understand to know when you have reached the end of the list (no more items to be added)?
When you think through these you might find you have a better way to approach the problem.
-
\$\begingroup\$ Scroll is done by jQuery ScrollExtend compraonline.grupoeroski.com/assets/1.18.4/ctx/js/… triggered when the users scroll beyond a particular tag. When that happens a particular class is added into the tag and removed after the pagination is done. I'm not aware of ajax calls to external apis. \$\endgroup\$tribet– tribet2016年11月09日 20:18:42 +00:00Commented Nov 9, 2016 at 20:18
-
\$\begingroup\$ @tribet So, is there something there you can use that you think would be better than scrolling the browser? \$\endgroup\$Mike Brant– Mike Brant2016年11月09日 20:33:11 +00:00Commented Nov 9, 2016 at 20:33
-
\$\begingroup\$ we might trigger the scrolling mechanism somehow. Either adding cheerio in the equation, either using nightmare's evaluate() function... \$\endgroup\$tribet– tribet2016年11月09日 20:35:27 +00:00Commented Nov 9, 2016 at 20:35
-
\$\begingroup\$ Perhaps something in Nightmare using
.wait(selector)
and scolling in loop until condition is met or waiting between scroll operations until something appears (perhaps the class). I would just have to think there is better way than scrolling X number of times. \$\endgroup\$Mike Brant– Mike Brant2016年11月09日 20:49:13 +00:00Commented Nov 9, 2016 at 20:49 -
\$\begingroup\$ Agree, there must be a better way. Not sure about the class approach as is not reliable. I'd rather want to use kind of loop like: scroll(currentHeight); newHeight = calculatePageHeight(); \$\endgroup\$tribet– tribet2016年11月09日 20:59:19 +00:00Commented Nov 9, 2016 at 20:59
scrollTo(4000,0)
will only go as far as the bottom, then so will 8000 and 12000 all three will have been called and the page might reload content once in this time. Maybe twice if you are lucky. This approach will never work. \$\endgroup\$