This function scrapes data from a webpage by spawning a process that executes a CasperJS web scraping script. The spawned process outputs data to stdout
.
This function has an event listener for stdout
and this is how the parent process (Node) gets the scraped data back. How can I make this more modular? Also, would creating a child_process
class make more sense?
function scrapeLinks(location, callback) {
// stores any data emitted from the stdout stream of spawned casper process
var processData = "";
// stores any errors emitted from the stderror stream of spawned casper process
var processError = "";
// initialises casperjs link scraping script as spawned process
var linkScrapeChild = child_process.spawn(casperjsPath, ['casperLinkScript.js ' + location]);
linkScrapeChild.stdout.on('data', function onScrapeProcessStdout(data) {
processData += data.toString();
console.log(data.toString())
});
linkScrapeChild.stderr.on('data', function onScrapeProcessError(err) {
processError += err.toString();
});
linkScrapeChild.on("error", function onScrapeProcessError(err) {
processError = err.toString();
});
//once spawned casper process finishes execution call the callback
linkScrapeChild.on('close', function onScrapeProcessExit(code) {
console.log('Child process - Location Scrape: ' + location + ' - closed with code: ' + code);
processData = convertToArray(processData);
// filter out non valid listing links
listingLinks = filterLinks(processData);
//console.log(listingLinks);
// filter duplicates
var uniqueLinks = [ ...new Set(listingLinks) ];
if(uniqueLinks.length === 0){
processError += 'No valid listings found for ' + location
}
logScrapeResults(processError, uniqueLinks, location);
callback(processError || null, uniqueLinks);
});
}
1 Answer 1
First of all, I think spawning a child process just to scrape for links is overkill. If you're scraping links from multiple online pages, the bottleneck isn't the processing speed but your network latency. Once you grab the page contents, scraping a page is a breeze. A single node process written asynchronously can do this easily.
Now if you really need to spawn that child process, then I might suggest you use Promises, arrow functions and the latest version of node to make this code a bit more compact. Use template strings to construct strings instead of concatenation. Don't forget to define var
/let
/const
.
function scrapeLinks(location) {
return new Promise((resolve, reject) => {
var processData = "";
var errors = "";
var command = `casperLinkScript.js ${location}`;
var linkScrapeChild = child_process.spawn(casperjsPath, command);
linkScrapeChild.stdout.on('data', (data) => processData += data.toString());
linkScrapeChild.stderr.on('data', (err) => errors += err.toString());
linkScrapeChild.on("error", (err) => errors = err.toString());
linkScrapeChild.on('close', function onScrapeProcessExit(code) {
var uniqueLinks = [...new Set(filterLinks(convertToArray(processData)))];
if (!uniqueLinks.length) errors += `No valid listings found for ${location}`;
if (errors)
reject({ code, errors });
else
resolve({ code, uniqueLinks });
});
});
}
// Usage
scrapeLinks('http://yahoo.com').then((result) => {
// result.code
// result.uniqueLinks
}, (result) => {
// result.code
// result.errors
});
-
\$\begingroup\$ Thanks a lot! However what happens if the spawned process is not launched perhaps because casperjs is not installed. The close event would not fire and therefore the promise would be left pending indefinitely? \$\endgroup\$therewillbecode– therewillbecode2016年03月10日 11:52:47 +00:00Commented Mar 10, 2016 at 11:52
-
\$\begingroup\$ @therewillbecode I'm not sure what happens. But you're free to call
reject
whenever you have the chance. \$\endgroup\$Joseph– Joseph2016年03月10日 13:02:21 +00:00Commented Mar 10, 2016 at 13:02 -
\$\begingroup\$ So is it best practice to call reject whenever an error is encountered by the child process? \$\endgroup\$therewillbecode– therewillbecode2016年03月10日 13:09:02 +00:00Commented Mar 10, 2016 at 13:09
-
\$\begingroup\$ @therewillbecode I wouldn't call the best practice. Given that your caller is only handed a promise, how else is it going to know it failed? \$\endgroup\$Joseph– Joseph2016年03月10日 13:32:46 +00:00Commented Mar 10, 2016 at 13:32