0
\$\begingroup\$

This function scrapes data from a webpage by spawning a process that executes a CasperJS web scraping script. The spawned process outputs data to stdout.

This function has an event listener for stdout and this is how the parent process (Node) gets the scraped data back. How can I make this more modular? Also, would creating a child_process class make more sense?

function scrapeLinks(location, callback) {
 // stores any data emitted from the stdout stream of spawned casper process
 var processData = "";
 // stores any errors emitted from the stderror stream of spawned casper process
 var processError = "";
 // initialises casperjs link scraping script as spawned process
 var linkScrapeChild = child_process.spawn(casperjsPath, ['casperLinkScript.js ' + location]);
 linkScrapeChild.stdout.on('data', function onScrapeProcessStdout(data) {
 processData += data.toString();
 console.log(data.toString())
 });
 linkScrapeChild.stderr.on('data', function onScrapeProcessError(err) {
 processError += err.toString();
 });
 linkScrapeChild.on("error", function onScrapeProcessError(err) {
 processError = err.toString();
 });
 //once spawned casper process finishes execution call the callback
 linkScrapeChild.on('close', function onScrapeProcessExit(code) {
 console.log('Child process - Location Scrape: ' + location + ' - closed with code: ' + code);
 processData = convertToArray(processData);
 // filter out non valid listing links
 listingLinks = filterLinks(processData);
 //console.log(listingLinks);
 // filter duplicates
 var uniqueLinks = [ ...new Set(listingLinks) ];
 if(uniqueLinks.length === 0){
 processError += 'No valid listings found for ' + location
 }
 logScrapeResults(processError, uniqueLinks, location);
 callback(processError || null, uniqueLinks);
 });
}
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Mar 9, 2016 at 22:15
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

First of all, I think spawning a child process just to scrape for links is overkill. If you're scraping links from multiple online pages, the bottleneck isn't the processing speed but your network latency. Once you grab the page contents, scraping a page is a breeze. A single node process written asynchronously can do this easily.

Now if you really need to spawn that child process, then I might suggest you use Promises, arrow functions and the latest version of node to make this code a bit more compact. Use template strings to construct strings instead of concatenation. Don't forget to define var/let/const.

function scrapeLinks(location) {
 return new Promise((resolve, reject) => {
 var processData = "";
 var errors = "";
 var command = `casperLinkScript.js ${location}`;
 var linkScrapeChild = child_process.spawn(casperjsPath, command);
 linkScrapeChild.stdout.on('data', (data) => processData += data.toString());
 linkScrapeChild.stderr.on('data', (err) => errors += err.toString());
 linkScrapeChild.on("error", (err) => errors = err.toString());
 linkScrapeChild.on('close', function onScrapeProcessExit(code) {
 var uniqueLinks = [...new Set(filterLinks(convertToArray(processData)))];
 if (!uniqueLinks.length) errors += `No valid listings found for ${location}`;
 if (errors)
 reject({ code, errors });
 else
 resolve({ code, uniqueLinks });
 });
 });
}
// Usage
scrapeLinks('http://yahoo.com').then((result) => {
 // result.code
 // result.uniqueLinks
}, (result) => {
 // result.code
 // result.errors
});
answered Mar 9, 2016 at 23:56
\$\endgroup\$
4
  • \$\begingroup\$ Thanks a lot! However what happens if the spawned process is not launched perhaps because casperjs is not installed. The close event would not fire and therefore the promise would be left pending indefinitely? \$\endgroup\$ Commented Mar 10, 2016 at 11:52
  • \$\begingroup\$ @therewillbecode I'm not sure what happens. But you're free to call reject whenever you have the chance. \$\endgroup\$ Commented Mar 10, 2016 at 13:02
  • \$\begingroup\$ So is it best practice to call reject whenever an error is encountered by the child process? \$\endgroup\$ Commented Mar 10, 2016 at 13:09
  • \$\begingroup\$ @therewillbecode I wouldn't call the best practice. Given that your caller is only handed a promise, how else is it going to know it failed? \$\endgroup\$ Commented Mar 10, 2016 at 13:32

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.