5

There appear to be a number of ways of scraping product data from a Magento site, but all seem to have their upsides and downsides.

We deal with sites who have little to no technical resource, but who have given us permission to scrape their product catalog. There appear to be 3 different ways of doing this, none of which really work:

  • Manual web scraping - developer intensive, requires updating when the theme changes.
  • Magento Web API - requires setting up an API user, too technical for many users.
  • Magento Plugin - too technical for many users, exposes sensitive business data so many companies won't do this.

Are we missing something? Is there a better alternative, or are there ways of changing any of the above 3 to be better for scraping?

For example, is it possible to provide a link to a 'one-click-setup' like process for API access? Shopify do this in a nice way using OAuth and permission scopes, so we can give our partners a link that will give us read only access to just their product catalog, in a way that non-technical users can use.

asked Aug 17, 2015 at 17:20
1
  • Most merchants that are using Magento are also connected to Google Merchant Center through an XML feed. You can ask your partners to provide you with that URL. We do more or less what you are looking for but we use Google Merchant Center feed instead of developing our own extension or API. Commented Aug 21, 2021 at 0:44

1 Answer 1

0

Not sure why magento plugin in and of itself would be too technical, especially if instructed to install via magento connect.

Which could build an accessible XML feed for you so you could scrape/retrieve the feed via HTTP without worrying about a changing theme layer.

I don't think this is the one click answer you're looking for, but an 'alternative' solution could be to have clients upload a custom script that you provide.

That script could be run via cron, and would perform periodic dumps of specified DB tables (i.e. no tables which contain 'sensitive business data').

Each dump could be retrieved via ssh/sftp if you have access to that, a public facing folder / email if not. Setting up a crontask via cpanel would be pretty easy for the average user.

That would give you the most complete dataset, although not without its glaring downsides.

As a sidenote, xpath parser for webscraping is an elegant tool, and could be implemented in a way to be mostly theme agnostic if it comes to that.

Teja Bhagavan Kollepara
3,8275 gold badges33 silver badges69 bronze badges
answered Aug 17, 2015 at 18:20
1
  • Thanks for your reply! Unfortunately Magento Connect looks too complicated for some of our partners, they often use contractors to set up Magento, and aren't able to do things like this. Also it doesn't solve the permissions issue, that plugins can read anything they want. We already use XPath, sitemaps, and lots of other ways to scrape data from the pages, but Magento themes differ enough on the sites we already do this for that we can't share much if any scraping code between them. Commented Aug 18, 2015 at 8:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.