4
\$\begingroup\$

I have created a scraper for yell.com in vba. The scraper is efficient enough to pull data from that site, whatever the search parameter is. If any link from that site is given to my parser, it is able to scrape the whole records irrespective of how many pages it has spread across. There is no "a" tag for the first page in pagination option for this reason it was previously scraping all the records except for the first page. However, I've fixed that issue and now it is working flawlessly pulling all the records available there. I tried to make it accurate yet there are always rooms for improvement.

Sub Yell_parser()
Const mlink = "https://www.yell.com"
Dim http As New XMLHTTP60
Dim html As New HTMLDocument, html2 As New HTMLDocument
Dim page As Object, newlink As String
Dim I As Long, x As Long
With http
 .Open "GET", "https://www.yell.com/ucs/UcsSearchAction.do?keywords=coffee&location=United+Kingdom&scrambleSeed=1370600159", False
 .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
 .send
 html.body.innerHTML = .responseText
End With
Set page = html.getElementsByClassName("row pagination")(0).getElementsByTagName("a")
' First page first, selected already, 'row pagination' doesn't have 'a' for it
GetPageData x, html
For I = 0 To page.Length - 2
 newlink = mlink & Replace(page(I).href, "about:", "")
 With http
 .Open "GET", newlink, False
 .setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
 .send
 html2.body.innerHTML = .responseText
 End With
 ' Next pages start from here
 GetPageData x, html2
Next I
End Sub
Sub GetPageData(ByRef x, ByRef html As HTMLDocument)
 Dim post As HTMLHtmlElement
 For Each post In html.getElementsByClassName("js-LocalBusiness")
 x = x + 1
 With post.getElementsByClassName("row businessCapsule--title")(0).getElementsByTagName("a")
 If .Length Then Cells(x + 1, 1) = .item(0).innerText
 End With
 With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
 If .Length > 1 Then Cells(x + 1, 2) = .item(1).innerText
 End With
 With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
 If .Length > 2 Then Cells(x + 1, 3) = .item(2).innerText
 End With
 With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
 If .Length > 3 Then Cells(x + 1, 4) = .item(3).innerText
 End With
 With post.getElementsByClassName("businessCapsule--tel")
 If .Length > 1 Then Cells(x + 1, 5) = .item(1).innerText
 End With
 Next post
End Sub
asked Jun 14, 2017 at 16:22
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

I would focus on the following improvements:

  • avoid code duplication - for instance, you have the User-Agent string specified twice - extract it as a constant and re-use. GetPageData also has duplicated code
  • some of your locators are layout-oriented which makes them less reliable and less readable - Bootstrap classes like col-lg-12 or col-md-11 have a layout/design meaning and have a high probability of change. row businessCapsule--title can become businessCapsule--title; col-sm-10 col-md-11 col-lg-12 businessCapsule--address would become businessCapsule--address.
answered Jul 26, 2017 at 15:16
\$\endgroup\$
4
  • \$\begingroup\$ @SMth80 right, leaving a single class name businessCapsule--address would do the trick. Thanks. \$\endgroup\$ Commented Jul 26, 2017 at 16:08
  • \$\begingroup\$ @SMth80 unfortunately, VBA bindings are very short on selenium-specific features that are available in other language bindings..do you have to use VBA?.. \$\endgroup\$ Commented Jul 26, 2017 at 16:13
  • \$\begingroup\$ Thanks sir alecxe, for your review and suggestion. Basically, I just checked and noticed that selecting single class makes the script look better and brings the same result. Btw, is there any hard and fast rule concerning which one to choose as the best fit among compound classes? \$\endgroup\$ Commented Jul 26, 2017 at 16:54
  • \$\begingroup\$ @SMth80 other selenium bindings actually don't allow to specify compound classes in the "by class name" locators..try to use single data-oriented classes for a "class name" choice. Thanks. \$\endgroup\$ Commented Jul 26, 2017 at 17:00

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.