I have created a scraper for yell.com in vba. The scraper is efficient enough to pull data from that site, whatever the search parameter is. If any link from that site is given to my parser, it is able to scrape the whole records irrespective of how many pages it has spread across. There is no "a" tag for the first page in pagination option for this reason it was previously scraping all the records except for the first page. However, I've fixed that issue and now it is working flawlessly pulling all the records available there. I tried to make it accurate yet there are always rooms for improvement.
Sub Yell_parser()
Const mlink = "https://www.yell.com"
Dim http As New XMLHTTP60
Dim html As New HTMLDocument, html2 As New HTMLDocument
Dim page As Object, newlink As String
Dim I As Long, x As Long
With http
.Open "GET", "https://www.yell.com/ucs/UcsSearchAction.do?keywords=coffee&location=United+Kingdom&scrambleSeed=1370600159", False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
.send
html.body.innerHTML = .responseText
End With
Set page = html.getElementsByClassName("row pagination")(0).getElementsByTagName("a")
' First page first, selected already, 'row pagination' doesn't have 'a' for it
GetPageData x, html
For I = 0 To page.Length - 2
newlink = mlink & Replace(page(I).href, "about:", "")
With http
.Open "GET", newlink, False
.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36"
.send
html2.body.innerHTML = .responseText
End With
' Next pages start from here
GetPageData x, html2
Next I
End Sub
Sub GetPageData(ByRef x, ByRef html As HTMLDocument)
Dim post As HTMLHtmlElement
For Each post In html.getElementsByClassName("js-LocalBusiness")
x = x + 1
With post.getElementsByClassName("row businessCapsule--title")(0).getElementsByTagName("a")
If .Length Then Cells(x + 1, 1) = .item(0).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 1 Then Cells(x + 1, 2) = .item(1).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 2 Then Cells(x + 1, 3) = .item(2).innerText
End With
With post.getElementsByClassName("col-sm-10 col-md-11 col-lg-12 businessCapsule--address")(0).getElementsByTagName("span")
If .Length > 3 Then Cells(x + 1, 4) = .item(3).innerText
End With
With post.getElementsByClassName("businessCapsule--tel")
If .Length > 1 Then Cells(x + 1, 5) = .item(1).innerText
End With
Next post
End Sub
1 Answer 1
I would focus on the following improvements:
- avoid code duplication - for instance, you have the User-Agent string specified twice - extract it as a constant and re-use.
GetPageData
also has duplicated code - some of your locators are layout-oriented which makes them less reliable and less readable - Bootstrap classes like
col-lg-12
orcol-md-11
have a layout/design meaning and have a high probability of change.row businessCapsule--title
can becomebusinessCapsule--title
;col-sm-10 col-md-11 col-lg-12 businessCapsule--address
would becomebusinessCapsule--address
.
-
\$\begingroup\$ @SMth80 right, leaving a single class name
businessCapsule--address
would do the trick. Thanks. \$\endgroup\$alecxe– alecxe2017年07月26日 16:08:42 +00:00Commented Jul 26, 2017 at 16:08 -
\$\begingroup\$ @SMth80 unfortunately, VBA bindings are very short on selenium-specific features that are available in other language bindings..do you have to use VBA?.. \$\endgroup\$alecxe– alecxe2017年07月26日 16:13:15 +00:00Commented Jul 26, 2017 at 16:13
-
\$\begingroup\$ Thanks sir alecxe, for your review and suggestion. Basically, I just checked and noticed that selecting single class makes the script look better and brings the same result. Btw, is there any hard and fast rule concerning which one to choose as the best fit among compound classes? \$\endgroup\$SIM– SIM2017年07月26日 16:54:16 +00:00Commented Jul 26, 2017 at 16:54
-
\$\begingroup\$ @SMth80 other selenium bindings actually don't allow to specify compound classes in the "by class name" locators..try to use single data-oriented classes for a "class name" choice. Thanks. \$\endgroup\$alecxe– alecxe2017年07月26日 17:00:39 +00:00Commented Jul 26, 2017 at 17:00